This project presents a comprehensive search engine designed to generate accurate and relevant search results by leveraging concepts from Natural Language Processing (NLP), Machine Learning (ML), and Information Retrieval (IR). The project integrates tools such as TF-IDF, BM25, and BERT models to build an efficient search solution, with a streamlined graphical user interface (GUI) for displaying results, developed using Python and Streamlit.
The project utilizes web-scraped data from Wikipedia and additional documents from Google, amounting to a corpus of 529 documents across 115 terms, to build a high-quality index for search retrieval. By incorporating preprocessing and indexing techniques with query expansion and relevance feedback, the project enhances search precision and recall. Evaluation metrics, such as normalized Discounted Cumulative Gain (nDCG), further validate the accuracy and effectiveness of the search results.
In the realm of information retrieval, the efficient extraction of relevant data from large document collections is essential. This project introduces a robust search engine built on NLP, ML, and IR techniques to provide users with accurate search results. By utilizing a combination of classical (TF-IDF, BM25) and advanced (BERT) models, it offers a multi-layered approach to data retrieval, ensuring a user-friendly experience through an intuitive Streamlit interface.
The search engine is built using Python-based tools such as PyTerrier and Streamlit, enabling users to run queries and retrieve relevant documents with ease. Data preprocessing, query expansion, and indexing play a critical role in this search engine's ability to handle diverse queries effectively.
The dataset consists of 529 documents gathered through web scraping, including Wikipedia articles and documents from Google searches. The corpus spans 115 topics, each associated with approximately five documents.
The preprocessing stage involved:
Indexing was performed using the PyTerrier library, where an inverted index was built. Unique terms were compiled, and frequency data was collected for each document.
Queries underwent preprocessing similar to the documents, followed by retrieval and ranking using the TF-IDF algorithm. The system provides lists of documents with matching terms, ranked by relevance. Top relevant documents are displayed for each query.
To enhance retrieval, the BM25 model was applied, supplemented by relevance feedback using the Rocchio algorithm with the RM3 model. Additionally, a search_bert
function re-ranks documents using BERT for improved relevance.
A GUI was developed with Streamlit, allowing users to input queries and view the ranked results generated by the search_bert
function. The entire application runs within Google Colab.
The project evaluates accuracy through nDCG metrics. Test queries such as "Standup Comedy," "Gaza," "Egypt," "Zewail City," and "Covid" assess retrieval precision and speed, measured using Python’s time
library.
The search engine demonstrated reliable performance across a variety of test queries. The nDCG scores reflect the model's ability to present relevant documents effectively and efficiently, validating its use in real-world applications.
This search engine project successfully combines traditional and advanced IR techniques, yielding a reliable, accurate retrieval tool suitable for diverse queries. The integration of TF-IDF, BM25, and BERT enables robust retrieval, while the GUI offers a seamless user experience. This project lays the groundwork for further exploration in NLP-driven search solutions, with potential expansions in dataset diversity and query complexity.
This project demonstrates the effective application of NLP, ML, and IR principles in building a functional search engine, showcasing promising results in information retrieval accuracy and relevance.