The rapid growth of artificial intelligence (AI) and machine learning (ML) research has resulted in an overwhelming amount of academic literature, making efficient document retrieval crucial for researchers. In response to this challenge, we developed a Mini-Retrieval-Augmented Generation (Mini-RAG) system that leverages a comprehensive dataset compiled from major AI and ML conferences, including NeurIPS, ICML, ICLR, AAAI, and IJCAI, spanning from 2010 to 2023. This dataset, enriched with paper titles, abstracts, authors, publication years, and source URLs, enables users to perform document similarity searches and explore research trends. The system uses SentenceTransformer ("all-MiniLM-L6-v2") to generate high-quality embeddings, combined with FastAPI for efficient, user-friendly document retrieval. Designed to be scalable and adaptable, this project aims to streamline research by enhancing access to relevant literature through advanced natural language processing techniques.
Our dataset comprises a meticulously compiled collection of research papers from top-tier AI and ML conferences such as NeurIPS, ICML, ICLR, AAAI, and IJCAI, covering publications from 2010 to 2023. This rich dataset serves as the foundation for our document similarity system, ensuring that users have access to a wide range of research topics and trends within the field.
The dataset contains the following columns:
This dataset not only facilitates historical analysis of AI and ML advancements but also supports a variety of applications in academic research and industry contexts, making it an invaluable resource for document retrieval systems.
You can find and download the dataset in the resources section
This section centers around creating a system that can embed a dataset of documents and perform similarity searches, enabling users to quickly retrieve the top n similar documents given a query. The backbone of this system is the SentenceTransformer model, specifically the "all-MiniLM-L6-v2" variant, known for its efficiency and effectiveness in generating high-quality sentence embeddings.
The process begins with the creation of a database containing the document embeddings. We used a straightforward Python script to load the dataset, generate embeddings for each document using the SentenceTransformer model, and then store these embeddings in a dictionary for easy retrieval.
The FastAPI framework powers the inference service, enabling users to input a document and receive a list of the most similar documents from the database. The service computes the cosine similarity between the query document and the documents in the database, returning the top n matches.
To demonstrate the power of this system, here are a couple of examples showcasing the results:
We propose a method for producing ensembles of predictors based on holdout estimations of their generalization performances. This approach uses a prior directly on the performance of predictors taken from a finite set of candidates and attempts to infer which one is best. Using Bayesian inference, we can thus obtain a posterior that represents our uncertainty about that choice and construct a weighted ensemble of predictors accordingly. This approach has the advantage of not requiring that the predictors be probabilistic themselves, can deal with arbitrary measures of performance and does not assume that the data was actually generated from any of the predictors in the ensemble. Since the problem of finding the best (as opposed to the true) predictor among a class is known as agnostic PAC-learning, we refer to our method as agnostic Bayesian learning. We also propose a method to address the case where the performance estimate is obtained from k-fold cross validation. While being efficient and easily adjustable to any loss function, our experiments confirm that the agnostic Bayes approach is state of the art compared to common baselines such as model selection based on k-fold cross-validation or a linear combination of predictor outputs.
Ensembling is among the most popular tools in machine learning (ML) due to its effectiveness in minimizing variance and thus improving generalization. Most ensembling methods for black-box base learners fall under the umbrella of "stacked generalization," namely training an ML algorithm that takes the inferences from the base learners as input. While stacking has been widely applied in practice, its theoretical properties are poorly understood. In this paper, we prove a novel result, showing that choosing the best stacked generalization from a (finite or finite-dimensional) family of stacked generalizations based on cross-validated performance does not perform "much worse" than the oracle best. Our result strengthens and significantly extends the results in Van der Laan et al. (2007). Inspired by the theoretical analysis, we further propose a particular family of stacked generalizations in the context of probabilistic forecasting, each one with a different sensitivity for how much the ensemble weights are allowed to vary across items, timestamps in the forecast horizon, and quantiles. Experimental results demonstrate the performance gain of the proposed method.
Virtually any model we use in machine learning to make predictions does not perfectly represent reality. So, most of the learning happens under model misspecification. In this work, we present a novel analysis of the generalization performance of Bayesian model averaging under model misspecification and i.i.d. data using a new family of second-order PAC-Bayes bounds. This analysis shows, in simple and intuitive terms, that Bayesian model averaging provides suboptimal generalization performance when the model is misspecified. In consequence, we provide strong theoretical arguments showing that Bayesian methods are not optimal for learning predictive models, unless the model class is perfectly specified. Using novel second-order PAC-Bayes bounds, we derive a new family of Bayesian-like algorithms, which can be implemented as variational and ensemble methods. The output of these algorithms is a new posterior distribution, different from the Bayesian posterior, which induces a posterior predictive distribution with better generalization performance. Experiments with Bayesian neural networks illustrate these findings.
For using the Mini-RAG on your custom dataset, refer to the github repository and follow the instructions in the README.md file. It is very easy to use.
The dataset we have compiled, encompassing over a decade of pioneering research from premier AI and machine learning conferences, is a critical asset for the academic and industrial research communities. It represents not just a collection of data points, but a comprehensive overview of the evolution and trends within the field of artificial intelligence over an extensive period. This rich, detailed dataset enables users to explore and analyze the trajectory of AI research, providing a historical context and a benchmark for future studies.
Furthermore, the integration of this dataset with the Mini-Retrieval-Augmented Generation (Mini-RAG) system exemplifies the practical application of advanced NLP technologies to enhance document retrieval capabilities. By leveraging the SentenceTransformer model, the system efficiently sifts through complex data, facilitating the retrieval of relevant documents based on semantic similarity. This not only accelerates the research process by enabling quicker access to pertinent studies but also showcases the synergy between well-curated datasets and cutting-edge technology in pushing the boundaries of information retrieval in AI research. The project highlights the transformative potential of combining rich datasets with robust models to create powerful tools for the academic and research community.