This project consists of developing a chatbot that makes queries to the ECH (Continuous Household Survey) of Uruguay, using its dictionary of variables and OpenAI's embedding and NLP models.
The first step is to take the original variable dictionary from the ECH ('json_ech.json' file), extract the variable names and their respective questions, and generate embeddings only to the questions and then give the model as variable dictionary those that most closely resemble the query made by the user. This process is done in Pre-processing.ipynb and the result is the file 'variables_ech_emb.json'.
The second step consists of creating a RAG that from the dictionary of variables can answer user queries about the ECH. This process is done in several parts:
a. The user query is taken and compared to the ECH questions using the cosine distance between embeddings. The 30 most similar questions are retained.
b. An NLP model is asked a Python code to do the necessary filters and calculations to obtain the answer to the user's query. For this purpose, a dictionary with the 30 previously selected questions is provided, as well as extra information, such as the names and structure of essential variables: department (province), gender, age.
c. Once the Python code is obtained, a different NLP model is used to modify it by incorporating the ECH semi-annual weights.
d. Finally, the Python code is run and a response is generated for the user with the information they asked for, as well as the Python code used to generate it.
Some examples of queries that the chatbot is able to answer correctly:
The chatbot can be used on the notebook 'chatbot.ipynb'.
Uruguay's ECH has a methodology whereby an initial implementation survey is carried out, followed by a longitudinal monitoring panel. This version of the chatbot works with the structure of the implementation survey and its weights.