Advancing Oil and Gas Emissions Assessment Through Large Language Model Data Extraction

Environment:

This program is designed to run on Linux, and has been tested on Ubuntu 22.04.1 LTS.
If you do not have a linux machine, You may install WSL(Windows Subsystem for Linux) on your Windows
If you use Macbook, you may install Ubuntu 22.04 LTS through virtual box. Follow the tutorial step by step to finish the installation.
Verify Python and Pip:

Verify you have correct version of Python. It should be above 3.10.
python3 --version
Verify you have pip3 installed.
which pip3
If it does not show anything, install pip3 if you uses Ubuntu.

sudo apt-get install python3-pip
Or you may manually install and refer here.

Python Virtual Environment:

Setup a python virtual environment named opgee.
Follow this tutorial on setting up a Python virtual environment. Preferably, use virtualenvwrapper.
Switch to the opgee environment using the command:
workon opgee
Ideally, put the above line at the end of you ~/.bashrc file, so you do not need to mannually switch everytime.
Repository Setup:

Clone this repository in your home directory or any directory you prefer.

git clone https://github.com/Richardczl98/OPGEE-LLM-Training.git

cd OPGEE-LLM-Training
If you do not have a Github personal access token, please apply one following this tutorial.

Install Python libraries:

Navigate to the deploy directory and install the necessary packages by following Python Development Environment Setup Instructions
Create .env

Ask the admin to give you .env file. Please note its filename starts with a dot. It looks like the below example but with real values.
PROJECT='opgee'
OPENAI_API_KEY='your openai api key'
ANTHROPIC_API_KEY='your claude api key'
AZURE_OPENAI=True
AZURE_OPENAI_API_KEY='ask admin for the key'
AZURE_OPENAI_API_BASE='ask admin for the endpoint. e.g. https://xyz.openai.azure.com/'
AZURE_OPENAI_API_VERSION='2023-08-01-preview check azure API doc if you would like higher version'
AZURE_OPENAI_GPT-4-DEPLOYMENT_NAME='ask admin for the name'
AZURE_OPENAI_GPT-3dot5-DEPLOYMENT_NAME='ask admin for the name'
Please never check in your .env to the repo.

Run the CLI

Use -h to learn how to run the cli.
python opgee_cli.py -h
e.g. a possible call is

python opgee_cli.py -p ./data/zips/spe-115712-ms.zip -m gpt-4 -g ./data/spe/SPE-115712-ms.xlsx
These testing zip and xlsx are checked in the repo, and you shall be able to run it. The results are placed in ./result/paper-name/time-model/. It contains the following file.

paper-name.log

extract.xls

extract_ref.xls

eval_matrix.png (if ground truth is provided)

eval_matrix_report.txt (if ground truth is provided)

cost.txt (if openAI model is used)

If you want to display and evaluate the model successfully extracted values in extract.xlsx. Please add -so parameter.

python opgee_cli.py -p ./data/zips/spe-115712-ms.zip -m gpt-4 -g ./data/spe/SPE-115712-ms.xlsx -so
Evaluate the results The below table is our definition of TP/TN/FP/FN. Please note this is NOT a confusion matrix.metric.png

The below table is the confusion matrix that using the definitions above. confusion-matrix.png

There is another version of the confusion matrix that maybe a little bit eaiser to be understood. alternative-confusion-matrix.png

We look at some metrics defined below.

metric definition notes
TP% TP/(TP+FP+FN) Not standard metric. Since our TN is usually very large, this is used to replace accuracy sometimes.
FP% FP/(TP+FP+FN) The lower the better. High value means model tends to have hallucination. Type 1 red error is worse than type 1 orange, which is worse than type 2 blue error in our extraction. High value of type-2-orange error means model's reasoning is not correct. We differ type-1-red & type-1-orange errors in extract-ref.xlsx, but has not count them seperately to caculate any metrics.
FN% FN/(TP+FP+FN) The lower the better. High value of type-2-blue error means model is too conservative to make extraction
TPR/recall TP/(TP+FN)=TP/P Standard metric. For a given POSITIVE, the probability that can be extracted correctly. If we want the papers to be extracted as much as possible, we need to observe this value closely.
FPR FP/(FP+TN)=FP/N For a given NEGATIVE, the probability that turns into an type-1 error. This might be used to reflect hallucination.
TNR TN/(FP+TN)=TN/N = 1- FPR; for a given NEGATIVE, the probability that it'll be correctly extracted.
FNR FN/(TP+FN)=FN/P = 1 - recall; a given POSITIVE, the probability it'll be a type-2 error.
Accuracy (TP+TN)/(N+P) standard metric. Usually ver high since our TN is very large comparing with TP.
Accuracy* =TP% not a standard metric
Precision TP/(TP+FP) standard metric; when there is a POSITIVE extraction, the probability that this extraction is right. From a data consumer perspective, this shall be the most important metric.
F1 2recallprecision/(recall+precision) The standard metric to show the balance between recall and precision.
Demo Script
demo.sh will run spe-115712-ms, spe-210009-ms and spe-28002-ms in one shot.

git pull
bash script/demo.sh
More sample CLIs
python opgee_cli.py -p ./data/zips/spe-115712-ms.zip -m gpt-4 -g ./data/spe/SPE-115712-ms.xlsx
python opgee_cli.py -p ./data/zips/spe-210009-ms.zip -m gpt-4 -g ./data/spe/SPE-210009-ms.xlsx
python opgee_cli.py -p ./data/zips/spe-28002-ms.zip -m gpt-4 -g ./data/spe/SPE-28002-ms.xlsx
python opgee_cli.py -p ./data/zips/spe-179612-ms.zip -m gpt-4 -g ./data/spe/SPE-179612-ms-WL.xlsx
python opgee_cli.py -p ./data/zips/spe-9478-pa.zip -m gpt-4 -g ./data/spe/spe-9478-pa-RZ.xlsx
python opgee_cli.py -p ./data/zips/spe-38926-ms.zip -m gpt-4 -g ./data/spe/spe-38926-ms_intermediate.xlsx
python opgee_cli.py -p ./data/zips/iptc-18834-ms.zip -m gpt-4 -g ./data/iptc/iptc-18834-ms_intermediate.xlsx

Playground

There are a few seperate CLIs provieded to give users flexibility to run various task.

Prepare a Paper
The first step is to convert a pdf to the zip format so we can get it further parsed it into txt files, because the model can only accept text. Always use -h to check how CLI works.

python script/prepare_paper.py -h
A possible example is as follows.

python script/prepare_paper.py -p data/spe/spe-87016-ms.pdf
The above command generate the paper's zip format to result/spe-87016-ms/spe-87016-ms.zip. It also saves an unzipped directory for the content.

It also generates the txt format of the paper in result/spe-87016-ms/txt. It contains paper.txt(all pages), paper-1.txt(the first page), ..., paper-n.txt(the last page). If you notice there is a txt file named paper-15-18.txt, that means there is a big table spanning these pages, and they have to be always processed together.

Please refrain to use this command now, since there is a 500/month limit with the current key. We'll need multiple keys in the production mode later.

Once you have prepared a paper, you can use only paper name to refer to this paper, and do not need to provide any path or file extension.

Show Tokens
This command gives you an idea the number of tokens of each page (or pages if they cannot be sperated) so that you can plan accordingly for next steps. Always use -h to check how CLI works.

python script/show_tokens.py -h
A possible example is as follows.

python script/show_tokens.py -p spe-87016-ms -m gpt-4
For all gpt models, the same tokenizer is used, so they have same token sizes. You can call this CLI as many times as you want, it runs locally and does not consume any API credits.

However, for claude, it might consume API credits, please refrain to use it.

Export System Prompt
If you do not want to start your prompt_playground.txt from scratch, you can export the prompts used in the system.

python script/export_prompt.py -h
A possible example is as follows.

python script/export_prompt.py -f Bangko -m gpt-4
The system prompts will be saved in script/prompt/Bangko-gpt/.

Ask LLM
This is the main CLI that you use to interact with the LLM and ask any questions to make proper extraction.

Always use -h to check how CLI works.

python script/ask_llm.py -h
A possible example is as follows.

python script/ask_llm.py -rl page-manual -p spe-87016-ms -m gpt-4 -sp 6 -ep 6
There are two two ways to load the paper as the reference.

-rl sys (The paper is loaded and splitted the same way as the system does. The question in playground/prompt.txt will be sent to LLM multiple times with every split. You'll get multiple responses.)
-rl page-manual (You need manually choose the start and end of pages. CLI will make sure your choice can be packed into a single request. The question in playground/prompt.txt will be sent to LLM one time with the pages you've chosen. You'll get one response.)
If you want gpt output answers in json format, please add '-j' parameter.

python script/ask_llm.py -rl page-manual -p spe-87016-ms -m gpt-4 -sp 6 -ep 6 -j
Note: Because json output mode is only support on GPT-4-Turbo, the input model will be replace with GPT-4-Turbo forcibly.

No matter what ref-loader you choose, the final prompt sent to the LLM is as follows.

Start of the Reference Text

{paper split}

End of the Reference Text

{content of playground/prompt.txt or the file you choose through CLI}
There are three files in the directory script/ you need to be aware of.

script/playground/sys_msg.txt (define how model behaves overall)
script/playground/prompt.txt (this is your prompt playground, edit as you wish)
script/playground/output.txt (the output is in the console, but will be kept appended to this file as well)
If you would like to track more debug details, it is located in result/paper-name/time-model-playground/paper-name.log.

The cost of runnning this command is also printed in the console at the end, and saved to result/paper-name/time-model-playground/cost.txt.

Evaluate History LLM Output

There are CLIs run evaluation with history LLM outputs.

Print help info

python opgee_local_cli.py -h

Run evaluation with folder which contain 'extract_raw.xslx' files

python opgee_local_cli.py -d result

Run evaluation with default result folder.

python opgee_local_cli.py

Run evaluation with single 'extract_raw.xslx' file.

python opgee_local_cli.py -f result/spe-182043-ms/231103_1340-gpt-4-individual/extract_raw.xlsx

Run evaluation with single 'extract_raw.xslx' file in success_only mode.

python opgee_local_cli.py -f result/spe-182043-ms/231103_1340-gpt-4-individual/extract_raw.xlsx -so
Options:

python opgee_local_cli.py -h

options:
-h, --help show this help message and exit
-f RAW_FILE, --raw_file RAW_FILE
Path of raw output file from LLM model which needs to be re-evaluated. The file name must be 'extract_raw.xlsx'
-r RAW_FILE_DIR, --raw_file_dir RAW_FILE_DIR
Result dir which contain raw output files from LLM model needs to be re-evaluated. Default is '/opgee/result'
-d DESTINATION_DIR, --destination_dir DESTINATION_DIR
Path to store re-evaluated results. If not provided, default is '/opgee/result_history'
-so, --success_only Only display and evaluate successfully parsed values in result excels.

Generate "TN,TP,FN,FP" statistics & Output multiple excels to one excel containing multiple sheets

There are CLIs pare provided for users to generate statistics about "TN, TP, FP, FN". Output is in the input directory, Also available output multiple excels to one excel containing multiple sheets

Print help info

python eval/eval_aggr.py -h

Generate statistics using the latest folder containing the 'eval_matrix.csv' file in the input file directory.

python eval/eval_aggr.py -c result

Generate statistics using the latest folder containing the 'eval_matrix.csv' file in the input file directory.

python eval/eval_aggr.py -c result/spe-182043-ms

Generate statistics using a folder containing the "eval_matrix.csv" file no earlier than the input time.

python eval/eval_aggr.py -t 230101
#Use the latest folder containing 'eval_matrix.csv' in the default result directory to generate statistical information and merged results.
python eval/eval_aggr.py

Output multiple excels to one excel containing multiple sheets

python eval/eval_aggr.py -e result/otc-26509-ms/231110_1626-gpt-4-individual/extract_ref.xlsx result/otc-26509-ms/231110_1626-gpt-4-individual/extract_ref.xlsx

Generate merged results to one excel "extract_aggr.xlsx" from a provided dir.

python eval/eval_aggr.py -m result

Generate difference table between two batch of results to one "extract_diff.xlsx" file from at least 2 provided dir.

The Generated 'extract_diff.xlsx' would be saved in "result_diff" folder.

python eval/eval_aggr.py -d <result_path_1> <result_path_2> ... <result_path_n>
Options:

python eval/eval_aggr.py -h

-h, --help show this help message and exit
-c CSV_PATH, --csv-path CSV_PATH
Path of the dir to generate 'TN TP FP FN' statistics, default is 'result'
-t TIME, --time TIME Produce 'TN TP FP FN' statistics from the input time format is '%y%m%d like '231019'
-e EXCEL_PATH [EXCEL_PATH ...], --excel-path EXCEL_PATH [EXCEL_PATH ...]
Path of the dir to generate combined excel like: 'result/otc-26509-ms/231110_1626-gpt-4-individual/extract_ref.xlsx' Please note that the data table in excel must be in the sheet named Sheet1
-m MERGE_RESULTS, --merge_results MERGE_RESULTS
Path of the dir to generate combined excel with ground truth, extracted_ref, and extracted_raw values like: 'result' The generated excel file - extracted_summary.xlsx would also be save in this path.
-d DIFF_RESULTS [DIFF_RESULTS ...], --diff_results DIFF_RESULTS [DIFF_RESULTS ...]
Path list of you results. To generate the difference table, at least 2 or more batch of results need to be provide.Difference table would be saved in 'result_diff/extract_diff.xlsx'.
"Deep Mode"

Deep Mode means there is not any background information when ask LLM model. In other words, the fields name, the field age (year) are all answered by llm model.

In this mode ground truth file is not a necessity, if ground truth is provided, It would also be used to evaluation, otherwise, the evaluation part would be skipped.

To enable the "Deep Mode" in opgee CLI please add -d parameter in command line.

This is and example:

Evaluate result

python opgee_cli.py -p ./data/zips/spe-9478-pa.zip -m gpt-4 -g ./data/spe/spe-9478-pa-v2.xlsx -d

Not evaluate result

python opgee_cli.py -p ./data/zips/spe-9478-pa.zip -m gpt-4 -d

Advancing Oil and Gas Emissions Assessment Through Large Language Model Data Extraction

Table of contents

Start of the Reference Text

End of the Reference Text

Print help info

Run evaluation with folder which contain 'extract_raw.xslx' files

Run evaluation with default result folder.

Run evaluation with single 'extract_raw.xslx' file.

Run evaluation with single 'extract_raw.xslx' file in success_only mode.

Print help info

Generate statistics using the latest folder containing the 'eval_matrix.csv' file in the input file directory.

Generate statistics using the latest folder containing the 'eval_matrix.csv' file in the input file directory.

Generate statistics using a folder containing the "eval_matrix.csv" file no earlier than the input time.

Output multiple excels to one excel containing multiple sheets

Generate merged results to one excel "extract_aggr.xlsx" from a provided dir.

Generate difference table between two batch of results to one "extract_diff.xlsx" file from at least 2 provided dir.

The Generated 'extract_diff.xlsx' would be saved in "result_diff" folder.

Evaluate result

Not evaluate result

Table of contents

Files