Info_Extraction: Automated Structured Data Extraction from Unstructured Documents

Project: Info_Extraction

Repository: https://github.com/hudasaleh97188/info_Extraction

Author: Huda Saleh

TL;DR

Info_Extraction is a streamlined, intelligent framework designed to transform unstructured document data into clean, structured formats. By leveraging modern Large Language Models (LLMs) and robust parsing techniques, it automates the tedious process of manual data entry. The system ingests various document types (such as PDFs or images), extracts key information fields with high precision, and outputs the data in standardized formats like JSON or CSV, making it immediately ready for downstream analysis or database integration.

Introduction: The Unstructured Data Bottleneck

In the modern data ecosystem, valuable information is often locked away in unstructured formats. Businesses process thousands of invoices, receipts, contracts, and resumes daily. While databases require structured inputs (rows and columns), the real world operates in PDFs, scanned images, and free-text emails.

Bridging this gap has traditionally required either brittle rule-based systems (Regex) that break with the slightest format change, or expensive human manual entry. Info_Extraction addresses this challenge by providing a flexible, code-centric solution that utilizes the reasoning capabilities of AI to understand and extract data contextually, regardless of the document's layout.

The Problem: Limitations of Traditional Extraction

Standard approaches to document processing face significant hurdles:

Template Rigidity: Traditional OCR tools often require defining specific "zones" on a page. If a vendor changes their invoice layout, the extraction fails.
Context Blindness: Regex can find a date pattern, but it cannot easily distinguish between an "Invoice Date" and a "Due Date" without complex logic.
Scalability: Manual data entry is slow, error-prone, and impossible to scale effectively with growing data volumes.

Info_Extraction: A Dynamic Solution

Info_Extraction is designed to be a generalized, adaptable extraction pipeline. Instead of relying on rigid templates, it treats information extraction as a semantic understanding task.

The workflow consists of the following stages:

Input Ingestion: User uploads unstructured documents (such as PDFs and images).
Schema Definition: The user defines what to extract (e.g., Names, Dates, Specific Clauses, or Summary Points) using a schema (often defined in Pydantic or JSON).
Process: utilizes Mistral OCR to convert visual data into machine-readable text.
Extract: Employs an LLM-driven engine to identify and extract specific fields based on user-defined schemas.
Structured Output: The extracted data is validated and formatted into a structured output like JSON, CSV, or a Database entry.

Key Features

Format Agnostic: Capable of handling various document layouts without the need for pre-training on specific templates.
LLM-Powered Precision: Uses the semantic understanding of Large Language Models to distinguish between ambiguous fields (e.g., differentiating "Billing Address" from "Shipping Address").
Structured Output: Automatically maps unstructured text to rigid data schemas (e.g., forcing dates into YYYY-MM-DD format).
Automated Validation: Ensures extracted data meets required types (integers vs. strings) before outputting.
Batch Processing: Designed to handle multiple documents in sequence, suitable for high-volume pipelines.

Architecture Deep Dive: The LangGraph Workflow

The workflow is orchestrated as a StateGraph with the following lifecycle:

1. Initialization & OCR (`prepare_document_node`)

Entry Point: The graph begins here.
Input Validation: It validates that the user has provided a file and a list of ExtractionTask objects.
Vision-Language Processing: It passes the file to Mistral OCR. Unlike standard text extractors, Mistral "sees" the document layout, returning a clean Markdown representation that preserves headers, tables, and form structures.
State Update: The resulting Markdown and the queue of tasks are saved to the global state (ExtractionGraphState).

2. The Task Dispatcher (The Loop Controller)

Node: task_dispatcher_node
Logic: This node acts as a traffic controller. It checks the tasks_to_process queue in the state.
If tasks remain: It pops the next task, loads it into current_task, and routes the flow to the Analysis phase.
If queue is empty: It routes the flow to Finalization.
Benefit: This cyclic design allows you to run 10 different extraction queries on a single document upload without reloading or re-OCR'ing the file.

3. The Extraction Cycle (Per Task)

Once a task is dispatched, it moves through a strict three-step pipeline:

A. Schema Analysis (analyze_schema_node):
Before extracting, the system analyzes the user's requested schema. It breaks down complex requirements (e.g., nested JSON objects or multi-row tables) into a clear "Extraction Aim" that helps the LLM understand intent before it sees the data.
B. Extraction (extract_data_node):
Model: Google Gemini 2.5 Flash.
Process: The system combines the Mistral Markdown (Context) + Analysis Result (Instructions) into a specialized prompt. Gemini generates a raw JSON candidate.
C. Validation & Cleaning (validate_data_node):
Self-Correction: The raw output is not trusted blindly. It is passed back to the LLM with a validation prompt.
Standardization: The model corrects formatting errors (e.g., converting "Five Hundred Dollars" to 500.00 or fixing broken JSON syntax) and casts data to the strict Pydantic types defined in your schema.
Loop Back: Once validated, the result is appended to completed_results, and the graph loops back to the Task Dispatcher for the next task.

4. Finalization (`finalize_graph_node`)

Aggregation: When all tasks are finished, this node aggregates the successes and failures.
Output: It constructs the FinalExtractionOutput object, returning a clean, unified JSON response containing results for every requested task, execution metadata, and token usage stats.

Testing and Validation

A critical aspect of the info_Extraction repository is its approach to reliability. Testing stochastic systems requires a layered strategy that separates code logic (speed/stability) from model intelligence (quality).

1. Unit Testing (Components)

We test the individual deterministic functions to ensure the "plumbing" works before any AI is involved.

Scope: File loaders, schema analyzers, and JSON parsers.
Goal: Confirm that a valid PDF loads correctly and a broken JSON string is repaired without crashing.
Command: pytest tests/unit

2. Integration Testing (Workflow Logic)

These tests validate the LangGraph wiring.

Scope: The full create_extraction_graph workflow.
Goal: Verify that data flows correctly from prepare_document task_dispatcher finalize_graph. It ensures the state machine correctly loops through multiple tasks and handles errors gracefully.

3. System Performance Testing (Latency & Overhead)

Added based on your code.
Since LLM APIs are inherently slow, we need to ensure our internal code doesn't add unnecessary delay.

Methodology: We use @pytest.mark.performance with Mocking. By replacing the slow Mistral and Gemini API calls with instant fake responses (MagicMock), we measure the execution time of the Graph Logic only.
Threshold: The test asserts execution_time < MAX_EXECUTION_TIME (e.g., 2.0s).
Why it matters: If this test fails, it means the LangGraph orchestration or Python logic is inefficient, regardless of how fast the LLM is.
Command: pytest -m performance

4.Accuracy Evaluation

The evaluation process ensures the reliability of the extracted data by leveraging DeepEval, an open-source testing framework designed for Large Language Models. Rather than relying on manual verification, the workflow implements automated unit tests where the extraction engine’s results (the "Actual Output") are rigorously compared against validated ground truths.

This framework quantitatively scores performance using key metrics:

Faithfulness: Verifies that the extracted information is derived strictly from the source document, preventing hallucinations.
Correctness: Ensures the structured data aligns perfectly with the required schema and logic.

To maintain high standards, the system utilizes a "Golden Dataset"—a collection of documents with known, manually verified values. The testing pipeline operates in three smooth steps:

Run Pipeline: The system processes the test documents to generate fresh extractions.
Compare: These extracted values are validated against the Golden Set.
Metrics: The system calculates Precision (accuracy of the data) and Recall (completeness of the data).

Example Test Case

Input: invoice_101.pdf

Expected Output (Golden): {"date": "2023-10-25", "total": 500.00}

Actual Output (Model): {"date": "2023-10-25", "total": 500.00}

Result: ✅ PASS

Use Cases & Examples

The flexibility of info_Extraction makes it suitable for various domains:

Financial Automation

Task: Invoice Processing.
Action: Automatically extract Vendor Name, Invoice Number, Line Items, and Total Amount from mixed-format invoices.
Benefit: Reduces accounts payable processing time by 80%.

HR & Recruitment

Task: Resume Parsing.
Action: Extract Candidate Name, Education, Skills, and Last Employer from resumes in PDF or Word format.
Benefit: Populates Applicant Tracking Systems (ATS) instantly without manual tagging.

Legal Compliance

Task: Contract Analysis.
Action: Extract Effective Dates, Expiry Dates, and Party Names from legal contracts.
Benefit: Automates the creation of a contract renewal calendar.

Implementation Highlights

The project is built using a modern Python stack designed for efficiency and readability:

Document Processing (OCR): Mistral OCR
Used to ingest the raw documents (PDFs, images). Mistral's OCR is particularly strong at "reading" complex layouts and converting them into clean text that the LLM can understand.
Orchestration & Flow: LangGraph
Instead of a simple linear chain, LangGraph is used to create a stateful workflow. It manages the steps of the extraction process. This allows the system to loop back and fix errors if the validation fails.
Data Validation: Pydantic
Defines the strict "Schema" (the blueprint) for the data. It forces the output to be structured JSON (ensuring you get a specific date format or integer for amounts) rather than unstructured text.
Intelligence (LLM): Google Gemini
The core reasoning engine. It takes the text from Mistral OCR and the rules from Pydantic to extract the specific information required.
Evaluation & Testing: DeepEval & Pytest
DeepEval serves as the "Judge," scoring the extraction quality (Faithfulness, Recall).
Pytest runs these evaluations automatically to ensure the pipeline is working correctly.

Conclusion

As organizations continue to amass unstructured data, tools like Info_Extraction become essential infrastructure. By combining the power of LLMs with rigorous software engineering practices, this project offers a robust framework for turning documents into data. We invite the community to explore the repository, contribute new parsers, and help standardize the future of information extraction.

*Check out the code and contribute at: https://github.com/hudasaleh97188/info_Extraction*

Info_Extraction: Automated Structured Data Extraction from Unstructured Documents

Table of contents

Info_Extraction: Automated Structured Data Extraction from Unstructured Documents

TL;DR

Introduction: The Unstructured Data Bottleneck

The Problem: Limitations of Traditional Extraction

Info_Extraction: A Dynamic Solution

Key Features

Architecture Deep Dive: The LangGraph Workflow

1. Initialization & OCR (`prepare_document_node`)

2. The Task Dispatcher (The Loop Controller)

3. The Extraction Cycle (Per Task)

4. Finalization (`finalize_graph_node`)

Testing and Validation

1. Unit Testing (Components)

2. Integration Testing (Workflow Logic)

3. System Performance Testing (Latency & Overhead)

4.Accuracy Evaluation

Use Cases & Examples

Financial Automation

HR & Recruitment

Legal Compliance

Implementation Highlights

Conclusion

Table of contents

Code

Code

Table of contents

Info_Extraction: Automated Structured Data Extraction from Unstructured Documents

TL;DR

Introduction: The Unstructured Data Bottleneck

The Problem: Limitations of Traditional Extraction

Info_Extraction: A Dynamic Solution

Key Features

Architecture Deep Dive: The LangGraph Workflow

1. Initialization & OCR (prepare_document_node)

2. The Task Dispatcher (The Loop Controller)

3. The Extraction Cycle (Per Task)

4. Finalization (finalize_graph_node)

Testing and Validation

1. Unit Testing (Components)

2. Integration Testing (Workflow Logic)

3. System Performance Testing (Latency & Overhead)

4.Accuracy Evaluation

Use Cases & Examples

Financial Automation

HR & Recruitment

Legal Compliance

Implementation Highlights

Conclusion

Table of contents

Code

Code

1. Initialization & OCR (`prepare_document_node`)

4. Finalization (`finalize_graph_node`)