We use cookies to improve your browsing experience and to analyze our website traffic. By clicking “Accept All” you agree to our use of cookies. Privacy policy.

SustainaRAG: A Retrieval-Augmented Assistant for CSR & ESG Strategy Design

Abstract

This project presents SustainaRAG, a lightweight Retrieval-Augmented Generation (RAG) assistant designed to help SMEs explore and design sustainable business strategies. By leveraging LangChain, FAISS, and Hugging Face LLMs, the tool retrieves answers from a curated ESG/CSR document set via a simple Streamlit UI.

Overview

SustainaRAG indexes sustainability reports and answers user questions using vector similarity search and LLM responses. It’s ideal for researchers, CSR consultants, and SMEs looking to accelerate sustainability insights without deep technical know-how.

“Figure 1: SustainaRAG answering a query about examples of corporate responsibility best practices.”

Industry Insights

As of 2024, compliance with the EU Corporate Sustainability Reporting Directive (CSRD) has become mandatory for large EU and non-EU companies operating in Europe. However, SMEs face significant challenges accessing structured guidance.

This tool offers early-stage support for SMEs attempting to align with ESG frameworks such as the EU Taxonomy, GRI, and SFDR. By providing a simple retrieval-based interface, SustainaRAG helps bridge the gap between regulatory expectations and operational implementation.

Target Audience

Sustainability officers
SME strategy teams
CSR/MBA students
AI developers interested in RAG tools

Problem Statement

CSR and ESG reports are long, inconsistent, and fragmented. SMEs often lack tools or expertise to parse and operationalize them effectively. SustainaRAG solves this by offering a semantic search interface powered by a RAG pipeline.

Technical Design

The system was tested using a small curated dataset of multiple PDF reports from EU institutions, CSR policy frameworks, and industry case studies. These were converted to text, chunked (~500 tokens), and embedded via SentenceTransformers.

Documents were loaded using LangChain’s PyPDFLoader, chunked into overlapping segments (500 tokens, 50 overlap), embedded using HuggingFace SentenceTransformers (all-MiniLM-L6-v2), and stored in a FAISS index. Metadata (title, source) was retained for future traceability.

Embedding & Indexing: FAISS with SentenceTransformers
LLM: HuggingFace Zephyr-7B via HuggingFaceInference
Interface: Streamlit
Core Files: ingest.py, retriever.py, app.py

“Figure 2: Architecture diagram of the SustainaRAG pipeline.”

Dataset Description

SustainaRAG uses a curated corpus of multiple sustainability documents sourced from publicly available European Union CSR frameworks, ESG guidelines, and industry-specific reports. These include regulatory texts, case studies from multinational companies, and implementation roadmaps.

Documents were converted to text (via PDF parsing), chunked into ~500-token passages with overlap, and embedded using SentenceTransformers. The FAISS index stores embeddings and associated metadata.

Dataset Processing Methodology

Documents were loaded using LangChain’s PyPDFLoader, chunked with overlap (500 tokens, 50 overlap), embedded using all-MiniLM-L6-v2, and stored in a FAISS index. Metadata fields (title, source) were preserved for traceability.

Implementation & Usage

Install dependencies: uv pip install -r requirements.txt
Run: streamlit run app.py
Query examples:
- “What are the 3 pillars of sustainability?”
- “How can SMEs align with EU taxonomy?”

Evaluation Strategy

The assistant was evaluated manually across 20 ESG/CSR questions using a set of test queries. The quality of answers was judged based on semantic correctness, relevance to retrieved documents, and completeness. Future work includes integrating user feedback loops and automatic relevance scoring.

Performance Observations:

Retrieval latency: ~150ms/query
Corpus size: 12 documents (~4MB)
Semantic accuracy (manual): ~85%

Early Testing Results

Success: Correct retrieval for “3 pillars of sustainability” with concise LLM-generated summary.
Failure: CSRD 2024-specific queries failed due to lack of latest source inclusion.

Monitoring and Maintenance

User query logging and retrieval trace analysis planned
Monthly FAISS index refresh with new EU publications
Retraining and performance tuning roadmap in progress

Key Benefits

LLM-powered responses grounded in ESG sources
Semantic search using vector embeddings
No API keys or hosted services required for local use
Modular architecture, ready for extension

Access

GitHub repo: https://github.com/kostas696/sustainarag
License: MIT
Status: Proof-of-concept (actively maintained)

Limitations Discussion

No citation tracing in current outputs
Limited multilingual support
Document set is small and manually curated
No retrieval score thresholds or ranking confidence displayed
Not tested in production environments

Future Work

Add source citations in output
Incorporate memory/chaining
Deploy on Hugging Face Space
Planned enhancements include basic logging, user feedback tracking, and automatic report ingestion from new EU documents.
Scheduled updates will ensure the vector index stays fresh and relevant.

Contact:
https://github.com/kostas696