This section outlines the datasets used to test and validate the Toy Compiler and multi-agent system components. It supports robust testing, error handling, and realistic simulationβall aligned with Module 3 requirements.
π§© 1. Toy Compiler Datasets
These datasets simulate input scripts for the toy language and are used in unit tests, integration tests, and error handling validation.
π Folder Structure
datasets/ βββ hello_world.txt βββ math_demo.txt βββ error_cases.txt βββ stress_test.txt
π Sample: hello_world.txt
text PRINT Hello, Nokwazi! ADD 5 7 PRINT Compilation complete.
π Sample: error_cases.txt
text ADD five 7 PRINT Missing quote UNKNOWN_CMD test
β Usage in Code
python with open("datasets/hello_world.txt") as f: source = f.read()
π€ 2. Multi-Agent System Datasets
These datasets support document search, summarization, collaboration, and experiment tracking.
π¬ Research & Document Processing
CORD-19 Dataset
Biomedical papers for document search and summarization
π Kaggle: CORD-19
Semantic Scholar Corpus
Academic papers with metadata
π Semantic Scholar API
π₯ Collaboration & Task Management
GitHub Archive
Issues and pull requests for simulating agent coordination
π gharchive.org
Trello JSON Export
Task boards for project tracking
π Trello Guide
π§ͺ Experiment Tracking & Logs
MLflow Example Logs
Simulated experiment tracking data
π MLflow Examples
Synthetic JSON Logs
json { "experimentid": "exp001", "model": "BERT", "accuracy": 0.87, "timestamp": "2025-10-28T08:45:00Z" }
π§° Mock Data for Testing Agents
Mockaroo
Generate realistic CSV/JSON datasets
π mockaroo.com
Faker (Python Library)
Generate fake names, emails, timestamps
bash pip install faker
python from faker import Faker fake = Faker() print(fake.name(), fake.email(), fake.date_time())
π Integration Tips