Training AI to Predict Clinical Trial Outcomes: A 30% Improvement in 3 Hours

Predicting whether a clinical trial will succeed or fail is notoriously difficult. Even experienced pharmaceutical analysts struggle with accuracy rates much better than a coin flip. But what if we could train an AI model to learn from thousands of past trials and improve its predictions?

I recently built a dataset of 1,366 clinical trial predictions and fine-tuned an 8B parameter language model to predict trial outcomes. This resulted to in a jump from 56% accuracy (barely better than guessing) to 73% accuracy, a 30% relative improvement. Here’s how I did it, what I learned, and why this matters for anyone working with prediction tasks.

The Challenge: Making Predictions from Historical Data

The pharmaceutical industry runs on uncertainty. When Eli Lilly announces a Phase 3 trial for a new obesity drug, analysts, investors, and competitors all ask the same question: Will it succeed?

Traditionally, answering this question requires:

Deep domain expertise in pharmacology
Knowledge of the company’s track record
Understanding of regulatory pathways
Access to clinical trial databases
Lots of time to research each case

Even then, human experts achieve modest accuracy. The question I wanted to answer is, could an AI model learn these patterns automatically from historical data?

The Data Problem: Labels Are Expensive

The biggest hurdle in building prediction models is getting labeled training data. Hiring medical experts to label thousands of clinical trial outcomes would cost tens of thousands of dollars and take months.

This is where I discovered Lightning Rod’s approach to data generation. Instead of manual labelling, their SDK uses what they call the “Future-as-Label” methodology. This approach makes the future outcome of a historical event its label.

Here’s how it works:

Find old news: Articles from 2023 about clinical trials starting
Generate questions: “Will Novo Nordisk’s Phase 3 trial meet endpoints by Q4 2024?”
Auto-label outcomes: Search recent news (late 2024/2025) to find what actually happened
Build dataset: Pair questions with verified outcomes

It does this without the need for human labellers. The Lightning Rod Python SDK automatically finds the answers by searching for what happened later.

Building the Dataset: 1,366 samples in 2 Minutes

Using Lightning Rod’s Python SDK, I generated the dataset with a simple pipeline:

from lightningrod import QuestionPipeline, NewsSeedGenerator, WebSearchLabeler
pipeline = QuestionPipeline(
    seed_generator=NewsSeedGenerator(
        start_date=datetime(2023, 1, 1),
        end_date=datetime(2024, 12, 31),
        search_query=["clinical trial Phase 3", "FDA approval"]
    ),
    question_generator=ForwardLookingQuestionGenerator(
        instructions="Generate binary questions about trial outcomes",
        examples=[
            "Will Eli Lilly's obesity drug trial meet endpoints by Q4 2024?",
            "Will the FDA approve Drug X by June 2024?"
        ]
    ),
    labeler=WebSearchLabeler(confidence_threshold=0.7)
)
dataset = lr.transforms.run(pipeline, max_questions=2000)

The SDK pulled news articles about clinical trials, generated forward-looking questions, and then searched for later outcomes. In about 10 minutes of compute time, I had 1,882 questions, with 72.6% successfully labeled, giving me 1,366 high-quality training examples.

Each example looked like this:

Question: "Will Novo Nordisk's CagriSema Phase 3 trial meet its 
primary endpoints by December 31, 2024?"
Answer: YES (1)
Confidence: 0.98

The labels weren’t guesses; they were verified facts from published trial results and FDA announcements.

The Experiment: Baseline vs Fine-Tuned Model

I split the data into training (85%) and test (15%) sets, then ran two experiments:

1. Baseline: Zero-Shot Prediction

First, I tested Llama-3–8B without any training. I gave each question and asked it to predict 0 (failure) or 1 (success).

Result: 56.3% Accuracy

The model was essentially guessing, with a slight optimistic bias (it predicted “success” too often). I wasn’t surprised because the base model has no special knowledge of pharmaceutical industry patterns.

2. Fine-Tuning: Teaching the Patterns

Next, I fine-tuned the model using LoRA (Low-Rank Adaptation) on the training data. LoRA is a parameter-efficient method that adds small adapter layers instead of retraining the entire model.

The setup I used:

Model: Llama-3–8B with 4-bit quantization
Method: LoRA fine-tuning via the Unsloth library
Hardware: Free Google Colab T4 GPU
Training time: ~21 minutes (3 epochs)
Trainable parameters: Only 16M (0.2% of the model)

Result: 73.3% Accuracy

The fine-tuned model correctly answered 151 out of 206 test questions, achieving 73.3% accuracy. This represents a 17-percentage-point improvement over the baseline, a 30% relative performance gain achieved in just 21 minutes of training. Notably, this was done using only 0.2% of the model’s parameters over 3 training epochs, demonstrating highly efficient improvement with minimal compute.

What Did the Model Learn?

The most interesting part wasn’t just the numbers; it was understanding what patterns the model discovered in the data.

Pattern 1: Company Track Records Matter

The model learned that pharmaceutical companies have different success rates. Questions mentioning Eli Lilly, Novo Nordisk, or Merck were more likely to be “YES” (success), while smaller biotech startups showed higher failure rates.

This makes sense because established companies have more resources, experience, and proven track records. The model picked this up automatically from the data.

Pattern 2: Therapeutic Areas Have Different Success Rates

Obesity and diabetes drugs showed ~68% success rates in the training data, while oncology trials succeeded only ~48% of the time. The model learned these differences without being explicitly told.

Cancer is harder to treat. Metabolic diseases have clearer biomarkers. The model internalized these domain patterns solely from examples.

Pattern 3: Timeline Realism

One surprising discovery: the model learned to spot unrealistic timelines.

Example questions the model corrected:

“Will [Small Biotech] complete Phase 3 in 6 months?” → Predicted NO (correctly)
“Will [Unproven Drug] get FDA approval in 3 months?” → Predicted NO (correctly)

The baseline model didn’t know that Phase 3 trials typically take 18–24 months. The fine-tuned version learned this pattern from the data.

Pattern 4: Better at Predicting Failures

The baseline model showed an optimistic bias, predicting “success” 63% of the time. The fine-tuned model was better calibrated at 52%, closer to the actual distribution.

More importantly, it learned to identify red flags like:

Aggressive timelines
Unproven mechanisms
Companies with poor track records
Challenging therapeutic areas

All five examples of “most improved” predictions involved the baseline incorrectly predicting success, while the fine-tuned model correctly predicted failure.

Why This Matters Beyond Clinical Trials

While this experiment focused on pharmaceutical trial outcomes, the real contribution goes beyond healthcare. It demonstrates a practical, repeatable workflow for building specialized prediction models from real-world data.

At its core, the approach is straightforward:

Identify a prediction task with historical data
Use temporal structure (what happened later becomes your label)
Generate datasets automatically instead of manual labeling
Fine-tune efficiently with LoRA on free GPUs
Achieve meaningful improvements in hours, not months

This same approach could work for:

Product launch predictions: “Will Company X release Product Y by Date Z?”
Policy outcomes: “Will Bill ABC pass by Q2 2026?”
Market events: “Will Stock X reach price Y by month Z?”
Sports forecasting: “Will Team X make the playoffs this season?”

Any domain with historical news or announcements, clear and verifiable outcomes, and sufficient examples can benefit from Lightning Rod’s Future-as-Label methodology.

It’s not limited to clinical trials; it serves as a scalable template for temporal prediction tasks across industries.

Technical Implementation Details

For developers interested in reproducing this:

Dataset Generation:

Source: News articles via Lightning Rod SDK
Questions: 1,882 generated, 1,366 valid (72.6%)
Label confidence: Average 0.998, minimum 0.85
Time: ~3 minutes

Model Training:

Base model: Llama-3–8B
Method: LoRA (rank=16, alpha=16)
Quantization: 4-bit for memory efficiency
Epochs: 3
Batch size: 2 (effective 8 with gradient accumulation)
Hardware: Google Colab free tier (T4 GPU)
Time: ~21 minutes

Evaluation:

Test set: 206 questions (15% holdout)
Metric: Binary accuracy
Baseline: 56.3% (116/206)
Fine-tuned: 73.3% (151/206)
Improvement: +17.0 percentage points

The full code and dataset are available on GitHub, and the dataset is published on Hugging Face for anyone who wants to reproduce or build on this work.

Limitations and Future Work

This isn’t a perfect crystal ball. The model still struggles with:

Novel drug mechanisms not seen in training
Rare diseases with limited examples
External factors (regulatory changes, manufacturing issues)
Very recent trials without outcome data yet

The 73% accuracy is a meaningful improvement over guessing, but it’s not prophecy. Think of it as moving from “coin flip” to “informed probability estimate.” For context, even experienced pharmaceutical analysts struggle to achieve accuracy above 65–70% in predicting trial outcomes.

Future improvements could include:

Larger datasets (5,000+ examples)
Additional features (company financials, prior trial data)
Ensemble methods combining multiple models
Continuous updating as new outcomes emerge

Key Takeaways

Three lessons from this project:

1. Automated labeling scales: Manually labeling 1,366 examples would have taken weeks and cost thousands. Lightning Rod’s Future-as-Label approach did it in 3 minutes.

2. Small models can specialize:* You don’t need GPT-4 or Claude for domain-specific tasks. An 8B model, fine-tuned on focused data, achieved 73% accuracy on a challenging prediction problem.

3. Historical data contains learnable patterns: Company track records, therapeutic area success rates, and timeline realism all emerged naturally from the training data. The model discovered what experts know from experience.

Try It Yourself

The tools I used are all publicly available:

Lightning Rod SDK: Open-source Python library for dataset generation
Pre-trained Model: Skip training, use the fine-tuned model directly via Hugging Face
Unsloth: Free library for efficient LoRA fine-tuning
Google Colab: Free GPU access for training
Hugging Face: Free dataset and model hosting

The barrier to entry for specialized AI models has never been lower. If you have a prediction task with historical data, you can build a custom model in a weekend.

The future of AI isn’t just giant general-purpose models; it’s also specialized models trained on focused, high-quality datasets for specific domains. This experiment is one example of what becomes possible when you combine automated data generation with efficient fine-tuning.

What prediction task would you build a model for?

The dataset is available on Hugging Face, and all code is on GitHub. Special thanks to Lightning Rod Labs for their SDK that made this project possible.

Resources:

🤖 Fine-tuned Model
📊 Dataset
💻 Code

Interested in building prediction datasets? Check out Lightning Rod or explore their examples on Hugging Face