This project implements a Bayesian causal inference model to analyze the relationship between review length and sentiment polarity using the Yelp Polarity dataset. The model leverages pre-trained BERT embeddings to account for the content of reviews while estimating the causal effect of review length on sentiment.
The model uses:
X: Review lengths (word counts).Z: Semantic features extracted from BERT embeddings.Y: Binary sentiment polarity (0 for negative, 1 for positive).alpha, beta, sigma, and weights for BERT embeddings).Z) using BERT embeddings.X) are calculated as the word count of each review.X (length) on Y (sentiment), controlling for Z (content).alpha, sigma, beta).The model was evaluated on a subset (1%) of the Yelp Polarity dataset, and the following key findings were observed:
The posterior distributions of key model parameters (alpha, sigma, and beta) provide insights into the model's learned beliefs:
alpha: Controls the prior strength in the generative process. The posterior distribution shows well-defined values, indicating the model learns this parameter effectively.sigma: Represents the noise or variability in sentiment predictions. A narrow distribution suggests the model captures the variability in the data accurately.beta: The causal effect of review length (X) on sentiment (Y). The posterior distribution of beta shows values concentrated near zero, indicating a negligible direct effect of review length on sentiment once the review content (Z) is accounted for.Y_test) against the predicted sentiments.beta close to zero). Sentiment polarity is primarily driven by the review content rather than its length.Z) effectively accounts for confounders, isolating the true relationship between review length and sentiment.| Posterior Distributions of Hyperparameters | Actual vs Predicted Sentiments |
|---|---|
|
|
|
These results highlight the model's ability to classify sentiments effectively while providing credible causal insights into the relationship between review length and sentiment polarity.