A lightweight NLP baseline that detects spoilers in Russian movie comments.
The project has two parts:
text
, label
in {spoiler
, no_spoiler
}Task. Binary text classification of Russian movie comments: spoiler
vs no_spoiler
.
Input. Short user comments only (Russian).
No movie scripts, plot summaries, subtitles, or metadata are used in this baseline.
Labels.
spoiler
: the comment reveals a plot twist, ending, key death/identity, or other major event.no_spoiler
: opinion/emotion/production quality, but no plot reveal.Data source & format. A small labeled CSV:
text
— raw comment in Russianlabel
— spoiler
or no_spoiler
Preprocessing & split. Lowercasing, Russian stop-words (NLTK).
Features: CountVectorizer
(bag-of-words). Model: MultinomialNB
.
Split: train_test_split(test_size=0.2, random_state=42)
.
What is out of scope (baseline). No lemmatization/stemming, no n-grams/TF-IDF, no external context (plots), no cross-validation.
Limitations. Small dataset; BoW features; scores may be optimistic; Russian-specific normalization is minimal.
Next steps. Add lemmatization; try TF-IDF and n-grams (1–2/1–3); evaluate embeddings (e.g., fastText/Sentence-BERT); optional context-aware variant (comment + short plot summary) as an ablation.
Sample (first rows).
csv
text,label
"Фильм отличный, но в конце герой...",spoiler
"Очень красивая картинка, саунд супер",no_spoiler
"Антагонист на самом деле отец главного героя",spoiler
Dataset: movie_comments.csv
CountVectorizer
(bag-of-words)MultinomialNB
train_test_split(test_size=0.2, random_state=42)
Insert the classification report screenshot here.
Accuracy: 1.00%
Note: very high scores can indicate an easy dataset or leakage; future work adds stronger validation.
API (Django REST Framework)
Model artifacts spoiler_detector.pkl and vec.pkl are loaded inside the DRF serializer.
Run locally:
cd .\moviesite\
pip install -r requirements.txt
python manage.py migrate
python manage.py runserver
write this path: http://127.0.0.1:8000/ru/rating/
example post:
after it come this way: http://127.0.0.1:8000/ru/1/ and look down
Repo & Resources
GitHub (root): https://github.com/Alymbaek/Movie_comments_ML_NLP
Training notebook: https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/Movie_comments_NLP.ipynb
DRF integration (serializers.py): https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/Movie-Site/moviesite/movie/serializers.py
Raw dataset / sample: https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/movie_comments.csv
Limitations
Small dataset; no lemmatization; no char/n-grams; no cross-validation; minimal Russian-specific normalization.