Spoiler Detection in Movie Comments (Russian NLP)

Overview

A lightweight NLP baseline that detects spoilers in Russian movie comments.
The project has two parts:

Training & evaluation notebook (scikit-learn + NLTK)
Django REST Framework integration that serves the trained model

Data

CSV with columns: text, label in {spoiler, no_spoiler}
Language: Russian
Class balance: ~50/50 (replace with exact counts if known)
Dataset link: see Repo & Resources

Context

Task. Binary text classification of Russian movie comments: spoiler vs no_spoiler.

Input. Short user comments only (Russian).
No movie scripts, plot summaries, subtitles, or metadata are used in this baseline.

Labels.

spoiler: the comment reveals a plot twist, ending, key death/identity, or other major event.
no_spoiler: opinion/emotion/production quality, but no plot reveal.

Data source & format. A small labeled CSV:

text — raw comment in Russian
label — spoiler or no_spoiler
Approx. size: ~N rows (replace with your count). Class balance ≈ 50/50.
Dataset link: see Repo & Resources.

Preprocessing & split. Lowercasing, Russian stop-words (NLTK).
Features: CountVectorizer (bag-of-words). Model: MultinomialNB.
Split: train_test_split(test_size=0.2, random_state=42).

What is out of scope (baseline). No lemmatization/stemming, no n-grams/TF-IDF, no external context (plots), no cross-validation.

Limitations. Small dataset; BoW features; scores may be optimistic; Russian-specific normalization is minimal.

Next steps. Add lemmatization; try TF-IDF and n-grams (1–2/1–3); evaluate embeddings (e.g., fastText/Sentence-BERT); optional context-aware variant (comment + short plot summary) as an ablation.

Sample (first rows).
csv
text,label
"Фильм отличный, но в конце герой...",spoiler
"Очень красивая картинка, саунд супер",no_spoiler
"Антагонист на самом деле отец главного героя",spoiler

Dataset: movie_comments.csv

Method

Preprocessing: lowercasing, NLTK Russian stopwords
Features: CountVectorizer (bag-of-words)
Model: MultinomialNB
Split: train_test_split(test_size=0.2, random_state=42)

Results

Insert the classification report screenshot here.
classification_report

Accuracy: 1.00%

Note: very high scores can indicate an easy dataset or leakage; future work adds stronger validation.

API (Django REST Framework)
Model artifacts spoiler_detector.pkl and vec.pkl are loaded inside the DRF serializer.

Run locally:

cd .\moviesite\

pip install -r requirements.txt
python manage.py migrate
python manage.py runserver

write this path: http://127.0.0.1:8000/ru/rating/
example post: Снимок экрана 2025-08-09 182803.png

after it come this way: http://127.0.0.1:8000/ru/1/ and look down

Repo & Resources
GitHub (root): https://github.com/Alymbaek/Movie_comments_ML_NLP

Training notebook: https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/Movie_comments_NLP.ipynb

DRF integration (serializers.py): https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/Movie-Site/moviesite/movie/serializers.py

Raw dataset / sample: https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/movie_comments.csv

Limitations
Small dataset; no lemmatization; no char/n-grams; no cross-validation; minimal Russian-specific normalization.

Overview

A lightweight NLP baseline that detects spoilers in Russian movie comments.
The project has two parts:

Training & evaluation notebook (scikit-learn + NLTK)
Django REST Framework integration that serves the trained model

Data

CSV with columns: text, label in {spoiler, no_spoiler}
Language: Russian
Class balance: ~50/50 (replace with exact counts if known)
Dataset link: see Repo & Resources

Context

Task. Binary text classification of Russian movie comments: spoiler vs no_spoiler.

Input. Short user comments only (Russian).
No movie scripts, plot summaries, subtitles, or metadata are used in this baseline.

Labels.

spoiler: the comment reveals a plot twist, ending, key death/identity, or other major event.
no_spoiler: opinion/emotion/production quality, but no plot reveal.

Data source & format. A small labeled CSV:

text — raw comment in Russian
label — spoiler or no_spoiler
Approx. size: ~N rows (replace with your count). Class balance ≈ 50/50.
Dataset link: see Repo & Resources.

Preprocessing & split. Lowercasing, Russian stop-words (NLTK).
Features: CountVectorizer (bag-of-words). Model: MultinomialNB.
Split: train_test_split(test_size=0.2, random_state=42).

What is out of scope (baseline). No lemmatization/stemming, no n-grams/TF-IDF, no external context (plots), no cross-validation.

Limitations. Small dataset; BoW features; scores may be optimistic; Russian-specific normalization is minimal.

Dataset: movie_comments.csv

Method

Preprocessing: lowercasing, NLTK Russian stopwords
Features: CountVectorizer (bag-of-words)
Model: MultinomialNB
Split: train_test_split(test_size=0.2, random_state=42)

Results

Insert the classification report screenshot here.
classification_report

Accuracy: 1.00%

Note: very high scores can indicate an easy dataset or leakage; future work adds stronger validation.

API (Django REST Framework)
Model artifacts spoiler_detector.pkl and vec.pkl are loaded inside the DRF serializer.

Run locally:

cd .\moviesite\

pip install -r requirements.txt
python manage.py migrate
python manage.py runserver

write this path: http://127.0.0.1:8000/ru/rating/
example post: Снимок экрана 2025-08-09 182803.png

after it come this way: http://127.0.0.1:8000/ru/1/ and look down

Repo & Resources
GitHub (root): https://github.com/Alymbaek/Movie_comments_ML_NLP

Training notebook: https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/Movie_comments_NLP.ipynb

DRF integration (serializers.py): https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/Movie-Site/moviesite/movie/serializers.py

Raw dataset / sample: https://github.com/Alymbaek/Movie_comments_ML_NLP/blob/main/movie_comments.csv

Limitations
Small dataset; no lemmatization; no char/n-grams; no cross-validation; minimal Russian-specific normalization.

Spoiler Detection in Movie Comments (Russian NLP)

Table of contents

Overview

Data

Context

Method

Results

Table of contents

Overview

Data

Context

Method

Results

Datasets

Datasets

Code

Code