This report details a text classification task performed on a subset of the 20 Newsgroups dataset. The objective was to automatically categorize newsgroup posts into predefined categories, specifically 'sci.space' and 'rec.sport.baseball'. The methodology involved preprocessing the text data, followed by feature extraction using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique. A Logistic Regression classifier was trained on the vectorized training data. The model's performance was evaluated on a held-out test set using standard classification metrics, including precision, recall, and F1-score, as presented in a classification report.
Text classification, the task of assigning predefined categories to text documents, is a fundamental problem in Natural Language Processing (NLP) with numerous applications, such as spam detection, sentiment analysis, and topic categorization. This study focuses on topic categorization, specifically classifying newsgroup posts from the well-known 20 Newsgroups dataset. The goal is to build a model capable of distinguishing between discussions related to space science ('sci.space') and baseball ('rec.sport.baseball'). We employ a supervised machine learning approach, utilizing a Logistic Regression model trained on features extracted from the text content of the posts.
2.1. Data Acquisition and Preparation:
The dataset used for this study is the 20 Newsgroups collection, a popular benchmark for text classification tasks. We specifically selected posts from two categories: 'sci.space' and 'rec.sport.baseball'. The fetch_20newsgroups utility from the scikit-learn library was used to load all posts from these categories. The data was then divided into a training set and a test set, with 80% of the data used for training the model and the remaining 20% reserved for evaluating its performance. This split was performed with a fixed random state to ensure reproducibility.
2.2. Feature Extraction: TF-IDF Vectorization:
Raw text data cannot be directly fed into machine learning algorithms. Therefore, a feature extraction step is necessary to convert the text into a numerical representation. We employed the Term Frequency-Inverse Document Frequency (TF-IDF) technique. TF-IDF reflects the importance of a word in a document relative to its frequency in the entire corpus.
The TfidfVectorizer from scikit-learn was utilized with the following configurations:
Stop Words: Common English stop words (e.g., "the", "is", "in") were removed as they generally do not contribute significantly to distinguishing between topics.
Max Document Frequency (max_df): Terms that appeared in more than 70% of the documents were excluded. This helps remove terms that are too common across all categories to be discriminative.
The vectorizer was first fitted on the training data to learn the vocabulary and IDF weights, and then used to transform both the training and test data into TF-IDF matrices.
2.3. Classification Model: Logistic Regression:
For the classification task, a Logistic Regression model was chosen. Logistic Regression is a linear model that is commonly used for binary and multi-class classification problems. It models the probability of a binary outcome using a logistic function. In this case, it learns to predict whether a given newsgroup post belongs to 'sci.space' or 'rec.sport.baseball' based on its TF-IDF features. The LogisticRegression classifier from scikit-learn was used with its default parameters.
3.1. Experimental Setup:
The experiment was conducted using Python and the scikit-learn library. The key steps involved were:
Loading the specified categories from the 20 Newsgroups dataset.
Splitting the data into training and testing subsets.
Initializing and fitting the TfidfVectorizer on the training text.
Transforming both training and testing text into TF-IDF matrices.
Initializing and training the LogisticRegression classifier using the training TF-IDF matrix and corresponding labels.
Making predictions on the test TF-IDF matrix.
Evaluating the predictions using a classification report.
3.2. Training and Prediction:
The Logistic Regression classifier was trained using the fit method, taking the TF-IDF matrix of the training data (X_train_tfidf) and the corresponding target labels (y_train) as input. Once trained, the model's predict method was used to generate predictions (y_pred) for the TF-IDF matrix of the test data (X_test_tfidf).
The performance of the trained Logistic Regression model was evaluated using the classification_report function from scikit-learn. This report provides key classification metrics for each class ('sci.space' and 'rec.sport.baseball'), including:
Precision: The ratio of correctly predicted positive observations to the total predicted positive observations (tp / (tp + fp)).
Recall (Sensitivity): The ratio of correctly predicted positive observations to all observations in the actual class (tp / (tp + fn)).
F1-score: The weighted average of Precision and Recall (2 * (Recall * Precision) / (Recall + Precision)).
Support: The number of actual occurrences of the class in the test set.
The classification report generated by the code would display these metrics, allowing for an assessment of how well the model distinguishes between the two newsgroup categories. For instance, a high F1-score for both classes would indicate good overall performance. (Note: Specific numerical results would be printed by the code when executed).
This study demonstrated the application of a Logistic Regression model for classifying text documents from the 20 Newsgroups dataset into 'sci.space' and 'rec.sport.baseball' categories. The process involved standard NLP techniques such as TF-IDF vectorization for feature extraction. The evaluation, based on the classification report, would provide insights into the model's effectiveness in this binary classification task. The results (as would be shown by the printed report) quantify the model's precision, recall, and F1-score for each category, indicating its ability to accurately categorize the newsgroup posts. Future work could explore other classification algorithms, more advanced feature engineering techniques, or hyperparameter tuning to potentially improve performance.