This project was developed during an internship at Brainwave Matrix and focuses on building a machine learning model to classify news articles as real or fake. With the increasing spread of misinformation, it's crucial to have tools that can help in detecting and preventing the dissemination of fake news. This repository contains a comprehensive Jupyter Notebook that details the entire process, from data preprocessing to model evaluation.
The WELFake dataset is used for this project. It is a comprehensive collection of news articles labeled as fake or real, merged from four popular datasets to provide a robust dataset for training and evaluating machine learning models.
title
: The headline of the news articletext
: The main content of the news articlelabel
: Binary label indicating fake (0) or real (1) newsDataset Reference: IEEE Transactions on Computational Social Systems
notebook.ipynb
: Jupyter Notebook containing all code for data preprocessing, model training, evaluation, and prediction.README.md
: Project documentation (this file).WELFake_Dataset.csv
: The dataset file (not included due to size constraints; instructions provided to obtain it).DatasetTo run this project locally, please follow these steps:
Clone the repository:
git clone https://github.com/your-username/Brainwave_Matrix_Intern_Fake_News_Classification.git
Navigate to the project directory:
cd Brainwave_Matrix_Intern_Fake_News_Classification
Create and activate a virtual environment (recommended):
python -m venv venv # Activate the virtual environment: # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
Install the required dependencies:
pip install -r requirements.txt
Note: If requirements.txt
is not available, install the dependencies manually:
pip install pandas numpy matplotlib nltk scikit-learn
Download the NLTK data:
Open a Python shell or include the following in your code:
import nltk nltk.download('punkt') nltk.download('stopwords')
Obtain the WELFake dataset:
WELFake_Dataset.csv
file in the project directory.Open the Jupyter Notebook:
jupyter notebook notebook.ipynb
Run the notebook cells:
Make Predictions:
Handling Missing Values:
Text Cleaning:
word_tokenize
.Exploratory Data Analysis:
TfidfVectorizer
.Data Splitting:
Models Used:
Multinomial Naive Bayes Classifier:
Random Forest Classifier:
The Random Forest Classifier was selected as the preferred model due to its higher accuracy and robustness.
Memory Limitations:
TfidfVectorizer
and converting feature arrays to sparse matrices.Processing Time:
Data Preprocessing Decisions:
The project successfully demonstrates the use of machine learning techniques for fake news detection. By preprocessing the data effectively and selecting appropriate models, we achieved high accuracy in classifying news articles. The Random Forest Classifier, in particular, showed superior performance and can be considered a reliable model for this task.
WELFake Dataset Publication:
Libraries and Frameworks:
Additional Resources:
For any questions, suggestions, or contributions, please feel free to open an issue or submit a pull request.