This project provides an automated solution for analyzing textual content across a diverse dataset, aimed at extracting sentiment, readability, and other textual metrics. Utilizing Python libraries, the project calculates sentiment polarity, readability indexes, and content-specific scores for each document, enabling large-scale textual analysis efficiently and accurately. The tool is designed to support content assessment, readability improvement, and sentiment evaluation in various domains, including media, academia, and business.
The dataset comprises multiple text documents provided in a standard format, assumed to represent various themes and topics. The data was cleansed and prepared for analysis.
Text preprocessing is crucial for accurate analysis. Steps include:
Tokenization: Splitting text into individual words or tokens.
Normalization: Converting text to lowercase.
Cleaning: Removing special characters, punctuation, and numbers.
Stopword Removal: Excluding common words that do not carry significant meaning.
Sentiment is assessed by using predefined positive and negative word dictionaries. For each document, scores are calculated as follows:
Positive Score: Count of words from the positive word list.
Negative Score: Count of words from the negative word list.
Polarity Score: (Positive Score - Negative Score) / ((Positive Score + Negative Score) + 0.000001), offering a normalized measure of sentiment ranging from -1 (negative) to +1 (positive).
Subjectivity Score: (Positive Score + Negative Score) / Total Words, indicating the extent of subjectivity.
Average Sentence Length: Total word count divided by the number of sentences, indicating document complexity.
Complex Word Count: Number of words with three or more syllables, providing insight into readability.
Fog Index: Calculated as 0.4 * (Average Sentence Length + Percentage of Complex Words), measuring the education level required to understand the text.
Content Metrics
Word Count: Measures the total number of words, offering a baseline metric for document size.
Syllable Count per Word: Identifies complex words to improve readability assessment.
Personal Pronouns Count: Measures engagement, as higher pronoun usage often correlates with more conversational and personalized content.
Each document’s sentiment is represented through positive, negative, polarity, and subjectivity scores. Higher positive scores and polarity values suggest positive sentiment, whereas negative values suggest criticism or unfavorable opinions.
Readability scores like the Fog Index reveal the complexity level of each document. Complex word and sentence structures correlate with higher scores, indicating material that may require a higher education level for comprehension.
Personal pronoun count reveals content personalization, which can be critical in marketing and customer-focused communications. Higher usage of personal pronouns typically implies direct engagement.
The resulting dataset with calculated scores for each document was saved in a CSV file, facilitating further analysis or integration into larger workflows.