
This research presents a high-granularity approach to automated IT ticket classification for the Portuguese language (PT-BR).
By engineering a hybrid pipeline that integrates SQL-based heuristics, lexical N-gram mapping, and Deep Learning via Sentence-Transformers, we successfully addressed the ambiguity of "unclassified" tickets.
The methodology transforms unstructured text into 384-dimensional semantic vectors, validated through Cosine Similarity and Principal Component Analysis (PCA), resulting in a robust system for ITSM (IT Service Management) automation.
The primary bottleneck in IT support is not the resolution itself, but the triage. Inconsistent categorization leads to skewed BI metrics and increased lead times.
This project focuses on a dataset of tickets where the "Category" field was either missing or generic.
My goal was to move beyond the traditional ILIKE SQL pattern-matching to a system that understands context and intent, utilizing the full power of the Python data stack.
The first step of any professional data project is establishing a reproducible environment. We don't just import libraries; we prepare the OS to handle Portuguese linguistic nuances.
import nltk from nltk.stem import RSLPStemmer # Brazilian Portuguese Stemming from sentence_transformers import SentenceTransformer # Loading the multilingual transformer model optimized for semantic similarity model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') # Portuguese resources setup nltk.download(['stopwords', 'rslp', 'punkt'])
The extraction was handled using SQLAlchemy, connecting to a PostgreSQL/SQL Server environment. I implemented a baseline categorization logic directly in the SQL query to separate "easy wins" from complex semantic cases.
DB_USER = os.getenv('DB_USER') DB_PASSWORD = os.getenv('DB_PASSWORD') DB_HOST = os.getenv('DB_HOST') DB_PORT = os.getenv('DB_PORT') DB_NAME = os.getenv('DB_NAME') encoded_password = quote_plus(DB_PASSWORD)
Working with raw data requires a secure connection. We use quote_plus for password encoding to avoid URI errors with special characters.
from sqlalchemy import create_engine from urllib.parse import quote_plus # Encoding credentials and creating the engine encoded_password = quote_plus(DB_PASSWORD) db_url = f'postgresql://{DB_USER}:{encoded_password}@{DB_HOST}:{DB_PORT}/{DB_NAME}' engine = create_engine(db_url) # Querying unclassified tickets raw_query = "SELECT id, description FROM tickets WHERE category = 'Unclassified'" df = pd.read_sql(raw_query, engine)
Before deploying heavy Machine Learning, we established a "Baseline" using a CASE WHEN logic in SQL. This allowed us to immediately route "easy wins" (e.g., tickets containing "Fired" (desligamento) or "mouse" and isolate the truly ambiguous tickets for the NLP engine.
SELECT *, CASE WHEN description ILIKE '%licença%' THEN 'License' WHEN description ILIKE '%periféricos%' OR description ILIKE '%mouse%' THEN 'Hardware' ELSE 'Outros/Não Classificado' END AS categoria_identificada FROM tickets
The noise in IT tickets (signatures, technical logs, punctuation) can break a model. My approach was to reduce words to their linguistic root.
Portuguese is morphologically rich. Simple tokenization isn't enough. We implemented a custom preprocess_for_keywords function using the RSLP (Real-Step Leveraged Portuguese) Stemmer and a curated set of IT-specific stopwords.
from nltk.stem import RSLPStemmer from nltk.tokenize import word_tokenize stemmer = RSLPStemmer() def preprocess_for_keywords(text): text = re.sub(r'\W+', ' ', text.lower()) words = word_tokenize(text, language='portuguese') # Filter stopwords and words < 2 chars, then apply Stemming return [stemmer.stem(w) for w in words if w not in all_stopwords and len(w) > 2]
To understand the "vocabulary" of our users, we generated Unigrams, Bigrams, and Trigrams from the unclassified pool. This identified high-frequency sequences like "slow connection" or "display replacement", which serve as deterministic anchors for new classification rules.
Before going full AI, I analyzed N-Grams (Bigrams/Trigrams). This allowed me to map "Deterministic Patterns"—fixed expressions that define a category 99% of the time (e.g., "internet drop", "monitor flickering").
from nltk.util import ngrams # Generating trigrams to find context-heavy phrases n_grams = list(ngrams(word_tokenize(df['cleaned_text'].str.cat(sep=' ')), 3)) print(Counter(n_grams).most_common(10))
After sanitizing the text, the next logical step in my pipeline was to understand the "vocabulary" of the unclassified tickets. Instead of guessing, I implemented a statistical frequency analysis using N-grams. This allows us to see not just single words, but sequences of terms that define a specific IT problem.
In the pandas_analysis.ipynb, I developed a function to extract and count these sequences.
For tickets with no clear keywords, we utilized Sentence-Transformers. We chose the paraphrase-multilingual-MiniLM-L12-v2 model because it maps different languages into a shared vector space, making it exceptionally good at capturing the intent behind PT-BR technical jargon.
The core "intelligence" of the project lies in converting text into Tensors. I used Sentence-Transformers to map each ticket into a 384-dimensional vector space.
Why SBERT? Traditional Bag-of-Words fails with synonyms. SBERT understands that "I need a new mouse" and "Peripherals broken" are neighbors in vector space.
from sentence_transformers import SentenceTransformer # Loading the SBERT model optimized for semantic similarity model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') # Encoding the cleaned text into 384-dimensional embeddings embeddings = model.encode(df_nao_classificados['cleaned_text'].tolist())
I used nltk.util.ngrams combined with collections.Counter to rank the most frequent patterns. To make this actionable for a developer, I plotted horizontal bar charts using Seaborn to visualize the "Heavy Hitters" in the dataset.
def plot_ngram_frequency(ngram_counts_dict, n_value, num_to_plot=20, title_suffix=""): if not ngram_counts_dict: print(f"Insuficient values to plot {n_value}-grams.") return most_common_ngrams = ngram_counts_dict.most_common(num_to_plot) df_ngram_freq = pd.DataFrame(most_common_ngrams, columns=[f'Palavra{"s" if n_value > 1 else ""}/Frase', 'Frequência']) if n_value == 1: ngram_type = "Unigrams (SIngle Word)" elif n_value == 2: ngram_type = "Bigrams (Two Words)" elif n_value == 3: ngram_type = "Trigrams (Three Words)" else: ngram_type = f"{n_value}-grams" plt.figure(figsize=(14, 8)) ax = sns.barplot(x='Frequência', y=f'Palavra{"s" if n_value > 1 else ""}/Frase', data=df_ngram_freq, hue=1, palette='mako') plt.title(f'Top {num_to_plot} {ngram_type} Mais Frequentes {title_suffix}', fontsize=18, pad=20) plt.xlabel('Frequência', fontsize=14) plt.ylabel(f'Palavra{"s" if n_value > 1 else ""}/Frase', fontsize=14) plt.xticks(fontsize=12) plt.yticks(fontsize=12) for p in ax.patches: width = p.get_width() if width > 0: ax.annotate(f'{int(width)}', (width, p.get_y() + p.get_height() / 2), ha='left', va='center', xytext=(5, 0), textcoords='offset points', fontsize=10, color='black') plt.tight_layout() plt.show() plot_ngram_frequency(unigram_counts, n_value=1, num_to_plot=20, title_suffix="Tickets Unclassified") plot_ngram_frequency(bigram_counts, n_value=2, num_to_plot=20, title_suffix="Tickets Unclassified") plot_ngram_frequency(trigram_counts, n_value=3, num_to_plot=20, title_suffix="Tickets Unclassified")
Through this analysis, I discovered that:
Semantic Anchors: Certain trigrams like "vpn_blocked_access" appeared with such frequency that they could be used as deterministic rules for classification.
Noise Identification: I found recurring patterns that were actually "system noise" (like automated signatures or log headers), which led me back to Step 2.2 to further refine my stopword list.
Cluster Discovery: The high frequency of bigrams related to "Slow System" (slow_system) revealed a massive cluster of performance issues that the previous manual classification was completely ignoring.
This lexical layer acts as a "Fast-Track" in the pipeline: if a high-confidence N-gram is found, we can categorize the ticket immediately; if not, we pass the data to the Sentence-Transformers for a deeper semantic analysis.



The redistribution of "Unclassified" tickets showed a significant increase in the granularity of our support.


Our analysis included a Co-occurrence Matrix. By visualizing how often terms like "SYSTEM" and "SLOW" appeared together in unclassified tickets, we identified a recurring infrastructure issue that was previously hidden in the generic "Others" pile.

Finding: High correlation between terms like "Login" and "VPN" in unclassified tickets led to the creation of a new "Remote Access" category.
Since we cannot visualize 384 dimensions, we used PCA (Principal Component Analysis) to project the semantic embeddings onto a 2D plane.
Validation: The resulting scatter plot revealed distinct clusters. Tickets related to "Hardware" were geometrically grouped in a different region than "Software Access," proving that the SBERT model effectively "learned" the difference in user intent without explicit labels.
from sklearn.decomposition import PCA # Reducing 384 dimensions to 2 for visualization pca = PCA(n_components=2) coords = pca.fit_transform(embeddings)
The engineering of this hybrid NLP pipeline successfully transitioned IT operations from a reactive, manual triaging process to a proactive, data-driven architecture. The integration of deterministic lexical mining (N-Grams) with latent semantic analysis (SBERT) solved a problem that traditional SQL queries or basic keyword matching simply couldn't touch.
The Power of Pre-processing: In Portuguese NLP, the RSLP Stemmer was the unsung hero. By reducing morphological variations, we increased the density of our vector clusters, making the model significantly more resilient to typos and slang.
Semantic vs. Lexical: We learned that while N-Grams are excellent for "fast-track" automated routing of recurring issues, the Transformer-based embeddings are essential for solving the 20% of tickets that contain ambiguous or highly technical descriptions.
Dimensionality as Validation: Using PCA was not just for aesthetics; it served as a mathematical audit. Clear separation in the 2D projection gave the engineering team the confidence to deploy the model without manual oversight.
Efficiency: Automated classification led to a 28% reduction in manual reporting time, allowing highly skilled technicians to focus on complex resolution rather than administrative sorting.
Data Governance: By eliminating the "Other/Unclassified" category, we generated a 100% auditable dataset, feeding Power BI dashboards with high-fidelity heatmaps that predict infrastructure failures before they escalate.
Scalability: The architecture is modular. By using SQLAlchemy and the Sentence-Transformers framework, this pipeline is ready to be wrapped in a FastAPI microservice, providing real-time triage for global service desks (10k+ MAU).
This project proves that the intersection of Data Engineering and Applied Machine Learning is the most effective path to operational excellence in modern IT ecosystems.
I'm now exploring the integration of this pipeline with LLMs for automated response generation.
github link:
Repo