POS Tagging Project: Rule-Based, CRF, BiLSTM-CRF, and BERT and Comparison

Part-of-Speech Tagging: A Comparative Study

This project implements and compares various approaches for Part-of-Speech (POS) tagging, including Rule-Based, Conditional Random Fields (CRF), BiLSTM-CRF, and fine-tuned BERT models. The evaluation is performed on the UD_English-GUM (English) and UD_Telugu_English-TECT (Telugu-English code-mixed) datasets.

Overview
Objectives
Dataset
Methodology
Implementation
Results
Conclusion
Future Work

Overview

Part-of-speech tagging is a fundamental task in natural language processing that involves automatically assigning grammatical tags (noun, verb, adjective, etc.) to each word in a sentence. This project focuses on English text and compares different approaches ranging from simple rule-based methods to sophisticated neural networks.

We evaluate four distinct approaches:

Rule-based tagger: Uses word frequency and morphological patterns
Conditional Random Field (CRF): Statistical model with handcrafted features
BiLSTM-CRF: Neural sequence model with automatic feature learning
BERT: Transformer-based pre-trained model (external evaluation)

Objectives

The primary goals of this project are to:

Implement a baseline rule-based tagger using word frequency and affix-based heuristics
Develop a Conditional Random Field (CRF) model with carefully engineered features
Build a BiLSTM-CRF neural tagger that learns representations directly from data
Evaluate and compare all models using standard metrics (accuracy, precision, recall, F1)
Analyze performance patterns across different text genres and POS categories

Dataset

UD English GUM Corpus

The project utilizes the Universal Dependencies English GUM dataset, which offers several advantages:

Universal Dependencies format: Standardized annotation scheme with universal POS tags
Multi-genre diversity: Includes texts from various domains including news, fiction, Wikipedia, and social media
Pre-structured splits: Comes with predefined train, development, and test sets
Substantial size: Contains thousands of sentences with tens of thousands of annotated tokens
Rich annotation: Provides detailed grammatical information beyond basic POS tags

Methodology

1. Data Loading and Preprocessing

The preprocessing pipeline involves several key steps:

Parse CoNLL-U formatted files to extract word-tag pairs
Build comprehensive vocabularies for words and POS tags
Implement sequence padding and batching for efficient processing
Handle out-of-vocabulary words with special tokens

2. Baseline Rule-Based Tagger

The rule-based approach combines two strategies:

Frequency-based assignment: Tag each word with its most common POS from training data
Morphological heuristics: Apply suffix-based rules for unknown words
- Words ending in -ing, -ed → VERB
- Words ending in -ly → ADVERB
- Capitalized words → PROPER NOUN

3. Feature Engineering for CRF

The CRF model leverages a rich set of handcrafted features:

Lexical features: Current word, neighboring words, character n-grams
Morphological features: Prefixes, suffixes, word length, capitalization patterns
Contextual features: Previous and next word information
Orthographic features: Presence of digits, punctuation, special characters

4. CRF Model Architecture

The CRF implementation uses a linear-chain structure:

Utilizes sklearn-crfsuite for efficient training and inference
Incorporates L1 and L2 regularization to prevent overfitting
Hyperparameter tuning for regularization coefficients (c1, c2) and iteration limits

5. BiLSTM-CRF Neural Architecture

The neural model combines representation learning with structured prediction:

Embedding layer: Maps words to dense vector representations
BiLSTM layers: Capture bidirectional contextual information
Linear projection: Maps LSTM outputs to tag space
CRF layer: Ensures valid tag sequences and optimal global decisions

6. Evaluation Framework

Comprehensive evaluation using multiple metrics:

Token-level accuracy: Overall percentage of correctly tagged words
Per-tag metrics: Precision, recall, and F1-score for each POS category
Confusion analysis: Detailed error patterns and common mistakes

Implementation

Data Preparation

# Load and process CoNLL-U files
train_data = load_conllu_data('en_gum-ud-train.conllu')
train_sentences, train_pos_tags = extract_sentences_and_tags(train_data)

# Build vocabularies with special tokens
word_vocab = defaultdict(lambda: len(word_vocab))
word_vocab["<PAD>"] = 0
word_vocab["<UNK>"] = 1

pos_vocab = defaultdict(lambda: len(pos_vocab))
pos_vocab["<PAD>"] = 0

# Populate vocabularies from training data
for sent, tags in zip(train_sentences, train_pos_tags):
    for word, tag in zip(sent, tags):
        _ = word_vocab[word]
        _ = pos_vocab[tag]

CRF Model Training

# Prepare feature-rich dataset
X_train, y_train = prepare_dataset(train_sentences, train_pos_tags)

# Configure and train CRF model
crf = CRF(
    algorithm='lbfgs',
    c1=0.005,           # L1 regularization
    c2=0.1,             # L2 regularization
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

BiLSTM-CRF Neural Model

class BiLSTM_CRF(nn.Module):
    def __init__(self, vocab_size, tag_size, embedding_dim=128, hidden_dim=64):
        super().__init__()
        self.embedding = nn.Embedding(
            vocab_size, embedding_dim, 
            padding_idx=word_vocab["<PAD>"]
        )
        self.lstm = nn.LSTM(
            embedding_dim, hidden_dim//2, 
            num_layers=2, bidirectional=True, 
            batch_first=True
        )
        self.hidden2tag = nn.Linear(hidden_dim, tag_size)
        self.crf = CRF(tag_size, batch_first=True)

    def forward(self, x, tags=None, mask=None):
        embeds = self.embedding(x)
        lstm_out, _ = self.lstm(embeds)
        emissions = self.hidden2tag(lstm_out)
        
        if tags is not None:
            # Training mode: return negative log-likelihood
            loss = -self.crf(emissions, tags, mask=mask, reduction='mean')
            return loss
        else:
            # Inference mode: return best tag sequence
            return self.crf.decode(emissions, mask=mask)

Training Loop

model = BiLSTM_CRF(len(word_vocab), len(pos_vocab))
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    
    for sentences, tags, lengths, mask in train_loader:
        # Forward pass and loss computation
        loss = model(sentences, tags, mask)
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}")

Model Evaluation

def evaluate_model(model, test_loader):
    model.eval()
    correct_predictions = 0
    total_predictions = 0
    
    with torch.no_grad():
        for sentences, true_tags, lengths, mask in test_loader:
            predicted_tags = model(sentences, mask=mask)
            
            # Calculate accuracy considering only non-padded tokens
            for pred_seq, true_seq, length in zip(predicted_tags, true_tags, lengths):
                correct_predictions += sum(p == t for p, t in zip(pred_seq[:length], true_seq[:length]))
                total_predictions += length
    
    return correct_predictions / total_predictions

# Evaluate all models
test_accuracy = evaluate_model(model, test_loader)
print(f"Test Accuracy: {test_accuracy:.4f}")

Results

The comparative evaluation reveals interesting performance patterns across different approaches:

Model Performance Summary

Model	Test Accuracy	Key Strengths	Limitations
Rule-based (basic)	85.9%	Fast, interpretable, no training required	Limited coverage, rigid rules
Rule-based (enhanced)	86.8%	Improved with affix heuristics	Still struggles with ambiguity
CRF	93.2%	Excellent feature integration, structured prediction	Manual feature engineering required
BiLSTM-CRF	90.6%	Automatic feature learning, contextual understanding	Requires more data and tuning
BERT (reference)	99.2%	State-of-the-art performance, pre-trained knowledge	Computationally expensive

Detailed Analysis

Rule-based Performance: The baseline rule-based tagger achieved respectable accuracy for such a simple approach. The addition of morphological heuristics provided a modest but meaningful improvement, particularly for handling out-of-vocabulary words.

CRF Excellence: The CRF model demonstrated superior performance among our implemented approaches. Its success stems from the ability to effectively combine multiple feature types while maintaining global sequence consistency through structured prediction.

Neural Model Insights: While the BiLSTM-CRF showed strong performance, it fell slightly short of the CRF model. This suggests that for this particular task and dataset size, carefully engineered features can compete effectively with learned representations.

BERT Benchmark: The transformer-based model set a high performance ceiling, demonstrating the potential of large-scale pre-training and attention mechanisms for sequence labeling tasks.

Conclusion

This comprehensive study of POS tagging approaches reveals several important insights about the evolution of natural language processing techniques. The progression from rule-based systems to statistical models and finally to neural networks illustrates the field's development trajectory.

Key findings include:

Feature Engineering Remains Valuable: The CRF model's superior performance over the BiLSTM-CRF highlights that thoughtful feature engineering can be highly effective, especially with limited training data.

Context Matters: All models that incorporated contextual information significantly outperformed simple lookup-based approaches, emphasizing the importance of considering word sequences rather than isolated tokens.

Trade-offs Exist: Each approach offers different advantages in terms of computational requirements, interpretability, training data needs, and deployment complexity.

Modern Transformers Excel: While not implemented in detail here, the BERT results demonstrate the substantial gains possible with pre-trained transformer models, albeit at increased computational cost.

Future Work

Several promising directions could extend and improve this work:

Model Enhancements

Advanced CRF Features: Incorporate syntactic dependencies, word embeddings, and character-level representations
Neural Architecture Improvements: Experiment with attention mechanisms, deeper networks, and advanced regularization techniques
Ensemble Methods: Combine predictions from multiple models to leverage complementary strengths

Evaluation Expansion

Cross-domain Analysis: Evaluate model robustness across different text genres and domains
Error Analysis: Conduct detailed investigation of failure cases and systematic biases
Efficiency Metrics: Compare models on inference speed, memory usage, and training time

Advanced Techniques

Transfer Learning: Leverage pre-trained embeddings and language models more effectively
Multi-task Learning: Train models jointly on related tasks like named entity recognition
Active Learning: Develop strategies for selecting most informative examples for annotation

Real-world Applications

Domain Adaptation: Adapt models to specialized domains like medical or legal text
Multilingual Extension: Extend approaches to code-mixed and low-resource languages
Production Deployment: Optimize models for real-time applications and scalable inference

This project demonstrates the rich landscape of approaches available for fundamental NLP tasks and provides a solid foundation for exploring more advanced techniques in computational linguistics.