This project implements and compares various approaches for Part-of-Speech (POS) tagging, including Rule-Based, Conditional Random Fields (CRF), BiLSTM-CRF, and fine-tuned BERT models. The evaluation is performed on the UD_English-GUM (English) and UD_Telugu_English-TECT (Telugu-English code-mixed) datasets.
Part-of-speech tagging is a fundamental task in natural language processing that involves automatically assigning grammatical tags (noun, verb, adjective, etc.) to each word in a sentence. This project focuses on English text and compares different approaches ranging from simple rule-based methods to sophisticated neural networks.
We evaluate four distinct approaches:
The primary goals of this project are to:
The project utilizes the Universal Dependencies English GUM dataset, which offers several advantages:
The preprocessing pipeline involves several key steps:
The rule-based approach combines two strategies:
-ing, -ed → VERB-ly → ADVERBThe CRF model leverages a rich set of handcrafted features:
The CRF implementation uses a linear-chain structure:
sklearn-crfsuite for efficient training and inferencec1, c2) and iteration limitsThe neural model combines representation learning with structured prediction:
Comprehensive evaluation using multiple metrics:
# Load and process CoNLL-U files train_data = load_conllu_data('en_gum-ud-train.conllu') train_sentences, train_pos_tags = extract_sentences_and_tags(train_data) # Build vocabularies with special tokens word_vocab = defaultdict(lambda: len(word_vocab)) word_vocab["<PAD>"] = 0 word_vocab["<UNK>"] = 1 pos_vocab = defaultdict(lambda: len(pos_vocab)) pos_vocab["<PAD>"] = 0 # Populate vocabularies from training data for sent, tags in zip(train_sentences, train_pos_tags): for word, tag in zip(sent, tags): _ = word_vocab[word] _ = pos_vocab[tag]
# Prepare feature-rich dataset X_train, y_train = prepare_dataset(train_sentences, train_pos_tags) # Configure and train CRF model crf = CRF( algorithm='lbfgs', c1=0.005, # L1 regularization c2=0.1, # L2 regularization max_iterations=100, all_possible_transitions=True ) crf.fit(X_train, y_train)
class BiLSTM_CRF(nn.Module): def __init__(self, vocab_size, tag_size, embedding_dim=128, hidden_dim=64): super().__init__() self.embedding = nn.Embedding( vocab_size, embedding_dim, padding_idx=word_vocab["<PAD>"] ) self.lstm = nn.LSTM( embedding_dim, hidden_dim//2, num_layers=2, bidirectional=True, batch_first=True ) self.hidden2tag = nn.Linear(hidden_dim, tag_size) self.crf = CRF(tag_size, batch_first=True) def forward(self, x, tags=None, mask=None): embeds = self.embedding(x) lstm_out, _ = self.lstm(embeds) emissions = self.hidden2tag(lstm_out) if tags is not None: # Training mode: return negative log-likelihood loss = -self.crf(emissions, tags, mask=mask, reduction='mean') return loss else: # Inference mode: return best tag sequence return self.crf.decode(emissions, mask=mask)
model = BiLSTM_CRF(len(word_vocab), len(pos_vocab)) optimizer = optim.Adam(model.parameters(), lr=0.01) for epoch in range(num_epochs): model.train() total_loss = 0 for sentences, tags, lengths, mask in train_loader: # Forward pass and loss computation loss = model(sentences, tags, mask) # Backward pass and optimization optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}")
def evaluate_model(model, test_loader): model.eval() correct_predictions = 0 total_predictions = 0 with torch.no_grad(): for sentences, true_tags, lengths, mask in test_loader: predicted_tags = model(sentences, mask=mask) # Calculate accuracy considering only non-padded tokens for pred_seq, true_seq, length in zip(predicted_tags, true_tags, lengths): correct_predictions += sum(p == t for p, t in zip(pred_seq[:length], true_seq[:length])) total_predictions += length return correct_predictions / total_predictions # Evaluate all models test_accuracy = evaluate_model(model, test_loader) print(f"Test Accuracy: {test_accuracy:.4f}")
The comparative evaluation reveals interesting performance patterns across different approaches:
| Model | Test Accuracy | Key Strengths | Limitations |
|---|---|---|---|
| Rule-based (basic) | 85.9% | Fast, interpretable, no training required | Limited coverage, rigid rules |
| Rule-based (enhanced) | 86.8% | Improved with affix heuristics | Still struggles with ambiguity |
| CRF | 93.2% | Excellent feature integration, structured prediction | Manual feature engineering required |
| BiLSTM-CRF | 90.6% | Automatic feature learning, contextual understanding | Requires more data and tuning |
| BERT (reference) | 99.2% | State-of-the-art performance, pre-trained knowledge | Computationally expensive |
Rule-based Performance: The baseline rule-based tagger achieved respectable accuracy for such a simple approach. The addition of morphological heuristics provided a modest but meaningful improvement, particularly for handling out-of-vocabulary words.
CRF Excellence: The CRF model demonstrated superior performance among our implemented approaches. Its success stems from the ability to effectively combine multiple feature types while maintaining global sequence consistency through structured prediction.
Neural Model Insights: While the BiLSTM-CRF showed strong performance, it fell slightly short of the CRF model. This suggests that for this particular task and dataset size, carefully engineered features can compete effectively with learned representations.
BERT Benchmark: The transformer-based model set a high performance ceiling, demonstrating the potential of large-scale pre-training and attention mechanisms for sequence labeling tasks.
This comprehensive study of POS tagging approaches reveals several important insights about the evolution of natural language processing techniques. The progression from rule-based systems to statistical models and finally to neural networks illustrates the field's development trajectory.
Key findings include:
Feature Engineering Remains Valuable: The CRF model's superior performance over the BiLSTM-CRF highlights that thoughtful feature engineering can be highly effective, especially with limited training data.
Context Matters: All models that incorporated contextual information significantly outperformed simple lookup-based approaches, emphasizing the importance of considering word sequences rather than isolated tokens.
Trade-offs Exist: Each approach offers different advantages in terms of computational requirements, interpretability, training data needs, and deployment complexity.
Modern Transformers Excel: While not implemented in detail here, the BERT results demonstrate the substantial gains possible with pre-trained transformer models, albeit at increased computational cost.
Several promising directions could extend and improve this work:
This project demonstrates the rich landscape of approaches available for fundamental NLP tasks and provides a solid foundation for exploring more advanced techniques in computational linguistics.