N-Gram Models: The Basics That Kickstart Your NLP Journey Toward LLMs

N-Gram Models: The Basics That Kickstart Your NLP Journey Toward LLMs

N-Gram Models: The Basics That Kickstart Your NLP Journey Toward LLMs

N-gram models are fundamental building blocks in natural language processing that capture the sequential nature of human language. Unlike simpler approaches that ignore word order, N-grams preserve local context and help us model how language flows naturally from one word to the next.

What Are N-gram Models?

An N-gram model is a probabilistic language model based on the Markov chain assumption, which states that the probability of a word depends only on the previous N-1 words. The "N" in N-gram refers to the number of consecutive words grouped together as a single unit.

Types of N-grams

N-gram TypeDescriptionExample (from "data science is fascinating")
UnigramSingle word units"data", "science", "is", "fascinating"
BigramTwo-word sequences"data science", "science is", "is fascinating"
TrigramThree-word sequences"data science is", "science is fascinating"

Why N-grams Matter: Preserving Context

"MI won over RCB" vs "RCB won over MI"
A Bag-of-Words model treats these as identical. An N-gram model recognizes the sequence and knows they mean different things!

Building an N-gram Model from Scratch in Python

import re
import random
from collections import defaultdict, Counter

def tokenize(text):
    text = text.lower()
    return re.findall(r'\b\w+\b', text)

def generate_ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def build_ngram_model(text, n):
    tokens = tokenize(text)
    ngrams = generate_ngrams(tokens, n)
    model = defaultdict(Counter)
    for ngram in ngrams:
        context, target = ngram[:-1], ngram[-1]
        model[context][target] += 1
    return model

def generate_text(model, context, num_words=20):
    generated = list(context)
    for _ in range(num_words):
        current_context = tuple(generated[-(len(context)):])
        if current_context not in model:
            break
        next_word = random.choices(
            list(model[current_context].keys()),
            weights=list(model[current_context].values()),
            k=1
        )[0]
        generated.append(next_word)
    return ' '.join(generated)

text = """The future is already here, it's just not evenly distributed. 
The future belongs to those who believe in the beauty of their dreams. 
The present is theirs; the future, for which I have really worked, is mine. 
The best way to predict the future is to create it. 
The only thing we know about the future is that it will be different. 
The truth is incontrovertible. Malice may attack it, 
ignorance may deride it, but in the end, there it is."""

bigram_model = build_ngram_model(text, 2)
unigram_model = Counter(tokenize(text))
top_unigrams = unigram_model.most_common(3)
bigram_preds = dict(bigram_model[("the",)])
total = sum(bigram_preds.values())
bigram_probs = sorted([(word, count/total) for word, count in bigram_preds.items()], key=lambda x: x[1], reverse=True)[:3]

print(f"Unigram top predictions (context-free): {top_unigrams}")
print(f"Bigram predictions after 'the': {bigram_probs}")
print("Generated text using bigram model:", generate_text(bigram_model, ('the',), 10))

Sample Output

  • Unigram top predictions: [('the', 0.2), ('is', 0.15), ('future', 0.1)]
  • Bigram predictions after 'the': [('future', 0.5), ('present', 0.25), ('truth', 0.25)]
  • Generated text: the future tell the truth and evaluate each one according

Applications of N-gram Models

  • Text prediction (autocomplete features)
  • Speech recognition
  • Machine translation
  • Spelling correction
  • Document classification

Limitations

  • Only captures local context
  • Struggles with long-range dependencies
  • Data sparsity issues
  • Higher-order N-grams need large datasets

Conclusion

N-gram models provide a simple yet effective approach to modeling sequential data like text. By preserving word order and local context, they overcome limitations of bag-of-words models and form the foundation for more advanced language models. While modern deep learning approaches have surpassed traditional N-gram models in many NLP tasks, understanding N-grams remains essential for anyone working in natural language processing.

Comments

Popular posts from this blog