Bag of Words Explained: Mastering the Fundamentals of NLP

Bag of Words Explained: Mastering the Fundamentals of NLP
Welcome to Part 2 of the 'NLP Engineering' series where I’ll guide you through essential NLP concepts—from theory to practice—with clear explanations and hands-on code examples. Perfect for beginners and seasoned engineers alike!"

Natural Language Processing (NLP) requires converting text into numerical formats that machines can understand. One of the fundamental techniques for this transformation is the Bag of Words (BoW) model. In this blog post, I'll explain how BoW works, why it's useful, and demonstrate it with Python code.

What is Bag of Words?

Bag of Words is a method of representing text data as numerical features. At its core, BoW involves creating a meaningful vocabulary from a corpus of text, which can then be used for various NLP tasks like sentiment analysis, text classification, or document clustering.

The name "Bag of Words" comes from the fact that this model disregards grammar and word order, treating text as an unordered collection of words - like words randomly pulled from a bag.

How Bag of Words Works

The BoW process can be broken down into three main steps:

1. Tokenization

First, we divide the corpus of text into smaller chunks or tokens (usually words). This process, called tokenization, transforms lengthy text into manageable pieces:

def preprocess(corpus):
    return [doc.lower().split() for doc in corpus]

This function:

  • Takes each document in the corpus
  • Converts it to lowercase
  • Splits it into individual words

2. Vocabulary Creation

Next, we build a vocabulary containing all unique words found across the entire corpus:

def build_vocab(tokenized_corpus):
    vocab = set()
    for doc in tokenized_corpus:
        vocab.update(doc)
    return sorted(list(vocab))

This function:

  • Collects every unique word from all documents
  • Creates a sorted vocabulary list

3. Vectorization

Finally, we represent each document as a numerical vector based on our vocabulary:

def vectorize(tokenized_corpus, vocab):
    vectors = []
    for doc in tokenized_corpus:
        vector = [doc.count(word) for word in vocab]
        vectors.append(vector)
    return vectors

This function:

  • Creates a vector for each document
  • Each position in the vector corresponds to a word in our vocabulary
  • The value represents how many times that word appears in the document

Putting It All Together

Let's see how this works with a sample corpus about Bag of Words itself:

corpus = [
    "Bag of Words (BoW) is a method used in NLP to represent text data as numerical features. It works by:",
    "Creating a Vocabulary: Identifying all the unique words in a given set of text documents (corpus).",
    "Counting Word Occurrences: Creating a bag (like a set) for each document containing the frequency (or count) of each word in the vocabulary. The order of the words is not considered (hence, bag).",
    "Representing as Vectors: Converting these bags into numerical vectors, where each element of the vector corresponds to the count of a specific word in the vocabulary."
]

# Step 1: Tokenize
tokenized = preprocess(corpus)

# Step 2: Build vocabulary
vocab = build_vocab(tokenized)

# Step 3: Vectorize
vectors = vectorize(tokenized, vocab)

print("Vocabulary:", vocab)
for i, vec in enumerate(vectors):
    print(f"Doc {i+1} vector:", vec)

Output of Our Code:

Vocabulary: ['(bow)', '(corpus).', '(hence,', '(like', '(or', 'a', 'all', 'as', 'bag', 'bag).', 'bags', 'by:', 'considered', 'containing', 'converting', 'corresponds', 'count', 'count)', 'counting', 'creating', 'data', 'document', 'documents', 'each', 'element', 'features.', 'for', 'frequency', 'given', 'identifying', 'in', 'into', 'is', 'it', 'method', 'nlp', 'not', 'numerical', 'occurrences:', 'of', 'order', 'represent', 'representing', 'set', 'set)', 'specific', 'text', 'the', 'these', 'to', 'unique', 'used', 'vector', 'vectors,', 'vectors:', 'vocabulary.', 'vocabulary:', 'where', 'word', 'words', 'works']

Doc 1 vector: [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]

Doc 2 vector: [0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]

Doc 3 vector: [0, 0, 1, 1, 1, 2, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0]

Doc 4 vector: [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 0, 3, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0]

Understanding the Output

Let's break down what this output means:

The vocabulary is a list of 61 unique words from our corpus, sorted alphabetically. These words form the "dimensions" of our vector space.

The document vectors show how many times each word appears in each document:

  • Document 1 contains the word "(bow)" once [1 in position 0], "bag" once [1 in position 8], "nlp" once [1 in position 35], and so on.
  • Document 2 contains the word "a" twice [2 in position 5], "all" once [1 in position 6], and "vocabulary:" once [1 in position 56].
  • Document 3 has the word "the" four times [4 in position 47], showing it's the most frequent word in this document.
  • Document 4 mentions "the" three times [3 in position 47] and "of" twice [2 in position 39].

What's interesting: We can immediately see patterns in these vectors. For example, Document 3 (about counting word occurrences) contains words like "frequency", "count", and "bag" which appear as 1's in its vector. Document 4 (about vector representation) has 1's for words like "vectors", "element", and "numerical".

These numerical representations allow machine learning algorithms to find patterns, similarities, and differences between documents that would be difficult to detect in raw text.

Applications of Bag of Words

This simple yet powerful representation enables many NLP applications:

  • Text Classification: Categorizing documents based on their content
  • Sentiment Analysis: Determining whether text expresses positive or negative sentiment
  • Document Clustering: Grouping similar documents together
  • Information Retrieval: Finding relevant documents for a given query

Limitations

While powerful, BoW has some limitations:

  • It loses word order information ("dog bites man" and "man bites dog" have identical BoW representations)
  • It doesn't capture semantics (meaning) of words
  • It creates sparse vectors with many zeros for large vocabularies

Conclusion

Bag of Words provides a straightforward way to convert text into numerical features that machine learning algorithms can process. While more advanced techniques like word embeddings (Word2Vec, GloVe) and transformers (BERT, GPT) have been developed, BoW remains a foundational concept in NLP that's important to understand.

The next time you encounter text data, remember that behind sophisticated NLP systems often lies this simple yet effective concept - representing text as numerical vectors by counting word frequencies.

Next in the series: [Coming Soon] Part 3: TF-IDF - Taking Bag of Words to the Next Level

Comments

Popular posts from this blog

Data Analysis