Bag of Words Explained: Mastering the Fundamentals of NLP
Natural Language Processing (NLP) requires converting text into numerical formats that machines can understand. One of the fundamental techniques for this transformation is the Bag of Words (BoW) model. In this blog post, I'll explain how BoW works, why it's useful, and demonstrate it with Python code.
What is Bag of Words?
Bag of Words is a method of representing text data as numerical features. At its core, BoW involves creating a meaningful vocabulary from a corpus of text, which can then be used for various NLP tasks like sentiment analysis, text classification, or document clustering.
The name "Bag of Words" comes from the fact that this model disregards grammar and word order, treating text as an unordered collection of words - like words randomly pulled from a bag.
How Bag of Words Works
The BoW process can be broken down into three main steps:
1. Tokenization
First, we divide the corpus of text into smaller chunks or tokens (usually words). This process, called tokenization, transforms lengthy text into manageable pieces:
def preprocess(corpus):
return [doc.lower().split() for doc in corpus]
This function:
- Takes each document in the corpus
- Converts it to lowercase
- Splits it into individual words
2. Vocabulary Creation
Next, we build a vocabulary containing all unique words found across the entire corpus:
def build_vocab(tokenized_corpus):
vocab = set()
for doc in tokenized_corpus:
vocab.update(doc)
return sorted(list(vocab))
This function:
- Collects every unique word from all documents
- Creates a sorted vocabulary list
3. Vectorization
Finally, we represent each document as a numerical vector based on our vocabulary:
def vectorize(tokenized_corpus, vocab):
vectors = []
for doc in tokenized_corpus:
vector = [doc.count(word) for word in vocab]
vectors.append(vector)
return vectors
This function:
- Creates a vector for each document
- Each position in the vector corresponds to a word in our vocabulary
- The value represents how many times that word appears in the document
Putting It All Together
Let's see how this works with a sample corpus about Bag of Words itself:
corpus = [
"Bag of Words (BoW) is a method used in NLP to represent text data as numerical features. It works by:",
"Creating a Vocabulary: Identifying all the unique words in a given set of text documents (corpus).",
"Counting Word Occurrences: Creating a bag (like a set) for each document containing the frequency (or count) of each word in the vocabulary. The order of the words is not considered (hence, bag).",
"Representing as Vectors: Converting these bags into numerical vectors, where each element of the vector corresponds to the count of a specific word in the vocabulary."
]
# Step 1: Tokenize
tokenized = preprocess(corpus)
# Step 2: Build vocabulary
vocab = build_vocab(tokenized)
# Step 3: Vectorize
vectors = vectorize(tokenized, vocab)
print("Vocabulary:", vocab)
for i, vec in enumerate(vectors):
print(f"Doc {i+1} vector:", vec)
Output of Our Code:
Vocabulary: ['(bow)', '(corpus).', '(hence,', '(like', '(or', 'a', 'all', 'as', 'bag', 'bag).', 'bags', 'by:', 'considered', 'containing', 'converting', 'corresponds', 'count', 'count)', 'counting', 'creating', 'data', 'document', 'documents', 'each', 'element', 'features.', 'for', 'frequency', 'given', 'identifying', 'in', 'into', 'is', 'it', 'method', 'nlp', 'not', 'numerical', 'occurrences:', 'of', 'order', 'represent', 'representing', 'set', 'set)', 'specific', 'text', 'the', 'these', 'to', 'unique', 'used', 'vector', 'vectors,', 'vectors:', 'vocabulary.', 'vocabulary:', 'where', 'word', 'words', 'works'] Doc 1 vector: [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1] Doc 2 vector: [0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0] Doc 3 vector: [0, 0, 1, 1, 1, 2, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0] Doc 4 vector: [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 0, 3, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0]
Understanding the Output
Let's break down what this output means:
The vocabulary is a list of 61 unique words from our corpus, sorted alphabetically. These words form the "dimensions" of our vector space.
The document vectors show how many times each word appears in each document:
- Document 1 contains the word "(bow)" once [1 in position 0], "bag" once [1 in position 8], "nlp" once [1 in position 35], and so on.
- Document 2 contains the word "a" twice [2 in position 5], "all" once [1 in position 6], and "vocabulary:" once [1 in position 56].
- Document 3 has the word "the" four times [4 in position 47], showing it's the most frequent word in this document.
- Document 4 mentions "the" three times [3 in position 47] and "of" twice [2 in position 39].
What's interesting: We can immediately see patterns in these vectors. For example, Document 3 (about counting word occurrences) contains words like "frequency", "count", and "bag" which appear as 1's in its vector. Document 4 (about vector representation) has 1's for words like "vectors", "element", and "numerical".
These numerical representations allow machine learning algorithms to find patterns, similarities, and differences between documents that would be difficult to detect in raw text.
Applications of Bag of Words
This simple yet powerful representation enables many NLP applications:
- Text Classification: Categorizing documents based on their content
- Sentiment Analysis: Determining whether text expresses positive or negative sentiment
- Document Clustering: Grouping similar documents together
- Information Retrieval: Finding relevant documents for a given query
Limitations
While powerful, BoW has some limitations:
- It loses word order information ("dog bites man" and "man bites dog" have identical BoW representations)
- It doesn't capture semantics (meaning) of words
- It creates sparse vectors with many zeros for large vocabularies
Conclusion
Bag of Words provides a straightforward way to convert text into numerical features that machine learning algorithms can process. While more advanced techniques like word embeddings (Word2Vec, GloVe) and transformers (BERT, GPT) have been developed, BoW remains a foundational concept in NLP that's important to understand.
The next time you encounter text data, remember that behind sophisticated NLP systems often lies this simple yet effective concept - representing text as numerical vectors by counting word frequencies.
Next in the series: [Coming Soon] Part 3: TF-IDF - Taking Bag of Words to the Next Level
Comments
Post a Comment