Understanding Text Representation: From Basic Counting to Semantic Understanding
Explore the fundamental techniques that transform human language into machine-understandable representations. From simple word counting to advanced semantic embeddings, these methods form the backbone of modern NLP applications.
The Bag of Words model converts text into numerical features by counting word occurrences. It ignores grammar and word order but keeps frequency information. Each unique word in the corpus becomes a feature, creating a vocabulary-based vector representation.
["advances", "deep", "fascinating", "is", "language", "learning", "machine", "natural", "processing", "uses"]
Sentence: "Machine learning is fascinating"
Tokens: ["machine", "learning", "is", "fascinating"]
Sentence 1: "Machine learning is fascinating"
[0, 0, 1, 1, 0, 1, 1, 0, 0, 0]
Sentence 2: "Deep learning advances machine learning"
[1, 1, 0, 0, 0, 1, 1, 0, 0, 0]
Sentence 3: "Natural language processing uses machine learning"
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
TF-IDF measures how important a word is in a document relative to the entire corpus. Common words that appear in many documents get low scores, while unique words that appear in few documents get high scores. This helps identify distinctive terms for each document.
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{n_t}\right) $$
Where:
Suppose the word "fascinating" appears once in a 4-word document and appears in only 1 out of 3 documents:
TF(fascinating) = 1 / 4 = 0.25
IDF(fascinating) = log(3 / 1) = 1.098
TF-IDF = 0.25 × 1.098 = 0.2745
Sentence 1:
[0, 0, 0.58, 0.46, 0, 0.40, 0.40, 0, 0, 0]
Sentence 2:
[0.58, 0.58, 0, 0, 0, 0.33, 0.33, 0, 0, 0]
Sentence 3:
[0, 0, 0, 0, 0.37, 0.29, 0.29, 0.37, 0.37, 0.37]
Word2Vec learns dense numeric representations of words (embeddings) by analyzing their contexts in large text corpora. Words that occur in similar surroundings end up with similar vectors in a multi-dimensional space. It uses either the Skip-Gram (predict context from word) or CBOW (Continuous Bag of Words - predict word from context) approach.
Word2Vec creates dense vectors (typically 100–300 dimensions) where semantic relationships are encoded geometrically. Similar words have vectors that point in similar directions, and mathematical operations on vectors preserve meaningful relationships.
Sentence: "The king and queen rule the kingdom"
Both "king" and "queen" appear in similar contexts (with words like "rule" and "kingdom"), so their vectors become similar in the embedding space.
Word Vector Representation
machine [ 0.12, -0.23, 0.44, 0.01, -0.35, ...]
learning [ 0.15, -0.20, 0.47, 0.05, -0.30, ...]
deep [ 0.42, -0.55, 0.33, 0.12, -0.09, ...]
natural [-0.18, 0.22, 0.66, -0.04, 0.20, ...]
processing [-0.15, 0.25, 0.63, -0.03, 0.18, ...]
Each vector encodes semantic relationships, enabling powerful operations:
cosine_similarity("machine", "learning") → high similarity (words used together)cosine_similarity("machine", "banana") → low similarity (unrelated concepts)king - man + woman ≈ queenParis - France + Italy ≈ Rome| Technique | Representation Type | What It Captures | Example Vector | Advantages | Limitations |
|---|---|---|---|---|---|
| Bag of Words | Sparse count vector | Word presence and frequency | [0, 0, 1, 1, 0, ...] |
Simple, fast, interpretable, works well for small datasets | Ignores meaning, context, word order, and semantic relationships |
| TF-IDF | Weighted sparse vector | Word importance relative to corpus | [0.58, 0, 0.33, ...] |
Highlights unique words, reduces common word impact, better for search and classification | Still ignores semantic meaning, context, and word relationships |
| Word2Vec | Dense semantic vector | Contextual meaning and semantic relations | [0.12, -0.23, 0.44, ...] |
Captures deep semantics, enables analogies, context-aware, supports transfer learning | Requires large datasets, computationally expensive to train, less interpretable |
The evolution from BoW to Word2Vec represents the journey from simple statistical methods to deep learning approaches in Natural Language Processing, each serving different use cases and computational requirements.