Main Fundamental Concepts of NLP

Understanding Text Representation: From Basic Counting to Semantic Understanding

Natural Language Processing

Explore the fundamental techniques that transform human language into machine-understandable representations. From simple word counting to advanced semantic embeddings, these methods form the backbone of modern NLP applications.

Natural Language Processing Visualization

Example Text Corpus

  1. Machine learning is fascinating
  2. Deep learning advances machine learning
  3. Natural language processing uses machine learning
1

Bag of Words (BoW)

Bag of Words Visualization

Concept

The Bag of Words model converts text into numerical features by counting word occurrences. It ignores grammar and word order but keeps frequency information. Each unique word in the corpus becomes a feature, creating a vocabulary-based vector representation.

Vocabulary

["advances", "deep", "fascinating", "is", "language", "learning", "machine", "natural", "processing", "uses"]

Example

Sentence: "Machine learning is fascinating"
Tokens: ["machine", "learning", "is", "fascinating"]

Vector Representation

Sentence 1: "Machine learning is fascinating"
[0, 0, 1, 1, 0, 1, 1, 0, 0, 0]

Sentence 2: "Deep learning advances machine learning"
[1, 1, 0, 0, 0, 1, 1, 0, 0, 0]

Sentence 3: "Natural language processing uses machine learning"
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
Key Points:
  • Only word counts matter - no context or meaning captured
  • "learning" appears in all sentences with a count of 1
  • No sense of word order or grammar
  • Creates sparse vectors (many zeros)
  • Simple and fast but loses semantic information
2

TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF Visualization

Concept

TF-IDF measures how important a word is in a document relative to the entire corpus. Common words that appear in many documents get low scores, while unique words that appear in few documents get high scores. This helps identify distinctive terms for each document.

Formula

$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{n_t}\right) $$

Where:

Example Calculation

Suppose the word "fascinating" appears once in a 4-word document and appears in only 1 out of 3 documents:

TF(fascinating) = 1 / 4 = 0.25
IDF(fascinating) = log(3 / 1) = 1.098
TF-IDF = 0.25 × 1.098 = 0.2745

Vector Representation (Approximate Values)

Sentence 1:
[0, 0, 0.58, 0.46, 0, 0.40, 0.40, 0, 0, 0]

Sentence 2:
[0.58, 0.58, 0, 0, 0, 0.33, 0.33, 0, 0, 0]

Sentence 3:
[0, 0, 0, 0, 0.37, 0.29, 0.29, 0.37, 0.37, 0.37]
Key Points:
  • "fascinating" (unique word) has high weight → 0.58
  • "learning" and "machine" (common words) have lower weight → 0.29–0.40
  • Produces weighted sparse vectors
  • Highlights document-specific and unique terms
  • Better than BoW for information retrieval tasks
3

Word2Vec (Semantic Embeddings)

Word2Vec Visualization

Concept

Word2Vec learns dense numeric representations of words (embeddings) by analyzing their contexts in large text corpora. Words that occur in similar surroundings end up with similar vectors in a multi-dimensional space. It uses either the Skip-Gram (predict context from word) or CBOW (Continuous Bag of Words - predict word from context) approach.

How It Works

Word2Vec creates dense vectors (typically 100–300 dimensions) where semantic relationships are encoded geometrically. Similar words have vectors that point in similar directions, and mathematical operations on vectors preserve meaningful relationships.

Example Context Learning

Sentence: "The king and queen rule the kingdom"
Both "king" and "queen" appear in similar contexts (with words like "rule" and "kingdom"), so their vectors become similar in the embedding space.

Individual Word Vectors (First 5 of ~100 dimensions)

Word         Vector Representation
machine      [ 0.12, -0.23,  0.44,  0.01, -0.35, ...]
learning     [ 0.15, -0.20,  0.47,  0.05, -0.30, ...]
deep         [ 0.42, -0.55,  0.33,  0.12, -0.09, ...]
natural      [-0.18,  0.22,  0.66, -0.04,  0.20, ...]
processing   [-0.15,  0.25,  0.63, -0.03,  0.18, ...]

Semantic Relationships

Each vector encodes semantic relationships, enabling powerful operations:

Key Points:
  • Dense vectors (every dimension has a non-zero value)
  • Meaningful geometry — direction and distance encode semantic relationships
  • Learned automatically from large text corpora (requires substantial data)
  • Captures context and semantic meaning, not just frequency
  • Enables transfer learning and powers many modern NLP applications
  • Can handle synonyms and word analogies naturally

Comprehensive Comparison

Technique Representation Type What It Captures Example Vector Advantages Limitations
Bag of Words Sparse count vector Word presence and frequency [0, 0, 1, 1, 0, ...] Simple, fast, interpretable, works well for small datasets Ignores meaning, context, word order, and semantic relationships
TF-IDF Weighted sparse vector Word importance relative to corpus [0.58, 0, 0.33, ...] Highlights unique words, reduces common word impact, better for search and classification Still ignores semantic meaning, context, and word relationships
Word2Vec Dense semantic vector Contextual meaning and semantic relations [0.12, -0.23, 0.44, ...] Captures deep semantics, enables analogies, context-aware, supports transfer learning Requires large datasets, computationally expensive to train, less interpretable

Summary

The evolution from BoW to Word2Vec represents the journey from simple statistical methods to deep learning approaches in Natural Language Processing, each serving different use cases and computational requirements.