How To Train Your Own Language Model - Part 1: Tokenization

Jan 14, 2026 · Manthan Gupta

TL;DR: Modern LLMs use subword tokenization (BPE, WordPiece, or Unigram) to balance vocabulary size with sequence length. Tokenization directly affects API costs, training compute, and model capabilities. It’s why LLMs struggle with arithmetic and spelling. For custom domains, train your own tokenizer using HuggingFace’s tokenizers library. Use Tiktoken for speed, SentencePiece for multilingual.

What is tokenization?
Token IDs and Embeddings
Building Intuition: How Tokenizers Work
Deep Dive: BPE, WordPiece, Unigram
Real World Comparisons
Real-World Tokenization Challenges
Building Your Own Tokenizer
Tokenization Tools Comparison
Conclusion

The wrong tokenization strategy could be costing you $500K+ annually.

Here’s why: language model APIs charge by the token. OpenAI, Anthropic, Google, etc. And the thing most people don’t realize is that the same sentence can be 10 tokens or 50 tokens depending on how you tokenize it. A company processing billions of API calls? That difference compounds over time to make a hole in the bank.

But it goes deeper than API costs. If you are training a model, tokenization affects:

Training time - more tokens means more compute
Context window utilization - inefficient tokenization wastes your precious context length
Model quality - a tokenizer that butchers your domain vocabulary will handicap your model from day one

I have been meaning to understand how language models are trained for a while now. That curiosity led me down the rabbit hole of building one from scratch. This is Part 1 of what I hope becomes a comprehensive series and we are starting with tokenization because it is the foundation everything else sits on.

By the end of this post, you will understand why different tokenization methods exist, how the numbers actually work, and how to build your own tokenizer from scratch.

Let’s get into it.

What is tokenization?

Neural networks don’t understand words, sentences, or characters, they understand numbers. Tokenization is the process of converting text into numbers. We will explain what these numbers represent shortly. But here’s the key insight: how you convert text to numbers matters enormously.

Tokenization pipeline

Tokenization happens in stages. Each stage transforms text before passing it to the next:

Stage	Purpose	Example
Normalizer	Standardizes text (lowercasing, unicode normalization, whitespace cleanup)	`"HELLO World"` -> `"hello world"`
Pre-tokenizer	Splits text into preliminary chunks (usually by whitespace/punctuation)	`"hello world"` -> `["hello", " world"]`
Model	Applies the tokenization algorithm (BPE, WordPiece, Unigram)	`["hello", " world"]` -> `[9906, 1917]`
Post-processor	Adds special tokens (BOS, EOS, padding)	`[9906, 1917]` -> `[1, 9906, 1917, 2]`
Decoder	Converts token IDs back to text	`[9906, 1917]` -> `"hello world"`

Three levels of granularity

1. Character-level: Every character is a token.

“Hello World” -> [“H”, “e”, “l”, “l”, “o”, " “, “W”, “o”, “r”, “l”, “d”] -> [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

Pros:

Tiny vocabulary (~256 for ASCII, ~65K for Unicode)
Can represent any text, no “unknown” tokens

Cons:

Sequences become very long (expensive!)
Model must learn spelling from scratch

2. Word-level: Every word is a token.

“Hello world” -> [“Hello”, “world”] -> [15496, 995]

Pros:

Semantically meaningful units
Short sequences

Cons:

Huge vocabulary (English has 170K+ words)
Can’t handle typos, new words, or morphology (“running” != “run”)

3. Subword-level: Breaks words into meaningful pieces.

“unhappiness” -> [“un”, “happiness”] -> [403, 26019]

Pros:

Reasonable vocabulary size (32K–100K typical)
Handles rare words by decomposition
Captures morphological patterns (“running” -> “run” + “ning”)
This is what modern LLMs use

Cons:

More complex to implement

Modern tokenizers (BPE, WordPiece, Unigram) are all trying to find the optimal balance by maximizing compression while maintaining meaningful units.

The Numbers Explained: Token IDs and Embeddings

Before we dive into the algorithms, let’s understand what these numbers actually mean. When you see "Hello" -> [15496], what does that 15496 represent?

Token IDs: The Vocabulary Index

A token ID is simply an index into the vocabulary. Think of it like a dictionary:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# The vocabulary is a dictionary: token -> ID
print(f"Vocabulary size: {len(tokenizer.vocab)}")
# Vocabulary size: 50257

# Look up a token
token = "hello"
token_id = tokenizer.convert_tokens_to_ids(token)
print(f"'{token}' -> ID {token_id}")
# 'hello' -> ID 31373

# Reverse lookup: ID -> token
print(f"ID {token_id} -> '{tokenizer.convert_ids_to_tokens([token_id])[0]}'")
# ID 31373 -> 'hello'

Key insight: Token IDs are just integers from 0 to vocab_size - 1. They have no inherent meaning, they are just positions in the vocabulary array.

From Token IDs to Embeddings

When you pass token IDs to a language model, they get converted to embeddings, dense vector representations that the model can work with.

Text: "Hello World"
  ↓ Tokenization
Token IDs: [15496, 995]
  ↓ Embedding Lookup
Embeddings: [[0.1, 0.3, -0.2, ...], [0.5, -0.1, 0.8, ...]]
  ↓ Model Processing
Output: ...

The Embedding Layer

Every language model has an embedding layer (also called a lookup table):

import torch
import torch.nn as nn

# Simplified embedding layer
vocab_size = 50257
embedding_dim = 768  # GPT-2's embedding dimension

# Create embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)

# Token IDs
token_ids = torch.tensor([15496, 995])  # "Hello World"

# Convert to embeddings
embeddings = embedding(token_ids)
print(embeddings.shape)
# torch.Size([2, 768])  # 2 tokens, each with 768-dimensional vector

What’s happening:

Token ID 15496 -> Look up row 15496 in the embedding matrix
Each row is a learned vector (e.g. 768 dimensions for GPT-2)
The model learns these embeddings during training
Similar tokens end up with similar embeddings (closer in vector space)

Why Token IDs Matter

Memory efficiency:
- Storing token IDs (integers) is much smaller than storing token strings
- A 4-byte integer vs potentially 10+ bytes per token string
Model input:
- Models expect integer tensors, not strings
- Token IDs are the interface between text and neural networks
Embedding lookup:
- Token IDs directly index into the embedding matrix
- Fast, efficient array indexing
Batch processing:
- All sequences in a batch must have the same length
- Token IDs make padding/truncation straightforward

Padding & Truncation

When processing multiple sequences, they need to be the same length

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

texts = [
    "Hello",
    "Hello world",
    "Hello world, how are you?"
]

# Encode with padding
encodings = tokenizer(
    texts,
    padding=True,        # Pad shorter sequences
    truncation=True,     # Truncate longer sequences
    max_length=10,       # Maximum length
    return_tensors="pt"  # Return PyTorch tensors
)

print(encodings["input_ids"])
# tensor([[15496, 50257, 50257, 50257, 50257, 50257, 50257],
#        [15496,   995, 50257, 50257, 50257, 50257, 50257],
#        [15496,   995,    11,   703,   389,   345,    30]])

# The padding token ID is typically 0 or a special token
print(f"Padding token ID: {tokenizer.pad_token_id}")
# Padding token ID: 50257

Padding token:

Usually ID 50257 or a special token like [PAD]
The model learns to ignore padding during attention
Padding doesn’t contribute to loss calculations

Building Intuition: How Tokenizers Work

Before diving into BPE, WordPiece, and Unigram, let’s build a simple tokenizer from scratch. This will give you a concrete understanding of what tokenization actually does.

A Simple Word-Level Tokenizer

Let’s start with the simplest possible approach: word-level tokenization. Every unique word gets its own token ID.

Step 1: Building the Vocabulary

First, we need to create a vocabulary, a mapping from words to numbers. We will:

Read a corpus of text
Extract all unique words
Assign each word a unique integer ID

You can use this link to use as training corpus. Copy it in a .txt file and save it as training-corpus.txt.

import re

# Read a text file (our training corpus)
with open("training-corpus.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()

# Preprocess: split on punctuation and whitespace, keep the delimiters
# This regex splits on punctuation/whitespace but keeps them as separate tokens
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s+)', raw_text)
# Remove empty strings and strip whitespace
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# Get all unique tokens (words + punctuation)
all_words = sorted(set(preprocessed))

# Add special tokens
# <|endoftext|> marks the end of a document
# <|unk|> represents unknown words (words not in vocabulary)
all_words.extend(["<|endoftext|>", "<|unk|>"])

# Create vocabulary: word -> integer ID
vocab = {token: integer for integer, token in enumerate(all_words)}

print(f"Vocabulary size: {len(vocab)}")
print(f"First 10 tokens: {list(vocab.items())[:10]}")
# Vocabulary size: 709
# First 10 tokens: [('!', 0), (',', 1), ('.', 2), (':', 3), (';', 4), ('?', 5), ('A', 6), ('Again', 7), ('Agnia', 8), ('Agnia’s', 9)]

What just happened?

We split text into tokens (words and punctuation)
Created a vocabulary mapping each unique token to an integer
Added special tokens for control flow (<|endoftext|> and <|unk|>)

Step 2: The Tokenizer Class

class SimpleTokenizerV1:
    def __init__(self, vocab):
        """
        Initialize with a vocabulary.
        
        vocab: dictionary mapping token strings to integer IDs
        """
        # Forward mapping: token -> ID (for encoding)
        self.str_to_int = vocab
        
        # Reverse mapping: ID -> token (for decoding)
        self.int_to_str = {v: k for k, v in vocab.items()}

    def encode(self, text):
        """
        Convert text into a list of token IDs.
        
        Process:
        1. Split text into tokens (same preprocessing as vocabulary building)
        2. Replace unknown tokens with <|unk|>
        3. Convert each token to its ID
        """
        # Same preprocessing as when building vocabulary
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s+)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        
        # Handle unknown words: if token not in vocab, use <|unk|>
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" 
            for item in preprocessed
        ]
        
        # Convert tokens to IDs
        ids = [self.str_to_int[item] for item in preprocessed]
        return ids

    def decode(self, ids):
        """
        Convert token IDs back into text.
        
        Process:
        1. Map each ID to its token string
        2. Join tokens with spaces
        3. Clean up spacing around punctuation
        """
        # Convert IDs back to tokens
        tokens = [self.int_to_str[id] for id in ids]
        
        # Join with spaces
        text = " ".join(tokens)
        
        # Fix spacing: remove space before punctuation
        # "Hello , world" -> "Hello, world"
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        
        return text

Step 3: Using the Tokenizer

# Initialize tokenizer with our vocabulary
tokenizer = SimpleTokenizerV1(vocab)

# Example texts
text1 = "Hello, do you like tea?"
text2 = "<|endoftext|> In the sunlit terraces of the palace."
text = " ".join([text1, text2])

# Encode: text -> numbers
ids = tokenizer.encode(text)
print("Encoded IDs:", ids)
# Output (IDs depend on your vocabulary):
# [708, 1, 200, 670, 351, 708, 5, 707, 34, 576, 708, 708, 401, 576, 708, 2]

# Decode: numbers -> text
decoded = tokenizer.decode(ids)
print("Decoded text:", decoded)
# Output: "<|unk|>, do you like <|unk|>? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>."

What’s happening under the hood?

Text: "Hello, do you like tea?"

Step 1: Preprocess (split on punctuation/whitespace)
  -> ["Hello", ",", " ", "do", " ", "you", " ", "like", " ", "tea", "?"]

Step 2: Strip and filter empty strings
  -> ["Hello", ",", "do", "you", "like", "tea", "?"]

Step 3: Map to IDs using vocabulary
  -> [708, 1, 200, 670, 351, 708, 5]  (IDs from our vocabulary lookup)

Step 4: Decode back (reverse mapping)
  -> ["Hello", ",", "do", "you", "like", "tea", "?"]

Step 5: Join and clean spacing
  -> "Hello, do you like tea?"

Now that you understand the core mechanics of tokenization, building a vocabulary, encoding text to IDs, and decoding back, let’s tackle the real question: how do modern tokenizers solve the vocabulary-size vs. sequence-length tradeoff?

Word-level tokenization is simple but problematic, huge vocabularies, no handling of new words, and no morphological awareness. Character-level goes to the other extreme, tiny vocabulary but extremely long sequences.

Subword tokenization finds the sweet spot. Let’s dive into the three algorithms that power today’s language models: BPE, WordPiece, and Unigram.

Deep Dive: The Three Major Subword Tokenizers

Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is a subword tokenization algorithm that builds a vocabulary by iteratively merging the most frequent pairs of adjacent tokens.

The core idea: start with a base vocabulary of individual characters (or bytes), repeatedly merge the most frequent adjacent pair into a new token, and stop when you reach your target vocabulary size. The result is a vocabulary that naturally captures common subwords, morphological patterns, and even full words for frequent terms.

Let’s walk through an example to understand how BPE works.

Training Phase: Building the Vocabulary

Training text: "low low low low lowest newest"

Step 0: Initialisation
First, we split each word into characters and add an end of word marker (typically `</w>`)
Vocabulary: {l, o, w, n, e, s, t, </w>}
Word Frequencies: {"l o w </w>": 4, "l o w e s t </w>": 1, "n e w e s t </w>": 1}

Step 1: Count all adjacent pairs
We count how many times each pair of adjacent tokens appears:
Pair Counts: {("l", "o"): 5, ("o", "w"): 5, ("w", "</w>"): 4, ("w", "e"): 2, ("e", "s"): 2, ("s", "t"): 2, ("t", "</w>"): 2, ("n", "e"): 1, ("e", "w"): 1}
The most frequent pair is "lo" with 5 occurrences. Merge it into a new token:

Step 2: Merge the most frequent pair
Create a new token `lo` by merging `l` and `o`:
Vocabulary: {l, o, w, n, e, s, t, </w>, lo}
Word Frequencies: {"lo w </w>": 4, "lo w e s t </w>": 1, "n e w e s t </w>": 1}

Step 3: Recount and merge again
Pair Counts: {("lo", "w"): 5, ("w", "</w>"): 4, ("w", "e"): 2, ("e", "s"): 2, ("s", "t"): 2, ("t", "</w>"): 2, ("n", "e"): 1, ("e", "w"): 1}
Merge the most frequent pair "lo" and "w" with 5 occurrences:
Word Frequencies: {"low </w>": 4, "low e s t </w>": 1, "n e w e s t </w>": 1}

Step 4: Continue merging
Next most frequent pairs might be:
- `(e, s)` -> `es` (appears 2 times)
- `(es, t)` -> `est` (appears 2 times)
- `(n, e)` -> `ne` (appears 1 time)
- `(ne, w)` -> `new` (appears 1 time)

After several more iterations, our vocabulary might look like:
Vocabulary: {l, o, w, e, s, t, n, </w>, lo, low, es, est, ne, new, newest, lowest}

Notice what happened:

Common words became single tokens: low (appeared 4 times)
Morphological patterns emerged: est (suffix for superlatives)
Rare words stay decomposed: newest might become ["new", "est"]

Encoding Phase: Tokenizing New Text

Once training is complete, we have a list of merge rules in the order they were learned. To encode new text, we apply these merges iteratively until no more can be applied.

Split the text into characters
Iterate through merge rules in the order they were learned
For each merge rule, check if the pair exists in the current token sequence
If found, merge the pair
Continue until no more merges can be applied
Return the final tokens

Example: encoding `"lowest"` (assume not in training data)

Let's say our trained BPE tokenizer learned these merge rules (in order of their frequency):

(l, o) -> "lo"
(lo, w) -> "low"
(e, s) -> "es"
(es, t) -> "est"
(n, e) -> "ne"
(ne, w) -> "new"
... (and many more)

Step-by-step encoding:

Initial: Split "lowest" into characters
["l", "o", "w", "e", "s", "t", "</w>"]
Iteration 1: Check merge rule #1: (l, o) -> "lo"
Found! Merge the first occurrence
["lo", "w", "e", "s", "t", "</w>"]
Iteration 2: Check merge rule #2: (lo, w) -> "low"
Found! Merge the pair
["low", "e", "s", "t", "</w>"]
Iteration 3: Check merge rule #3: (e, s) -> "es"
Found! Merge the pair
["low", "es", "t", "</w>"]
Iteration 4: Check merge rule #4: (es, t) -> "est"
Found! Merge the pair
["low", "est", "</w>"]
Iteration 5: Check merge rule #5: (n, e) -> "ne"
Not found in current sequence, skip
Iteration 6: Check merge rule #6: (ne, w) -> "new"
Not found in current sequence, skip
... Continue checking all remaining merge rules ...
No more applicable merges found.
Final result: ["low", "est"]

Even though "lowest" wasn’t in training, BPE handles it by decomposing into known subwords.

Why BPE Works

BPE discovers linguistic structure automatically:

Frequency-based compression: Common words and phrases become single tokens, reducing sequence length
Morphological awareness: Suffixes like -ing, -tion, -est naturally emerge as tokens
Graceful degradation: Rare words decompose into known subwords rather than becoming unknown tokens
Deterministic: Same input always produces same output (important for reproducibility)

Byte-Level BPE: Popularized by GPT-2

Standard BPE operates on characters. Byte-level BPE (used by GPT-2, GPT-3, GPT-4) operates on bytes (0-255).

Why bytes? Because any UTF-8 character is 1-4 bytes, so ANY text can be encoded.

Standard BPE might start with:
base_vocab = ['a', 'b', 'c', ..., 'z', 'A', 'B', ..., 'Z', '0', '1', ...]
Problem: What about Chinese characters? Emojis? Code?

Byte-level BPE starts with:
base_vocab = list(range(256))  # All possible bytes
Any UTF-8 character is 1-4 bytes, so ANY text can be encoded

Benefits:

No unknown tokens ever: Any UTF-8 text can be encoded
Language-agnostic: Works for any language without modification
Fixed base vocabulary: Always exactly 256 tokens
Handles everything: Emojis, code, special characters, etc.

Code Example

import tiktoken

# GPT-2 uses byte-level BPE
tokenizer = tiktoken.get_encoding("gpt2")

# Tokenize some text
text = "tokenization is fascinating"
tokens = tokenizer.encode(text)

print(f"Text: {text}")
# Text: tokenization is fascinating
print(f"Token IDs: {tokens}")
# Token IDs: [30001, 1634, 318, 13899]
print("\nToken breakdown:")
for i, token_id in enumerate(tokens):
    token_str = tokenizer.decode([token_id])
    print(f"  Token {i+1}: ID={token_id:5d} -> '{token_str}'")

# Token breakdown:
#  Token 1: ID=30001 -> 'token'
#  Token 2: ID= 1634 -> 'ization'
#  Token 3: ID=  318 -> ' is'
#  Token 4: ID=13899 -> ' fascinating'


text = "supercalifragilisticexpialidocious"
tokens = tokenizer.encode(text)
print(f"Text: {text}")
# Text: supercalifragilisticexpialidocious
print(f"Token IDs: {tokens}")
# Token IDs: [16668, 9948, 361, 22562, 346, 396, 501, 42372, 498, 312, 32346]
print("\nToken breakdown:")
for i, token_id in enumerate(tokens):
    token_str = tokenizer.decode([token_id])
    print(f"  Token {i+1}: ID={token_id:5d} -> '{token_str}'")

# Token breakdown:
#  Token 1: ID=16668 -> 'super'
#  Token 2: ID= 9948 -> 'cal'
#  Token 3: ID=  361 -> 'if'
#  Token 4: ID=22562 -> 'rag'
#  Token 5: ID=  346 -> 'il'
#  Token 6: ID=  396 -> 'ist'
#  Token 7: ID=  501 -> 'ice'
#  Token 8: ID=42372 -> 'xp'
#  Token 9: ID=  498 -> 'ial'
#  Token 10: ID=  312 -> 'id'
#  Token 11: ID=32346 -> 'ocious'

Limitations of BPE

While BPE is powerful, it has some limitations:

Greedy algorithm: Locally optimal merges may not be globally optimal
No probabilistic reasoning: Doesn’t consider context or multiple valid segmentations
Morphologically rich languages: Languages like Turkish or Finnish with complex inflections may get suboptimal splits
Rare word handling: Very rare words might decompose into single characters

Who uses BPE?

BPE (especially byte-level BPE) is the most widely used tokenization method by GPT-2, GPT-3, GPT-4 (byte-level BPE), LLaMA, LLaMA 2, LLaMA 3 (byte-level BPE), Mistral (byte-level BPE), Claude (Anthropic’s models use BPE variants), and most open-source LLMs default to BPE.

WordPiece

WordPiece is a subword tokenization algorithm developed at Google, originally for Japanese and Korean speech recognition, then adapted for BERT in 2018. It’s similar to BPE in structure but uses a different criterion for selecting which pairs to merge.

The core idea: start with a base vocabulary of individual characters, repeatedly merge the pair that maximizes the likelihood of the training data (rather than just frequency), and stop when you reach your target vocabulary size. This subtle change leads to better handling of rare words and more linguistically meaningful merges.

Training text: "low low low low lowest newest"

Step 0: Initialisation
First, we split each word into characters. Tokens that don't start a word get the ## prefix.
Initial vocabulary: {l, o, w, n, e, s, t, ##o, ##w, ##e, ##s, ##t, ...}
Word representations: 
  "low" (4 times) -> [l, ##o, ##w]
  "lowest" (1 time) -> [l, ##o, ##w, ##e, ##s, ##t]
  "newest" (1 time) -> [n, ##e, ##w, ##e, ##s, ##t]

Step 1: Count individual token frequencies
We need to know how often each token appears individually:
Token Frequencies: {l: 5, ##o: 5, ##w: 6, ##e: 3, ##s: 2, ##t: 2, n: 1, ...}

Step 2: Count all adjacent pairs and compute WordPiece scores
We count pair frequencies and compute: score = count(ab) / (count(a) × count(b))

Pair (l, ##o):
count(l, ##o) = 5
count(l) = 5, count(##o) = 5
score = 5 / (5 × 5) = 0.2

Pair (##o, ##w):
count(##o, ##w) = 5
count(##o) = 5, count(##w) = 6
score = 5 / (5 × 6) = 0.167

Pair (##e, ##s):
count(##e, ##s) = 2
count(##e) = 3, count(##s) = 2
score = 2 / (3 × 2) = 0.333

Pair (##s, ##t):
count(##s, ##t) = 2
count(##s) = 2, count(##t) = 2
score = 2 / (2 × 2) = 0.5 -> Highest score!
The pair (##s, ##t) has the highest WordPiece score (0.5), so we merge it first.

Step 3: Merge the highest scoring pair
Create a new token ##st by merging ##s and ##t:
Vocabulary now includes: {..., ##st}
Word representations update:
  "lowest" -> [l, ##o, ##w, ##e, ##st]
  "newest" -> [n, ##e, ##w, ##e, ##st]

Step 4: Recount and merge again
Recount token frequencies and pair scores, then merge the next highest-scoring pair.
This process continues until the vocabulary reaches the target size.

After several more iterations, our vocabulary might look like:
{l, o, w, e, s, t, n, ##o, ##w, ##e, ##s, ##t, ##st, ##est, lo, low, ne, new, ...}

Note: Tokens starting words have no prefix (l, low, new), while 
continuation tokens have ## prefix (##est, ##ing, ##s).

Notice what happened:

WordPiece merged (##s, ##t) before (l, ##o) even though (l, ##o) is more frequent, because (##s, ##t) has a higher score, the tokens appear together more consistently
More informative merges: The scoring formula favors pairs where individual tokens are rare but the combination is common
Morphological patterns still emerge: ##est, low, new
The ## prefix explicitly marks continuation tokens vs. word-starting tokens

Encoding Phase: Tokenizing New Text

WordPiece uses a longest-match-first (greedy left-to-right) algorithm for encoding:

Start from the beginning of the word
Find the longest token that matches from the current position
Mark it with ## if it’s not at word start
Move to the next position
Repeat until the entire word is tokenized

Example: encoding "tokenization" (not in training data)

Let's say our trained WordPiece tokenizer has this vocabulary:
{token, ##ization, ##tion, ##ing, to, ken, ...}

Step-by-step encoding:
Start: "tokenization"

Iteration 1: Find longest match from start
Check: "tokenization" -> not in vocab
Check: "tokenizatio" -> not in vocab
Check: "tokenizati" -> not in vocab
...
Check: "token" -> found in vocab!
Result: ["token"], remaining: "ization"

Iteration 2: Find longest match from "ization"
Check: "ization" -> found in vocab!
Mark with ## (not at word start)
Result: ["token", "##ization"]

Even though "tokenization" wasn’t in training, WordPiece handles it by decomposing into known subwords. The ## prefix indicates that "##ization" is a continuation token (not at word start).

Why WordPiece Works

WordPiece discovers linguistic structure with a more principled approach:

Likelihood-based compression: Merges are chosen to maximize the probability of the training data, not just frequency
Better rare word handling: The scoring formula favors informative pairs, leading to better decomposition of rare words
Morphological awareness: Suffixes and prefixes naturally emerge, similar to BPE
The ## convention: Explicitly marks word boundaries, helping the model understand morphology

The `##` Convention

WordPiece uses a special marker to distinguish word-starting tokens from word-continuing tokens:

Tokens at the start of a word have no prefix: "token"
Tokens inside a word are prefixed with ##: "##ization"

This helps the model understand:

"token" can start a word
"##ization" is a continuation (suffix)
"##s" is a continuation (plural marker)

Example: “I like tokens” -> [“I”, “like”, “token”, “##s”]

Code Example

from transformers import AutoTokenizer

# BERT uses WordPiece
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize some text
text = "tokenization is fascinating"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text, add_special_tokens=False)

print(f"Text: {text}")
# Text: tokenization is fascinating
print(f"Tokens: {tokens}")
# Tokens: ['token', '##ization', 'is', 'fascinating']
print(f"Token IDs: {ids}")
# Token IDs: [19204, 3989, 2003, 17160]

print("\nToken breakdown:")
for i, token in enumerate(tokens):
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"  Token {i+1}: '{token}' -> ID={token_id}")

# Token breakdown:
#  Token 1: 'token' -> ID=19204
#  Token 2: '##ization' -> ID=3989
#  Token 3: 'is' -> ID=2003
#  Token 4: 'fascinating' -> ID=17160

# Decode back
print(f"\nDecoded: {tokenizer.decode(ids)}")
# Decoded: tokenization is fascinating

# BERT adds special tokens automatically
full_encoding = tokenizer.encode(text)
print(f"\nWith special tokens: {tokenizer.convert_ids_to_tokens(full_encoding)}")
# ['[CLS]', 'token', '##ization', 'is', 'fascinating', '[SEP]']

Limitations of WordPiece

While WordPiece is powerful, it has some limitations:

Can have unknown tokens: Unlike byte-level BPE, WordPiece can’t encode everything, unknown characters become [UNK]
Greedy encoding: Longest match-first algorithm may not always be optimal
More complex scoring: The likelihood formula is more expensive to compute than frequency
Language-specific: Works best for languages similar to the training data

Who uses WordPiece?

WordPiece is primarily used in the BERT ecosystem by BERT, BERT-base, BERT-large (all variants), DistilBERT (distilled BERT), Electra (efficiently trained BERT variant), and most encoder-only models in the BERT family.

Unigram

Unigram is a subword tokenization algorithm introduced by Kudo (2018) as part of SentencePiece. Unlike BPE and WordPiece which build vocabulary bottom-up (starting small and growing), Unigram works top-down: start with a large vocabulary of candidate tokens and iteratively prune the least useful ones until reaching the target vocabulary size.

The core idea: assign a probability to each token, compute how much removing each token would hurt the overall likelihood of the training data, and remove the tokens with lowest impact. This probabilistic approach allows for multiple valid segmentations of the same word, which can be used for data augmentation during training.

Training Phase: Building the Vocabulary

Training text: "low low low low lowest newest"

Step 0: Initialisation - Build Large Candidate Vocabulary
Start by generating all possible substrings from the training corpus up to a maximum length (e.g., 16 characters).

For our training corpus, we extract all unique substrings:
All single characters: {l, o, w, e, s, t, n}
All 2-grams: {lo, ow, we, es, st, ne, ew}
All 3-grams: {low, owe, wes, est, new, ewe, wes}
All 4-grams: {lowe, owest, west, este, ...}
All 5-grams: {lowes, owest, ...}
... up to maximum length

Initial Vocabulary (simplified):
{l, o, w, e, s, t, n, lo, ow, we, es, st, ne, ew, low, owe, wes, est, new, lowe, owest, west, este, lowest, newest, ...}
Size: ~10,000+ candidate tokens

Step 1: Estimate Token Probabilities Using EM Algorithm
We need to assign probabilities to each token. The Expectation-Maximization (EM) algorithm does this:

E-step: For each word in training data, find all possible segmentations and their probabilities
M-step: Update token probabilities based on how often they appear in the best segmentations

Initial probabilities (uniform or frequency-based):
P(low) = 0.04 (appears 4 times out of 6 words in the training corpus)
P(lowest) = 0.01 (appears 1 time)
P(newest) = 0.01 (appears 1 time)
P(l) = 0.05
P(o) = 0.08
P(w) = 0.10
... (all other tokens have small probabilities)
After EM iterations, probabilities converge to:
P(low) = 0.03
P(est) = 0.02
P(new) = 0.01
P(lowest) = 0.005
P(newest) = 0.005
P(l) = 0.05
P(o) = 0.08
... (probabilities reflect actual usefulness)

Step 2: Compute Loss Impact for Each Token
For each token, we compute: "If we remove this token, how much does the total log-likelihood decrease?"

The loss impact formula:
L(token) = -log(P(alternative_segmentation)) - (-log(P(current_segmentation)))
Where:
P(current_segmentation) uses the token
P(alternative_segmentation) is the best segmentation without the token

Example 1: Removing "low" (appears 4 times)
Current segmentation: ["low"] -> P = 0.03
Alternative (without "low"): ["l", "o", "w"] -> P = P(l) × P(o) × P(w) = 0.05 × 0.08 × 0.10 = 0.0004
Loss impact = -log(0.0004) - (-log(0.03))
= 7.82 - 3.51
= 4.31 (high impact - keep this token!)

Example 2: Removing "newest" (appears 1 time, rare)
Current segmentation: ["newest"] -> P = 0.005
Alternative: ["new", "est"] -> P = P(new) × P(est) = 0.01 × 0.02 = 0.0002
Loss impact = -log(0.0002) - (-log(0.005))
= 8.52 - 5.30
= 3.22 (medium impact)

Example 3: Removing "xyz" (rare, appears once in a different word)
Current: ["xyz"] -> P = 0.0001
Alternative: ["x", "y", "z"] -> P = P(x) × P(y) × P(z) = 0.001 × 0.001 × 0.001 = 0.000000001
Loss impact = -log(0.000000001) - (-log(0.0001))
= 20.72 - 9.21
= 11.51 (but this token is so rare, removing it barely affects overall corpus likelihood)

Actually, we compute loss impact across the ENTIRE corpus:
L(token) = Σ [loss when removing token from word_i] for all words in corpus

Step 3: Remove Bottom Tokens
Sort all tokens by loss impact (ascending). Remove the bottom 20% (lowest impact tokens).
Tokens sorted by loss impact:
"low" -> 4.31 (keep - high impact)
"est" -> 3.50 (keep - high impact)
"new" -> 2.80 (keep - medium impact)
"newest" -> 3.22 (keep - medium impact)
"lowest" -> 2.50 (keep - medium impact)
...
"xyz" -> 0.01 (remove - very low impact)
"abc" -> 0.005 (remove - very low impact)
"qwerty" -> 0.001 (remove - very low impact)
Remove bottom 20%: ~2,000 tokens removed
Remaining vocabulary: ~8,000 tokens

Step 4: Re-estimate Probabilities
After pruning, we need to update probabilities because:
Some tokens are gone, so segmentations change
Remaining tokens might be used more/less frequently

Run EM algorithm again on the pruned vocabulary:
E-step: Find best segmentations using remaining tokens
M-step: Update probabilities based on new segmentations
New probabilities:
P(low) = 0.035 (increased - now more important)
P(est) = 0.025 (increased)
P(new) = 0.015 (increased)
... (probabilities rebalanced)

Step 5: Repeat Pruning
Go back to Step 2. Compute loss impacts again with updated probabilities.
Remove another 20% of lowest-impact tokens.
Re-estimate probabilities.
Repeat until vocabulary reaches target size (e.g., 8,000 tokens).
After several iterations:
Vocabulary: {l, o, w, e, s, t, n, lo, low, es, est, ne, new, lowest, ...}
Size: 8,000 tokens (down from 10,000+ candidate tokens)

Notice what happened:

Top-down approach: Started large, pruned down (opposite of BPE/WordPiece)
Probabilistic: Each token has a probability, not just a frequency
Loss-based pruning: Removes tokens that don’t help compression
Common patterns preserved: low, est, new remain because they’re useful

Encoding Phase: Tokenizing New Text

Unigram uses the Viterbi algorithm to find the most probable segmentation. Unlike BPE/WordPiece which have a single deterministic segmentation, Unigram can have multiple valid segmentations with different probabilities.

For each possible way to segment the word, compute the total probability
The probability of a segmentation is the product of individual token probabilities
Choose the segmentation with highest probability
Optionally, sample from the distribution (for data augmentation)

Example: encoding "lowest" (not in training data)

Let's say our trained Unigram tokenizer has these tokens with probabilities:
P(low) = 0.03
P(est) = 0.02
P(lo) = 0.01
P(we) = 0.005
P(st) = 0.008
P(l) = 0.05
P(o) = 0.08
P(w) = 0.10
P(e) = 0.06
P(s) = 0.04
P(t) = 0.04

Possible segmentations and their probabilities:
Option 1: ["low", "est"]
P = P(low) × P(est) = 0.03 × 0.02 = 0.0006

Option 2: ["lo", "we", "st"]
P = P(lo) × P(we) × P(st) = 0.01 × 0.005 × 0.008 = 0.0000004

Option 3: ["l", "o", "w", "e", "s", "t"]
P = P(l) × P(o) × P(w) × P(e) × P(s) × P(t)
P = 0.05 × 0.08 × 0.10 × 0.06 × 0.04 × 0.04 = 0.0000000384

Option 1 has the highest probability, so we choose: ["low", "est"]

The Viterbi algorithm efficiently finds this optimal segmentation using dynamic programming, avoiding the need to check all possible segmentations.

Why Unigram Works

Unigram’s probabilistic approach offers several advantages:

Principled optimization: Directly optimizes for compression quality (likelihood) rather than greedy frequency based merging
Multiple valid segmentations: Can sample different tokenizations during training, acting as data augmentation
Better rare word handling: The loss-based pruning naturally keeps tokens that help with rare words
Flexible: Can trade off between compression and vocabulary size more explicitly

Code Example

from transformers import AutoTokenizer

# T5 uses Unigram via SentencePiece
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# Tokenize some text
text = "tokenization is fascinating"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text, add_special_tokens=False)

print(f"Text: {text}")
# Text: tokenization is fascinating
print(f"Tokens: {tokens}")
# Tokens: ['▁token', 'ization', '▁is', '▁fascinating']
print(f"Token IDs: {ids}")
# Token IDs: [14145, 1707, 19, 8899]

print("\nToken breakdown:")
for i, token in enumerate(tokens):
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"  Token {i+1}: '{token}' -> ID={token_id}")

# Token breakdown:
#   Token 1: '▁token' -> ID=14145
#   Token 2: 'ization' -> ID=1707
#   Token 3: '▁is' -> ID=19
#   Token 4: '▁fascinating' -> ID=8899

# Decode back
print(f"\nDecoded: {tokenizer.decode(ids)}")
# Decoded: tokenization is fascinating

# Notice the ▁ prefix marks word boundaries
text2 = "I love tokenization"
print(f"\n'{text2}' tokenized:")
print(tokenizer.tokenize(text2))
# ['▁I', '▁love', '▁token', 'ization']

Limitations of Unigram

While Unigram is powerful, it has some limitations:

More complex to implement: Requires EM algorithm for probability estimation and Viterbi for encoding
Slower training: The iterative pruning and probability re-estimation is computationally expensive
Non-deterministic by default: Multiple valid segmentations can be confusing (though this is also a feature)
Requires initial vocabulary: Need to generate candidate tokens upfront (though SentencePiece handles this)

Who uses Unigram?

Unigram is primarily used via SentencePiece in multilingual and encoder-decoder models by T5, mT5 (text-to-text transfer transformer), ALBERT (A Lite BERT), XLNet (generalized autoregressive pretraining), mBART (multilingual BART), and most multilingual models (SentencePiece is language agnostic).

Note: Unigram is less common than BPE for decoder only models (GPT, LLaMA). It’s more popular for encoder-decoder architectures and multilingual applications where SentencePiece’s language-agnostic design shines.

Real World Comparisons: BPE vs WordPiece vs Unigram

Theory is one thing, but seeing how different tokenizers handle real-world text reveals their practical differences. Let’s compare BPE, WordPiece, and Unigram on a complete short story to see the actual impact.

Compression Ratio

We will use “A Transgression” by Anton Chekhov as our test case.

from transformers import AutoTokenizer

# Load the story
with open("a-transgression.txt", "r", encoding="utf-8") as f:
    story = f.read()

print(f"Story length: {len(story):,} characters")
print(f"Story preview: {story[:200]}...")

# Initialize tokenizers
tokenizer_bpe = AutoTokenizer.from_pretrained("gpt2")
tokenizer_wp = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer_uni = AutoTokenizer.from_pretrained("t5-base")

# Tokenize the entire story
tokens_bpe = tokenizer_bpe.tokenize(story)
tokens_wp = tokenizer_wp.tokenize(story)
tokens_uni = tokenizer_uni.tokenize(story)

ids_bpe = tokenizer_bpe.encode(story, add_special_tokens=False)
ids_wp = tokenizer_wp.encode(story, add_special_tokens=False)
ids_uni = tokenizer_uni.encode(story, add_special_tokens=False)

print(f"\n{'Tokenizer':<15} {'Tokens':<10} {'Characters/Token':<20} {'Compression Ratio'}")
print("-" * 70)
print(f"{'BPE (GPT-2)':<15} {len(tokens_bpe):<10,} {len(story)/len(tokens_bpe):<20.2f} {len(tokens_bpe)/len(story.split()):.2f}x")
print(f"{'WordPiece (BERT)':<15} {len(tokens_wp):<10,} {len(story)/len(tokens_wp):<20.2f} {len(tokens_wp)/len(story.split()):.2f}x")
print(f"{'Unigram (T5)':<15} {len(tokens_uni):<10,} {len(story)/len(tokens_uni):<20.2f} {len(tokens_uni)/len(story.split()):.2f}x")

Output:

Story length: 9,941 characters
Story preview: A collegiate assessor called Miguev stopped at a telegraph-post in the course of his evening walk and heaved a deep sigh. A week before, as he was returning home from his evening walk, he had been ove...

Tokenizer       Tokens     Characters/Token     Compression Ratio
----------------------------------------------------------------------
BPE (GPT-2)      2,829      3.51                 1.48x
WordPiece (BERT) 2,584      3.85                 1.35x
Unigram (T5)     3,068      3.24                 1.61x

Results:

WordPiece produces the fewest tokens, 2,584 tokens (9% fewer than BPE)
Unigram produces the most tokens, 3,068 tokens (19% more than WordPiece)
BPE is in the middle, 2,829 tokens

This contradicts common assumptions! Let’s understand why.

Why WordPiece Wins

WordPiece’s likelihood based merging creates more efficient merges for common English text. The ## convention might seem verbose, but it actually helps WordPiece create better subword units:

# Let's examine some specific words from the story
sample_words = ["collegiate", "assessor", "telegraph-post", "returning"]

print(f"{'Word':<20} {'BPE':<25} {'WordPiece':<25} {'Unigram':<25}")
print("-" * 100)

for word in sample_words:
    bpe_tokens = tokenizer_bpe.tokenize(word)
    wp_tokens = tokenizer_wp.tokenize(word)
    uni_tokens = tokenizer_uni.tokenize(word)
    
    print(f"{word:<20} {str(bpe_tokens):<25} {str(wp_tokens):<25} {str(uni_tokens):<25}")

Output:

Word                 BPE                              WordPiece                 Unigram                  
----------------------------------------------------------------------------------------------------
collegiate           ['col', 'leg', 'iate']          ['collegiate']             ['▁', 'collegiate']      
assessor             ['ass', 'essor']                ['assess', '##or']         ['▁assess', 'or']        
telegraph-post       ['te', 'legraph', '-', 'post']  ['telegraph', '-', 'post'] ['▁', 'tele', 'graph', '-', 'post']
returning            ['return', 'ing']               ['returning']              ['▁returning']

Key insight: For well-trained tokenizers on similar text, WordPiece’s likelihood based merging can create more efficient vocabulary for common English patterns.

Why Unigram Produces More Tokens

Unigram’s probabilistic approach can be less efficient for this type of text because:

It was trained on different data (T5’s training corpus)
The vocabulary might not be optimized for literary/narrative text
The pruning process may have removed tokens that would help with this specific domain

Cost Impact

# API pricing: $0.002 per 1K tokens (example rate)
cost_per_1k = 0.002

# Scenario: Processing 10,000 stories per month
stories_per_month = 10_000

bpe_cost = (len(tokens_bpe) * cost_per_1k / 1000) * stories_per_month
wp_cost = (len(tokens_wp) * cost_per_1k / 1000) * stories_per_month
uni_cost = (len(tokens_uni) * cost_per_1k / 1000) * stories_per_month

print(f"\n{'Scenario':<30} {'BPE':<15} {'WordPiece':<15} {'Unigram':<15}")
print("-" * 75)
print(f"{'Cost per story':<30} ${bpe_cost/stories_per_month:.4f}      ${wp_cost/stories_per_month:.4f}      ${uni_cost/stories_per_month:.4f}")
print(f"{'Monthly cost (10K stories)':<30} ${bpe_cost:,.2f}      ${wp_cost:,.2f}      ${uni_cost:,.2f}")
print(f"{'Annual cost (10K stories)':<30} ${bpe_cost*12:,.2f}      ${wp_cost*12:,.2f}      ${uni_cost*12:,.2f}")

print(f"\nSavings:")
print(f"  WordPiece vs BPE:     ${bpe_cost - wp_cost:,.2f}/month (${(bpe_cost - wp_cost)*12:,.2f}/year)")
print(f"  WordPiece vs Unigram: ${uni_cost - wp_cost:,.2f}/month (${(uni_cost - wp_cost)*12:,.2f}/year)")

Output:

Scenario                       BPE             WordPiece       Unigram        
---------------------------------------------------------------------------
Cost per story                 $0.0057         $0.0052        $0.0061
Monthly cost (10K stories)     $56.58          $51.68         $61.36
Annual cost (10K stories)      $678.96         $620.16        $736.32

Savings:
  WordPiece vs BPE:     $4.90/month ($58.80/year)
  WordPiece vs Unigram: $9.68/month ($116.16/year)

The Key Lesson: Context Matters

These results show that the “best” tokenizer depends on your text:

WordPiece excels here because:

BERT was trained on similar English text (Wikipedia, books)
The vocabulary is well-suited for narrative prose
Likelihood based merging works well for common English patterns

Unigram struggles here because:

T5 was trained on different data (web text, C4 corpus)
The vocabulary may not be optimized for literary text
Different domain = different optimal vocabulary

BPE is middle ground because:

GPT-2 was trained on web text (diverse but different from literature)
Byte-level encoding is robust but not domain-optimized

Real-World Tokenization Challenges

Understanding tokenization algorithms is one thing. Understanding how tokenization affects real-world model behavior is another. Let’s explore the challenges and quirks that come with tokenization.

Why LLMs Struggle with Certain Tasks

Many of the “strange” behaviors of language models trace back to tokenization. When people say “GPT can’t do math” or “LLMs struggle with spelling,” tokenization is often the culprit.

The Arithmetic Problem

import tiktoken
enc = tiktoken.get_encoding("gpt2")

# How GPT-2 sees arithmetic
print(enc.encode("317+492=809"))
# [34125, 10, 40256, 28, 34583]

# Each number becomes multiple tokens with no consistent pattern:
for num in ["317", "492", "809"]:
    tokens = enc.encode(num)
    print(f"{num} -> {[enc.decode([t]) for t in tokens]}")
# 317 -> ['317']
# 492 -> ['492']  
# 809 -> ['809']

The model doesn’t see 317 + 492 = 809. It sees a messy sequence where some numbers are single tokens and others are split inconsistently. Learning addition becomes much harder when the representation varies.

The Spelling Problem

# Can a model spell "hello" backwards?
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("hello"))
# [31373]  - Just one token

# The model never sees individual letters, it sees "hello" as an atomic unit
# To spell backwards, it would need to decompose something it treats as indivisible

This is why LLMs struggle with:

Counting letters in words
Reversing strings
Rhyming (which requires phoneme-level understanding)
Anagram solving

The Trailing Whitespace Problem

import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
# These are DIFFERENT tokens
print(tokenizer.encode("hello"))   # [31373]
print(tokenizer.encode("hello "))  # [31373, 220]
print(tokenizer.encode(" hello"))  # [23748]

# " hello" (with leading space) is a completely different token than "hello"!

This is why prompts can be sensitive to whitespace. A trailing space changes the tokenization entirely.

Multilingual Tokenization: The Hidden Cost

Tokenizers trained primarily on English text create a tokenization tax for other languages:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Same meaning, vastly different token counts
texts = {
    "English": "Hello, how are you today?",
    "Chinese": "你好，今天你好吗？",
    "Arabic": "مرحبا، كيف حالك اليوم؟",
    "Hindi": "नमस्ते, आज आप कैसे हैं?",
}

for lang, text in texts.items():
    tokens = tokenizer.encode(text)
    chars_per_token = len(text) / len(tokens)
    print(f"{lang:10} | {len(tokens):3} tokens | {chars_per_token:.2f} chars/token | {text}")

Output:

English    |   7 tokens | 3.57 chars/token | Hello, how are you today?
Chinese    |  19 tokens | 0.47 chars/token | 你好，今天你好吗？
Arabic     |  24 tokens | 0.92 chars/token | مرحبا، كيف حالك اليوم؟
Hindi      |  36 tokens | 0.64 chars/token | नमस्ते, आज आप कैसे हैं?

The implications:

API costs: Non-English users pay 3-5x more for the same semantic content
Context window: Less information fits in the same context length
Model quality: Rare tokens have weaker embeddings (less training data)

Solutions:

Use multilingual tokenizers (SentencePiece, mT5)
Train domain-specific tokenizers for your target language
Consider byte-level models for universal coverage

Vocabulary Size Trade-offs

Why does GPT-2 use exactly 50,257 tokens? The number isn’t arbitrary, it’s 50,000 BPE merges + 256 byte tokens + 1 end-of-text token. But why 50,000 merges instead of 10,000 or 100,000?

Consider the sentence “The tokenization process is fascinating”:

With a small vocabulary: ["The", "token", "ization", "pro", "cess", "is", "fas", "cin", "ating"] -> 9 tokens
With a large vocabulary: ["The", "tokenization", "process", "is", "fascinating"] -> 5 tokens

Same text, different token counts. This difference compounds across millions of documents.

The core trade-off is between two costs:

Cost Type	Small Vocab (8K)	Large Vocab (100K+)
Sequence length	Longer (more attention compute)	Shorter (less attention compute)
Embedding matrix	Smaller (less memory)	Larger (more memory)
Softmax computation	Faster	Slower (O(vocab_size) per token)

Where the costs come from:

Embedding matrix memory = vocab_size × embedding_dim × bytes_per_param
- GPT-2 (50K vocab, 768-dim): ~150 MB
- GPT-3 (50K vocab, 12,288-dim): ~2.4 GB
- Double the vocab -> double the embedding memory
Softmax bottleneck: Every token prediction requires computing probabilities over the entire vocabulary. With 100K tokens, that’s 100K operations per generated token -> this becomes the inference bottleneck for large vocabularies.
Sequence length vs attention cost: Attention is O(n²) where n is sequence length. A smaller vocabulary means more tokens per sentence, which quadratically increases attention computation. This often outweighs the savings from a smaller embedding matrix.

The sweet spot (32K-64K):

Keeps sequences reasonably short (good compression)
Embedding matrix stays manageable
Softmax doesn’t dominate inference time
32K specifically fits in 16-bit integers, simplifying implementation

Special Tokens Explained

Every tokenizer has special tokens that control model behavior:

Token	Purpose	Example Models
`[CLS]`	Classification token (sentence representation)	BERT
`[SEP]`	Separator between segments	BERT
`[MASK]`	Masked position for MLM training	BERT
`[PAD]`	Padding for batch processing	Most models
`<s>`, `</s>`	Beginning/end of sequence	LLaMA, T5
`<	endoftext	>`
`<	im_start	>`,` <

from transformers import AutoTokenizer

# BERT's special tokens
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = bert_tok.encode("Hello world")
print(bert_tok.convert_ids_to_tokens(encoded))
# ['[CLS]', 'hello', 'world', '[SEP]']

# GPT-2's special tokens
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
print(f"End of text token: {gpt2_tok.eos_token} (ID: {gpt2_tok.eos_token_id})")
# End of text token: <|endoftext|> (ID: 50256)

Why special tokens matter:

[CLS] aggregates sentence meaning for classification tasks
[SEP] tells the model where one segment ends and another begins
[PAD] is ignored during attention (via attention masks)
<|endoftext|> signals document boundaries during training

Tokenization for Code

Code has unique tokenization challenges:

import tiktoken
enc = tiktoken.get_encoding("gpt2")

code = '''def hello_world():
    print("Hello, World!")
'''

tokens = enc.encode(code)
print(f"Token count: {len(tokens)}")
for t in tokens:
    print(f"  {t:5d} -> {repr(enc.decode([t]))}")

Output:

Token count: 17
   4299 -> 'def'
  23748 -> ' hello'
     62 -> '_'
   6894 -> 'world'
  33529 -> '():'
    198 -> '\n'
    220 -> ' '
    220 -> ' '
    220 -> ' '
   3601 -> ' print'
   7203 -> '("'
  15496 -> 'Hello'
     11 -> ','
   2159 -> ' World'
   2474 -> '!"'
      8 -> ')'
    198 -> '\n'

Key observations:

Whitespace handling varies: Indentation may split into multiple space tokens rather than a single ' ' token
Common patterns merge: '():' and '("' become single tokens
Snake_case splits: hello_world becomes ['hello', '_', 'world']
CamelCase varies: Depends on frequency in training data

Code-specific challenges:

Variable names often split unpredictably
Rare function names decompose to characters
Different languages (Python vs Rust vs Go) tokenize differently
Comments in non-English get heavily penalized

Best practices for code:

Use tokenizers trained on code (Codex, StarCoder tokenizers)
Be aware that renaming variables can change token count
Consider the tokenization when crafting prompts for code generation

Building Your Own Tokenizer

Now that we have a better understanding of how different tokenizers work, let’s build our own tokenizer from scratch.

Step 1: Setup and Data Preparation

First, let’s set up our environment and prepare the training data:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import BertProcessing
from tokenizers.decoders import BPEDecoder
from transformers import PreTrainedTokenizerFast

with open("a-transgression.txt", "r", encoding="utf-8") as f:
    corpus = f.read()

print(f"Corpus size: {len(corpus):,} characters")
print(f"Corpus preview: {corpus[:200]}...")
# Output
# Corpus size: 9,941 characters
# Corpus preview: A collegiate assessor called Miguev stopped at a telegraph-post in the course of his evening walk and heaved a deep sigh. A week before, as he was returning home from his evening walk, he had been ove...

Step 2: Initialize the Tokenizer

# Initialize a blank BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|unk|>"))

# Set pre-tokenizer (splits on whitespace)
tokenizer.pre_tokenizer = Whitespace()

What’s happening:

BPE(unk_token="<|unk|>") creates a BPE model with an unknown token
Whitespace() pre-tokenizer splits text on whitespace before BPE
This is the simplest setup - you can customize later

Step 3: Configure the Trainer

# Define special tokens
special_tokens = [
    "<|unk|>",      # Unknown token
    "<|pad|>",      # Padding token
    "<|bos|>",      # Beginning of sequence
    "<|eos|>",      # End of sequence
]

# Create trainer with parameters
trainer = BpeTrainer(
    vocab_size=8000,           # Target vocabulary size
    min_frequency=2,            # Minimum frequency for a token to be included
    special_tokens=special_tokens,
    show_progress=True,         # Show training progress
)

print(f"Trainer configured: vocab_size={8000}, min_frequency={2}")

Key parameters:

vocab_size: Target vocabulary size (common: 8K, 16K, 32K, 50K)
min_frequency: Tokens must appear at least this many times
special_tokens: Control tokens for your model

Step 4: Train the Tokenizer

# Prepare training data
# tokenizers expects a list of strings or file paths
training_files = ["a-transgression.txt"]

# Train the tokenizer
tokenizer.train(training_files, trainer)

print("Training complete!")
print(f"Vocabulary size: {len(tokenizer.get_vocab())}")

Step 5: Add Post-Processing

# Add post-processor (adds special tokens)
# This adds <|bos|> at start and <|eos|> at end
tokenizer.post_processor = BertProcessing(
    ("<|eos|>", tokenizer.token_to_id("<|eos|>")),
    ("<|bos|>", tokenizer.token_to_id("<|bos|>")),
)

# Set decoder (converts tokens back to text)
tokenizer.decoder = BPEDecoder()

print("Post-processing configured")

Step 6: Test The Tokenizer

# Test encoding
text = "The tokenization process converts text into numbers."
encoding = tokenizer.encode(text)

print(f"Original text: {text}")
print(f"Token IDs: {encoding.ids}")
print(f"Tokens: {encoding.tokens}")
print(f"Number of tokens: {len(encoding.ids)}")

# Test decoding
decoded = tokenizer.decode(encoding.ids)
print(f"Decoded text: {decoded}")

Output:

Original text: The tokenization process converts text into numbers.
Token IDs: [2, 181, 77, 40, 81, 197, 165, 125, 191, 205, 7, 3]
Tokens: ['<|bos|>', 'The', 'token', 'ization', 'process', 'converts', 'text', 'into', 'numbers', '.', '<|eos|>']
Number of tokens: 12
Decoded text: The tokenization process converts text into numbers.

Note: Token IDs and counts will vary based on your training corpus and vocabulary size.

Step 7: Save and Load

# Save tokenizer
tokenizer.save("my_tokenizer.json")

print("Tokenizer saved to my_tokenizer.json")

# Load it back
loaded_tokenizer = Tokenizer.from_file("my_tokenizer.json")

# Test that it works
test_text = "Hello world"
encoding = loaded_tokenizer.encode(test_text)
print(f"Loaded tokenizer works: {loaded_tokenizer.decode(encoding.ids)}")

Step 8: Use with Transformers

# Wrap in PreTrainedTokenizerFast for Transformers compatibility
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|bos|>",
    eos_token="<|eos|>",
    unk_token="<|unk|>",
    pad_token="<|pad|>",
)

# Now you can use it like any HuggingFace tokenizer
text = "The tokenization process"
tokens = wrapped_tokenizer.tokenize(text)
ids = wrapped_tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

# Save for Transformers
wrapped_tokenizer.save_pretrained("./my_tokenizer")

Complete Code

Here’s all the steps combined into a single script:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import BertProcessing
from tokenizers.decoders import BPEDecoder
from transformers import PreTrainedTokenizerFast

# Step 1: Setup and Data Preparation
with open("a-transgression.txt", "r", encoding="utf-8") as f:
    corpus = f.read()

print(f"Corpus size: {len(corpus):,} characters")
print(f"Corpus preview: {corpus[:200]}...")

# Step 2: Initialize the Tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|unk|>"))
tokenizer.pre_tokenizer = Whitespace()

# Step 3: Configure the Trainer
special_tokens = [
    "<|unk|>",      # Unknown token
    "<|pad|>",      # Padding token
    "<|bos|>",      # Beginning of sequence
    "<|eos|>",      # End of sequence
]

trainer = BpeTrainer(
    vocab_size=8000,
    min_frequency=2,
    special_tokens=special_tokens,
    show_progress=True,
)

print(f"Trainer configured: vocab_size={8000}, min_frequency={2}")

# Step 4: Train the Tokenizer
training_files = ["a-transgression.txt"]
tokenizer.train(training_files, trainer)

print("Training complete!")
print(f"Vocabulary size: {len(tokenizer.get_vocab())}")

# Step 5: Add Post-Processing
tokenizer.post_processor = BertProcessing(
    ("<|eos|>", tokenizer.token_to_id("<|eos|>")),
    ("<|bos|>", tokenizer.token_to_id("<|bos|>")),
)
tokenizer.decoder = BPEDecoder()

print("Post-processing configured")

# Step 6: Test The Tokenizer
text = "The tokenization process converts text into numbers."
encoding = tokenizer.encode(text)

print(f"Original text: {text}")
print(f"Token IDs: {encoding.ids}")
print(f"Tokens: {encoding.tokens}")
print(f"Number of tokens: {len(encoding.ids)}")

decoded = tokenizer.decode(encoding.ids)
print(f"Decoded text: {decoded}")

# Step 7: Save and Load
tokenizer.save("my_tokenizer.json")
print("Tokenizer saved to my_tokenizer.json")

loaded_tokenizer = Tokenizer.from_file("my_tokenizer.json")
test_text = "Hello world"
encoding = loaded_tokenizer.encode(test_text)
print(f"Loaded tokenizer works: {loaded_tokenizer.decode(encoding.ids)}")

# Step 8: Use with Transformers
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|bos|>",
    eos_token="<|eos|>",
    unk_token="<|unk|>",
    pad_token="<|pad|>",
)

text = "The tokenization process"
tokens = wrapped_tokenizer.tokenize(text)
ids = wrapped_tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

wrapped_tokenizer.save_pretrained("./my_tokenizer")

Output:

Corpus size: 9,941 characters
Corpus preview: A collegiate assessor called Miguev stopped at a telegraph-post in the course of his evening walk and heaved a deep sigh. A week before, as he was returning home from his evening walk, he had been ove...
Trainer configured: vocab_size=8000, min_frequency=2
[00:00:00] Pre-processing files (0 Mo)    ████████████████████████████████                100%
[00:00:00] Tokenize words                 ████████████████████████████████ 684      /      684
[00:00:00] Count pairs                    ████████████████████████████████ 684      /      684
[00:00:00] Compute merges                 ████████████████████████████████ 603      /      603
Training complete!
Vocabulary size: 664
Post-processing configured
Original text: The tokenization process converts text into numbers.
Token IDs: [2, 181, 77, 40, 81, 38, 55, 76, 197, 507, 32, 84, 48, 165, 143, 49, 48, 125, 53, 49, 191, 43, 50, 318, 205, 7, 3]
Tokens: ['<|bos|>', 'The', 'to', 'k', 'en', 'i', 'z', 'at', 'ion', 'pro', 'c', 'es', 's', 'con', 'ver', 't', 's', 'te', 'x', 't', 'into', 'n', 'u', 'mb', 'ers', '.', '<|eos|>']
Number of tokens: 27
Decoded text: Thetokenizationprocessconvertstextintonumbers.
Tokenizer saved to my_tokenizer.json
Loaded tokenizer works: Helloworld
Tokens: ['The', 'to', 'k', 'en', 'i', 'z', 'at', 'ion', 'pro', 'c', 'es', 's']
IDs: [2, 181, 77, 40, 81, 38, 55, 76, 197, 507, 32, 84, 48, 3]

Training WordPiece or Unigram Instead

The HuggingFace tokenizers library also supports WordPiece and Unigram. Here’s how to swap algorithms:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece, Unigram
from tokenizers.trainers import WordPieceTrainer, UnigramTrainer

# For WordPiece (BERT-style)
tokenizer_wp = Tokenizer(WordPiece(unk_token="[UNK]"))
trainer_wp = WordPieceTrainer(
    vocab_size=8000,
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

# For Unigram (T5/SentencePiece-style)
tokenizer_uni = Tokenizer(Unigram())
trainer_uni = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<pad>", "<s>", "</s>"]
)

# Training works the same way
tokenizer_wp.train(["corpus.txt"], trainer_wp)
tokenizer_uni.train(["corpus.txt"], trainer_uni)

Choose based on your use case:

BPE: Best for general-purpose, especially if extending GPT-family models
WordPiece: Best for BERT-family models or encoder-only architectures
Unigram: Best for multilingual models or when you want subword regularization

Tokenization Tools Comparison

Quick Comparison Table

Tool	Speed	Algorithms	Custom Training	Pre-trained	Multilingual	Best For	When to Use
HuggingFace Tokenizers	Very Fast (~1M tok/s)	BPE, WordPiece, Unigram	Yes	Many available	Good	Custom tokenizers, flexibility	Building custom tokenizers, training models
Tiktoken	Extremely Fast (~2-3M tok/s)	BPE only	No	GPT models	Good (byte-level)	GPT models, token counting	Working with GPT-2/3/4, fast counting
SentencePiece	Moderate (~500K tok/s)	BPE, Unigram	Yes	Some available	Excellent	Multilingual models	T5/ALBERT/XLNet, multilingual text
Transformers AutoTokenizer	Fast (~500K-1M tok/s)	All (via tokenizers)	Yes	Thousands available	Good	Transformers models	Using HuggingFace models, convenience

Performance Benchmarks

import time
from transformers import AutoTokenizer
import tiktoken
from tokenizers import Tokenizer
import sentencepiece as spm

text = "The tokenization process converts text into numbers. " * 1000  # ~10K chars
iterations = 100

# HuggingFace Transformers (GPT-2)
start = time.time()
tokenizer_hf = AutoTokenizer.from_pretrained("gpt2")
for _ in range(iterations):
    tokenizer_hf.encode(text)
hf_time = time.time() - start
hf_speed = (len(text) * iterations) / hf_time / 1000  # chars per ms

# Tiktoken
start = time.time()
enc_tiktoken = tiktoken.get_encoding("gpt2")
for _ in range(iterations):
    enc_tiktoken.encode(text)
tiktoken_time = time.time() - start
tiktoken_speed = (len(text) * iterations) / tiktoken_time / 1000

# HuggingFace Tokenizers (low-level)
start = time.time()
tokenizer_low = Tokenizer.from_pretrained("gpt2")
for _ in range(iterations):
    tokenizer_low.encode(text)
low_time = time.time() - start
low_speed = (len(text) * iterations) / low_time / 1000

# SentencePiece (via T5 tokenizer)
start = time.time()
tokenizer_sp = AutoTokenizer.from_pretrained("t5-base")
for _ in range(iterations):
    tokenizer_sp.encode(text)
sp_time = time.time() - start
sp_speed = (len(text) * iterations) / sp_time / 1000

print(f"{'Tool':<30} {'Time (s)':<12} {'Speed (K chars/ms)':<20} {'Relative Speed'}")
print("-" * 75)
print(f"{'HuggingFace Transformers':<30} {hf_time:<12.3f} {hf_speed:<20.2f} 1.0x")
print(f"{'Tiktoken':<30} {tiktoken_time:<12.3f} {tiktoken_speed:<20.2f} {hf_time/tiktoken_time:.2f}x")
print(f"{'HuggingFace Tokenizers (low)':<30} {low_time:<12.3f} {low_speed:<20.2f} {hf_time/low_time:.2f}x")
print(f"{'SentencePiece (T5)':<30} {sp_time:<12.3f} {sp_speed:<20.2f} {hf_time/sp_time:.2f}x")

Output:

Tool                           Time (s)     Speed (K chars/ms)   Relative Speed
---------------------------------------------------------------------------
HuggingFace Transformers       1.963        2699.53              1.0x
Tiktoken                       0.224        23652.91             8.76x
HuggingFace Tokenizers (low)   0.947        5594.55              2.07x
SentencePiece (T5)             2.828        1874.37              0.69x

Insights:

Tiktoken is fastest - 8.76x faster than Transformers wrapper for GPT tokenization
Low-level tokenizers is 2.07x faster - Than Transformers wrapper
SentencePiece is slower - Due to Python bindings and more complex processing (0.69x relative speed)
Transformers has overhead but provides model aware convenience (1.0x relative speed)

Decision Guide

Choose HuggingFace Tokenizers if:

Building custom tokenizers
Need multiple algorithms (BPE/WordPiece/Unigram)
Want maximum flexibility

Choose Tiktoken if:

Working with GPT models (GPT-2/3/4)
Need fastest tokenization
Counting tokens for API costs

Choose SentencePiece if:

Building multilingual models
Need Unigram tokenization
Working with T5/ALBERT/XLNet
Want language agnostic tokenization

Choose Transformers AutoTokenizer if:

Using HuggingFace Transformers models
Want pre-trained tokenizers
Prefer convenience over raw speed

Conclusion

Tokenization is the under appreciated part of language models. It’s the foundation that everything else sits on, and as we have seen, the choices you make here have real consequences for cost, performance, and model quality.

What we covered:

Why tokenization matters - It affects API costs, training time, and model performance
The three major algorithms - BPE (merge frequent pairs), WordPiece (merge likely pairs), and Unigram (prune unlikely tokens)
How the numbers work - Token IDs map to embeddings that models actually process
Real-world challenges - Why LLMs struggle with math, spelling, and multilingual text
Practical comparisons - Tokenizer choice depends on your text; always test on your data
Building your own - You can train custom tokenizers (BPE, WordPiece, or Unigram) tailored to your domain
Which tool to use - Tiktoken for speed, tokenizers for flexibility, SentencePiece for multilingual

Key takeaways:

Subword tokenization is the standard - It balances vocabulary size with compression
Tokenization shapes model capabilities - Many LLM “failures” (arithmetic, spelling) trace back to tokenization
There’s no universal “best” tokenizer - Performance depends on your text, language, and use case
Non-English pays a token tax - Multilingual considerations matter for global applications
Domain-specific tokenizers help - If your domain has specialized vocabulary, train your own
Test on your actual data - Don’t assume; measure token counts and compare

If you found this interesting, I would love to hear your thoughts. Share it on Twitter, LinkedIn, or reach out at guptaamanthan01[at]gmail[dot]com.

References

Books:

Raschka, Sebastian. Build a Large Language Model (From Scratch). Manning Publications, 2024.

Blog Posts & Tutorials:

Tokenization in Transformers v5: Simpler, Clearer, and More Modular - HuggingFace, December 2025
Subword Tokenization Algorithms - Luminary Blog

Research Papers:

Sennrich, Rico, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units - Original BPE paper (2016)
Kudo, Taku and John Richardson. SentencePiece: A simple and language independent subword tokenizer - SentencePiece/Unigram paper (2018)
Context-Dependent Tokenizer Performance Analysis - EUSIPCO 2024
Tokenization Strategies for Large Language Models - arXiv 2024

Documentation:

Table of Contents

What is tokenization?

Tokenization pipeline

Three levels of granularity

The Numbers Explained: Token IDs and Embeddings

Token IDs: The Vocabulary Index

From Token IDs to Embeddings

The Embedding Layer

Why Token IDs Matter

Padding & Truncation

Building Intuition: How Tokenizers Work

A Simple Word-Level Tokenizer

What’s happening under the hood?

Deep Dive: The Three Major Subword Tokenizers

Byte Pair Encoding (BPE)

Why BPE Works

Byte-Level BPE: Popularized by GPT-2

Code Example

Limitations of BPE

Who uses BPE?

WordPiece

Why WordPiece Works

The ## Convention

Code Example

Limitations of WordPiece

Who uses WordPiece?

Unigram

Why Unigram Works

Code Example

Limitations of Unigram

Who uses Unigram?

Real World Comparisons: BPE vs WordPiece vs Unigram

Compression Ratio

Why WordPiece Wins

Why Unigram Produces More Tokens

Cost Impact

The Key Lesson: Context Matters

Real-World Tokenization Challenges

Why LLMs Struggle with Certain Tasks

The Arithmetic Problem

The Spelling Problem

The Trailing Whitespace Problem

Multilingual Tokenization: The Hidden Cost

Vocabulary Size Trade-offs

Special Tokens Explained

Tokenization for Code

Building Your Own Tokenizer

Step 1: Setup and Data Preparation

Step 2: Initialize the Tokenizer

Step 3: Configure the Trainer

Step 4: Train the Tokenizer

Step 5: Add Post-Processing

Step 6: Test The Tokenizer

Step 7: Save and Load

Step 8: Use with Transformers

Complete Code

Training WordPiece or Unigram Instead

Tokenization Tools Comparison

Quick Comparison Table

Performance Benchmarks

Decision Guide

Conclusion

References

The `##` Convention