Tokenizers Explained — How GPT, Claude, LLaMA and BERT Actually Read Your Text
A visual deep-dive into BPE, WordPiece, SentencePiece, and why your Hindi prompts cost more than English

When you send a prompt to ChatGPT, Claude, or any large language model, the model never actually sees your text.
It sees numbers. Just numbers.
The thing that converts your text into numbers — and back — is called a tokenizer. It is the most overlooked piece of the entire LLM stack, yet it silently controls your API bill, your context window, and even how fair the model is to non-English languages.
Let me show you how it actually works.
What Is a Token?
A token is a chunk of text. Sometimes a whole word, sometimes part of a word, sometimes a single character.
Here is the sentence IraManage is amazing tokenized by GPT-4:
"Ira" → 41,514
"Manage" → 26,789
" is" → 3,083
" amazing" → 4,998
4 tokens. 4 numbers. The model sees [41514, 26789, 3083, 4998] and nothing else.
When the model generates a response, it outputs token IDs, which the tokenizer converts back into text.
Why Not Just Use Words?
The obvious approach is to map each word to a number. cat = 1, dog = 2, and so on.
This breaks for three reasons:
1. Vocabulary explosion English has roughly 170,000 words. Add inflections (run, runs, running, ran), proper nouns, scientific terms, slang, typos — and your vocabulary balloons to millions. Larger vocab = bigger model = slower inference.
2. Out-of-vocabulary words What happens when the model sees IraManage? Or a brand new word like ChatGPT that did not exist during training? With word-level tokenization, you get an <UNK> token. The model has no idea what you are talking about.
3. Multiple languages Now multiply by every language on earth. Word-level tokenization simply does not scale.
Why Not Just Use Characters?
The opposite extreme — map each character to a number.
This solves the vocab explosion (only ~100 unique characters in English), but creates a new problem: sequence length.
The sentence IraManage is amazing becomes 20 characters = 20 tokens. A 1000-word document becomes 5000+ characters. Your context window fills up instantly, training takes forever, and the model struggles to learn meaningful patterns from individual letters.
The Sweet Spot: Subword Tokenization
What if we could split common words whole but break rare words into pieces?
the→ 1 token (super common)IraManage→Ira+Manage(rare word, broken into known pieces)tokenization→token+ization(suffix recognized)
This is subword tokenization. It is the basis for every modern LLM tokenizer.
How BPE Works — A Walkthrough
BPE (Byte-Pair Encoding) is the most popular tokenizer algorithm today. It is used by GPT, Claude, LLaMA, Mistral, and most modern LLMs.
The idea is brutally simple: start with characters, repeatedly merge the most common pair.
Let's train a tiny BPE on this corpus:
low low low low low
lower lower
newest newest newest
widest widest widest
Step 1 — Split everything into characters
l o w l o w l o w l o w l o w
l o w e r l o w e r
n e w e s t n e w e s t n e w e s t
w i d e s t w i d e s t w i d e s t
Step 2 — Count adjacent pairs
| Pair | Count |
|---|---|
l o |
7 |
o w |
7 |
e s |
6 |
s t |
6 |
n e |
3 |
w i |
3 |
Step 3 — Merge the most common pair (l o → lo)
lo w lo w lo w lo w lo w
lo w e r lo w e r
n e w e s t n e w e s t n e w e s t
w i d e s t w i d e s t w i d e s t
Step 4 — Repeat until vocabulary reaches target size
Each iteration merges another popular pair. After many iterations:
lo+w→lowe+s→eses+t→estlow+er→lower
Eventually, common words like low, lower, newest, widest become single tokens. Rare words remain broken into pieces.
That's the entire algorithm. No deep learning, no neural networks — just frequency counting and merging.
WordPiece — BERT's Tokenizer
WordPiece is very similar to BPE with one key difference: instead of merging the most frequent pair, it merges the pair that maximizes likelihood of the training data.
It also marks subword continuations with ##:
"tokenization" → ["token", "##ization"]
"IraManage" → ["Ira", "##Manage"]
The ## prefix tells the model "this is a continuation of the previous token, not a new word".
Used by: BERT, DistilBERT, Electra, RoBERTa (a few variants).
SentencePiece — The Multilingual Champion
SentencePiece is what Google built for languages that don't use spaces (Japanese, Chinese, Thai). It treats the input as a raw stream of bytes — spaces included as actual characters.
Spaces become ▁ (a special underscore character):
"IraManage is amazing"
→ ["▁Ira", "Manage", "▁is", "▁amazing"]
This makes detokenization trivial: just replace ▁ with a space and concatenate. No language-specific rules needed.
Used by: T5, mT5, ALBERT, XLNet, Gemini, LLaMA (variant).
Which Model Uses Which Tokenizer?
| Model Family | Tokenizer | Vocab Size | Library |
|---|---|---|---|
| GPT-2, GPT-3 | BPE | 50,257 | tiktoken |
| GPT-4, GPT-4o | BPE (cl100k) | 100,277 | tiktoken |
| GPT-4o (Omni) | BPE (o200k) | 200,019 | tiktoken |
| Claude (all versions) | BPE (proprietary) | ~100k | anthropic SDK |
| BERT | WordPiece | 30,522 | transformers |
| LLaMA, LLaMA 2 | SentencePiece (BPE) | 32,000 | sentencepiece |
| LLaMA 3 | tiktoken-based BPE | 128,256 | tiktoken |
| Mistral | SentencePiece (BPE) | 32,000 | sentencepiece |
| T5 | SentencePiece (Unigram) | 32,128 | sentencepiece |
| Gemini | SentencePiece variant | ~256k | google-genai |
Key insight: larger vocab = more text fits in fewer tokens = cheaper API calls + bigger effective context window. This is why GPT-4o's o200k tokenizer is better than GPT-3's r50k.
Why This Matters in Practice
1. Your API Bill Depends on the Tokenizer
OpenAI, Anthropic, and Google all charge per token. The same text can produce dramatically different token counts on different tokenizers.
The English sentence Hello, how are you doing today?:
| Tokenizer | Tokens |
|---|---|
| GPT-4o (o200k) | 8 |
| Claude | 9 |
| LLaMA 2 | 11 |
| BERT | 9 |
Same meaning, different cost.
2. Non-English Languages Are Penalised
This is the dirty secret of modern LLMs. Tokenizers are trained primarily on English text, so they tokenize English efficiently and other languages inefficiently.
The same sentence in different languages on GPT-4:
| Language | Text | Tokens |
|---|---|---|
| English | Hello, how are you? |
6 |
| Hindi | नमस्ते, आप कैसे हैं? |
18 |
| Tamil | வணக்கம், எப்படி இருக்கிறீர்கள்? |
32 |
Hindi costs 3x more than English. Tamil costs 5x more. For the exact same meaning.
3. The Strawberry Problem
You may have seen viral tweets about GPT-4 failing to count the letter r in strawberry. This is a tokenizer problem.
strawberry is tokenized as ["str", "aw", "berry"]. The model never sees individual characters — it sees three sub-word chunks. Asking it to count individual letters is like asking you to count the brush strokes in a painted word. The information is just not there at that level.
4. Special Tokens Matter
Every tokenizer reserves special tokens:
<|endoftext|>— marks document boundaries<|im_start|>,<|im_end|>— chat message boundaries<|tool_call|>— function calling markers (newer models)[CLS],[SEP],[MASK]— BERT specifics
If user input contains text that looks like these tokens, naive systems can be tricked. This is one root cause of prompt injection vulnerabilities.
Try It Yourself
Three tools that make this concrete:
1. OpenAI's official tokenizer platform.openai.com/tokenizer — paste any text, see GPT's tokens with color coding.
2. tiktokenizer.vercel.app Switch between GPT, Claude, LLaMA, Mistral tokenizers and see the difference live.
3. Python (locally)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("IraManage is amazing")
print(tokens) # [41514, 26789, 3083, 4998]
print(len(tokens)) # 4
# Decode back
print(enc.decode(tokens)) # "IraManage is amazing"
Install with: pip install tiktoken
For Hugging Face models:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tok.tokenize("IraManage is amazing"))
# ['ira', '##man', '##age', 'is', 'amazing']
Key Takeaways
LLMs don't see text — they see token IDs (just numbers)
Subword tokenization is the sweet spot between word-level and character-level
BPE is the dominant algorithm today (GPT, Claude, LLaMA, Mistral)
WordPiece powers BERT, SentencePiece powers T5/Gemini and is great for multilingual
Tokenizers directly affect your API cost, context window, and fairness across languages
Non-English languages cost 2–5x more tokens for the same meaning — this is a known limitation
Tokenizers are the invisible layer that everyone uses but few understand. Once you see it, you start noticing it everywhere — in your API bills, in benchmark numbers, in why certain model quirks exist.
Building with LLMs in your product? IraManage uses LLM-powered features for natural language queries on society data. Visit iramanage.com or reach out at contact@iramanage.com.
