Tokenizers Explained — How GPT, Claude, LLaMA and BERT Actually Read Your Text

When you send a prompt to ChatGPT, Claude, or any large language model, the model never actually sees your text.

It sees numbers. Just numbers.

The thing that converts your text into numbers — and back — is called a tokenizer. It is the most overlooked piece of the entire LLM stack, yet it silently controls your API bill, your context window, and even how fair the model is to non-English languages.

Let me show you how it actually works.

What Is a Token?

A token is a chunk of text. Sometimes a whole word, sometimes part of a word, sometimes a single character.

Here is the sentence IraManage is amazing tokenized by GPT-4:

"Ira"      → 41,514
"Manage"   → 26,789
" is"      →  3,083
" amazing" →  4,998

4 tokens. 4 numbers. The model sees [41514, 26789, 3083, 4998] and nothing else.

When the model generates a response, it outputs token IDs, which the tokenizer converts back into text.

Why Not Just Use Words?

The obvious approach is to map each word to a number. cat = 1, dog = 2, and so on.

This breaks for three reasons:

1. Vocabulary explosion English has roughly 170,000 words. Add inflections (run, runs, running, ran), proper nouns, scientific terms, slang, typos — and your vocabulary balloons to millions. Larger vocab = bigger model = slower inference.

2. Out-of-vocabulary words What happens when the model sees IraManage? Or a brand new word like ChatGPT that did not exist during training? With word-level tokenization, you get an <UNK> token. The model has no idea what you are talking about.

3. Multiple languages Now multiply by every language on earth. Word-level tokenization simply does not scale.

Why Not Just Use Characters?

The opposite extreme — map each character to a number.

This solves the vocab explosion (only ~100 unique characters in English), but creates a new problem: sequence length.

The sentence IraManage is amazing becomes 20 characters = 20 tokens. A 1000-word document becomes 5000+ characters. Your context window fills up instantly, training takes forever, and the model struggles to learn meaningful patterns from individual letters.

The Sweet Spot: Subword Tokenization

What if we could split common words whole but break rare words into pieces?

the → 1 token (super common)
IraManage → Ira + Manage (rare word, broken into known pieces)
tokenization → token + ization (suffix recognized)

This is subword tokenization. It is the basis for every modern LLM tokenizer.

How BPE Works — A Walkthrough

BPE (Byte-Pair Encoding) is the most popular tokenizer algorithm today. It is used by GPT, Claude, LLaMA, Mistral, and most modern LLMs.

The idea is brutally simple: start with characters, repeatedly merge the most common pair.

Let's train a tiny BPE on this corpus:

low low low low low
lower lower
newest newest newest
widest widest widest

Step 1 — Split everything into characters

l o w  l o w  l o w  l o w  l o w
l o w e r  l o w e r
n e w e s t  n e w e s t  n e w e s t
w i d e s t  w i d e s t  w i d e s t

Step 2 — Count adjacent pairs

Pair	Count
`l o`	7
`o w`	7
`e s`	6
`s t`	6
`n e`	3
`w i`	3

Step 3 — Merge the most common pair (`l o` → `lo`)

lo w  lo w  lo w  lo w  lo w
lo w e r  lo w e r
n e w e s t  n e w e s t  n e w e s t
w i d e s t  w i d e s t  w i d e s t

Step 4 — Repeat until vocabulary reaches target size

Each iteration merges another popular pair. After many iterations:

lo + w → low
e + s → es
es + t → est
low + er → lower

Eventually, common words like low, lower, newest, widest become single tokens. Rare words remain broken into pieces.

That's the entire algorithm. No deep learning, no neural networks — just frequency counting and merging.

WordPiece — BERT's Tokenizer

WordPiece is very similar to BPE with one key difference: instead of merging the most frequent pair, it merges the pair that maximizes likelihood of the training data.

It also marks subword continuations with ##:

"tokenization" → ["token", "##ization"]
"IraManage"    → ["Ira", "##Manage"]

The ## prefix tells the model "this is a continuation of the previous token, not a new word".

Used by: BERT, DistilBERT, Electra, RoBERTa (a few variants).

SentencePiece — The Multilingual Champion

SentencePiece is what Google built for languages that don't use spaces (Japanese, Chinese, Thai). It treats the input as a raw stream of bytes — spaces included as actual characters.

Spaces become ▁ (a special underscore character):

"IraManage is amazing"
→ ["▁Ira", "Manage", "▁is", "▁amazing"]

This makes detokenization trivial: just replace ▁ with a space and concatenate. No language-specific rules needed.

Used by: T5, mT5, ALBERT, XLNet, Gemini, LLaMA (variant).

Which Model Uses Which Tokenizer?

Model Family	Tokenizer	Vocab Size	Library
GPT-2, GPT-3	BPE	50,257	tiktoken
GPT-4, GPT-4o	BPE (cl100k)	100,277	tiktoken
GPT-4o (Omni)	BPE (o200k)	200,019	tiktoken
Claude (all versions)	BPE (proprietary)	~100k	anthropic SDK
BERT	WordPiece	30,522	transformers
LLaMA, LLaMA 2	SentencePiece (BPE)	32,000	sentencepiece
LLaMA 3	tiktoken-based BPE	128,256	tiktoken
Mistral	SentencePiece (BPE)	32,000	sentencepiece
T5	SentencePiece (Unigram)	32,128	sentencepiece
Gemini	SentencePiece variant	~256k	google-genai

Key insight: larger vocab = more text fits in fewer tokens = cheaper API calls + bigger effective context window. This is why GPT-4o's o200k tokenizer is better than GPT-3's r50k.

Why This Matters in Practice

1. Your API Bill Depends on the Tokenizer

OpenAI, Anthropic, and Google all charge per token. The same text can produce dramatically different token counts on different tokenizers.

The English sentence Hello, how are you doing today?:

Tokenizer	Tokens
GPT-4o (o200k)	8
Claude	9
LLaMA 2	11
BERT	9

Same meaning, different cost.

2. Non-English Languages Are Penalised

This is the dirty secret of modern LLMs. Tokenizers are trained primarily on English text, so they tokenize English efficiently and other languages inefficiently.

The same sentence in different languages on GPT-4:

Language	Text	Tokens
English	`Hello, how are you?`	6
Hindi	`नमस्ते, आप कैसे हैं?`	18
Tamil	`வணக்கம், எப்படி இருக்கிறீர்கள்?`	32

Hindi costs 3x more than English. Tamil costs 5x more. For the exact same meaning.

3. The Strawberry Problem

You may have seen viral tweets about GPT-4 failing to count the letter r in strawberry. This is a tokenizer problem.

strawberry is tokenized as ["str", "aw", "berry"]. The model never sees individual characters — it sees three sub-word chunks. Asking it to count individual letters is like asking you to count the brush strokes in a painted word. The information is just not there at that level.

4. Special Tokens Matter

Every tokenizer reserves special tokens:

<|endoftext|> — marks document boundaries
<|im_start|>, <|im_end|> — chat message boundaries
<|tool_call|> — function calling markers (newer models)
[CLS], [SEP], [MASK] — BERT specifics

If user input contains text that looks like these tokens, naive systems can be tricked. This is one root cause of prompt injection vulnerabilities.

Try It Yourself

Three tools that make this concrete:

1. OpenAI's official tokenizer platform.openai.com/tokenizer — paste any text, see GPT's tokens with color coding.

2. tiktokenizer.vercel.app Switch between GPT, Claude, LLaMA, Mistral tokenizers and see the difference live.

3. Python (locally)

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("IraManage is amazing")
print(tokens)        # [41514, 26789, 3083, 4998]
print(len(tokens))   # 4

# Decode back
print(enc.decode(tokens))  # "IraManage is amazing"

Install with: pip install tiktoken

For Hugging Face models:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tok.tokenize("IraManage is amazing"))
# ['ira', '##man', '##age', 'is', 'amazing']

Key Takeaways

LLMs don't see text — they see token IDs (just numbers)
Subword tokenization is the sweet spot between word-level and character-level
BPE is the dominant algorithm today (GPT, Claude, LLaMA, Mistral)
WordPiece powers BERT, SentencePiece powers T5/Gemini and is great for multilingual
Tokenizers directly affect your API cost, context window, and fairness across languages
Non-English languages cost 2–5x more tokens for the same meaning — this is a known limitation

Tokenizers are the invisible layer that everyone uses but few understand. Once you see it, you start noticing it everywhere — in your API bills, in benchmark numbers, in why certain model quirks exist.

Building with LLMs in your product? IraManage uses LLM-powered features for natural language queries on society data. Visit iramanage.com or reach out at contact@iramanage.com.

Tokenizers Explained — How GPT, Claude, LLaMA and BERT Actually Read Your Text

What Is a Token?

Why Not Just Use Words?

Why Not Just Use Characters?

The Sweet Spot: Subword Tokenization

How BPE Works — A Walkthrough

Step 1 — Split everything into characters

Step 2 — Count adjacent pairs

Step 3 — Merge the most common pair (`l o` → `lo`)

Step 4 — Repeat until vocabulary reaches target size

WordPiece — BERT's Tokenizer

SentencePiece — The Multilingual Champion

Which Model Uses Which Tokenizer?

Why This Matters in Practice

1. Your API Bill Depends on the Tokenizer

2. Non-English Languages Are Penalised

3. The Strawberry Problem

4. Special Tokens Matter

Try It Yourself

Key Takeaways

Comments

More from this blog

I Named My SaaS After My Daughter — Here's Why I Built IraManage

Command Palette

What Is a Token?

Why Not Just Use Words?

Why Not Just Use Characters?

The Sweet Spot: Subword Tokenization

How BPE Works — A Walkthrough

Step 1 — Split everything into characters

Step 2 — Count adjacent pairs

Step 3 — Merge the most common pair (l o → lo)

Step 4 — Repeat until vocabulary reaches target size

WordPiece — BERT's Tokenizer

SentencePiece — The Multilingual Champion

Which Model Uses Which Tokenizer?

Why This Matters in Practice

1. Your API Bill Depends on the Tokenizer

2. Non-English Languages Are Penalised

3. The Strawberry Problem

4. Special Tokens Matter

Try It Yourself

Key Takeaways

Comments

More from this blog

Step 3 — Merge the most common pair (`l o` → `lo`)