Skip to main content

Command Palette

Search for a command to run...

Tokenizers Explained — How GPT, Claude, LLaMA and BERT Actually Read Your Text

A visual deep-dive into BPE, WordPiece, SentencePiece, and why your Hindi prompts cost more than English

Published
9 min read
Tokenizers Explained — How GPT, Claude, LLaMA and BERT Actually Read Your Text

When you send a prompt to ChatGPT, Claude, or any large language model, the model never actually sees your text.

It sees numbers. Just numbers.

The thing that converts your text into numbers — and back — is called a tokenizer. It is the most overlooked piece of the entire LLM stack, yet it silently controls your API bill, your context window, and even how fair the model is to non-English languages.

Let me show you how it actually works.


What Is a Token?

A token is a chunk of text. Sometimes a whole word, sometimes part of a word, sometimes a single character.

Here is the sentence IraManage is amazing tokenized by GPT-4:

"Ira"      → 41,514
"Manage"   → 26,789
" is"      →  3,083
" amazing" →  4,998

4 tokens. 4 numbers. The model sees [41514, 26789, 3083, 4998] and nothing else.

When the model generates a response, it outputs token IDs, which the tokenizer converts back into text.


Why Not Just Use Words?

The obvious approach is to map each word to a number. cat = 1, dog = 2, and so on.

This breaks for three reasons:

1. Vocabulary explosion English has roughly 170,000 words. Add inflections (run, runs, running, ran), proper nouns, scientific terms, slang, typos — and your vocabulary balloons to millions. Larger vocab = bigger model = slower inference.

2. Out-of-vocabulary words What happens when the model sees IraManage? Or a brand new word like ChatGPT that did not exist during training? With word-level tokenization, you get an <UNK> token. The model has no idea what you are talking about.

3. Multiple languages Now multiply by every language on earth. Word-level tokenization simply does not scale.


Why Not Just Use Characters?

The opposite extreme — map each character to a number.

This solves the vocab explosion (only ~100 unique characters in English), but creates a new problem: sequence length.

The sentence IraManage is amazing becomes 20 characters = 20 tokens. A 1000-word document becomes 5000+ characters. Your context window fills up instantly, training takes forever, and the model struggles to learn meaningful patterns from individual letters.


The Sweet Spot: Subword Tokenization

What if we could split common words whole but break rare words into pieces?

  • the → 1 token (super common)

  • IraManageIra + Manage (rare word, broken into known pieces)

  • tokenizationtoken + ization (suffix recognized)

This is subword tokenization. It is the basis for every modern LLM tokenizer.


How BPE Works — A Walkthrough

BPE (Byte-Pair Encoding) is the most popular tokenizer algorithm today. It is used by GPT, Claude, LLaMA, Mistral, and most modern LLMs.

The idea is brutally simple: start with characters, repeatedly merge the most common pair.

Let's train a tiny BPE on this corpus:

low low low low low
lower lower
newest newest newest
widest widest widest

Step 1 — Split everything into characters

l o w  l o w  l o w  l o w  l o w
l o w e r  l o w e r
n e w e s t  n e w e s t  n e w e s t
w i d e s t  w i d e s t  w i d e s t

Step 2 — Count adjacent pairs

Pair Count
l o 7
o w 7
e s 6
s t 6
n e 3
w i 3

Step 3 — Merge the most common pair (l olo)

lo w  lo w  lo w  lo w  lo w
lo w e r  lo w e r
n e w e s t  n e w e s t  n e w e s t
w i d e s t  w i d e s t  w i d e s t

Step 4 — Repeat until vocabulary reaches target size

Each iteration merges another popular pair. After many iterations:

  • lo + wlow

  • e + ses

  • es + test

  • low + erlower

Eventually, common words like low, lower, newest, widest become single tokens. Rare words remain broken into pieces.

That's the entire algorithm. No deep learning, no neural networks — just frequency counting and merging.


WordPiece — BERT's Tokenizer

WordPiece is very similar to BPE with one key difference: instead of merging the most frequent pair, it merges the pair that maximizes likelihood of the training data.

It also marks subword continuations with ##:

"tokenization" → ["token", "##ization"]
"IraManage"    → ["Ira", "##Manage"]

The ## prefix tells the model "this is a continuation of the previous token, not a new word".

Used by: BERT, DistilBERT, Electra, RoBERTa (a few variants).


SentencePiece — The Multilingual Champion

SentencePiece is what Google built for languages that don't use spaces (Japanese, Chinese, Thai). It treats the input as a raw stream of bytes — spaces included as actual characters.

Spaces become (a special underscore character):

"IraManage is amazing"
→ ["▁Ira", "Manage", "▁is", "▁amazing"]

This makes detokenization trivial: just replace with a space and concatenate. No language-specific rules needed.

Used by: T5, mT5, ALBERT, XLNet, Gemini, LLaMA (variant).


Which Model Uses Which Tokenizer?

Model Family Tokenizer Vocab Size Library
GPT-2, GPT-3 BPE 50,257 tiktoken
GPT-4, GPT-4o BPE (cl100k) 100,277 tiktoken
GPT-4o (Omni) BPE (o200k) 200,019 tiktoken
Claude (all versions) BPE (proprietary) ~100k anthropic SDK
BERT WordPiece 30,522 transformers
LLaMA, LLaMA 2 SentencePiece (BPE) 32,000 sentencepiece
LLaMA 3 tiktoken-based BPE 128,256 tiktoken
Mistral SentencePiece (BPE) 32,000 sentencepiece
T5 SentencePiece (Unigram) 32,128 sentencepiece
Gemini SentencePiece variant ~256k google-genai

Key insight: larger vocab = more text fits in fewer tokens = cheaper API calls + bigger effective context window. This is why GPT-4o's o200k tokenizer is better than GPT-3's r50k.


Why This Matters in Practice

1. Your API Bill Depends on the Tokenizer

OpenAI, Anthropic, and Google all charge per token. The same text can produce dramatically different token counts on different tokenizers.

The English sentence Hello, how are you doing today?:

Tokenizer Tokens
GPT-4o (o200k) 8
Claude 9
LLaMA 2 11
BERT 9

Same meaning, different cost.

2. Non-English Languages Are Penalised

This is the dirty secret of modern LLMs. Tokenizers are trained primarily on English text, so they tokenize English efficiently and other languages inefficiently.

The same sentence in different languages on GPT-4:

Language Text Tokens
English Hello, how are you? 6
Hindi नमस्ते, आप कैसे हैं? 18
Tamil வணக்கம், எப்படி இருக்கிறீர்கள்? 32

Hindi costs 3x more than English. Tamil costs 5x more. For the exact same meaning.

3. The Strawberry Problem

You may have seen viral tweets about GPT-4 failing to count the letter r in strawberry. This is a tokenizer problem.

strawberry is tokenized as ["str", "aw", "berry"]. The model never sees individual characters — it sees three sub-word chunks. Asking it to count individual letters is like asking you to count the brush strokes in a painted word. The information is just not there at that level.

4. Special Tokens Matter

Every tokenizer reserves special tokens:

  • <|endoftext|> — marks document boundaries

  • <|im_start|>, <|im_end|> — chat message boundaries

  • <|tool_call|> — function calling markers (newer models)

  • [CLS], [SEP], [MASK] — BERT specifics

If user input contains text that looks like these tokens, naive systems can be tricked. This is one root cause of prompt injection vulnerabilities.


Try It Yourself

Three tools that make this concrete:

1. OpenAI's official tokenizer platform.openai.com/tokenizer — paste any text, see GPT's tokens with color coding.

2. tiktokenizer.vercel.app Switch between GPT, Claude, LLaMA, Mistral tokenizers and see the difference live.

3. Python (locally)

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("IraManage is amazing")
print(tokens)        # [41514, 26789, 3083, 4998]
print(len(tokens))   # 4

# Decode back
print(enc.decode(tokens))  # "IraManage is amazing"

Install with: pip install tiktoken

For Hugging Face models:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tok.tokenize("IraManage is amazing"))
# ['ira', '##man', '##age', 'is', 'amazing']

Key Takeaways

  • LLMs don't see text — they see token IDs (just numbers)

  • Subword tokenization is the sweet spot between word-level and character-level

  • BPE is the dominant algorithm today (GPT, Claude, LLaMA, Mistral)

  • WordPiece powers BERT, SentencePiece powers T5/Gemini and is great for multilingual

  • Tokenizers directly affect your API cost, context window, and fairness across languages

  • Non-English languages cost 2–5x more tokens for the same meaning — this is a known limitation

Tokenizers are the invisible layer that everyone uses but few understand. Once you see it, you start noticing it everywhere — in your API bills, in benchmark numbers, in why certain model quirks exist.


Building with LLMs in your product? IraManage uses LLM-powered features for natural language queries on society data. Visit iramanage.com or reach out at contact@iramanage.com.