Tokens per Word: GPT-5 vs Claude vs GPT-4, Measured (2026)
We ran the same seven-language passage, plus code, JSON, Markdown, emoji, and CSV samples, through five tokenizers — exact counts from tiktoken for the GPT family and from Anthropic's official count-tokens API for Claude. Here is what a word really costs, and the full dataset is free to download.
- Why tokens per word decides your bill
- The dataset and how it was measured
- Tokens per word by language
- Same meaning, different price
- The o200k effect: three GPT generations
- Claude counts twice: Opus 4.8 vs Sonnet 4.6
- Code, JSON, and CSV cost more than prose
- Emoji are expensive
- What a million words costs
- Reproduce the numbers
Why tokens per word decides your bill
Every large language model bills by the token, never by the word. The exchange rate between those two units is where API budgets quietly drift. Most planning guides repeat the same rule of thumb: one token is about three quarters of an English word. That figure is roughly right for English on a modern tokenizer, and increasingly wrong for everything else: other languages, source code, structured data, and emoji all convert at their own rates.
Published numbers on this are surprisingly thin, so we measured it. This article reports exact token counts for the same content across five tokenizers and three model families, with the corpus and results downloadable below. If you budget LLM usage in any language other than English, the differences are large enough to change your projections.
The dataset and how it was measured
The corpus has 13 samples. Seven are human translations of the same 94-word passage about editing, in English, Spanish, Portuguese, French, German, Chinese, and Japanese, so the cross-language comparison holds meaning constant rather than length. The other six cover the text developers actually send to models: Python, JavaScript, a JSON order record, a Markdown document, an emoji-heavy social post, and CSV numeric data.
Counts for the GPT family come from tiktoken, OpenAI's published tokenizer, so they are exact: o200k_base (GPT-5, GPT-4o, the o-series), cl100k_base (GPT-4, GPT-3.5), and the GPT-3 era p50k_base for historical contrast. Claude counts come from Anthropic's official count-tokens API endpoint, which reports the billable figure per model. The endpoint counts the whole request, so we measured the fixed message envelope (6 tokens on Opus 4.8, 7 on Sonnet 4.6 and Haiku 4.5) and subtracted it, then verified the calibration with a doubling check that came back with zero drift. Absolute Claude counts carry about one token of uncertainty; ratios are unaffected.
Gemini is excluded from the measurements because Google does not publish its tokenizer and we had no countTokens access to verify against; we would rather scope the data honestly than estimate.
Tokens per word by language
The headline table. Same passage, same meaning, five tokenizers:
| Language | Words | GPT-5 (o200k) | Tokens/word | GPT-4 (cl100k) | Claude Sonnet 4.6 | Claude Opus 4.8 |
|---|---|---|---|---|---|---|
| English | 94 | 110 | 1.17 | 110 | 116 | 177 |
| Spanish | 107 | 143 | 1.34 | 172 | 184 | 256 |
| Portuguese | 102 | 137 | 1.34 | 176 | 188 | 241 |
| French | 109 | 153 | 1.40 | 194 | 207 | 275 |
| German | 93 | 159 | 1.71 | 203 | 245 | 324 |
| Chinese | n/a | 159 | n/a | 223 | 217 | 216 |
| Japanese | n/a | 205 | n/a | 268 | 241 | 240 |
English is the cheapest language in every column: 110 tokens for 94 words on GPT-5, or about 1.17 tokens per word. The popular 0.75-words-per-token rule holds almost exactly for English prose. Spanish runs 1.34 tokens per word on the same encoding, Portuguese 1.34, French 1.40, and German, with its long compounds, 1.71. Chinese and Japanese have no whitespace word boundaries, so per-word figures are not applicable; the next section compares them on equal meaning instead.
Same meaning, different price
Because all seven passages say the same thing, the fairest question is: what does it cost to express identical meaning in each language? Taking English as the baseline:
| Language | vs English, GPT-5 (o200k) | vs English, GPT-4 (cl100k) | vs English, Claude Sonnet 4.6 |
|---|---|---|---|
| Spanish | +30% | +56% | +59% |
| Portuguese | +25% | +60% | +62% |
| French | +39% | +76% | +78% |
| German | +45% | +85% | +111% |
| Chinese | +45% | +103% | +87% |
| Japanese | +86% | +144% | +108% |
On GPT-5, expressing this passage in Spanish costs 30% more tokens than in English; Portuguese costs 25% more, and Japanese 86% more. The penalty grows on older encodings: the same Spanish passage that costs +30% on o200k cost +56% on GPT-4's cl100k, and the GPT-3 era p50k encoding needed 222 tokens for it, more than double its English equivalent. Anyone running multilingual workloads inherited those legacy ratios in their intuition, and they are now badly out of date.
The o200k effect: three GPT generations
The encoding history explains the shift. p50k and cl100k were trained heavily on English; o200k doubled the vocabulary to around 200,000 tokens and allocated far more of it to non-English text. For Spanish, the progression is 222 tokens (GPT-3 era) to 172 (GPT-4) to 143 (GPT-5) for the identical passage. Chinese improved even more sharply: 223 tokens on cl100k against 159 on o200k, a 29% drop.
The improvement is not universal. Our JavaScript sample is one honest counterexample: it costs 140 tokens on cl100k and 149 on o200k, slightly more on the newer encoding. English prose and Python were essentially flat. o200k's gains went to human languages, not to code.
Claude counts twice: Opus 4.8 vs Sonnet 4.6
The least documented result in the dataset: Anthropic's count-tokens endpoint reports two distinct counting regimes across its current models. Sonnet 4.6 and Haiku 4.5 return identical counts for every sample in the corpus. Opus 4.8 reports substantially higher figures for the same text, which matches Anthropic's own migration notes that Opus 4.7 and later count tokens differently.
| Sample | Sonnet 4.6 / Haiku 4.5 | Opus 4.8 | Opus vs Sonnet |
|---|---|---|---|
| English prose | 116 | 177 | 1.53x |
| Spanish prose | 184 | 256 | 1.39x |
| German prose | 245 | 324 | 1.32x |
| Python code | 208 | 254 | 1.22x |
| JSON | 249 | 284 | 1.14x |
| Chinese | 217 | 216 | 1.00x |
| Japanese | 241 | 240 | 1.00x |
The inflation is concentrated in Latin-script text, where Opus reports roughly 1.3 to 1.5 times the Sonnet count. On Chinese and Japanese the two regimes nearly coincide. This matters for budgeting because the billable unit differs by model: Opus 4.8 at $5 per million input tokens does not cost 1.67 times Sonnet 4.6 at $3 for English prose; measured end to end it costs about 2.5 times as much per word, because each word registers as more tokens. The cost table below uses each model's own measured counts.
Code, JSON, and CSV cost more than prose
Per character, structured text is far denser than prose. Punctuation, brackets, quotes, and digits fragment into many small tokens:
| Sample | Characters | GPT-5 tokens | Tokens per 100 chars |
|---|---|---|---|
| English prose | 572 | 110 | 19.2 |
| Markdown document | 639 | 162 | 25.4 |
| Python code | 667 | 167 | 25.0 |
| JavaScript code | 636 | 149 | 23.4 |
| Social text with emoji | 283 | 88 | 31.1 |
| JSON order record | 521 | 214 | 41.1 |
| CSV numeric data | 416 | 237 | 57.0 |
CSV numeric data is the most expensive input in the corpus at 57 tokens per 100 characters, three times the density of English prose. Dates, IDs, decimals, and percent signs tokenize one fragment at a time. The practical advice: when you pipe spreadsheets or logs into a model, the character count will mislead you; count tokens on a representative chunk first, and consider summarizing or sampling numeric tables before sending them whole.
Emoji are expensive
The social-media sample packs 11 emoji into 283 characters. Each emoji costs one to three tokens on o200k, and skin-tone or compound variants cost more. The sample lands at 88 GPT-5 tokens, a per-character density between prose and code. For chat products that process social text at scale, emoji are a real line item, not a rounding error.
What a million words costs
Converting measured tokens per word into input cost at current published prices (GPT-5 $1.25, GPT-5 mini $0.25, GPT-4o $2.50, Claude Haiku 4.5 $1.00, Sonnet 4.6 $3.00, Opus 4.8 $5.00 per million input tokens) gives the number a budget owner actually wants, the cost to process one million words:
| Language | GPT-5 | GPT-5 mini | GPT-4o | Haiku 4.5 | Sonnet 4.6 | Opus 4.8 |
|---|---|---|---|---|---|---|
| English | $1.46 | $0.29 | $2.93 | $1.23 | $3.70 | $9.41 |
| Spanish | $1.67 | $0.33 | $3.34 | $1.72 | $5.16 | $11.96 |
| Portuguese | $1.68 | $0.34 | $3.36 | $1.84 | $5.53 | $11.81 |
| French | $1.75 | $0.35 | $3.51 | $1.90 | $5.70 | $12.61 |
| German | $2.14 | $0.43 | $4.27 | $2.63 | $7.90 | $17.42 |
Two readings of this table. First, language overhead compounds with model choice: a million German words through Opus 4.8 costs $17.42 against $1.46 for English through GPT-5, a 12x spread for the same volume of meaning. Second, input pricing is cheap everywhere in absolute terms; the ratios matter when you multiply by output tokens, which typically cost four to five times the input rate and follow similar per-language inflation.
Reproduce the numbers
The full dataset and corpus are free to download and reuse with attribution (CC BY 4.0):
- tokenizer-comparison-2026.csv, every measurement in one flat table
- tokenizer-comparison-2026.json, measurements plus methodology and derived metrics
- tokenizer-corpus-2026.json, the 13-sample corpus, so you can verify every count
To check the GPT figures, run any sample through tiktoken with the o200k_base or cl100k_base encoding. To check Claude, call Anthropic's count-tokens endpoint with the sample as a single user message and subtract the envelope as described above. To get a feel for the numbers interactively, paste any corpus sample into our browser-local Token Counter: it runs the real o200k encoding client side, so the GPT counts match this dataset exactly and your text never leaves the page. For background on what a token is in the first place, see the Token Counter complete guide.
Exact GPT-5 token counts in your browser. Nothing is uploaded.
Frequently asked questions
How many tokens is one English word?
About 1.17 tokens on GPT-5's o200k encoding, measured on standard prose. Claude Sonnet 4.6 reports about 1.23 tokens per English word, and Claude Opus 4.8 reports about 1.88 because its counting changed from the 4.7 generation onward. The old rule that a token is three quarters of a word holds for English on modern GPT encodings.
Does Spanish use more tokens than English?
Yes. Expressing the same meaning in Spanish costs about 30% more tokens than English on GPT-5, about 56% more on GPT-4's cl100k encoding, and roughly 59% more on Claude Sonnet 4.6, all measured on a parallel passage. Portuguese behaves similarly at 25% to 62% depending on the tokenizer.
Why is GPT-5 so much better at non-English text than GPT-4?
GPT-5 uses the o200k encoding, which roughly doubled the vocabulary to 200,000 tokens and allocated much more of it to non-English words. The same Spanish passage that needed 172 tokens on GPT-4's cl100k needs 143 on o200k, and Chinese dropped 29%. Code saw little or no improvement.
Why does Claude Opus 4.8 report more tokens than Sonnet 4.6?
Anthropic updated token counting from Opus 4.7 onward, and the official count-tokens endpoint reflects it: Opus 4.8 reports roughly 1.3 to 1.5 times the Sonnet 4.6 count for the same Latin-script text, while Chinese and Japanese counts stay nearly identical. Since billing follows each model's own count, Opus costs more per word than its price per token suggests.
Is CSV data really more expensive than prose?
Per character, yes, by about three times. Our CSV sample measured 57 GPT-5 tokens per 100 characters against 19 for English prose, because digits, decimals, dates, and separators fragment into many small tokens. Count a representative chunk before sending large tables to a model.
Can I download and reuse this dataset?
Yes. The corpus and all measurements are published under CC BY 4.0 at textkit.tech/data, in CSV and JSON form. Cite textkit.tech when you reuse them. Every number is reproducible with tiktoken and Anthropic's free count-tokens endpoint using the method described in the article.
Keep reading
Written by SAVI. We build the tools we write about. Try the Token Counter used in this post.