How many tokens is one English word?

About 1.17 tokens on GPT-5's o200k encoding, measured on standard prose. Claude Sonnet 4.6 reports about 1.23 tokens per English word, and Claude Opus 4.8 reports about 1.88 because its counting changed from the 4.7 generation onward. The old rule that a token is three quarters of a word holds for English on modern GPT encodings.

Does Spanish use more tokens than English?

Yes. Expressing the same meaning in Spanish costs about 30% more tokens than English on GPT-5, about 56% more on GPT-4's cl100k encoding, and roughly 59% more on Claude Sonnet 4.6, all measured on a parallel passage. Portuguese behaves similarly at 25% to 62% depending on the tokenizer.

Why is GPT-5 so much better at non-English text than GPT-4?

GPT-5 uses the o200k encoding, which roughly doubled the vocabulary to 200,000 tokens and allocated much more of it to non-English words. The same Spanish passage that needed 172 tokens on GPT-4's cl100k needs 143 on o200k, and Chinese dropped 29%. Code saw little or no improvement.

Why does Claude Opus 4.8 report more tokens than Sonnet 4.6?

Anthropic updated token counting from Opus 4.7 onward, and the official count-tokens endpoint reflects it: Opus 4.8 reports roughly 1.3 to 1.5 times the Sonnet 4.6 count for the same Latin-script text, while Chinese and Japanese counts stay nearly identical. Since billing follows each model's own count, Opus costs more per word than its price per token suggests.

Is CSV data really more expensive than prose?

Per character, yes, by about three times. Our CSV sample measured 57 GPT-5 tokens per 100 characters against 19 for English prose, because digits, decimals, dates, and separators fragment into many small tokens. Count a representative chunk before sending large tables to a model.

Can I download and reuse this dataset?

Yes. The corpus and all measurements are published under CC BY 4.0 at textkit.tech/data, in CSV and JSON form. Cite textkit.tech when you reuse them. Every number is reproducible with tiktoken and Anthropic's free count-tokens endpoint using the method described in the article.

Tokens per Word: GPT-5 vs Claude, Measured (2026)

We ran the same seven-language passage, plus code, JSON, Markdown, emoji, and CSV samples, through five tokenizers — exact counts from tiktoken for the GPT family and from Anthropic's official count-tokens API for Claude. Here is what a word really costs, and the full dataset is free to download.

Why tokens per word decides your bill

Every large language model bills by the token, never by the word. The exchange rate between those two units is where API budgets quietly drift. Most planning guides repeat the same rule of thumb: one token is about three quarters of an English word. That figure is roughly right for English on a modern tokenizer, and increasingly wrong for everything else: other languages, source code, structured data, and emoji all convert at their own rates.

Published numbers on this are surprisingly thin, so we measured it. This article reports exact token counts for the same content across five tokenizers and three model families, with the corpus and results downloadable below. If you budget LLM usage in any language other than English, the differences are large enough to change your projections.

The dataset and how it was measured

The corpus has 13 samples. Seven are human translations of the same 94-word passage about editing, in English, Spanish, Portuguese, French, German, Chinese, and Japanese, so the cross-language comparison holds meaning constant rather than length. The other six cover the text developers actually send to models: Python, JavaScript, a JSON order record, a Markdown document, an emoji-heavy social post, and CSV numeric data.

Counts for the GPT family come from tiktoken, OpenAI's published tokenizer, so they are exact: o200k_base (GPT-5, GPT-4o, the o-series), cl100k_base (GPT-4, GPT-3.5), and the GPT-3 era p50k_base for historical contrast. Claude counts come from Anthropic's official count-tokens API endpoint, which reports the billable figure per model. The endpoint counts the whole request, so we measured the fixed message envelope (6 tokens on Opus 4.8, 7 on Sonnet 4.6 and Haiku 4.5) and subtracted it, then verified the calibration with a doubling check that came back with zero drift. Absolute Claude counts carry about one token of uncertainty; ratios are unaffected.

Gemini is excluded from the measurements because Google does not publish its tokenizer and we had no countTokens access to verify against; we would rather scope the data honestly than estimate.

Tokens per word by language

The headline table. Same passage, same meaning, five tokenizers:

Language	Words	GPT-5 (o200k)	Tokens/word	GPT-4 (cl100k)	Claude Sonnet 4.6	Claude Opus 4.8
English	94	110	1.17	110	116	177
Spanish	107	143	1.34	172	184	256
Portuguese	102	137	1.34	176	188	241
French	109	153	1.40	194	207	275
German	93	159	1.71	203	245	324
Chinese	n/a	159	n/a	223	217	216
Japanese	n/a	205	n/a	268	241	240

English is the cheapest language in every column: 110 tokens for 94 words on GPT-5, or about 1.17 tokens per word. The popular 0.75-words-per-token rule holds almost exactly for English prose. Spanish runs 1.34 tokens per word on the same encoding, Portuguese 1.34, French 1.40, and German, with its long compounds, 1.71. Chinese and Japanese have no whitespace word boundaries, so per-word figures are not applicable; the next section compares them on equal meaning instead.

Same meaning, different price

Because all seven passages say the same thing, the fairest question is: what does it cost to express identical meaning in each language? Taking English as the baseline:

Language	vs English, GPT-5 (o200k)	vs English, GPT-4 (cl100k)	vs English, Claude Sonnet 4.6
Spanish	+30%	+56%	+59%
Portuguese	+25%	+60%	+62%
French	+39%	+76%	+78%
German	+45%	+85%	+111%
Chinese	+45%	+103%	+87%
Japanese	+86%	+144%	+108%

On GPT-5, expressing this passage in Spanish costs 30% more tokens than in English; Portuguese costs 25% more, and Japanese 86% more. The penalty grows on older encodings: the same Spanish passage that costs +30% on o200k cost +56% on GPT-4's cl100k, and the GPT-3 era p50k encoding needed 222 tokens for it, more than double its English equivalent. Anyone running multilingual workloads inherited those legacy ratios in their intuition, and they are now badly out of date.

The o200k effect: three GPT generations

The encoding history explains the shift. p50k and cl100k were trained heavily on English; o200k doubled the vocabulary to around 200,000 tokens and allocated far more of it to non-English text. For Spanish, the progression is 222 tokens (GPT-3 era) to 172 (GPT-4) to 143 (GPT-5) for the identical passage. Chinese improved even more sharply: 223 tokens on cl100k against 159 on o200k, a 29% drop.

The improvement is not universal. Our JavaScript sample is one honest counterexample: it costs 140 tokens on cl100k and 149 on o200k, slightly more on the newer encoding. English prose and Python were essentially flat. o200k's gains went to human languages, not to code.

Claude counts twice: Opus 4.8 vs Sonnet 4.6

The least documented result in the dataset: Anthropic's count-tokens endpoint reports two distinct counting regimes across its current models. Sonnet 4.6 and Haiku 4.5 return identical counts for every sample in the corpus. Opus 4.8 reports substantially higher figures for the same text, which matches Anthropic's own migration notes that Opus 4.7 and later count tokens differently.

Sample	Sonnet 4.6 / Haiku 4.5	Opus 4.8	Opus vs Sonnet
English prose	116	177	1.53x
Spanish prose	184	256	1.39x
German prose	245	324	1.32x
Python code	208	254	1.22x
JSON	249	284	1.14x
Chinese	217	216	1.00x
Japanese	241	240	1.00x

The inflation is concentrated in Latin-script text, where Opus reports roughly 1.3 to 1.5 times the Sonnet count. On Chinese and Japanese the two regimes nearly coincide. This matters for budgeting because the billable unit differs by model: Opus 4.8 at $5 per million input tokens does not cost 1.67 times Sonnet 4.6 at $3 for English prose; measured end to end it costs about 2.5 times as much per word, because each word registers as more tokens. The cost table below uses each model's own measured counts.

Code, JSON, and CSV cost more than prose

Per character, structured text is far denser than prose. Punctuation, brackets, quotes, and digits fragment into many small tokens:

Sample	Characters	GPT-5 tokens	Tokens per 100 chars
English prose	572	110	19.2
Markdown document	639	162	25.4
Python code	667	167	25.0
JavaScript code	636	149	23.4
Social text with emoji	283	88	31.1
JSON order record	521	214	41.1
CSV numeric data	416	237	57.0

CSV numeric data is the most expensive input in the corpus at 57 tokens per 100 characters, three times the density of English prose. Dates, IDs, decimals, and percent signs tokenize one fragment at a time. The practical advice: when you pipe spreadsheets or logs into a model, the character count will mislead you; count tokens on a representative chunk first, and consider summarizing or sampling numeric tables before sending them whole.

Emoji are expensive

The social-media sample packs 11 emoji into 283 characters. Each emoji costs one to three tokens on o200k, and skin-tone or compound variants cost more. The sample lands at 88 GPT-5 tokens, a per-character density between prose and code. For chat products that process social text at scale, emoji are a real line item, not a rounding error.

What a million words costs

Converting measured tokens per word into input cost at current published prices (GPT-5 $1.25, GPT-5 mini $0.25, GPT-4o $2.50, Claude Haiku 4.5 $1.00, Sonnet 4.6 $3.00, Opus 4.8 $5.00 per million input tokens) gives the number a budget owner actually wants, the cost to process one million words:

Language	GPT-5	GPT-5 mini	GPT-4o	Haiku 4.5	Sonnet 4.6	Opus 4.8
English	$1.46	$0.29	$2.93	$1.23	$3.70	$9.41
Spanish	$1.67	$0.33	$3.34	$1.72	$5.16	$11.96
Portuguese	$1.68	$0.34	$3.36	$1.84	$5.53	$11.81
French	$1.75	$0.35	$3.51	$1.90	$5.70	$12.61
German	$2.14	$0.43	$4.27	$2.63	$7.90	$17.42

Two readings of this table. First, language overhead compounds with model choice: a million German words through Opus 4.8 costs $17.42 against $1.46 for English through GPT-5, a 12x spread for the same volume of meaning. Second, input pricing is cheap everywhere in absolute terms; the ratios matter when you multiply by output tokens, which typically cost four to five times the input rate and follow similar per-language inflation.

Reproduce the numbers

The full dataset and corpus are free to download and reuse with attribution (CC BY 4.0):

tokenizer-comparison-2026.csv, every measurement in one flat table
tokenizer-comparison-2026.json, measurements plus methodology and derived metrics
tokenizer-corpus-2026.json, the 13-sample corpus, so you can verify every count

To check the GPT figures, run any sample through tiktoken with the o200k_base or cl100k_base encoding. To check Claude, call Anthropic's count-tokens endpoint with the sample as a single user message and subtract the envelope as described above. To get a feel for the numbers interactively, paste any corpus sample into our browser-local Token Counter: it runs the real o200k encoding client side, so the GPT counts match this dataset exactly and your text never leaves the page. For background on what a token is in the first place, see the Token Counter complete guide.

Count your own text

Exact GPT-5 token counts in your browser. Nothing is uploaded.

Open the Token Counter

Sources and further reading

Written by SAVI. We build the tools we write about. Try the Token Counter used in this post.

Tokens per Word: GPT-5 vs Claude vs GPT-4, Measured (2026)