{
  "name": "TextKit Tokenizer Comparison Corpus",
  "version": "1.0",
  "date": "2026-06-10",
  "license": "CC BY 4.0 — cite textkit.tech when reusing",
  "notes": "Samples L1-L7 are human translations of the same ~100-word passage (parallel corpus: same meaning, different language). Samples C1-C2 are code, S1-S3 structured/social/numeric text. Word counts are whitespace-delimited; for Chinese and Japanese (unsegmented scripts) per-word metrics are not applicable and per-100-character metrics are used instead.",
  "samples": [
    {
      "id": "L1-en",
      "label": "English prose",
      "category": "language-parallel",
      "script": "latin",
      "text": "Clear writing is a form of respect for the reader. Every unnecessary word costs attention, and attention is the scarcest resource a writer can ask for. Good editors cut without mercy: they remove filler, shorten sentences, and choose plain words over ornate ones. The result is prose that moves quickly and says exactly what it means. None of this happens by accident. It takes several passes, a willingness to delete favorite phrases, and the discipline to stop polishing once the meaning is clear. Readers rarely notice careful editing, but they always feel its absence."
    },
    {
      "id": "L2-es",
      "label": "Spanish prose",
      "category": "language-parallel",
      "script": "latin",
      "text": "Escribir con claridad es una forma de respeto hacia quien lee. Cada palabra innecesaria cuesta atención, y la atención es el recurso más escaso que un escritor puede pedir. Los buenos editores cortan sin piedad: eliminan el relleno, acortan las oraciones y prefieren las palabras sencillas a las rebuscadas. El resultado es una prosa que avanza con rapidez y dice exactamente lo que quiere decir. Nada de esto ocurre por accidente. Hacen falta varias pasadas, la disposición a borrar frases queridas y la disciplina de dejar de pulir cuando el significado ya es claro. Quien lee rara vez nota una edición cuidadosa, pero siempre siente su ausencia."
    },
    {
      "id": "L3-pt",
      "label": "Portuguese prose",
      "category": "language-parallel",
      "script": "latin",
      "text": "Escrever com clareza é uma forma de respeito por quem lê. Cada palavra desnecessária custa atenção, e a atenção é o recurso mais escasso que um escritor pode pedir. Bons editores cortam sem piedade: eliminam o enchimento, encurtam as frases e preferem palavras simples às rebuscadas. O resultado é uma prosa que avança depressa e diz exatamente o que pretende dizer. Nada disso acontece por acaso. São necessárias várias revisões, a disposição de apagar frases queridas e a disciplina de parar de polir quando o sentido já está claro. Quem lê raramente percebe uma edição cuidadosa, mas sempre sente a sua falta."
    },
    {
      "id": "L4-fr",
      "label": "French prose",
      "category": "language-parallel",
      "script": "latin",
      "text": "Écrire avec clarté est une forme de respect envers le lecteur. Chaque mot inutile coûte de l'attention, et l'attention est la ressource la plus rare qu'un écrivain puisse demander. Les bons éditeurs coupent sans pitié : ils suppriment le remplissage, raccourcissent les phrases et préfèrent les mots simples aux mots recherchés. Le résultat est une prose qui avance vite et dit exactement ce qu'elle veut dire. Rien de tout cela n'arrive par hasard. Il faut plusieurs relectures, la volonté de supprimer ses phrases préférées et la discipline d'arrêter de polir une fois le sens devenu clair. Les lecteurs remarquent rarement une édition soignée, mais ils en sentent toujours l'absence."
    },
    {
      "id": "L5-de",
      "label": "German prose",
      "category": "language-parallel",
      "script": "latin",
      "text": "Klares Schreiben ist eine Form von Respekt gegenüber den Lesenden. Jedes überflüssige Wort kostet Aufmerksamkeit, und Aufmerksamkeit ist die knappste Ressource, um die ein Autor bitten kann. Gute Lektoren kürzen ohne Gnade: Sie streichen Füllwörter, verkürzen Sätze und wählen schlichte Wörter statt geschraubter. Das Ergebnis ist Prosa, die schnell vorankommt und genau das sagt, was sie meint. Nichts davon geschieht zufällig. Es braucht mehrere Durchgänge, die Bereitschaft, Lieblingssätze zu streichen, und die Disziplin, mit dem Feilen aufzuhören, sobald die Bedeutung klar ist. Lesende bemerken sorgfältiges Lektorat selten, aber sie spüren sein Fehlen immer."
    },
    {
      "id": "L6-zh",
      "label": "Chinese prose (Simplified)",
      "category": "language-parallel",
      "script": "cjk",
      "text": "清晰的写作是对读者的一种尊重。每一个多余的词都在消耗注意力，而注意力是写作者所能请求的最稀缺的资源。优秀的编辑下手毫不留情：他们删去填充词，缩短句子，宁用朴素的词语而不用华丽的辞藻。最终的文字行进迅速，说的正是它想说的意思。这一切都不是偶然发生的。它需要反复修改，需要舍得删掉自己喜欢的句子，也需要在意思已经清楚时停止打磨的自律。读者很少注意到用心的编辑，但他们总能感觉到它的缺席。"
    },
    {
      "id": "L7-ja",
      "label": "Japanese prose",
      "category": "language-parallel",
      "script": "cjk",
      "text": "明晰な文章は、読者への敬意のかたちである。不要な語はひとつひとつ注意力を消費する。そして注意力は、書き手が求めうる最も希少な資源だ。優れた編集者は容赦なく削る。埋め草を取り除き、文を短くし、飾った言葉より平易な言葉を選ぶ。その結果、文章は速く進み、言いたいことを正確に言う。これは偶然には起こらない。何度も推敲を重ね、気に入った言い回しを捨てる覚悟を持ち、意味が明確になったら磨くのをやめる自制が要る。読者が丁寧な編集に気づくことはまれだが、その不在はいつも感じ取る。"
    },
    {
      "id": "C1-python",
      "label": "Python code",
      "category": "code",
      "script": "latin",
      "text": "def count_words(text: str) -> dict:\n    \"\"\"Count words, characters, and sentences in a text.\"\"\"\n    words = text.split()\n    sentences = [s for s in text.replace('!', '.').replace('?', '.').split('.') if s.strip()]\n    return {\n        'words': len(words),\n        'characters': len(text),\n        'characters_no_spaces': len(text.replace(' ', '')),\n        'sentences': len(sentences),\n        'avg_word_length': sum(len(w) for w in words) / len(words) if words else 0,\n    }\n\n\nif __name__ == '__main__':\n    sample = 'The quick brown fox jumps over the lazy dog.'\n    stats = count_words(sample)\n    for key, value in stats.items():\n        print(f'{key}: {value}')"
    },
    {
      "id": "C2-javascript",
      "label": "JavaScript code",
      "category": "code",
      "script": "latin",
      "text": "function debounce(fn, delay = 250) {\n  let timer = null;\n  return function (...args) {\n    clearTimeout(timer);\n    timer = setTimeout(() => fn.apply(this, args), delay);\n  };\n}\n\nconst input = document.querySelector('#search');\nconst results = document.querySelector('#results');\n\ninput.addEventListener('input', debounce(async (event) => {\n  const query = event.target.value.trim();\n  if (!query) { results.innerHTML = ''; return; }\n  const response = await fetch(`/api/search?q=${encodeURIComponent(query)}`);\n  const items = await response.json();\n  results.innerHTML = items.map(item => `<li>${item.title}</li>`).join('');\n}, 300));"
    },
    {
      "id": "S1-json",
      "label": "JSON data",
      "category": "structured",
      "script": "latin",
      "text": "{\n  \"order_id\": \"ORD-2026-08431\",\n  \"created_at\": \"2026-06-10T14:32:07Z\",\n  \"currency\": \"USD\",\n  \"customer\": {\n    \"id\": \"CUST-99217\",\n    \"name\": \"Jane Doe\",\n    \"email\": \"jane.doe@example.com\",\n    \"loyalty_tier\": \"gold\"\n  },\n  \"items\": [\n    { \"sku\": \"BK-1042\", \"title\": \"The Art of Plain Writing\", \"qty\": 1, \"unit_price\": 18.95 },\n    { \"sku\": \"NB-0207\", \"title\": \"Dot Grid Notebook A5\", \"qty\": 3, \"unit_price\": 7.5 }\n  ],\n  \"subtotal\": 41.45,\n  \"tax\": 3.32,\n  \"shipping\": 4.99,\n  \"total\": 49.76,\n  \"status\": \"paid\"\n}"
    },
    {
      "id": "S2-markdown",
      "label": "Markdown document",
      "category": "structured",
      "script": "latin",
      "text": "# Release Notes — v2.4.0\n\n## New features\n\n- **Dark mode** now follows the system preference and can be toggled manually.\n- Added a `--dry-run` flag to the import command.\n- Search results highlight the matched terms.\n\n## Fixes\n\n1. Fixed a crash when the config file was empty.\n2. Long file names no longer overflow the sidebar.\n3. Date parsing now handles ISO 8601 with offsets.\n\n## Upgrade guide\n\nRun `npm install` and restart the daemon:\n\n```bash\nnpm install\nsystemctl restart appd\n```\n\n> Note: configuration keys renamed in v2.3 are still read, but support ends in v3.0.\n\nSee the [changelog](https://example.com/changelog) for details."
    },
    {
      "id": "S3-social",
      "label": "Social media text with emoji",
      "category": "social",
      "script": "mixed",
      "text": "Just shipped our biggest update yet 🚀🎉 Dark mode 🌙, faster search ⚡, and offline support 📴 — all free! Huge thanks to the 12,847 beta testers who filed 3,200+ reports 🙏💜 Your feedback made this 10x better. Try it now 👉 link in bio. RT appreciated 🔁 #buildinpublic #devtools #SaaS 🛠️✨"
    },
    {
      "id": "S4-numeric",
      "label": "Numeric / tabular data",
      "category": "structured",
      "script": "latin",
      "text": "date,region,visits,signups,conversion,revenue\n2026-05-01,NA,48211,1247,2.59%,$18429.50\n2026-05-01,EU,39874,1031,2.59%,$15244.80\n2026-05-01,LATAM,21490,498,2.32%,$6890.25\n2026-05-02,NA,51038,1322,2.59%,$19551.10\n2026-05-02,EU,41226,1066,2.59%,$15762.40\n2026-05-02,LATAM,22817,531,2.33%,$7344.75\n2026-05-03,NA,46905,1198,2.55%,$17716.60\n2026-05-03,EU,38442,989,2.57%,$14623.90\n2026-05-03,LATAM,20973,476,2.27%,$6585.50"
    }
  ]
}
