Lists · Data · Keywords

Remove Duplicates

Paste any list. Duplicates disappear instantly.

0 Removed
0 Total lines
0 Unique lines
Input
Output. Deduplicated
Advertisement

About the Remove Duplicates tool

Remove duplicate lines from any text. Case-sensitive or case-insensitive, with optional whitespace trimming and order preservation. Paste a list of emails, log entries, tags, or any line-based data; get back a deduplicated version in your browser. Nothing is uploaded.

What deduplication actually means

Duplicate removal sounds simple: keep one copy of each unique line, drop the rest. The complications come from how you define "duplicate." Are example@gmail.com and Example@Gmail.com duplicates? Are hello and hello (with a trailing space)? Are Hello and HELLO?

Different jobs need different definitions, and the wrong definition produces silently wrong results. This tool exposes three options that cover almost every real case: case-sensitive matching (the default. Strict equality), case-insensitive matching (treats foo and FOO as the same), and whitespace trimming (strips leading and trailing spaces before comparing). Choose the combination that matches your data.

Real use cases

Cleaning email lists. Newsletter exports, contact form submissions, and CRM extracts often contain duplicates from users who signed up multiple times. Email addresses are case-insensitive by spec (the local part technically can be case-sensitive, but no major mail server enforces this), so dedupe with case-insensitive matching enabled. Trim whitespace too. Copy-paste from spreadsheets often introduces trailing spaces that defeat naive deduplication.

Deduplicating log entries. When two systems log the same event with the same payload, you end up with paired duplicates that need to collapse into single records. Paste the log into this tool with case-sensitive matching (logs may have meaningful case distinctions in IDs) and order preservation on, then export.

Tag list cleanup. Database tag fields, blog post taxonomies, and Slack channel topic lists accumulate near-duplicate variants over time: frontend, front-end, Frontend, FRONT-END. Run the list through this tool with case-insensitive matching to catch obvious duplicates; the spelling variants you'll need to fix manually.

CSV row deduplication. Spreadsheet exports sometimes contain duplicate rows that survived a merge or import. If your CSV is small enough to paste in (under 100,000 lines), this tool dedupes faster than wrestling Excel's "Remove Duplicates" feature, which is column-aware and trips up on whitespace.

Word list cleanup. Vocabulary lists, glossaries, search-keyword lists, and content-tag dictionaries all benefit from periodic deduplication. Case-insensitive matching is usually right here: you don't want both SQL and sql in a glossary.

Domain and IP address lists. Allowlists, blocklists, firewall rules, and analytics filters commonly contain duplicates from years of accumulated config. Whitespace-trimming is essential. Invisible trailing spaces from copy-paste defeat naive dedupe and cause subtle filter bugs.

Reference and citation cleanup. Academic papers, internal docs, and reading lists. URLs, DOIs, and citation strings are case-sensitive in some parts and case-insensitive in others; case-insensitive dedupe with manual review catches the common cases without false collapses.

Match modes explained

Three orthogonal options control what counts as a duplicate. They combine: you can have all three on, all three off, or any combination.

Case-sensitive (default). Hello and hello are different lines. Use this when case carries meaning. Log files with mixed-case IDs, code variable names, JSON keys.

Case-insensitive. Hello and HELLO are the same line. The first occurrence is kept; later case variants are removed. Use this for emails, tags, domain names, and any natural-language list where capitalization is incidental.

Trim whitespace. Lines are compared after stripping leading and trailing spaces, tabs, and other whitespace. hello and hello are the same line. Almost always the right choice, since invisible whitespace is a common source of "why isn't this duplicate detected" bugs.

Preserve order. When on, the output appears in the same order as the first occurrence of each unique line. When off, output is sorted alphabetically. Order preservation is the better default for human-readable lists; alphabetical sort is better for further programmatic processing.

Common pitfalls

Trailing whitespace. The single most common cause of "why isn't this finding the duplicates?" is invisible trailing whitespace. Always enable Trim whitespace unless you have a specific reason not to.

BOM and zero-width characters. Files saved in some Windows editors contain a Byte Order Mark at the very start. Lines copied from the web sometimes contain zero-width spaces (U+200B) inserted by tracking pixels or sloppy CMSes. Both produce "duplicates" that don't match because the first occurrence has the invisible character and the rest don't (or vice versa). The tool's Trim whitespace option doesn't currently handle these specifically; for stubborn invisible-character bugs, paste the input into the Find & Replace tool and search for \u200B.

Encoded vs decoded duplicates. URLs sometimes appear in your data both percent-encoded and decoded: https://example.com/foo%20bar and https://example.com/foo bar. These are the same URL functionally but different strings textually. The tool will treat them as different. Decode (or encode) consistently before deduplicating.

Smart quotes and dumb quotes. Text pasted from Word, Pages, or Google Docs often replaces straight quotes with curly quotes (""). If your data mixes both forms, dedupe will treat them as different. Normalize before deduping if this is a known issue.

Remove Duplicates vs Excel REMOVE DUPLICATES vs uniq

Three common ways to dedupe, three different trade-offs.

This tool. Fastest for plain-text lists in a browser, exposes case-sensitivity and whitespace trimming as explicit toggles, no software install, runs locally.

Excel "Remove Duplicates". Necessary when you need to dedupe based on specific columns of a multi-column spreadsheet (keep first row by Email, ignore other columns). Case-insensitive by default, doesn't trim whitespace, opaque about what it's doing. Reliable for spreadsheet-shaped data, painful for free-form text.

The uniq command. Fast for huge files, handles GBs without breaking a sweat. But uniq only removes adjacent duplicates, so you must sort first: sort file.txt | uniq. Case-sensitive by default; case-insensitive needs -i; whitespace handling is up to you. Best choice for files too large to paste into a browser.

How the tool works

Paste text into the input box. The tool splits it on line breaks, applies your selected normalization (trim, lowercase if case-insensitive enabled), and uses a JavaScript Set to track which normalized forms have been seen. The first occurrence of each unique form is kept; later occurrences are dropped. Output is rendered in the result box with the original (un-normalized) text of each kept line, in either input order or alphabetical order based on your choice.

Performance scales linearly with input size. Up to about 1 million lines (roughly 50MB of plain text) runs in under a second on typical hardware; past that, browser memory becomes the limit.

Workflow tips

Always show the diff. Before trusting deduplicated output, verify the count matches expectations. If you pasted 5,000 lines and got back 4,847, ask whether 153 duplicates is plausible. If it's far off, your match settings are probably wrong (case-sensitivity in the wrong direction, whitespace not trimmed when it should be).

For email lists, always lowercase first. Email addresses are case-insensitive in practice. The tool's case-insensitive option handles this, but for downstream tools that may not, run the deduped output through a lowercase pass with the Case Converter.

For ordered logs, preserve order. Time-series data loses meaning if you alphabetize it. Always check the order-preservation toggle when working with sequential data.

Frequently asked questions

What's considered a "line"?

Anything separated by a newline character (\n on Unix and macOS, \r\n on Windows). Both line-ending styles are handled correctly. Empty lines are kept as a single empty line if they appear in input; subsequent empty lines are deduplicated like any other.

Will it remove the original first occurrence too, or just the later ones?

Just the later ones. The first occurrence of each unique line is always kept. If you want every duplicate gone (so lines appearing twice or more are removed entirely), this isn't the right tool. You'd need a separate "remove all duplicates including originals" feature.

How is the order preserved?

Output appears in the same order as the first occurrence of each unique line in your input. If "apple" appears at line 3 and "banana" at line 1, "banana" comes first in the output.

What about partial-line duplicates?

The tool only matches whole lines. If line 1 is "hello world" and line 2 is "hello world today", they're treated as different. Substring or fuzzy duplicate detection is a different problem requiring a different tool.

Does it handle CSV with embedded newlines?

No. The tool treats every newline as a record separator, which breaks CSV cells that contain newlines inside quoted fields. For real CSV deduplication with quote-aware parsing, use a spreadsheet tool or a CSV library.

Is the input limit really 1 million lines?

Practical limit, not enforced. Browser memory determines the hard cap, and modern desktop browsers can handle several million short lines without trouble. Mobile browsers hit limits earlier.

Related

Advertisement

Learn more about remove duplicate lines