How to Remove Duplicate Lines (and Three Edge Cases)

Removing duplicates from a list looks trivial. Three edge cases — case sensitivity, whitespace, and order preservation — are where most workflows break. The complete pattern, with the gotchas spelled out.

The basic operation

Remove duplicate lines: a list goes in, every line that appeared more than once shows up only once on the way out. The first occurrence is kept; later occurrences are dropped.

For most lists, this is what you want and the tool just does it. Open Remove Duplicates, paste, copy the result. Done in five seconds.

The interesting part is what counts as "duplicate." Three edge cases handle most of the questions that come up:

Edge case 1 — case sensitivity

Are Apple and apple the same line?

For human-readable text (names, words, addresses), almost always: yes. Treat them as duplicates and dedupe to one. For code-like text (case-sensitive identifiers, file names on Linux, URLs with case-sensitive paths), no — they're different and should both survive.

Default: case-insensitive on the TextKit tool, because that matches the most common need. Toggle off if you need case-sensitive comparison.

Edge case 2 — whitespace

Three lines that look identical:

alice
alice
 alice

Are these duplicates? With strict comparison, only the first two are; the third has a leading space. With trimming enabled, all three become alice and dedupe to one.

The right answer depends on what produced the leading space:

  • If it's a paste artifact (column from a spreadsheet, copied web text), trim and treat as duplicate.
  • If it's meaningful (indented config, formatted code), keep strict comparison.

For typical list-cleaning work, trim before dedup. Toggle "trim whitespace" on.

Edge case 3 — order preservation

You have a list. You want duplicates removed. You also want the remaining lines in the same order as the input.

This needs stable dedup. The first occurrence of each line is kept; the order of those first occurrences matches the input. Most browser tools do this; sort -u in shell does not (it sorts, which loses input order).

For shell stable dedup, use awk '!seen[$0]++' — this is the canonical one-liner. It walks the input, tracking each line in an associative array, and prints only first occurrences. Order is preserved.

Browser tools handle this automatically. The TextKit Remove Duplicates tool defaults to stable dedup; you have to opt into sorting.

Worked example — cleaning a contact list

You have a CSV column of email addresses with these issues:

alice@example.com
Alice@example.com
alice@example.com 
bob@example.com
ALICE@EXAMPLE.COM

The cleanup pipeline:

  1. Trim whitespace. Removes the trailing space on row 3.
  2. Lowercase. Normalizes the case variations.
  3. Dedupe. Collapses the four alice@example.com rows into one.

The TextKit Remove Duplicates tool has all three as toggles — turn them on, paste, copy. The output:

alice@example.com
bob@example.com

Five rows down to two. Standard cleanup pattern for any contact, ID, or identifier list.

One paste, deduplicated. The Remove Duplicates tool handles trim, case-insensitive, and stable dedup. Local-only — nothing leaves the browser.

The right tool by list size

List sizeBest tool
Up to 1,000 linesBrowser (TextKit)
1,000 - 100,000 linesBrowser still works; Excel slightly slower; shell faster than browser past ~50K
100,000 - 10M linesShell sort -u or awk '!seen[$0]++'
10M+ linesDatabase table with UNIQUE index, or Spark/Pandas

For the typical workflow case (cleaning up a paste, deduping a column), the browser is the fastest end-to-end tool because the round-trip into Excel or shell costs more time than the dedup itself.

The inverse operation — find duplicates

Sometimes you want to see the duplicates rather than remove them. Two patterns:

  • "Show me only the lines that appeared more than once." Shell: sort | uniq -d.
  • "Show me the count for each unique line." Shell: sort | uniq -c | sort -rn — sorted by frequency descending.

For browser equivalents, the Find & Replace tool with regex can handle frequency counting via JavaScript snippets, or Excel's COUNTIF in a helper column.

The two-line summary

For 95% of dedup work: open the Remove Duplicates tool, paste, toggle on trim and case-insensitive if needed, copy. For very large lists or pipelines: awk '!seen[$0]++' input.txt.

For the broader reference on list operations, see List Operations: The Complete Guide. For the related sort and shuffle operations, see How to Sort Lines Alphabetically and How to Shuffle Lines (Fisher-Yates).

Frequently asked questions

Does dedup change the order of remaining lines?

It depends on the tool. Stable dedup (the default in TextKit's Remove Duplicates) preserves the input order of the lines that survive. Unstable dedup (sometimes coupled with sort) reorders. Pick stable when input order matters.

How do I dedupe case-insensitively?

Toggle 'case-insensitive' in the dedup tool. The TextKit Remove Duplicates tool has this option. Manually in shell: sort -uf (sort + dedup, case-insensitive).

Can I keep duplicates and only see which lines are duplicated?

Yes — that's the inverse operation. sort | uniq -d in shell shows only duplicated lines. Some browser tools have a 'show duplicates only' mode.

How do I dedupe a CSV by a specific column?

Single-line dedup tools treat the entire line as the comparison key. For column-aware dedup, use Excel (Data → Remove Duplicates → choose column), Pandas (df.drop_duplicates(subset=['col'])), or shell awk.

Why does my dedup tool return more lines than expected?

Almost always invisible whitespace — trailing spaces or tabs — making lines that look identical actually different. Trim before dedup. The TextKit Remove Duplicates tool has a 'trim whitespace' toggle that fixes this.

Keep reading

Written by the TextKit team. We build the tools we write about — try the Remove Duplicates tool used in this post.