How to Extract Emails from Any Text (Step by Step)

Three steps from raw text to a deduplicated list. Works for resumes, web pages, CSV dumps, chat logs, and contact directories. The shortcut takes ten seconds in the browser. The script-based path scales to gigabyte files.

The three-step workflow

Whether the source is a resume, a web page, a CSV dump, or a chat log, the workflow is the same:

  1. Get the source as text. Copy from the browser, paste from the clipboard, drag a file in. The extractor needs raw characters, not formatted documents.
  2. Run the extraction. A regex pass pulls every address. A normalization pass lowercases. A dedup pass collapses duplicates.
  3. Export the result. Copy as a column, download as CSV, or paste into the next system.

Total time for a typical paste: ten seconds. The longest part is usually getting the source into a clipboard-friendly form.

The shortcut. Open the Email Extractor, paste your text, see the deduplicated list immediately. Toggle "Detect obfuscated emails" if your source uses [at]/(dot) patterns.

Step 1 — getting the source as text

Five common sources, each with the right approach:

  • Web page. Click in the page, Cmd+A / Ctrl+A, Cmd+C / Ctrl+C. Or right-click → Save As → Text. The browser strips HTML and gives you the visible text.
  • Resume PDF. Open in any PDF viewer, select all, copy. For image-only PDFs (scanned), run OCR first.
  • Word document. Open, select all, copy. Or File → Save As → Plain Text.
  • CSV file. Open in any text editor, select all, copy. Or open in Excel/Sheets and copy the email column.
  • Chat-log export (Slack, Discord, Email MBOX). Open the export file in a text editor and copy the relevant range. For email MBOX, the entire file is plain text.

The one source that's hard: image-based emails. If the address is rendered as a PNG (a deliberate anti-scraping move), no text-extraction step will see it. OCR is the only path, and it's lossy on small text.

Step 2 — running the extraction

Three options, in order of effort:

Browser (everyday use). Paste into the TextKit Email Extractor. Toggle "Detect obfuscated emails" if needed. Click extract. The list appears immediately.

Command line (large files, scripting).

grep -Eoi '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' source.txt | sort -u

The -E enables extended regex. -o prints only the match. -i is case-insensitive. sort -u dedupes. Done in one pipeline.

Python script (large files, custom rules).

import re
with open('source.txt', encoding='utf-8') as f:
    text = f.read()
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = sorted(set(m.lower() for m in re.findall(pattern, text)))
print('\n'.join(emails))

The Python version gives you control over the dedup logic, the case handling, and any project-specific filtering (e.g., excluding internal domains, validating against a known TLD list).

Step 3 — exporting the result

Three formats cover almost every downstream use:

  • Plain list, one per line. Paste into a CRM's bulk-import field, a spreadsheet column, or another extraction tool.
  • CSV. One column, one address per row. Universal for import into Mailchimp, ConvertKit, HubSpot, Salesforce, and basically every other email tool.
  • Comma-separated string. Paste into the To/CC/BCC field of an email client. Mostly useful for small lists where you'll send manually.

The TextKit Email Extractor offers all three. Click the output type chip to switch.

The cleanup step most workflows skip

After extraction, three common quality issues to fix before using the list:

  1. Trailing punctuation. Strip .,;:)>}] from the right side of every address. The extractor should do this automatically.
  2. Doubled domains. Sometimes manual data entry produces alice@example.com.example.com. These need human review — there's no automatic fix that's safe.
  3. Internal-only addresses. Most lists have a few addresses you shouldn't email — your own, your team's, system addresses (noreply@, postmaster@). Filter these out before any send.

What to do when the regex misses

If the result has fewer addresses than expected, the source probably uses obfuscation. Toggle "Detect obfuscated emails" on the extractor — it adds a pass that recognizes [at], (at), [dot], (dot), and the literal words at and dot with whitespace boundaries.

If addresses are still missing after the obfuscation pass, the source is probably using JavaScript assembly or image rendering. These can't be caught by text-based extraction — only headless-browser scraping (for JS) or OCR (for images) will reach them. Both require separate tools.

For the regex specifics with copy-paste-ready patterns, see Email Regex Cheatsheet. For the deeper reference on extraction and the legal limits, see Email Extraction: The Complete Guide.

Frequently asked questions

What if I have a PDF, not text?

Extract the text first. Most PDFs allow text selection — copy and paste into the extractor. For scanned PDFs (image-based), run OCR first via tools like macOS Preview, Tesseract, or Adobe Acrobat.

How do I extract emails from a Gmail mailbox?

Use Gmail's export (Google Takeout) to download as MBOX. Open the MBOX in a text editor or run grep -E with the email regex. The TextKit extractor handles MBOX text once pasted.

Can I extract emails from a Word document?

Yes — copy the document text and paste. Or save the document as plain text first to strip formatting noise.

How do I extract emails from a Twitter or LinkedIn profile?

Twitter rarely has plain-text emails on profiles (deliberately). LinkedIn shows them only on connected profiles. For both, view the page text, copy, and extract — same workflow. Image-based addresses won't be caught.

Should I sort the result alphabetically?

Yes, almost always. Sorted lists are easier to scan, easier to dedupe across multiple extractions, and easier to compare against existing data. Most extractors offer this as a toggle.

Keep reading

Written by the TextKit team. We build the tools we write about — try the Email Extractor used in this post.