Email Extraction: The Complete Guide (2026)
Email extraction is regex plus discipline. The regex is the easy part — getting consistent results across resumes, emails, web pages, CSV dumps, and PDFs takes a workflow. The patterns, the gotchas, the obfuscation tricks to watch for, and the legal and deliverability constraints that determine what you can do with the list once you have it.
The regex that does 95% of the work
The standard practical email regex — not the RFC-compliant one, the one people actually use:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Three parts: local (everything before the @), domain (everything between @ and the last dot), top-level domain (everything after the last dot). Each part has a character class that's permissive enough to catch real-world addresses without matching obvious garbage.
This regex correctly extracts:
alice@example.com— basic casebob.smith+filter@example.co.uk— plus filtering, multi-part domainsupport_team@subdomain.company.io— underscores, subdomain, modern TLD
And correctly rejects:
not-an-email— missing@@example.com— empty localalice@— missing domainalice@example— missing TLD
Where the regex fails — the 5% you have to handle
Trailing punctuation. The text "...email me at alice@example.com." extracts alice@example.com. with a trailing period. The regex's [a-zA-Z]{2,} happily eats periods that should belong to the surrounding sentence. Fix: post-process by stripping trailing .,;:)>}].
Internationalized addresses. Real RFC 6531 email addresses can contain non-ASCII characters in the local part: 用户@example.com. The standard regex won't catch them. In practice they're rare — most extraction tools (including the TextKit Email Extractor) skip them by default and surface a toggle for the rare projects where they matter.
Quoted local parts. The full RFC allows "strange address"@example.com with a quoted local. Almost no real address uses this. Skipping is fine.
Comments. The full RFC allows (comment)alice@example.com — embedded parenthetical comments. Never seen in the wild outside RFC test suites. Skipping is fine.
Obfuscation — the patterns to recognize
Webpages, social media bios, and resumes often hide email addresses to deter scrapers. The five common patterns:
| Pattern | Reconstructed |
|---|---|
alice [at] example [dot] com | alice@example.com |
alice (at) example (dot) com | alice@example.com |
alice@example DOT com | alice@example.com |
alice AT example.com | alice@example.com |
alice{at}example{dot}com | alice@example.com |
A second extraction pass with relaxed pattern matching — replacing the obfuscated tokens before running the standard regex — recovers most of these. Toggle "Detect obfuscated emails" in the Email Extractor to enable this pass.
Two patterns that can't be reliably reversed:
- Image-based emails. The address is rendered as a PNG inside the page. OCR is required, and the OCR is rarely perfect on small text.
- JavaScript-assembled. The address is concatenated at runtime from string fragments. Static text scraping won't see it; only headless-browser scraping will.
The deduplication step
A single source frequently contains the same email multiple times — in a header, a signature, a CC line, and an inline reference. Always dedupe.
The non-obvious dedupe: case-insensitive comparison. Alice@example.com and alice@example.com are the same address. Lowercase before comparing.
The trickier dedupe: dot-aliases on Gmail. alice.smith@gmail.com, alicesmith@gmail.com, and a.l.i.c.e.smith@gmail.com are all the same Gmail mailbox. For non-Gmail domains, dots are significant. The dot-collapse step is Gmail-specific and rarely worth the complexity unless the entire list is Gmail.
The deliverability check — what's actually a real address
Extraction gives you a syntactically valid string. Real-world deliverability requires three checks:
- MX record exists. A DNS query for the domain. Cheap, can be done at scale, doesn't touch the recipient's server. About 5% of extracted addresses fail here — typo'd domains, expired domains, scratch addresses for sign-up forms.
- SMTP server accepts the address. Open a connection, send
RCPT TO:, read the response. Expensive, gets you blacklisted if done at volume, and many servers lie (catch-all addresses accept anything). - Address pattern matches the known scheme. If you've seen
firstname.lastname@company.comwork for one person at a company, similar addresses for others probably work. Free, scales infinitely, but no per-address confirmation.
Real bulk-validation services (NeverBounce, ZeroBounce, Hunter Email Verifier) combine all three. For small lists, MX-only is enough — it's cheap and catches the most common error mode.
The legal layer — what you can actually do with the list
Three rules cover most jurisdictions, but consult a lawyer for any commercial program:
- CAN-SPAM (US). Unsolicited commercial email is allowed if you (a) include a clear unsubscribe link, (b) honor it within 10 days, (c) don't use deceptive subject lines, and (d) include a physical postal address. No prior consent required.
- GDPR (EU). Marketing email to EU residents requires a lawful basis. For most outbound marketing, that means explicit prior consent. Extraction-then-spam to EU addresses is a per-recipient violation with fines up to 4% of global revenue.
- CASL (Canada) and Spam Act 2003 (Australia). Both require prior consent — implied or express — for commercial email. Both are stricter than CAN-SPAM.
The practical implication: extraction for internal use, contact lookup, deduplication, list cleanup, and one-to-one outreach is fine almost everywhere. Bulk cold outbound to scraped EU addresses is illegal in the EU; bulk cold outbound to scraped Canadian and Australian addresses is illegal there. Bulk cold outbound to scraped US addresses is technically legal under CAN-SPAM, but most ESPs (Mailchimp, ConvertKit, Brevo) prohibit it in their terms — sending will get the account suspended.
The browser-vs-server question
Email extraction can run in three places. Each has a different trade-off:
| Where | Best for | Privacy | Speed |
|---|---|---|---|
| Browser (regex in JS) | Pasted text up to ~50MB | Best — nothing leaves the device | ~1M chars/sec |
| Local script (Python, grep -E) | Files larger than 50MB, pipelines | Best — local-only | ~5M chars/sec |
| Server (paid API) | Continuous scraping, validation, scale | Worst — data leaves your control | Network-bound |
For the typical use case — clean up a CSV, parse a chat log, pull addresses out of a resume — the browser is the right tool. The TextKit Email Extractor handles paste, dedupe, sort, lowercase, and CSV export without uploading anything.
The five workflow patterns
- Resume cleanup. Paste a candidate's resume text. Extract the one or two contact addresses. Confirm they match the contact section.
- Web page contact harvesting. View source, copy the page text, paste, extract. Useful for finding addresses on a contact page that have been spread across multiple sections.
- CSV deduplication. Paste a CSV column or a multi-column dump. Extract → dedupe → re-export. Cleans up lists that grew through multiple acquisitions and import paths.
- Chat-log parsing. Slack export, email thread, Discord backup. Extract addresses from sender lines, signatures, and inline mentions. The dedupe step matters most here.
- Form-submission audits. Form responses dumped as a text file. Extract the email field across all responses to verify the input was clean.
For the regex itself with worked examples, see Email Regex Cheatsheet. For the step-by-step workflow, see How to Extract Emails from Any Text. For the comparison with paid tools, see Hunter.io vs Free Email Extractors.
Frequently asked questions
Is it legal to extract emails from a website?
Extraction itself isn't illegal in most jurisdictions. Sending unsolicited bulk email to extracted addresses almost always is — CAN-SPAM, GDPR, CASL, and Australia's Spam Act 2003 all apply. Extraction for personal contact lookup, internal data cleanup, or research is generally fine; cold outbound to scraped lists is generally not.
How do I detect obfuscated emails like 'name [at] company [dot] com'?
A second regex pass that recognizes [at], (at), {at}, [dot], (dot), and the literal words at and dot with whitespace boundaries. The TextKit Email Extractor has this as a toggle.
Why do some emails get extracted with trailing punctuation?
Because the regex's word-boundary character class includes characters that aren't valid in email locals. After extraction, run a cleanup pass that strips trailing periods, commas, semicolons, and closing brackets.
How do I tell which extracted emails are real?
Three signals: domain MX records exist (cheap to check), the SMTP server accepts the address (more expensive, looks like spam), and the address pattern matches the known scheme for the domain (free if you've seen one valid address). Bulk validation services do all three.
Should I lowercase extracted emails?
The local part of an email address is technically case-sensitive per RFC 5321. In practice, every major provider treats it case-insensitively. Lowercasing on extraction prevents duplicate rows for Alice@example.com and alice@example.com.
How fast is email extraction in the browser?
Around 1 million characters per second for a single-pass regex on a modern laptop. A 10MB text file processes in roughly 10 seconds without leaving the browser.
Keep reading
Written by the TextKit team. We build the tools we write about — try the Email Extractor used in this post.