Is it legal to extract emails from a website?

Extraction itself isn't illegal in most jurisdictions. Sending unsolicited bulk email to extracted addresses almost always is. CAN-SPAM, GDPR, CASL, and Australia's Spam Act 2003 all apply. Extraction for personal contact lookup, internal data cleanup, or research is generally fine; cold outbound to scraped lists is generally not.

How do I detect obfuscated emails like 'name [at] company [dot] com'?

A second regex pass that recognizes "[at]", "(at)", "{at}", "[dot]", "(dot)", and the literal words at and dot with whitespace boundaries. The TextKit Email Extractor has this as a toggle.

Why do some emails get extracted with trailing punctuation?

Because the regex's word-boundary character class includes characters that aren't valid in email locals. After extraction, run a cleanup pass that strips trailing periods, commas, semicolons, and closing brackets.

How do I tell which extracted emails are real?

Three signals: domain MX records exist (cheap to check), the SMTP server accepts the address (more expensive, looks like spam), and the address pattern matches the known scheme for the domain (free if you've seen one valid address). Bulk validation services do all three.

Should I lowercase extracted emails?

The local part of an email address is technically case-sensitive per RFC 5321. In practice, every major provider treats it case-insensitively. Lowercasing on extraction prevents duplicate rows for "Alice@example.com" and "alice@example.com".

How fast is email extraction in the browser?

Around 1 million characters per second for a single-pass regex on a modern laptop. A 10MB text file processes in roughly 10 seconds without leaving the browser.

Email Extraction: The Complete Guide (2026)

Email extraction is regex plus discipline. The regex is the easy part — getting consistent results across resumes, emails, web pages, CSV dumps, and PDFs takes a workflow. The patterns, the gotchas, the obfuscation tricks to watch for, and the legal and deliverability constraints that determine what you can do with the list once you have it.

On this page

The regex that does 95% of the work
Where the regex fails: the 5% you have to handle
Obfuscation: the patterns to recognize
The deduplication step
The deliverability check: what's actually a real address
The legal layer: what you can actually do with the list
The browser-vs-server question
The five workflow patterns

The regex that does 95% of the work

The standard practical email regex. Not the RFC-compliant one, the one people actually use:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Three parts: local (everything before the @), domain (everything between @ and the last dot), top-level domain (everything after the last dot). Each part has a character class that's permissive enough to catch real-world addresses without matching obvious garbage.

This regex correctly extracts:

alice@example.com. Basic case
bob.smith+filter@example.co.uk. Plus filtering, multi-part domain
support_team@subdomain.company.io. Underscores, subdomain, modern TLD

And correctly rejects:

not-an-email. Missing @
@example.com. Empty local
alice@. Missing domain
alice@example. Missing TLD

Where the regex fails — the 5% you have to handle

Trailing punctuation. The text "...email me at alice@example.com." extracts alice@example.com. with a trailing period. The regex's [a-zA-Z]{2,} happily eats periods that should belong to the surrounding sentence. Fix: post-process by stripping trailing .,;:)>}].

Internationalized addresses. Real RFC 6531 email addresses can contain non-ASCII characters in the local part: 用户@example.com. The standard regex won't catch them. In practice they're rare. Most extraction tools (including the TextKit Email Extractor) skip them by default and surface a toggle for the rare projects where they matter.

Quoted local parts. The full RFC allows "strange address"@example.com with a quoted local. Almost no real address uses this. Skipping is fine.

Comments. The full RFC allows (comment)alice@example.com. Embedded parenthetical comments. Never seen in the wild outside RFC test suites. Skipping is fine.

Obfuscation — the patterns to recognize

Webpages, social media bios, and resumes often hide email addresses to deter scrapers. The five common patterns:

Pattern	Reconstructed
`alice [at] example [dot] com`	alice@example.com
`alice (at) example (dot) com`	alice@example.com
`alice@example DOT com`	alice@example.com
`alice AT example.com`	alice@example.com
`alice{at}example{dot}com`	alice@example.com

A second extraction pass with relaxed pattern matching. Replacing the obfuscated tokens before running the standard regex. Recovers most of these. Toggle "Detect obfuscated emails" in the Email Extractor to enable this pass.

Two patterns that can't be reliably reversed:

Image-based emails. The address is rendered as a PNG inside the page. OCR is required, and the OCR is rarely perfect on small text.
JavaScript-assembled. The address is concatenated at runtime from string fragments. Static text scraping won't see it; only headless-browser scraping will.

The deduplication step

A single source frequently contains the same email multiple times. In a header, a signature, a CC line, and an inline reference. Always dedupe.

The non-obvious dedupe: case-insensitive comparison. Alice@example.com and alice@example.com are the same address. Lowercase before comparing.

The trickier dedupe: dot-aliases on Gmail. alice.smith@gmail.com, alicesmith@gmail.com, and a.l.i.c.e.smith@gmail.com are all the same Gmail mailbox. For non-Gmail domains, dots are significant. The dot-collapse step is Gmail-specific and rarely worth the complexity unless the entire list is Gmail.

One paste, deduplicated list. The Email Extractor handles regex, lowercase normalization, deduplication, obfuscation detection, and CSV export. All locally in the browser, nothing uploaded.

The deliverability check — what's actually a real address

Extraction gives you a syntactically valid string. Real-world deliverability requires three checks:

MX record exists. A DNS query for the domain. Cheap, can be done at scale, doesn't touch the recipient's server. About 5% of extracted addresses fail here. Typo'd domains, expired domains, scratch addresses for sign-up forms.
SMTP server accepts the address. Open a connection, send RCPT TO:, read the response. Expensive, gets you blacklisted if done at volume, and many servers lie (catch-all addresses accept anything).
Address pattern matches the known scheme. If you've seen firstname.lastname@company.com work for one person at a company, similar addresses for others probably work. Free, scales infinitely, but no per-address confirmation.

Real bulk-validation services (NeverBounce, ZeroBounce, Hunter Email Verifier) combine all three. For small lists, MX-only is enough. It's cheap and catches the most common error mode.

The legal layer — what you can actually do with the list

Three rules cover most jurisdictions, but consult a lawyer for any commercial program:

CAN-SPAM (US). Unsolicited commercial email is allowed if you (a) include a clear unsubscribe link, (b) honor it within 10 days, (c) don't use deceptive subject lines, and (d) include a physical postal address. No prior consent required.
GDPR (EU). Marketing email to EU residents requires a lawful basis. For most outbound marketing, that means explicit prior consent. Extraction-then-spam to EU addresses is a per-recipient violation with fines up to 4% of global revenue.
CASL (Canada) and Spam Act 2003 (Australia). Both require prior consent. Implied or express. For commercial email. Both are stricter than CAN-SPAM.

The practical implication: extraction for internal use, contact lookup, deduplication, list cleanup, and one-to-one outreach is fine almost everywhere. Bulk cold outbound to scraped EU addresses is illegal in the EU; bulk cold outbound to scraped Canadian and Australian addresses is illegal there. Bulk cold outbound to scraped US addresses is technically legal under CAN-SPAM, but most ESPs (Mailchimp, ConvertKit, Brevo) prohibit it in their terms. Sending will get the account suspended.

The browser-vs-server question

Email extraction can run in three places. Each has a different trade-off:

Where	Best for	Privacy	Speed
Browser (regex in JS)	Pasted text up to ~50MB	Best. Nothing leaves the device	~1M chars/sec
Local script (Python, grep -E)	Files larger than 50MB, pipelines	Best. Local-only	~5M chars/sec
Server (paid API)	Continuous scraping, validation, scale	Worst. Data leaves your control	Network-bound

For the typical use case. Clean up a CSV, parse a chat log, pull addresses out of a resume. The browser is the right tool. The TextKit Email Extractor handles paste, dedupe, sort, lowercase, and CSV export without uploading anything.

The five workflow patterns

Resume cleanup. Paste a candidate's resume text. Extract the one or two contact addresses. Confirm they match the contact section.
Web page contact harvesting. View source, copy the page text, paste, extract. Useful for finding addresses on a contact page that have been spread across multiple sections.
CSV deduplication. Paste a CSV column or a multi-column dump. Extract → dedupe → re-export. Cleans up lists that grew through multiple acquisitions and import paths.
Chat-log parsing. Slack export, email thread, Discord backup. Extract addresses from sender lines, signatures, and inline mentions. The dedupe step matters most here.
Form-submission audits. Form responses dumped as a text file. Extract the email field across all responses to verify the input was clean.

For the regex itself with worked examples, see Email Regex Cheatsheet. For the step-by-step workflow, see How to Extract Emails from Any Text. For the comparison with paid tools, see Hunter.io vs Free Email Extractors.

Sources and further reading

Written by SAVI. We build the tools we write about. Try the Email Extractor used in this post.