picture_as_pdf

Email Extractor for PDF Files

Extract email addresses from PDF documents including invoices, business cards, contracts. Supports OCR for scanned documents. Download free Python script.

code Python verified Free Download devices Cross-platform

code Code Preview

Python
#!/usr/bin/env python3
"""
Email Extractor for PDF Files
Supports digital PDFs and OCR for scanned documents
"""
import re
from pathlib import Path

try:
    import pdfplumber
except ImportError:
    pdfplumber = None

EMAIL_PATTERN = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

def extract_from_digital_pdf(pdf_path):
    """Extract emails from text-based PDF"""
    if not pdfplumber:
        raise ImportError("Install pdfplumber: pip install pdfplumber")

    emails = set()

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                found = re.findall(EMAIL_PATTERN, text)
                emails.update(e.lower() for e in found)

    return sorted(emails)

def extract_with_ocr(pdf_path):
    """Extract emails from scanned PDF using OCR"""
    try:
        import pytesseract
        from pdf2image import convert_from_path
    except ImportError:
        raise ImportError("Install: pip install pytesseract pdf2image")

    emails = set()
    images = convert_from_path(pdf_path)

    for img in images:
        text = pytesseract.image_to_string(img)
        found = re.findall(EMAIL_PATTERN, text)
        emails.update(e.lower() for e in found)

    return sorted(emails)

def extract_emails(pdf_path, use_ocr=False):
    """Main extraction function"""
    if use_ocr:
        return extract_with_ocr(pdf_path)
    return extract_from_digital_pdf(pdf_path)

def batch_extract(folder_path, use_ocr=False):
    """Extract emails from all PDFs in folder"""
    all_emails = set()
    folder = Path(folder_path)

    for pdf_file in folder.glob('*.pdf'):
        try:
            emails = extract_emails(pdf_file, use_ocr)
            all_emails.update(emails)
            print(f"{pdf_file.name}: {len(emails)} emails")
        except Exception as e:
            print(f"{pdf_file.name}: Error - {e}")

    return sorted(all_emails)

if __name__ == '__main__':
    import sys
    pdf_path = sys.argv[1]
    use_ocr = '--ocr' in sys.argv
    emails = extract_emails(pdf_path, use_ocr)
    print(f"Found {len(emails)} unique emails")
    for email in emails:
        print(email)

info About This Tool

The PDF Email Extractor parses PDF files to find email addresses. Supports both digital PDFs and scanned documents with OCR capability.

Key Features

  • Digital PDFs - Fast extraction using pdfplumber (5-10 pages/sec)
  • OCR Support - Tesseract OCR for scanned documents
  • Multi-Page - Handles documents with 100+ pages
  • Table Detection - Extracts from complex layouts
  • Batch Processing - Process entire folders
  • Password Support - Handles encrypted PDFs

Supported PDF Types

  • Text-based PDFs (Word, Excel exports)
  • Scanned documents (with OCR)
  • Business cards and invoices
  • Multi-column layouts

Requirements

  • Python 3.7+
  • pdfplumber (pip install pdfplumber)
  • For OCR: pytesseract, pdf2image, Tesseract engine

Performance: Digital PDFs: 5-10 pages/sec. OCR: ~1 page/sec. Accuracy: 85-95% for clear scans.

download Download Script

Need Full Automation?

Try Postigo for automated email campaigns with AI personalization

rocket_launch Start Free Trial