Automating Business Lead Extraction with AI and SaaS Platforms

TL;DR: Manual lead research is soul-crushing and slow. I built an automated pipeline that scrapes business data from Google Maps, enriches it with AI, and delivers qualified leads ready for outreach — automatically.

The Problem with Manual Lead Generation

Sales and marketing teams spend 30–40% of their time on manual prospecting tasks:

Searching Google Maps for businesses in a niche
Copying phone numbers, emails, and addresses
Qualifying leads based on arbitrary criteria
Deduplicating and cleaning data
Importing into CRM

This is exactly the kind of repetitive, high-volume, rule-based work that automation excels at.

The Solution: Automated Lead Extraction Pipeline

I built a Python-based pipeline that handles the full flow:

Business Type + Location
         │
         ▼
Google Maps Scraper (Python)
         │
   Raw Business Data
         │
         ▼
AI Enrichment Layer (GPT-4o API)
         │
   Enriched + Qualified Leads
         │
         ▼
Export (CSV / PostgreSQL / Notion / CRM)

The Google Maps Scraper

Google Maps doesn't have a public API for business listings. I built a Playwright-based scraper:

import asyncio
from playwright.async_api import async_playwright
import json

async def scrape_google_maps(query: str, location: str, max_results: int = 100):
    """
    Scrapes business listings from Google Maps for a given query and location.
    Returns a list of business dictionaries.
    """
    businesses = []
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Navigate to Google Maps search
        search_url = f"https://www.google.com/maps/search/{query}+{location}"
        await page.goto(search_url)
        await page.wait_for_selector('[role="feed"]', timeout=10000)
        
        # Scroll to load more results
        feed = page.locator('[role="feed"]')
        last_count = 0
        
        while len(businesses) < max_results:
            # Extract currently visible listings
            listings = await page.locator('[data-result-index]').all()
            
            for listing in listings[last_count:]:
                business = await extract_listing_data(listing, page)
                if business:
                    businesses.append(business)
            
            last_count = len(businesses)
            
            # Scroll down to load more
            await feed.evaluate('el => el.scrollTop += 800')
            await page.wait_for_timeout(1500)  # Rate limiting
            
            # Check if no more results
            if await page.locator('text="You\'ve reached the end"').is_visible():
                break
        
        await browser.close()
    
    return businesses[:max_results]

async def extract_listing_data(listing, page):
    """Clicks into a listing and extracts detailed business data."""
    try:
        await listing.click()
        await page.wait_for_timeout(1000)
        
        # Extract all available data points
        data = {
            'name': await safe_extract(page, 'h1.DUwDvf'),
            'rating': await safe_extract(page, '.MW4etd'),
            'reviews_count': await safe_extract(page, '.UY7F9'),
            'category': await safe_extract(page, '.DkEaL'),
            'address': await safe_extract(page, '[data-item-id="address"]'),
            'phone': await safe_extract(page, '[data-item-id*="phone"]'),
            'website': await safe_extract(page, '[data-item-id="authority"]'),
            'hours': await get_operating_hours(page),
            'google_maps_url': page.url,
        }
        
        return data if data['name'] else None
        
    except Exception as e:
        print(f"Failed to extract listing: {e}")
        return None

async def safe_extract(page, selector: str) -> str:
    """Safely extracts text from a CSS selector, returns empty string on failure."""
    try:
        element = page.locator(selector).first
        return await element.inner_text(timeout=2000)
    except:
        return ''

AI Enrichment: From Data to Insights

Raw scraped data is just data. The AI layer converts it into actionable intelligence:

from openai import OpenAI

client = OpenAI()

def enrich_lead_with_ai(business: dict) -> dict:
    """
    Uses GPT-4o to analyze a business listing and add qualification signals.
    """
    prompt = f"""
    Analyze this business listing and provide qualification insights:
    
    Business: {business['name']}
    Category: {business['category']}
    Rating: {business['rating']} ({business['reviews_count']} reviews)
    Website: {business.get('website', 'None')}
    Address: {business['address']}
    
    Provide a JSON response with:
    1. "lead_score": 1-10 based on business health signals
    2. "size_estimate": "solo" | "small" | "medium" | "large"
    3. "tech_adoption": "low" | "medium" | "high" (based on website quality, presence)
    4. "pain_points": list of likely business pain points for this category
    5. "outreach_angle": best approach for cold outreach
    6. "email_guess": educated email format guess based on website/name
    7. "qualification_notes": brief notes on lead quality
    
    Return only valid JSON.
    """
    
    response = client.chat.completions.create(
        model='gpt-4o-mini',  # Use mini for cost efficiency at scale
        messages=[{'role': 'user', 'content': prompt}],
        response_format={'type': 'json_object'},
    )
    
    insights = json.loads(response.choices[0].message.content)
    
    return { **business, **insights }

Deduplication and Data Cleaning

import pandas as pd
from fuzzywuzzy import fuzz

def deduplicate_leads(leads: list[dict]) -> list[dict]:
    """Remove duplicate businesses using fuzzy name matching."""
    df = pd.DataFrame(leads)
    
    seen_names = []
    unique_leads = []
    
    for lead in leads:
        name = lead['name'].lower().strip()
        
        # Check fuzzy similarity against seen names
        is_duplicate = any(
            fuzz.ratio(name, seen) > 90  # >90% similar = duplicate
            for seen in seen_names
        )
        
        if not is_duplicate:
            seen_names.append(name)
            unique_leads.append(lead)
    
    return unique_leads

def clean_phone(phone: str) -> str:
    """Standardize phone number format."""
    import re
    digits = re.sub(r'\D', '', phone)
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return phone

def extract_email_from_website(website_url: str) -> str | None:
    """Attempt to find a contact email on the business website."""
    import httpx, re
    
    try:
        response = httpx.get(website_url, timeout=5, follow_redirects=True)
        emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', response.text)
        
        # Filter out common non-contact emails
        filtered = [e for e in emails 
                    if not any(skip in e.lower() for skip in ['example', 'noreply', 'sentry'])]
        
        return filtered[0] if filtered else None
    except:
        return None

AutomateLanka Integration: Building This as a Workflow

This pipeline is now integrated into AutomateLanka as a workflow node:

[Trigger: Schedule Daily at 9am]
         │
         ▼
[Node: Google Maps Scraper]
  Config: query, location, limit
         │
         ▼
[Node: AI Enrichment]
  Config: enrichment fields, model
         │
         ▼
[Node: Filter]
  Condition: lead_score >= 7
         │
         ▼
[Node: Export to Notion Database]
  OR
[Node: Push to CRM via Webhook]

// AutomateLanka workflow node definition
const leadGenerationWorkflow = {
  id: 'lead-gen-001',
  name: 'Daily Lead Extraction',
  nodes: [
    {
      id: 'scraper',
      type: 'google-maps-scraper',
      config: {
        query: '{{env.LEAD_QUERY}}',
        location: '{{env.TARGET_LOCATION}}',
        limit: 50,
      }
    },
    {
      id: 'enricher',
      type: 'ai-enrichment',
      input: '{{nodes.scraper.output}}',
      config: { model: 'gpt-4o-mini', fields: ['lead_score', 'pain_points', 'outreach_angle'] }
    },
    {
      id: 'filter',
      type: 'data-filter',
      input: '{{nodes.enricher.output}}',
      condition: 'item.lead_score >= 7',
    },
    {
      id: 'export',
      type: 'notion-export',
      input: '{{nodes.filter.output}}',
      config: { databaseId: '{{env.NOTION_DB_ID}}' }
    }
  ]
};

Results and Impact

Running this pipeline for a client in the SaaS sales space:

| Metric | Manual | Automated | |--------|--------|-----------| | Leads per hour | ~15 | ~200 | | Qualification time | Manual review | Instant (AI) | | Data quality | Variable | Consistent | | Cost per lead | ~$8 | ~$0.12 |

The 67× improvement in lead volume and 98.5% cost reduction demonstrates the power of targeted automation.

Ethical and Legal Considerations

Always ensure scraping complies with:

Google Maps ToS — use only for non-commercial research or use the official Places API for commercial use
GDPR/CCPA — business data for B2B outreach has different rules than consumer data
Rate limiting — don't hammer servers; respect robots.txt
Opt-out mechanisms — provide easy ways for businesses to remove their data

Explore AutomateLanka: Live Demo | GitHub