February 19, 2026 · 9 min
Automating Business Lead Extraction with AI and SaaS Platforms
How I built a Google Maps data scraper that extracts, enriches, and qualifies business leads automatically — cutting manual prospecting from hours to minutes.
Automating Business Lead Extraction with AI and SaaS Platforms
TL;DR: Manual lead research is soul-crushing and slow. I built an automated pipeline that scrapes business data from Google Maps, enriches it with AI, and delivers qualified leads ready for outreach — automatically.
The Problem with Manual Lead Generation
Sales and marketing teams spend 30–40% of their time on manual prospecting tasks:
- Searching Google Maps for businesses in a niche
- Copying phone numbers, emails, and addresses
- Qualifying leads based on arbitrary criteria
- Deduplicating and cleaning data
- Importing into CRM
This is exactly the kind of repetitive, high-volume, rule-based work that automation excels at.
The Solution: Automated Lead Extraction Pipeline
I built a Python-based pipeline that handles the full flow:
Business Type + Location
│
▼
Google Maps Scraper (Python)
│
Raw Business Data
│
▼
AI Enrichment Layer (GPT-4o API)
│
Enriched + Qualified Leads
│
▼
Export (CSV / PostgreSQL / Notion / CRM)
The Google Maps Scraper
Google Maps doesn't have a public API for business listings. I built a Playwright-based scraper:
import asyncio
from playwright.async_api import async_playwright
import json
async def scrape_google_maps(query: str, location: str, max_results: int = 100):
"""
Scrapes business listings from Google Maps for a given query and location.
Returns a list of business dictionaries.
"""
businesses = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# Navigate to Google Maps search
search_url = f"https://www.google.com/maps/search/{query}+{location}"
await page.goto(search_url)
await page.wait_for_selector('[role="feed"]', timeout=10000)
# Scroll to load more results
feed = page.locator('[role="feed"]')
last_count = 0
while len(businesses) < max_results:
# Extract currently visible listings
listings = await page.locator('[data-result-index]').all()
for listing in listings[last_count:]:
business = await extract_listing_data(listing, page)
if business:
businesses.append(business)
last_count = len(businesses)
# Scroll down to load more
await feed.evaluate('el => el.scrollTop += 800')
await page.wait_for_timeout(1500) # Rate limiting
# Check if no more results
if await page.locator('text="You\'ve reached the end"').is_visible():
break
await browser.close()
return businesses[:max_results]
async def extract_listing_data(listing, page):
"""Clicks into a listing and extracts detailed business data."""
try:
await listing.click()
await page.wait_for_timeout(1000)
# Extract all available data points
data = {
'name': await safe_extract(page, 'h1.DUwDvf'),
'rating': await safe_extract(page, '.MW4etd'),
'reviews_count': await safe_extract(page, '.UY7F9'),
'category': await safe_extract(page, '.DkEaL'),
'address': await safe_extract(page, '[data-item-id="address"]'),
'phone': await safe_extract(page, '[data-item-id*="phone"]'),
'website': await safe_extract(page, '[data-item-id="authority"]'),
'hours': await get_operating_hours(page),
'google_maps_url': page.url,
}
return data if data['name'] else None
except Exception as e:
print(f"Failed to extract listing: {e}")
return None
async def safe_extract(page, selector: str) -> str:
"""Safely extracts text from a CSS selector, returns empty string on failure."""
try:
element = page.locator(selector).first
return await element.inner_text(timeout=2000)
except:
return ''
AI Enrichment: From Data to Insights
Raw scraped data is just data. The AI layer converts it into actionable intelligence:
from openai import OpenAI
client = OpenAI()
def enrich_lead_with_ai(business: dict) -> dict:
"""
Uses GPT-4o to analyze a business listing and add qualification signals.
"""
prompt = f"""
Analyze this business listing and provide qualification insights:
Business: {business['name']}
Category: {business['category']}
Rating: {business['rating']} ({business['reviews_count']} reviews)
Website: {business.get('website', 'None')}
Address: {business['address']}
Provide a JSON response with:
1. "lead_score": 1-10 based on business health signals
2. "size_estimate": "solo" | "small" | "medium" | "large"
3. "tech_adoption": "low" | "medium" | "high" (based on website quality, presence)
4. "pain_points": list of likely business pain points for this category
5. "outreach_angle": best approach for cold outreach
6. "email_guess": educated email format guess based on website/name
7. "qualification_notes": brief notes on lead quality
Return only valid JSON.
"""
response = client.chat.completions.create(
model='gpt-4o-mini', # Use mini for cost efficiency at scale
messages=[{'role': 'user', 'content': prompt}],
response_format={'type': 'json_object'},
)
insights = json.loads(response.choices[0].message.content)
return { **business, **insights }
Deduplication and Data Cleaning
import pandas as pd
from fuzzywuzzy import fuzz
def deduplicate_leads(leads: list[dict]) -> list[dict]:
"""Remove duplicate businesses using fuzzy name matching."""
df = pd.DataFrame(leads)
seen_names = []
unique_leads = []
for lead in leads:
name = lead['name'].lower().strip()
# Check fuzzy similarity against seen names
is_duplicate = any(
fuzz.ratio(name, seen) > 90 # >90% similar = duplicate
for seen in seen_names
)
if not is_duplicate:
seen_names.append(name)
unique_leads.append(lead)
return unique_leads
def clean_phone(phone: str) -> str:
"""Standardize phone number format."""
import re
digits = re.sub(r'\D', '', phone)
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return phone
def extract_email_from_website(website_url: str) -> str | None:
"""Attempt to find a contact email on the business website."""
import httpx, re
try:
response = httpx.get(website_url, timeout=5, follow_redirects=True)
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', response.text)
# Filter out common non-contact emails
filtered = [e for e in emails
if not any(skip in e.lower() for skip in ['example', 'noreply', 'sentry'])]
return filtered[0] if filtered else None
except:
return None
AutomateLanka Integration: Building This as a Workflow
This pipeline is now integrated into AutomateLanka as a workflow node:
[Trigger: Schedule Daily at 9am]
│
▼
[Node: Google Maps Scraper]
Config: query, location, limit
│
▼
[Node: AI Enrichment]
Config: enrichment fields, model
│
▼
[Node: Filter]
Condition: lead_score >= 7
│
▼
[Node: Export to Notion Database]
OR
[Node: Push to CRM via Webhook]
// AutomateLanka workflow node definition
const leadGenerationWorkflow = {
id: 'lead-gen-001',
name: 'Daily Lead Extraction',
nodes: [
{
id: 'scraper',
type: 'google-maps-scraper',
config: {
query: '{{env.LEAD_QUERY}}',
location: '{{env.TARGET_LOCATION}}',
limit: 50,
}
},
{
id: 'enricher',
type: 'ai-enrichment',
input: '{{nodes.scraper.output}}',
config: { model: 'gpt-4o-mini', fields: ['lead_score', 'pain_points', 'outreach_angle'] }
},
{
id: 'filter',
type: 'data-filter',
input: '{{nodes.enricher.output}}',
condition: 'item.lead_score >= 7',
},
{
id: 'export',
type: 'notion-export',
input: '{{nodes.filter.output}}',
config: { databaseId: '{{env.NOTION_DB_ID}}' }
}
]
};
Results and Impact
Running this pipeline for a client in the SaaS sales space:
| Metric | Manual | Automated | |--------|--------|-----------| | Leads per hour | ~15 | ~200 | | Qualification time | Manual review | Instant (AI) | | Data quality | Variable | Consistent | | Cost per lead | ~$8 | ~$0.12 |
The 67× improvement in lead volume and 98.5% cost reduction demonstrates the power of targeted automation.
Ethical and Legal Considerations
Always ensure scraping complies with:
- Google Maps ToS — use only for non-commercial research or use the official Places API for commercial use
- GDPR/CCPA — business data for B2B outreach has different rules than consumer data
- Rate limiting — don't hammer servers; respect
robots.txt - Opt-out mechanisms — provide easy ways for businesses to remove their data