Automating company search and outreach: architecture, risks, and lessons from a prototype
Automating contact discovery sounds simple only in a sales deck. In practice it very quickly stops being a single script and starts becoming a small system: configuration, a database, limits, send history, deduplication, error handling, and a whole bag of API keys.
Before you read
An official data source instead of scraping map views
The first decision was simple: the company list should come from an official API, not from scraping a map view. That is an important boundary. The system does not pretend to be a browser, does not click through Google Maps, does not bypass protections, and does not mass-download data from an interface built for humans.
The source of candidates is the official Google Places API. From there you can get the company name, address, phone number, profile link, and, if available, the website. Only then does that website become the place for further analysis.
That changes the entire character of the solution. We are not building an aggressive scraper. We are building a process that uses an official source to find companies and then checks publicly available company pages in search of general business contact data.
The second stage: the public company website
A company being listed somewhere does not mean you already have a good contact. In the prototype, an important step was opening the public company website and checking whether it contains an email address.
In practice this is not trivial. Company sites differ a lot. One has a clean contact@ address in the footer. Another only has a form. A third has several addresses: office, service, sales, recruitment. A fourth is broken, redirects, or responds too slowly. A fifth publishes a personal address, which is not always the right choice for a first business contact.
That is why the prototype does not store only the value "email found". The source matters too: the page URL from which the email was extracted. Without that, it is hard later to assess data quality, improve the selection rules, or explain where a contact came from.
Result from the sample: roughly two thirds of sites had an email
One of the more interesting observations was the effectiveness of a simple, careful approach. In the tested sample, about two thirds of companies with a website had an email address that could be stored. In the local export, that was 50 out of 79 websites, or 63.29%.
That is not a guarantee for every industry or every dataset. The result depends on the type of companies, the region, site quality, and the way companies publish contact data.
Database, CSV, and sheet as different process layers
In the prototype, data does not end its life in script memory. It goes into a local SQLite database and a CSV file, and optionally can be exported to Google Sheets. Each of those layers has a different role.
- SQLite — history, statuses, and deduplication
- CSV — quick inspection, export, and fallback analysis
- Google Sheets — human interface: filtering, review, comments
That split is healthy. A sheet does not need to be the only source of truth. The database does not need to be the only place where a human works. CSV does not need to pretend to be a CRM. Each tool does its own part.
Deduplication: defense against chaos
In lead systems, duplicates appear faster than dust on a black monitor. The same company may be found through different queries, may have multiple name variants, several URLs, or the same email on several pages.
That is why deduplication cannot rely on a single field. A sensible approach compares several signals: name and address, website, profile link, email, and other stable identifiers.
Outreach as a separate stage, not an automatic trigger
One of the most important architectural decisions was this: lead discovery and email sending are separate. The system does not send automatically at the moment it finds an address. The campaign is a separate, manually triggered process with preview mode by default.
That is a strong safety boundary. Automatically found data should first go through a filter: industry, region, status, address type, earlier sends, and whether the contact makes sense. Only then does it make sense to think about the message.
Sending should have daily limits, history, statuses, and a safeguard against repeated contact with already handled records. That separation is what turns a work-support tool into something other than a bot that does not know when to stop talking.
Seven layers of responsible automation
Lead automation is not one script. It is a layered process where every layer has its own role and its own responsibility boundary.
The biggest hidden cost: authentication and secrets
The most difficult part of such a project is not always the search algorithm. Often it sits in the configuration of external services.
You need API keys, Google Cloud permissions, access to Google Sheets, a service account, Gmail API or SMTP configuration, a sender identity, a reply address, limits, permission scopes, and configuration files. Each of those can fail separately. Each can also become a security risk.
Statuses matter more than they look
In the prototype, every record can have a status. That looks like a detail, but it is not. The status tells you whether an email was found, whether the site does not exist, whether data could not be fetched, whether the company has no website, or whether the record was skipped as a duplicate.
Without statuses, the system turns into a pile of rows. With statuses, it becomes a process. You can measure effectiveness, find problems, retry only the failed cases, and avoid mixing companies without websites with companies whose site simply timed out.
The same applies to outreach. A sent status or campaign history is a safety brake. It lets the system know that a message already went out to that contact and should not be repeated just because someone reran the campaign.
A careful crawler instead of a hungry spider
Website checking needs limits. Reasonable choices are timeouts, limits on the number of pages visited, no login flows, respect for public site rules, and focus on places where business contact details are actually published: footer, contact page, about page, sometimes service pages.
The goal is not to crawl the entire internet. The goal is to find a publicly published business contact at acceptable cost and with minimal load on other people’s sites.
What such a system should not do
- Log into third-party sites, bypass protections, or pull data from panels that are not publicly available.
- Automatically send messages to everything that looks like an email address.
- Hide data sources. If you do not know where a contact came from, you do not have a controlled process.
- Treat configuration as an afterthought. Secrets, logs, limits, and history are part of the product, not technical wallpaper.
Who this makes sense for
This approach makes sense for small businesses and teams that want to structure market research, find potential companies in a specific industry, and prepare a controlled first contact.
Especially when the goal is not mass mailing, but a selective process: find, check, store, filter, review, and only then send.
It is not a good fit for someone who wants to press one button and send thousands of messages without control. That path usually ends quickly with data quality problems, domain reputation damage, and plain human fatigue on the receiving side.
Summary
The prototype showed that careful automation can produce a real effect. In the tested sample, about 63% of companies with a website had an email address that could be stored. That is a strong result, especially because the process relied on official APIs and public websites, not aggressive scraping.
At the same time, the same prototype showed where the real difficulty lives: data quality, deduplication, statuses, sending limits, Google Cloud configuration, Gmail API, service accounts, secret files, and the responsible split between discovery and outreach.
The conclusion is less dramatic than the promise of a “lead bot,” but much more useful: good automation is not about the system doing everything by itself. Good automation handles the repeatable part of the work, leaves traces, enforces limits, and leaves human decisions where human decisions still matter.
Want to build a controlled process instead of an unbraked bot?
If you want a responsible approach to company search and outreach — with official APIs, statuses, deduplication, and a separate campaign stage — we can design it together.
Read more in Technical
Jak zrobić transkrypcję filmu do tekstu za pomocą FFmpeg i faster-whisper
Praktyczny poradnik: film zamieniamy na audio przez FFmpeg, audio przepuszczamy przez lokalny model Whisper, a wynik zapisujemy do pliku tekstowego ze znacznikami czasu.
Git przy pracy z AI: jak nie stracić kontroli nad projektem
AI przyspiesza zmiany w kodzie, ale Git pozwala sprawdzić diff, pracować na branchu, zapisywać decyzje w commitach i nie zamieniać projektu w serię przypadkowych eksperymentów.
MAPI-local-medium: lokalny serwer MCP, który daje modelowi pamięć, narzędzia i granice
Techniczne wyjaśnienie, czym jest MAPI-local-medium: lokalny serwer MCP, który pozwala modelowi językowemu korzystać z pamięci projektowej, narzędzi, bootstrapu kontekstu i kontrolowanego środowiska pracy.