AutomationTechnicalTechnical article

Automating company search and outreach: architecture, risks, and lessons from a prototype

MorenaTechTechnical users and small business ownersTechnicalabout 10 min
Published:

Automating contact discovery sounds simple only in a sales deck. In practice it very quickly stops being a single script and starts becoming a small system: configuration, a database, limits, send history, deduplication, error handling, and a whole bag of API keys.

Before you read

This is a technical article. It is useful if you care about architecture, integrations, or the implementation layer of AI solutions. The guide-style versions are in the “For small business” section.
This article describes a prototype system that searches for companies, checks their public websites, stores contact data, and lets you prepare an email campaign. It is not a step-by-step build guide. The point is different: show the architecture, technical decisions, risk points, and the issues that appear only when an idea leaves the whiteboard and meets real data.

An official data source instead of scraping map views

The first decision was simple: the company list should come from an official API, not from scraping a map view. That is an important boundary. The system does not pretend to be a browser, does not click through Google Maps, does not bypass protections, and does not mass-download data from an interface built for humans.

The source of candidates is the official Google Places API. From there you can get the company name, address, phone number, profile link, and, if available, the website. Only then does that website become the place for further analysis.

That changes the entire character of the solution. We are not building an aggressive scraper. We are building a process that uses an official source to find companies and then checks publicly available company pages in search of general business contact data.

The second stage: the public company website

A company being listed somewhere does not mean you already have a good contact. In the prototype, an important step was opening the public company website and checking whether it contains an email address.

In practice this is not trivial. Company sites differ a lot. One has a clean contact@ address in the footer. Another only has a form. A third has several addresses: office, service, sales, recruitment. A fourth is broken, redirects, or responds too slowly. A fifth publishes a personal address, which is not always the right choice for a first business contact.

That is why the prototype does not store only the value "email found". The source matters too: the page URL from which the email was extracted. Without that, it is hard later to assess data quality, improve the selection rules, or explain where a contact came from.

63%
of companies with an email in the sample
50 out of 79 companies with a website had a contact address that could be stored.
7
layers of a responsible process
From the data source and deduplication all the way to sending audit.
0
automatic sends without control
The campaign is always triggered manually, with preview mode by default.

Result from the sample: roughly two thirds of sites had an email

One of the more interesting observations was the effectiveness of a simple, careful approach. In the tested sample, about two thirds of companies with a website had an email address that could be stored. In the local export, that was 50 out of 79 websites, or 63.29%.

That is not a guarantee for every industry or every dataset. The result depends on the type of companies, the region, site quality, and the way companies publish contact data.

The biggest value is not simply "finding an email." The value is building a process that knows where the data came from, when it was stored, whether the company was already handled earlier, and whether the contact is suitable for further communication.

Database, CSV, and sheet as different process layers

In the prototype, data does not end its life in script memory. It goes into a local SQLite database and a CSV file, and optionally can be exported to Google Sheets. Each of those layers has a different role.

  • SQLite — history, statuses, and deduplication
  • CSV — quick inspection, export, and fallback analysis
  • Google Sheets — human interface: filtering, review, comments

That split is healthy. A sheet does not need to be the only source of truth. The database does not need to be the only place where a human works. CSV does not need to pretend to be a CRM. Each tool does its own part.

Deduplication: defense against chaos

In lead systems, duplicates appear faster than dust on a black monitor. The same company may be found through different queries, may have multiple name variants, several URLs, or the same email on several pages.

That is why deduplication cannot rely on a single field. A sensible approach compares several signals: name and address, website, profile link, email, and other stable identifiers.

Deduplication matters especially before sending. Without it, it is easy to build a tool that accidentally writes to the same company several times. Technically it is just a logic error. In reputation terms it is already a small catastrophe wearing a polite hat.

Outreach as a separate stage, not an automatic trigger

One of the most important architectural decisions was this: lead discovery and email sending are separate. The system does not send automatically at the moment it finds an address. The campaign is a separate, manually triggered process with preview mode by default.

That is a strong safety boundary. Automatically found data should first go through a filter: industry, region, status, address type, earlier sends, and whether the contact makes sense. Only then does it make sense to think about the message.

Sending should have daily limits, history, statuses, and a safeguard against repeated contact with already handled records. That separation is what turns a work-support tool into something other than a bot that does not know when to stop talking.

Process architecture

Seven layers of responsible automation

Lead automation is not one script. It is a layered process where every layer has its own role and its own responsibility boundary.

1
Official source of companies
Google Places API as the only source of candidates. No scraping of the map view.
2
Public company websites
Careful inspection of publicly available company pages to find business contact details.
3
Data and source storage
Every contact is stored with the page URL it came from. No anonymous data.
4
Deduplication
Comparing multiple signals: name, address, website, email. Not just one field.
5
Manually controlled campaign
Sending is a separate, manually triggered process. Preview mode by default.
6
Secret management
API keys outside the code, outside the repository, with a clear rotation procedure.
7
Audit: statuses, logs, history
Every record has a status. The system knows what it did and why. It can be explained later.

The biggest hidden cost: authentication and secrets

The most difficult part of such a project is not always the search algorithm. Often it sits in the configuration of external services.

You need API keys, Google Cloud permissions, access to Google Sheets, a service account, Gmail API or SMTP configuration, a sender identity, a reply address, limits, permission scopes, and configuration files. Each of those can fail separately. Each can also become a security risk.

Keys should not be hardcoded. Secret files should not end up in the repository. If a key is exposed, deleting it from the file is not enough. Treat it as burned and rotate it. This is one of the most commonly ignored costs of automation.

Statuses matter more than they look

In the prototype, every record can have a status. That looks like a detail, but it is not. The status tells you whether an email was found, whether the site does not exist, whether data could not be fetched, whether the company has no website, or whether the record was skipped as a duplicate.

Without statuses, the system turns into a pile of rows. With statuses, it becomes a process. You can measure effectiveness, find problems, retry only the failed cases, and avoid mixing companies without websites with companies whose site simply timed out.

The same applies to outreach. A sent status or campaign history is a safety brake. It lets the system know that a message already went out to that contact and should not be repeated just because someone reran the campaign.

A careful crawler instead of a hungry spider

Website checking needs limits. Reasonable choices are timeouts, limits on the number of pages visited, no login flows, respect for public site rules, and focus on places where business contact details are actually published: footer, contact page, about page, sometimes service pages.

The goal is not to crawl the entire internet. The goal is to find a publicly published business contact at acceptable cost and with minimal load on other people’s sites.

Less aggression means fewer errors, fewer blocks, and fewer legal or reputational problems. The system may work slower, but it works more steadily. And steadiness in automation is a harder currency than a flashy chart.

What such a system should not do

  • Log into third-party sites, bypass protections, or pull data from panels that are not publicly available.
  • Automatically send messages to everything that looks like an email address.
  • Hide data sources. If you do not know where a contact came from, you do not have a controlled process.
  • Treat configuration as an afterthought. Secrets, logs, limits, and history are part of the product, not technical wallpaper.

Who this makes sense for

This approach makes sense for small businesses and teams that want to structure market research, find potential companies in a specific industry, and prepare a controlled first contact.

Especially when the goal is not mass mailing, but a selective process: find, check, store, filter, review, and only then send.

It is not a good fit for someone who wants to press one button and send thousands of messages without control. That path usually ends quickly with data quality problems, domain reputation damage, and plain human fatigue on the receiving side.

Summary

The prototype showed that careful automation can produce a real effect. In the tested sample, about 63% of companies with a website had an email address that could be stored. That is a strong result, especially because the process relied on official APIs and public websites, not aggressive scraping.

At the same time, the same prototype showed where the real difficulty lives: data quality, deduplication, statuses, sending limits, Google Cloud configuration, Gmail API, service accounts, secret files, and the responsible split between discovery and outreach.

The conclusion is less dramatic than the promise of a “lead bot,” but much more useful: good automation is not about the system doing everything by itself. Good automation handles the repeatable part of the work, leaves traces, enforces limits, and leaves human decisions where human decisions still matter.

Want to build a controlled process instead of an unbraked bot?

If you want a responsible approach to company search and outreach — with official APIs, statuses, deduplication, and a separate campaign stage — we can design it together.

Read more in Technical