Getting Accurate Results at Scale with Bulk Geocoding¶
Free
Starter
Standard
Professional
Sometimes you need to geocode thousands or even millions of addresses. You may have used the search API already to make a few queries, or maybe even a few hundred. But you need to go fast, and making a bunch of individual requests won't cut it.
The Stadia Maps bulk geocoding API helps you execute these large jobs quickly, packing up to 5,000 queries in a single request! This guide will cover what you need to know to quickly and effectively do geocoding en masse.
Preparing to process your data¶
The first step in any large operation is planning, and bulk geocoding is no different. The main preparation step for geocoding is to figure out what format it's currently in, and how you want to submit it to the API. Before sending your first batch geocoding query, you need to answer two important questions.
Question 1: Should you use structured or unstructured search?¶
The bulk geocoding API supports queries in two forms: structured and unstructured. The main difference is that with the structured API, you're in control of how the address components are interpreted. We usually recommend using the structured API if possible, but there are a few factors to consider before deciding what's best for your use case.
If your data looks like 123 Main Street, Some Town, USA
, then your data is currently in unstructured form.
However, it's important to understand how your data got into this format.
If your data came from a form with three fields, which were then mechanically joined by commas,
your data is effectively already structured.
(But be careful; maybe the input contained commas in some fields!)
If your data comes from free-form user input, you should probably just assume the data is unstructured. Our APIs are pretty good at dealing with unstructured text input, so it's usually better to just send us it as-is unless you know for certain that you can reliably get it back into a structured form.
If you have (or can transform your data to) structured components, you then need to figure out how they map to our API. Check the API documentation over here for a list of fields that our geocoder understands.
Tip
The component that most people call a "city" (or town or village) in an address
is a locality
in our taxonomy.
Note that we pretty much leave it up to you to which fields to include!
If you're just looking for postal codes, for example, you don't need to send any other fields!
In the case of address geocoding, you will need to decide which info to include.
We recommend starting with address
, locality
, region
, and country
where possible.
Question 2: How can you narrow the search?¶
If you know you're looking for results in a specific area, you should also consider using the focus or boundary parameters. These let you bias or filter the results to a geographic area. You can set the area by country code, GID, or geometry (bounding box or radius from a point).
You can also specify which layers you're interested in for each query. If you're specifically looking for addresses or postal codes, for example, you can say so upfront!
Tip
Sometimes we can't find an address. When this happens, we'll try a few fallback requests internally to try to give you something. For example, with addresses, we try to interpolate missing house numbers along the street. If we still can't find an address, we'll fall back to things like the street or locality. Specifying which layers you want will determine what types of results the API will return.
Always do a test run¶
When performing geocoding at large scale, it's important to be able to evaluate the quality of the results. We have a whole page on determining result quality, but there are a few rules of thumb which are specific to batch geocoding.
First, test on a small sample of your data first! You can burn through a lot of API credits in a few seconds. The bulk API returns a status code for each request. It does some basic sanity checks up front, ensuring that your query is well-formed JSON, and that there are no more than the maximum number of requests. But it will not spot requests that don't include a required parameter or are otherwise incorrect. Since all of these have to be sent to the backend, you will be charged credits for these failed requests! So your first test should verify that there are no failure status codes.
How many credits will I use?
This API internally makes requests to other geocoding endpoints on your behalf. You will be charged for each request in the batch. For example, if your bulk request includes 200 forward geocodes and 300 structured geocodes, you will be charged for 200 forward geocodes and 300 structured geocodes accourding to the current credit schedule.
This endpoint is available on our Standard, Professinoal, and Enterprise plans.
After spot checking a few results on a smaller sample to make sure your code is working,
make a slightly larger request and be sure to check the match_type
of the results.
If you're getting a large proportion of fallback
results,
double-check your data and/or code.
Finally, if you have any inkling about the structure of your dataset (ex: there is a US state name or country in your data), it might be worth trying to make your sample diverse across these areas to identify any edge cases. If you encounter any issues, send us a message at support@stadiamaps.com. We're here to help!
Python sample code¶
Every bulk geocoding operation will probably look a bit different, but they all have some common elements. We've put together the following Python sample to illustrate the common patterns, and serve as a base for your batch geocoding scripts. Our example will get its records from a CSV file, geocode some or all of the records, and save the results to a new CSV file.
We'll use Python, but the general approach and principles are the same in any language. Check out our bulk geocoding endpoint documentation for code samples in other languages.
Installation Instructions
The Stadia Maps Python SDK is available through any package manager that supports PyPi.
pip install stadiamaps
poetry add stadiamaps
import csv
import os
import random
import stadiamaps
import sys
from stadiamaps.rest import ApiException
# The max number of records per batch (our API will reject larger requests)
MAX_BATCH_SIZE = 5_000
# You can also use our EU endpoint to keep traffic within the EU like so:
# configuration = stadiamaps.Configuration(host="https://api-eu.stadiamaps.com")
configuration = stadiamaps.Configuration()
# Get your file paths; here we'll load it as the first two arguments to the script, but you can hard-code it too for a one-off job
in_csv_path = sys.argv[1]
# We'll write the output as a new CSV file which you can load using familiar tools like Excel
out_csv_path = sys.argv[2]
# Configure API key authentication (ex: via environment variable or CLI argument). (1)
configuration.api_key['ApiKeyAuth'] = os.environ.get("API_KEY", sys.argv[3])
# The final CLI argument: if `--full` is present, process all records.
# Otherwise, only process 10%, or 100 records (whichever is smaller).
process_all_records = '--full' in sys.argv
# Load the data from the CSV file.
queries = []
row_data = []
with open(in_csv_path, newline='') as infile:
reader = csv.DictReader(infile) # NOTE: The sample code assumes your first row is a header.
for row in reader:
# Construct a bulk request query for every entry in the CSV file.
# The sample code assumes that you have the columns address, locality, region, and country.
queries.append(stadiamaps.BulkRequest(endpoint="/v1/search/structured",
query=stadiamaps.BulkRequestQuery(stadiamaps.SearchStructuredQuery(
address=row["address"],
# "or None" lets us ignore blank values
locality=row["locality"] or None,
region=row["region"] or None,
country=row["country"] or None,
# Filter layers; we'll assume that we really want addresses,
# but would accept a street fallback.
layers=[stadiamaps.GeocodingLayer.ADDRESS, stadiamaps.GeocodingLayer.STREET]))))
# Save a copy of the row data
row_data.append(row)
# Select a random subset unless running in full mode
if not process_all_records:
num_to_keep = max(1, min(100, len(queries) / 10))
indices_to_keep = random.sample(range(len(queries)), num_to_keep)
queries = [queries[i] for i in indices_to_keep]
row_data = [row_data[i] for i in indices_to_keep]
with stadiamaps.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = stadiamaps.GeocodingApi(api_client)
with open(out_csv_path, 'w', newline='') as outfile:
# We'll write the original data PLUS the most important fields of the geocoding results to the CSV file
fieldnames = list(row_data[0].keys()) + ['status', 'match_type', 'layer', 'lat', 'lon']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(0, len(queries), MAX_BATCH_SIZE):
# Operate on one batch at a time, up to MAX_BATCH_SIZE
batch = queries[i:i+MAX_BATCH_SIZE]
# Make the request!
api_res = api_instance.search_bulk(batch)
for orig_row, bulk_res in zip(row_data[i:i+MAX_BATCH_SIZE], api_res):
if bulk_res.status != 200:
print(f"Request for row {orig_row} failed with status {bulk_res.status}")
writer.writerow({'status': bulk_res.status, **orig_row})
continue
elif len(bulk_res.response.features) == 0:
print(f"No result for row {orig_row}")
writer.writerow({'status': bulk_res.status, **orig_row})
continue
# In our example, we'll just use the first result, as this is what the geocoder thinks is the "best" one.
# There are legitimate cases where multiple addresses may be returned
# (ex: records from multiple data sources with slightly differing names).
# You can decide what to do with these.
feature = bulk_res.response.features[0]
# Write each result for the row in the original CSV (may be more than one)
writer.writerow({
'status': bulk_res.status,
'match_type': feature.properties.match_type,
'layer': feature.properties.layer,
'lat': feature.geometry.coordinates[1],
'lon': feature.geometry.coordinates[0],
**orig_row # Neat Python trick that lets us merge dictionaries
})
print(f"Done! Your geocoding results are in {out_csv_path}")
- Learn how to get an API key in our authentication guide.
Can I Store Geocoding API Results?
Unlike most vendors, we won't charge you 10x the standard fee per request to store geocoding results long-term! However, we do require an active Standard, Professional, or Enterprise subscription to permanently store results (e.g. in a database). Temporary storage in the normal course of your work is allowed on all plans. See our terms of service for the full legal terms.
Running the sample code¶
Make sure you've installed the latest version of the stadiamaps
Python package using pip
, poetry
, uv
,
or your favorite other package manager.
Then, you can run the python script as follows:
python3 geocode-bulk-test.py in.csv out.csv YOUR-STADIA-API-KEY
You'll need a Stadia Maps API key to run the script (see below).
By default, it will only look up 10% of your input data set, or 100 records, whichever is smaller.
This should keep you from accidentally running out of credits in a single run.
After you've verified that it works, you can add --full
to your command line
to process the whole file.
Next steps¶
Sign up for a free account to start your 14-day trial (no credit card required!) to get started geocoding your own data. During your trial, you'll have access to all of our APIs, plus enough API credits to test out your bulk geocoding.
Get Started With a Free Account
Once you're ready to give the API some real work, check out our guide to activate a paid subscription with bulk geocoding access.