Overview
EuroDealers is a global bicycle dealer directory that aggregates, deduplicates, and quality-scores retail locations from 10 distinct data sources into a single unified record set.
Data is collected by scraping brand locator APIs, open community databases, and industry directories. Raw records flow through a multi-stage pipeline that normalises contact details, blocks candidates for comparison, applies fuzzy name matching, and merges field values by source authority — producing a single Dealer record per physical location.
Data Sources
Ten sources are registered in the system, falling into three broad categories: OpenStreetMap Brand Locators Industry Directories Enrichment Only
Listed in descending merge priority (higher = wins conflicts).
1. OSM Overpass
OpenStreetMap Global Highest Authority
Queries the Overpass API
for nodes and ways tagged shop=bicycle in a given bounding box. Returns
name, address, phone, website, opening hours, brand tags, and lat/lon from
OpenStreetMap community data.
Note: robots.txt disallows /api/ — scraper uses
check_robots=False per OSM's actual usage policy (Overpass is a
public read API). Rate-limited to 1 req/s. No API key required.
2. Nominatim
Enrichment Only GlobalOSM's geocoding service. Used exclusively to enrich existing Dealer records with coordinates when latitude/longitude are missing. Does not add new dealers.
Limit: 1 req/s, no bulk use. Requires a descriptive
User-Agent header per OSM policy.
3. NBDA Dealer Finder
Industry Directory US Only IBD Authoritative
National Bicycle Dealers Association member directory. IBD membership is
verified, making this the most authoritative source for ibd_status="ibd"
classification (confidence 0.95). Scraped via WP REST API with HTML fallback.
Coverage: US independent bike dealers only. Approximately 3,500 listings.
4. Yelp Fusion API
Industry Directory US + Some EU
Business listings via Yelp's Fusion API, category bicycleshops.
Returns name, address, phone, coordinates, rating, and review count. Useful
for US coverage gaps; EU coverage is sparse.
Requires: Yelp API key (YELP_API_KEY env var).
Free tier: 500 calls/day, 50 results/call.
5. Trek Dealer Locator
Brand Locator GlobalTrek's official dealer finder, reverse-engineered JSON API. Uses a grid-sampling strategy: divides the target bounding box into a lat/lon grid and fires one request per cell to overcome the per-request result cap. Returns Trek authorised dealers only.
IBD confidence: 0.70 (brand dealer assumption). No API key needed.
6. ENVE Composites (Locally.com)
Brand Locator GlobalENVE's dealer finder is powered by Locally.com. The underlying API endpoint was discovered via Playwright network interception. Returns 1,566 ENVE dealers worldwide (US=1,029, GB=77, FR=76, CA=67, AU=43, DE=31, JP=26 …).
enve-composites.locally.com/stores/conversion_dataRequired params:
has_data=true, dealers_company_id=185731,
bounding box, sort_by=proximity&zoom_level=5&inline=1
Critical: omitting has_data=true returns Australian
retailer data for a different company.
7. Specialized Dealer Locator
Brand Locator GlobalSpecialized's dealer finder scraped via XHR interception. Falls back to Playwright-driven browser automation when the XHR endpoint is not accessible. Returns Specialized authorised retailers.
IBD confidence: 0.70 (brand dealer assumption).
8. Giant Dealer Locator
Brand Locator US + EU RegionalGiant operates separate regional domains for the US and major EU markets. The scraper iterates over configured regional endpoints to pull Giant authorised dealers.
IBD confidence: 0.70 (brand dealer assumption).
9. BikeExchange
Industry Directory US OnlyPaid marketplace for bicycle retailers. Listings imply an actively trading independent dealer; IBD confidence 0.70. Scraped via HTML parsing.
Note: paid listings may lag — dealers can remain listed after closure.
10. OpenCorporates
Enrichment Only Global Lowest Authority
Company registry aggregator. Used solely to enrich existing dealers with
year_established. Does not add new dealers or affect name/address.
Requires: OpenCorporates API key or accepts throttled anonymous requests.
The Pipeline
The pipeline transforms raw scraped records
(RawDealerRecord with status="unprocessed") into
unified, deduplicated Dealer rows. Run it with
manage.py run_pipeline or via the Celery pipeline queue.
Standardises all text fields: strips HTML, collapses whitespace, converts
phone numbers to E.164 format (+1-555-123-4567), extracts
root domain from website URLs, lower-cases and accent-strips the name into
normalized_name, and normalises postal codes.
Groups candidate pairs for comparison using four blocking keys — avoiding an O(n²) full comparison. Keys: zip+name-prefix, phone (E.164), website domain, and name 4-gram prefix. Two records must share at least one key to be compared.
Applies a composite similarity score to each blocked pair:
RapidFuzz token-sort ratio on normalized_name (weight 0.5),
plus exact-match bonuses for address, phone, and website (0.17 each).
Pairs scoring ≥ 0.75 are treated as the same physical dealer.
For matched pairs, picks the best field values by source priority
(OSM highest → OpenCorporates lowest). Existing Dealer records
are updated in place; new records are created if no match exists.
source_count is incremented for each distinct source that
contributed data.
Assigns ibd_status and ibd_confidence based on
contributing sources and name-keyword rules (see IBD
Status below). Computes data_quality_score from field
completeness and source count (see Quality
Score below).
Dealer Fields & Definitions
| Field | Type | Description |
|---|---|---|
| id | UUID | Primary key (UUID v4). Stable across pipeline runs. |
| name | str | Display name of the dealer as it appears in the source. |
| normalized_name | str | Lowercased, accent-stripped, whitespace-collapsed name. Used exclusively for deduplication matching — never displayed. |
| address | str | Street address line(s). |
| city | str | City or municipality. |
| state | str | State / province / region name or abbreviation. |
| postal_code | str | Post / ZIP code. Normalised (stripped whitespace, uppercased for UK codes). |
| country | FK | Foreign key to the Country model (ISO 3166-1 alpha-2). |
| phone | str | Primary phone in E.164 format where possible (+country-area-local). |
| str | Primary contact email. | |
| website | URL | Full website URL. Root domain used as a blocking/matching key. |
| latitude / longitude | decimal | WGS-84 coordinates. Populated by scraper or by geocode_dealers command via Nominatim. |
| brands | M2M | Many-to-many to Brand. Set from brand-locator sources (e.g. "Trek", "ENVE", "Giant"). |
| ibd_status | choice | "ibd" / "chain" / "unknown" — see IBD Status. |
| ibd_confidence | 0.0–1.0 | Confidence level for the ibd_status classification. |
| data_quality_score | 0.0–1.0 | Composite completeness score — see Quality Score. Displayed as 5 dots. |
| source_count | int | Number of distinct data sources that have contributed data to this record. |
| source_ids | JSON | List of DataSource.slug values for all contributing sources. |
| is_verified | bool | Manually verified by staff. Prevents overwrite by pipeline merges. |
| is_active | bool | Soft-delete flag. Inactive dealers are excluded from all public views. |
| year_established | int? | Year business was registered (from OpenCorporates enrichment). May be null. |
IBD Status
IBD stands for Independent Bike Dealer — a privately owned, single-location (or small-chain) bicycle shop, as distinct from a large retail chain or big-box store.
Classification is applied during the Classify pipeline step. The highest-confidence signal wins when multiple sources contribute.
| Signal | Assigned Status | Confidence | Rationale |
|---|---|---|---|
| Record sourced from NBDA | ibd | 0.95 | NBDA membership is curated and independently verified. |
| Brand locator sources (Trek, ENVE, Specialized, Giant) | ibd | 0.70 | Authorised dealers are typically independents; chains sometimes excluded by brands. |
| BikeExchange paid listing | ibd | 0.70 | Paid marketplace participation implies an active, trading retailer. |
| Chain keywords in name | chain | 0.90 |
Matched against: walmart, target,
decathlon, rei, academy,
dick's, sports authority, halfords,
intersport and similar.
|
| No signal | unknown | 0.0 | Insufficient information to classify. |
Note: is_verified=True records are never
reclassified by the pipeline. Staff can override IBD status through the CRM.
Data Quality Score
The data_quality_score (0.0–1.0) reflects how completely a dealer
record is populated, boosted slightly for records confirmed by multiple independent
sources.
Formula
score = (filled_fields / 8) + min(source_count × 0.05, 0.20)
Result is clamped to [0.0, 1.0].
Fields counted (8 total):
- name
- address
- city
- postal_code
- phone
- website
- latitude and longitude (counts as 1)
| Score | Dots | Meaning |
|---|---|---|
| 0.0–0.19 | Very sparse — name only or nearly empty | |
| 0.20–0.39 | Partial — has address or phone but missing contact info | |
| 0.40–0.59 | Adequate — most key fields present | |
| 0.60–0.79 | Good — all core fields present, confirmed by 2+ sources | |
| 0.80–1.0 | Excellent — fully populated, confirmed by 4+ sources |
Management Commands
All commands use the Django management framework: manage.py <command> [options]
| Command | Purpose | When to Run |
|---|---|---|
| load_regions | Loads EU country + US state fixtures into the database | Once after initial setup or to add new regions |
| register_sources | Registers all 10 DataSource records with their slugs and metadata |
Once after setup; re-run to add newly coded sources |
| scrape_region |
Runs a single scraper for a given country/region.
--country LU --source osm_overpass
|
Ad-hoc testing; Celery handles production scraping |
| run_pipeline | Processes all unprocessed RawDealerRecords through the full pipeline |
After any scrape completes; or on a nightly schedule |
| sync_latlon | Copies lat/lon from RawDealerRecord to matched Dealer records where coordinates are missing |
After a fresh scrape or geocoding run |
| geocode_dealers | Calls Nominatim for dealers with a full address but no coordinates | After run_pipeline when coverage is low |
| report_coverage | Prints a breakdown of dealer count, geocoded %, IBD %, and quality distribution per country | Any time — read-only reporting command |
Tip: Prefix commands with
PYTHONIOENCODING=utf-8 on Windows to avoid encoding errors
when output contains Unicode or emoji.
Glossary
- Raw Record
-
A
RawDealerRecord— the unprocessed output of a single scraper run for one physical store. May contain duplicates and inconsistencies. Has status"unprocessed"until the pipeline runs. - Blocking
- A technique to avoid comparing every raw record against every other (O(n²)). Records are only compared if they share at least one blocking key (zip+name-prefix, phone, website domain, or name n-gram). Reduces comparison work by ~99%.
- Fuzzy Matching
- Name similarity is computed using RapidFuzz's token-sort ratio, which handles word-order variation ("Bike Shop City" vs "City Bike Shop"). Scores range 0–100; the pipeline threshold is 75 (equivalent to 0.75).
- Source Priority
- When two sources provide conflicting values for the same field, the higher-priority source wins. Priority order (high → low): OSM > NBDA > Trek > ENVE > Specialized > Giant > Yelp > BikeExchange > Nominatim > OpenCorporates.
- Geocoding
- Converting a text address into latitude/longitude coordinates. Done automatically by scrapers that return coordinates directly (OSM, brand locators) or retroactively via the Nominatim geocoder for address-only records.
- Overpass API
-
A read-only API for querying OpenStreetMap data using the Overpass Query Language
(QL). Queries can select nodes/ways by tags (e.g.
shop=bicycle) within a bounding box or radius. Hosted atoverpass-api.de. - Nominatim
- OSM's official geocoding and reverse-geocoding service. Converts addresses to coordinates and vice-versa. Strict usage policy: max 1 req/s, no bulk geocoding without prior permission.
- IBD
- Independent Bike Dealer. A privately owned bicycle shop not affiliated with a national retail chain. The NBDA (National Bicycle Dealers Association) is the primary US trade body for IBDs.
- Data Quality Score
-
A 0.0–1.0 metric reflecting how completely a
Dealerrecord is populated, with a bonus for multi-source confirmation. Displayed as dots in the UI. - Pipeline
-
The multi-stage process (Normalize → Block → Match → Merge → Classify & Score)
that converts raw scraped records into unified
Dealerrows. Triggered bymanage.py run_pipelineor the Celerypipelinequeue. - E.164
-
International telephone number format standardised by the ITU. All phone numbers
are stored as
+[country code][area code][number]with no spaces or dashes (e.g.+14155551234). Used as a blocking key during deduplication. - WGS-84
- World Geodetic System 1984 — the coordinate reference system used by GPS and all coordinates in this project. Latitude and longitude are stored as decimal degrees (e.g. 48.8566, 2.3522 for Paris).