Overview

EuroDealers is a global bicycle dealer directory that aggregates, deduplicates, and quality-scores retail locations from 10 distinct data sources into a single unified record set.

Data is collected by scraping brand locator APIs, open community databases, and industry directories. Raw records flow through a multi-stage pipeline that normalises contact details, blocks candidates for comparison, applies fuzzy name matching, and merges field values by source authority — producing a single Dealer record per physical location.

1,566+
ENVE dealers (global)
10
Registered sources
77
Countries & territories

Data Sources

Ten sources are registered in the system, falling into three broad categories: OpenStreetMap Brand Locators Industry Directories Enrichment Only

Listed in descending merge priority (higher = wins conflicts).

1. OSM Overpass
OpenStreetMap Global Highest Authority

Queries the Overpass API for nodes and ways tagged shop=bicycle in a given bounding box. Returns name, address, phone, website, opening hours, brand tags, and lat/lon from OpenStreetMap community data.

Note: robots.txt disallows /api/ — scraper uses check_robots=False per OSM's actual usage policy (Overpass is a public read API). Rate-limited to 1 req/s. No API key required.

2. Nominatim
Enrichment Only Global

OSM's geocoding service. Used exclusively to enrich existing Dealer records with coordinates when latitude/longitude are missing. Does not add new dealers.

Limit: 1 req/s, no bulk use. Requires a descriptive User-Agent header per OSM policy.

3. NBDA Dealer Finder
Industry Directory US Only IBD Authoritative

National Bicycle Dealers Association member directory. IBD membership is verified, making this the most authoritative source for ibd_status="ibd" classification (confidence 0.95). Scraped via WP REST API with HTML fallback.

Coverage: US independent bike dealers only. Approximately 3,500 listings.

4. Yelp Fusion API
Industry Directory US + Some EU

Business listings via Yelp's Fusion API, category bicycleshops. Returns name, address, phone, coordinates, rating, and review count. Useful for US coverage gaps; EU coverage is sparse.

Requires: Yelp API key (YELP_API_KEY env var). Free tier: 500 calls/day, 50 results/call.

5. Trek Dealer Locator
Brand Locator Global

Trek's official dealer finder, reverse-engineered JSON API. Uses a grid-sampling strategy: divides the target bounding box into a lat/lon grid and fires one request per cell to overcome the per-request result cap. Returns Trek authorised dealers only.

IBD confidence: 0.70 (brand dealer assumption). No API key needed.

6. ENVE Composites (Locally.com)
Brand Locator Global

ENVE's dealer finder is powered by Locally.com. The underlying API endpoint was discovered via Playwright network interception. Returns 1,566 ENVE dealers worldwide (US=1,029, GB=77, FR=76, CA=67, AU=43, DE=31, JP=26 …).

enve-composites.locally.com/stores/conversion_data
Required params: has_data=true, dealers_company_id=185731, bounding box, sort_by=proximity&zoom_level=5&inline=1

Critical: omitting has_data=true returns Australian retailer data for a different company.

7. Specialized Dealer Locator
Brand Locator Global

Specialized's dealer finder scraped via XHR interception. Falls back to Playwright-driven browser automation when the XHR endpoint is not accessible. Returns Specialized authorised retailers.

IBD confidence: 0.70 (brand dealer assumption).

8. Giant Dealer Locator
Brand Locator US + EU Regional

Giant operates separate regional domains for the US and major EU markets. The scraper iterates over configured regional endpoints to pull Giant authorised dealers.

IBD confidence: 0.70 (brand dealer assumption).

9. BikeExchange
Industry Directory US Only

Paid marketplace for bicycle retailers. Listings imply an actively trading independent dealer; IBD confidence 0.70. Scraped via HTML parsing.

Note: paid listings may lag — dealers can remain listed after closure.

10. OpenCorporates
Enrichment Only Global Lowest Authority

Company registry aggregator. Used solely to enrich existing dealers with year_established. Does not add new dealers or affect name/address.

Requires: OpenCorporates API key or accepts throttled anonymous requests.

The Pipeline

The pipeline transforms raw scraped records (RawDealerRecord with status="unprocessed") into unified, deduplicated Dealer rows. Run it with manage.py run_pipeline or via the Celery pipeline queue.

1
Normalize

Standardises all text fields: strips HTML, collapses whitespace, converts phone numbers to E.164 format (+1-555-123-4567), extracts root domain from website URLs, lower-cases and accent-strips the name into normalized_name, and normalises postal codes.

2
Block

Groups candidate pairs for comparison using four blocking keys — avoiding an O(n²) full comparison. Keys: zip+name-prefix, phone (E.164), website domain, and name 4-gram prefix. Two records must share at least one key to be compared.

3
Match

Applies a composite similarity score to each blocked pair: RapidFuzz token-sort ratio on normalized_name (weight 0.5), plus exact-match bonuses for address, phone, and website (0.17 each). Pairs scoring ≥ 0.75 are treated as the same physical dealer.

4
Merge

For matched pairs, picks the best field values by source priority (OSM highest → OpenCorporates lowest). Existing Dealer records are updated in place; new records are created if no match exists. source_count is incremented for each distinct source that contributed data.

5
Classify & Score

Assigns ibd_status and ibd_confidence based on contributing sources and name-keyword rules (see IBD Status below). Computes data_quality_score from field completeness and source count (see Quality Score below).

Dealer Fields & Definitions

Field Type Description
idUUIDPrimary key (UUID v4). Stable across pipeline runs.
namestrDisplay name of the dealer as it appears in the source.
normalized_namestrLowercased, accent-stripped, whitespace-collapsed name. Used exclusively for deduplication matching — never displayed.
addressstrStreet address line(s).
citystrCity or municipality.
statestrState / province / region name or abbreviation.
postal_codestrPost / ZIP code. Normalised (stripped whitespace, uppercased for UK codes).
countryFKForeign key to the Country model (ISO 3166-1 alpha-2).
phonestrPrimary phone in E.164 format where possible (+country-area-local).
emailstrPrimary contact email.
websiteURLFull website URL. Root domain used as a blocking/matching key.
latitude / longitudedecimalWGS-84 coordinates. Populated by scraper or by geocode_dealers command via Nominatim.
brandsM2MMany-to-many to Brand. Set from brand-locator sources (e.g. "Trek", "ENVE", "Giant").
ibd_statuschoice"ibd" / "chain" / "unknown" — see IBD Status.
ibd_confidence0.0–1.0Confidence level for the ibd_status classification.
data_quality_score0.0–1.0Composite completeness score — see Quality Score. Displayed as 5 dots.
source_countintNumber of distinct data sources that have contributed data to this record.
source_idsJSONList of DataSource.slug values for all contributing sources.
is_verifiedboolManually verified by staff. Prevents overwrite by pipeline merges.
is_activeboolSoft-delete flag. Inactive dealers are excluded from all public views.
year_establishedint?Year business was registered (from OpenCorporates enrichment). May be null.

IBD Status

IBD stands for Independent Bike Dealer — a privately owned, single-location (or small-chain) bicycle shop, as distinct from a large retail chain or big-box store.

Classification is applied during the Classify pipeline step. The highest-confidence signal wins when multiple sources contribute.

Signal Assigned Status Confidence Rationale
Record sourced from NBDA ibd 0.95 NBDA membership is curated and independently verified.
Brand locator sources (Trek, ENVE, Specialized, Giant) ibd 0.70 Authorised dealers are typically independents; chains sometimes excluded by brands.
BikeExchange paid listing ibd 0.70 Paid marketplace participation implies an active, trading retailer.
Chain keywords in name chain 0.90 Matched against: walmart, target, decathlon, rei, academy, dick's, sports authority, halfords, intersport and similar.
No signal unknown 0.0 Insufficient information to classify.

Note: is_verified=True records are never reclassified by the pipeline. Staff can override IBD status through the CRM.

Data Quality Score

The data_quality_score (0.0–1.0) reflects how completely a dealer record is populated, boosted slightly for records confirmed by multiple independent sources.

Formula

score = (filled_fields / 8) + min(source_count × 0.05, 0.20)

Result is clamped to [0.0, 1.0].

Fields counted (8 total):

  • name
  • address
  • city
  • postal_code
  • phone
  • email
  • website
  • latitude and longitude (counts as 1)
ScoreDotsMeaning
0.0–0.19 Very sparse — name only or nearly empty
0.20–0.39 Partial — has address or phone but missing contact info
0.40–0.59 Adequate — most key fields present
0.60–0.79 Good — all core fields present, confirmed by 2+ sources
0.80–1.0 Excellent — fully populated, confirmed by 4+ sources

Management Commands

All commands use the Django management framework: manage.py <command> [options]

Command Purpose When to Run
load_regions Loads EU country + US state fixtures into the database Once after initial setup or to add new regions
register_sources Registers all 10 DataSource records with their slugs and metadata Once after setup; re-run to add newly coded sources
scrape_region Runs a single scraper for a given country/region. --country LU --source osm_overpass Ad-hoc testing; Celery handles production scraping
run_pipeline Processes all unprocessed RawDealerRecords through the full pipeline After any scrape completes; or on a nightly schedule
sync_latlon Copies lat/lon from RawDealerRecord to matched Dealer records where coordinates are missing After a fresh scrape or geocoding run
geocode_dealers Calls Nominatim for dealers with a full address but no coordinates After run_pipeline when coverage is low
report_coverage Prints a breakdown of dealer count, geocoded %, IBD %, and quality distribution per country Any time — read-only reporting command

Tip: Prefix commands with PYTHONIOENCODING=utf-8 on Windows to avoid encoding errors when output contains Unicode or emoji.

Glossary

Raw Record
A RawDealerRecord — the unprocessed output of a single scraper run for one physical store. May contain duplicates and inconsistencies. Has status "unprocessed" until the pipeline runs.
Blocking
A technique to avoid comparing every raw record against every other (O(n²)). Records are only compared if they share at least one blocking key (zip+name-prefix, phone, website domain, or name n-gram). Reduces comparison work by ~99%.
Fuzzy Matching
Name similarity is computed using RapidFuzz's token-sort ratio, which handles word-order variation ("Bike Shop City" vs "City Bike Shop"). Scores range 0–100; the pipeline threshold is 75 (equivalent to 0.75).
Source Priority
When two sources provide conflicting values for the same field, the higher-priority source wins. Priority order (high → low): OSM > NBDA > Trek > ENVE > Specialized > Giant > Yelp > BikeExchange > Nominatim > OpenCorporates.
Geocoding
Converting a text address into latitude/longitude coordinates. Done automatically by scrapers that return coordinates directly (OSM, brand locators) or retroactively via the Nominatim geocoder for address-only records.
Overpass API
A read-only API for querying OpenStreetMap data using the Overpass Query Language (QL). Queries can select nodes/ways by tags (e.g. shop=bicycle) within a bounding box or radius. Hosted at overpass-api.de.
Nominatim
OSM's official geocoding and reverse-geocoding service. Converts addresses to coordinates and vice-versa. Strict usage policy: max 1 req/s, no bulk geocoding without prior permission.
IBD
Independent Bike Dealer. A privately owned bicycle shop not affiliated with a national retail chain. The NBDA (National Bicycle Dealers Association) is the primary US trade body for IBDs.
Data Quality Score
A 0.0–1.0 metric reflecting how completely a Dealer record is populated, with a bonus for multi-source confirmation. Displayed as dots in the UI.
Pipeline
The multi-stage process (Normalize → Block → Match → Merge → Classify & Score) that converts raw scraped records into unified Dealer rows. Triggered by manage.py run_pipeline or the Celery pipeline queue.
E.164
International telephone number format standardised by the ITU. All phone numbers are stored as +[country code][area code][number] with no spaces or dashes (e.g. +14155551234). Used as a blocking key during deduplication.
WGS-84
World Geodetic System 1984 — the coordinate reference system used by GPS and all coordinates in this project. Latitude and longitude are stored as decimal degrees (e.g. 48.8566, 2.3522 for Paris).