Dive Site Data Ingestion
Dive Site Data Ingestion
Instructions
Build data ingestion pipelines that pull georeferenced dive site records from multiple external APIs and open databases, normalize them to a common schema, deduplicate across sources, and produce a clean merged database with SQL import scripts.
1. Available Data Sources
TheDiveAPI (RapidAPI — subscription required)
- Host:
world-scuba-diving-sites-api.p.rapidapi.com - Auth: RapidAPI key in
x-rapidapi-keyheader - Endpoints:
GET /divesites/gps?southWestLat=&northEastLat=&southWestLng=&northEastLng=— bounding-box search, max 200 resultsGET /divesites?country=— text search by country/region (query must be >4 characters)- Fields:
id,name,region(hierarchical comma-separated),latitude,longitude,ocean,location,description - Coverage: ~17,000 sites worldwide
- Cloudflare note: Python
urllib/requestsmay receive HTTP 403 (error code 1010). Usecurlas a subprocess instead. - Rate limit: Check your RapidAPI plan (e.g., 25 req/sec). Recommended: 2 req/sec to stay well within limits.
OpenDiveMap (free, no auth)
- Base URL:
https://api.opendivemap.com/v1 - Endpoints:
GET /sites?limit=1000&offset=0— paginated GeoJSON FeatureCollectionGET /stats— aggregate countsGET /enums— environment, topology, entry enum values- Fields:
name,coordinates(GeoJSON [lng, lat]),country_code,country_name,sea_name,environment,topologies[],max_depth,entry - Coverage: ~3,100 sites across 59 countries
- License: ODbL 1.0 — requires attribution
- Rate limit: No documented limit, but use 2-second delays between paginated calls
Dive Vibe Community (GitHub — free)
- Repository:
jbunderwater/dive-vibe-community - Data URL pattern:
https://raw.githubusercontent.com/jbunderwater/dive-vibe-community/main/divesites/{slug}/index.json - Destinations list:
https://raw.githubusercontent.com/jbunderwater/dive-vibe-community/main/destinations.json - Fields:
name,latitude,longitude,site_type,difficulty,depth,entry_type - Coverage: ~2,700 sites across 122+ destinations
- License: ODbL 1.0 (data) / MIT (code)
- Source: OpenStreetMap + AI-validated corrections
- Note: Parent-level slugs (e.g.,
india,us) return 404; use specific destination slugs (e.g.,bali,florida-keys)
OpenStreetMap via Overpass API (free, no auth)
- Tags:
sport=scuba_diving,amenity=dive_centre,shop=scuba_diving - Use case: Gap-filling only — OpenDiveMap and Dive Vibe already curate OSM data
- Note: Raw OSM includes dive shops mixed with dive sites; requires filtering
2. Common Target Schema
Normalize all sources to this structure:
{
"name": "El Bajo Seamount",
"country": "Mexico",
"region": "Caribbean",
"location": "24.21560, -110.39810",
"lat": 24.2156,
"lng": -110.3981,
"dive_types": ["pinnacle"],
"depth_min_m": null,
"depth_max_m": 40,
"cert_level": "AOW",
"marine_life": [],
"description": "",
"source": "opendivemap"
}
Field mapping by source:
| Field | TheDiveAPI | OpenDiveMap | Dive Vibe |
|---|---|---|---|
name |
name |
properties.name |
name |
lat |
latitude |
geometry.coordinates[1] |
latitude |
lng |
longitude |
geometry.coordinates[0] |
longitude |
country |
Parsed from region (2nd comma-part) |
properties.country_name |
_destination_meta.countryCode → lookup |
region |
Mapped from country | Mapped from country | _destination_meta.region → mapped |
dive_types |
Not provided | properties.topologies |
site_type |
depth_max_m |
Not provided | properties.max_depth |
depth |
cert_level |
Derived from depth | Derived from depth | difficulty → mapped |
3. GPS Grid Strategy for Bounding-Box APIs
When an API caps results per call (TheDiveAPI: 200), tile the world systematically:
- Define grid cells covering coastal/ocean regions — skip landlocked areas
- Use 5° cells for dense areas (Caribbean, SE Asia, Mediterranean, Red Sea)
- Use 10-20° cells for sparse areas (open Pacific, polar)
- Adaptive subdivision: If a cell returns max results (200), split into 4 quadrants and re-query each
- Supplement with country text queries to catch sites the grid may miss
Recommended grid: ~115 cells covers all major diving regions worldwide.
4. Rate Limiting and Cost Management
REQUEST_DELAY_SEC = 0.5 # 2 req/sec — conservative
REPORT_EVERY_N = 50 # Progress report every 50 requests
# Track and report:
# - Requests made
# - Unique sites collected
# - Estimated data transfer (MB)
# - Estimated cost (for metered APIs)
# - Elapsed time
For paid APIs (TheDiveAPI), checkpoint progress to a file so runs can be resumed after interruption without repeating completed cells.
5. Deduplication Strategy
Multi-pass deduplication using name matching + geographic proximity:
Pass 1: Exact name match
- Normalize names: lowercase, strip punctuation, collapse whitespace
- If normalized names match exactly → duplicate
Pass 2: Substring name match + proximity
- If normalized name A contains B or B contains A, AND distance < 5 km → duplicate
- Catches: “Playa Los Tubos, Manatí, PR” vs “Playa Los Tubos, Manatí”
Pass 3: Very close proximity with similar names
- If distance < 300m AND first 5 characters of normalized names match → duplicate
- Catches: “Tabyanas” vs “Tabyannas”, “Old Isaacs” vs “Old Issac’s”
Pass 4: Pure proximity
- If distance < 50m AND any word overlap in names (words >3 chars) → duplicate
Always keep the record with more data (longer description, more dive_types, etc.).
Haversine Distance Function
def haversine_km(lat1, lon1, lat2, lon2):
if any(v is None for v in (lat1, lon1, lat2, lon2)):
return float("inf")
R = 6371.0
dlat = math.radians(lat2 - lat1)
dlon = math.radians(lon2 - lon1)
a = (math.sin(dlat/2)**2 +
math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
math.sin(dlon/2)**2)
return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
6. Reverse Geocoding by Coordinates
When API data has ocean names instead of countries (TheDiveAPI often returns “Caribbean Sea” or “Straits Of Florida” as the country), resolve the actual country using coordinate bounding boxes:
- Build a lookup table of ~100+ country/territory bounding boxes
(name, region, s_lat, n_lat, w_lng, e_lng) - Order from most specific (small island territories) to least specific (large countries)
- For each unassigned site, find the first bounding box containing its lat/lng
- Fallback: broad ocean-based region assignment using lat/lng quadrants
This eliminates external geocoding API dependency entirely.
7. Data Quality Audit
After every merge, run these checks:
| Check | Action |
|---|---|
| Exact duplicates (name + coords to 4dp) | Remove |
| Near-duplicates (same name, <500m) | Merge |
| Very close similar names (<50m) | Merge |
| Missing coordinates | Remove |
| Zero coordinates (0,0) | Remove |
HTML entities in names (', &) |
Decode with html.unescape() |
| Very short names (≤2 chars) | Remove |
| Junk/test names (“test”, “fsd”, etc.) | Remove |
| Missing region assignment | Reverse-geocode |
| Excess whitespace | Normalize |
8. Output Artifacts
| Artifact | Format | Purpose |
|---|---|---|
dive-sites.json |
JSON with metadata + sites[] |
Production database for plugin |
new-sites-import.sql |
SQL INSERT | WordPress DB import for new sites |
*_raw.json |
JSON | Cached API responses (rerunnable) |
*_normalized.json |
JSON | Intermediate normalized + deduped |
*_checkpoint.json |
JSON | Resume state for long-running pulls |
9. Attribution Requirements
| Source | License | Required Attribution |
|---|---|---|
| OpenDiveMap | ODbL 1.0 | “© OpenDiveMap contributors” |
| Dive Vibe Community | ODbL 1.0 / MIT | “© contributors” + OSM attribution |
| OpenStreetMap | ODbL 1.0 | “© OpenStreetMap contributors” |
| TheDiveAPI | Commercial (RapidAPI) | Per subscription terms |
Include attribution in database metadata and in any public-facing map tile layers.
Inputs Required
- API credentials (RapidAPI key for TheDiveAPI)
- Existing dive site database path (
dive-sites.json) - Target output paths for merged database and SQL
Output Format
- Merged JSON database with metadata (source counts, attribution, generation date)
- SQL import script with CREATE TABLE and INSERT statements
- Progress report with request count, unique sites, estimated cost
- Data quality audit summary (issues found and fixed)
Anti-Patterns
- Using Python
urllib/requestswith Cloudflare-protected APIs: TheDiveAPI’s Cloudflare returns 403 for non-browser User-Agents. Usecurlsubprocess instead. - No checkpoint for long-running pulls: A 200+ request grid pull can take 5+ minutes. Always checkpoint completed cells.
- Name-only deduplication: Sites in different countries can share names (“Blue Hole” exists in 15+ countries). Always combine name matching with geographic proximity.
- Trusting API country fields: TheDiveAPI often puts ocean names in the country slot. Always validate or reverse-geocode.
- Ignoring HTML encoding: Data from web-scraped sources frequently contains
'(apostrophe),&(ampersand),"(quote). Decode before storing. - Single-source dependency: No one source has complete coverage. The best database merges 3-4 sources after deduplication.
- Unbounded API pulls without cost tracking: Always estimate and report data transfer costs for metered APIs.
Reference Implementation
See products/scuba-gpt/data-pipelines/:
07_import_external_sites.py— OpenDiveMap + Dive Vibe Community pull08_import_thediveapi.py— TheDiveAPI GPS grid pull with curl subprocess
