Skip to main content
< All Topics
Print

Dive Site Data Ingestion

name: dive-site-data-ingestion

description: Ingest, normalize, deduplicate, and merge dive site data from multiple external APIs and open databases (TheDiveAPI, OpenDiveMap, Dive Vibe Community, OpenStreetMap) into a unified georeferenced dive site database. Covers GPS grid-based API traversal, rate-limited bulk pulls, coordinate-based reverse geocoding, multi-source deduplication with haversine proximity matching, HTML entity cleanup, and data quality auditing. Use when expanding a dive site database from external sources, building data pipelines for georeferenced marine data, integrating TheDiveAPI via RapidAPI, pulling from OpenDiveMap’s GeoJSON API, or running quality audits on merged location datasets.

Dive Site Data Ingestion

Instructions

Build data ingestion pipelines that pull georeferenced dive site records from multiple external APIs and open databases, normalize them to a common schema, deduplicate across sources, and produce a clean merged database with SQL import scripts.

1. Available Data Sources

TheDiveAPI (RapidAPI — subscription required)

  • Host: world-scuba-diving-sites-api.p.rapidapi.com
  • Auth: RapidAPI key in x-rapidapi-key header
  • Endpoints:
  • GET /divesites/gps?southWestLat=&northEastLat=&southWestLng=&northEastLng= — bounding-box search, max 200 results
  • GET /divesites?country= — text search by country/region (query must be >4 characters)
  • Fields: id, name, region (hierarchical comma-separated), latitude, longitude, ocean, location, description
  • Coverage: ~17,000 sites worldwide
  • Cloudflare note: Python urllib/requests may receive HTTP 403 (error code 1010). Use curl as a subprocess instead.
  • Rate limit: Check your RapidAPI plan (e.g., 25 req/sec). Recommended: 2 req/sec to stay well within limits.

OpenDiveMap (free, no auth)

  • Base URL: https://api.opendivemap.com/v1
  • Endpoints:
  • GET /sites?limit=1000&offset=0 — paginated GeoJSON FeatureCollection
  • GET /stats — aggregate counts
  • GET /enums — environment, topology, entry enum values
  • Fields: name, coordinates (GeoJSON [lng, lat]), country_code, country_name, sea_name, environment, topologies[], max_depth, entry
  • Coverage: ~3,100 sites across 59 countries
  • License: ODbL 1.0 — requires attribution
  • Rate limit: No documented limit, but use 2-second delays between paginated calls

Dive Vibe Community (GitHub — free)

  • Repository: jbunderwater/dive-vibe-community
  • Data URL pattern: https://raw.githubusercontent.com/jbunderwater/dive-vibe-community/main/divesites/{slug}/index.json
  • Destinations list: https://raw.githubusercontent.com/jbunderwater/dive-vibe-community/main/destinations.json
  • Fields: name, latitude, longitude, site_type, difficulty, depth, entry_type
  • Coverage: ~2,700 sites across 122+ destinations
  • License: ODbL 1.0 (data) / MIT (code)
  • Source: OpenStreetMap + AI-validated corrections
  • Note: Parent-level slugs (e.g., india, us) return 404; use specific destination slugs (e.g., bali, florida-keys)

OpenStreetMap via Overpass API (free, no auth)

  • Tags: sport=scuba_diving, amenity=dive_centre, shop=scuba_diving
  • Use case: Gap-filling only — OpenDiveMap and Dive Vibe already curate OSM data
  • Note: Raw OSM includes dive shops mixed with dive sites; requires filtering

2. Common Target Schema

Normalize all sources to this structure:


{
  "name": "El Bajo Seamount",
  "country": "Mexico",
  "region": "Caribbean",
  "location": "24.21560, -110.39810",
  "lat": 24.2156,
  "lng": -110.3981,
  "dive_types": ["pinnacle"],
  "depth_min_m": null,
  "depth_max_m": 40,
  "cert_level": "AOW",
  "marine_life": [],
  "description": "",
  "source": "opendivemap"
}

Field mapping by source:

Field TheDiveAPI OpenDiveMap Dive Vibe
name name properties.name name
lat latitude geometry.coordinates[1] latitude
lng longitude geometry.coordinates[0] longitude
country Parsed from region (2nd comma-part) properties.country_name _destination_meta.countryCode → lookup
region Mapped from country Mapped from country _destination_meta.region → mapped
dive_types Not provided properties.topologies site_type
depth_max_m Not provided properties.max_depth depth
cert_level Derived from depth Derived from depth difficulty → mapped

3. GPS Grid Strategy for Bounding-Box APIs

When an API caps results per call (TheDiveAPI: 200), tile the world systematically:

  1. Define grid cells covering coastal/ocean regions — skip landlocked areas
  2. Use 5° cells for dense areas (Caribbean, SE Asia, Mediterranean, Red Sea)
  3. Use 10-20° cells for sparse areas (open Pacific, polar)
  4. Adaptive subdivision: If a cell returns max results (200), split into 4 quadrants and re-query each
  5. Supplement with country text queries to catch sites the grid may miss

Recommended grid: ~115 cells covers all major diving regions worldwide.

4. Rate Limiting and Cost Management


REQUEST_DELAY_SEC = 0.5   # 2 req/sec — conservative
REPORT_EVERY_N = 50       # Progress report every 50 requests

# Track and report:
# - Requests made
# - Unique sites collected
# - Estimated data transfer (MB)
# - Estimated cost (for metered APIs)
# - Elapsed time

For paid APIs (TheDiveAPI), checkpoint progress to a file so runs can be resumed after interruption without repeating completed cells.

5. Deduplication Strategy

Multi-pass deduplication using name matching + geographic proximity:

Pass 1: Exact name match

  • Normalize names: lowercase, strip punctuation, collapse whitespace
  • If normalized names match exactly → duplicate

Pass 2: Substring name match + proximity

  • If normalized name A contains B or B contains A, AND distance < 5 km → duplicate
  • Catches: “Playa Los Tubos, Manatí, PR” vs “Playa Los Tubos, Manatí”

Pass 3: Very close proximity with similar names

  • If distance < 300m AND first 5 characters of normalized names match → duplicate
  • Catches: “Tabyanas” vs “Tabyannas”, “Old Isaacs” vs “Old Issac’s”

Pass 4: Pure proximity

  • If distance < 50m AND any word overlap in names (words >3 chars) → duplicate

Always keep the record with more data (longer description, more dive_types, etc.).

Haversine Distance Function


def haversine_km(lat1, lon1, lat2, lon2):
    if any(v is None for v in (lat1, lon1, lat2, lon2)):
        return float("inf")
    R = 6371.0
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat/2)**2 +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon/2)**2)
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))

6. Reverse Geocoding by Coordinates

When API data has ocean names instead of countries (TheDiveAPI often returns “Caribbean Sea” or “Straits Of Florida” as the country), resolve the actual country using coordinate bounding boxes:

  1. Build a lookup table of ~100+ country/territory bounding boxes (name, region, s_lat, n_lat, w_lng, e_lng)
  2. Order from most specific (small island territories) to least specific (large countries)
  3. For each unassigned site, find the first bounding box containing its lat/lng
  4. Fallback: broad ocean-based region assignment using lat/lng quadrants

This eliminates external geocoding API dependency entirely.

7. Data Quality Audit

After every merge, run these checks:

Check Action
Exact duplicates (name + coords to 4dp) Remove
Near-duplicates (same name, <500m) Merge
Very close similar names (<50m) Merge
Missing coordinates Remove
Zero coordinates (0,0) Remove
HTML entities in names (', &) Decode with html.unescape()
Very short names (≤2 chars) Remove
Junk/test names (“test”, “fsd”, etc.) Remove
Missing region assignment Reverse-geocode
Excess whitespace Normalize

8. Output Artifacts

Artifact Format Purpose
dive-sites.json JSON with metadata + sites[] Production database for plugin
new-sites-import.sql SQL INSERT WordPress DB import for new sites
*_raw.json JSON Cached API responses (rerunnable)
*_normalized.json JSON Intermediate normalized + deduped
*_checkpoint.json JSON Resume state for long-running pulls

9. Attribution Requirements

Source License Required Attribution
OpenDiveMap ODbL 1.0 “© OpenDiveMap contributors”
Dive Vibe Community ODbL 1.0 / MIT “© contributors” + OSM attribution
OpenStreetMap ODbL 1.0 “© OpenStreetMap contributors”
TheDiveAPI Commercial (RapidAPI) Per subscription terms

Include attribution in database metadata and in any public-facing map tile layers.

Inputs Required

  • API credentials (RapidAPI key for TheDiveAPI)
  • Existing dive site database path (dive-sites.json)
  • Target output paths for merged database and SQL

Output Format

  • Merged JSON database with metadata (source counts, attribution, generation date)
  • SQL import script with CREATE TABLE and INSERT statements
  • Progress report with request count, unique sites, estimated cost
  • Data quality audit summary (issues found and fixed)

Anti-Patterns

  • Using Python urllib/requests with Cloudflare-protected APIs: TheDiveAPI’s Cloudflare returns 403 for non-browser User-Agents. Use curl subprocess instead.
  • No checkpoint for long-running pulls: A 200+ request grid pull can take 5+ minutes. Always checkpoint completed cells.
  • Name-only deduplication: Sites in different countries can share names (“Blue Hole” exists in 15+ countries). Always combine name matching with geographic proximity.
  • Trusting API country fields: TheDiveAPI often puts ocean names in the country slot. Always validate or reverse-geocode.
  • Ignoring HTML encoding: Data from web-scraped sources frequently contains ' (apostrophe), & (ampersand), " (quote). Decode before storing.
  • Single-source dependency: No one source has complete coverage. The best database merges 3-4 sources after deduplication.
  • Unbounded API pulls without cost tracking: Always estimate and report data transfer costs for metered APIs.

Reference Implementation

See products/scuba-gpt/data-pipelines/:

  • 07_import_external_sites.py — OpenDiveMap + Dive Vibe Community pull
  • 08_import_thediveapi.py — TheDiveAPI GPS grid pull with curl subprocess
Table of Contents