Skip to main content
< All Topics
Print

Email Parsing — Travel Bookings

name: email-parsing-travel-bookings

description: Parsing airline, hotel, restaurant, and car rental confirmation emails into structured reservation data. Covers pattern matching for major booking providers, date/time extraction with timezone handling, confirmation number identification, multi-leg itinerary parsing, and reservation data normalization into a canonical schema. Use when extracting travel bookings from email, building automatic itinerary builders, parsing confirmation numbers from booking emails, or normalizing reservation data across providers.

Email Parsing — Travel Bookings

Instructions

Email Source Access

  1. Access methods (in order of preference):
  • Gmail API: structured access, search operators, rich metadata
  • IMAP: universal, works with any email provider
  • Email forwarding: user forwards confirmations to a dedicated address
  • Manual paste: user copies email text into the app
  1. Email identification (finding booking emails):
  • Search Gmail: from:(united.com OR delta.com OR aa.com) subject:(confirmation OR itinerary OR receipt)
  • IMAP search: SUBJECT "confirmation" FROM "bookings"
  • Filter by known sender domains (see provider registry below)

Provider Registry

Maintain a registry of known booking providers with parsing patterns:

Airlines

Provider Sender Domain(s) Confirmation Pattern
United united.com [A-Z0-9]{6} (6-char alphanumeric)
Delta delta.com [A-Z0-9]{6}
American aa.com, americanairlines.com [A-Z]{6}
Southwest southwestairlines.com [A-Z0-9]{6}
JetBlue jetblue.com [A-Z]{6}
Alaska alaskaairlines.com [A-Z]{6}

Hotels

Provider Sender Domain(s) Confirmation Pattern
Marriott marriott.com [0-9]{8,9}
Hilton hilton.com [0-9]{9}
Hyatt hyatt.com [A-Z0-9]+
IHG ihg.com [0-9]{8,}
Airbnb airbnb.com HM[A-Z0-9]+
Booking.com booking.com [0-9]{8,10}

Restaurants

Provider Sender Domain(s) Confirmation Pattern
OpenTable opentable.com Reservation ID in URL
Resy resy.com Reservation ID in URL

Car Rental

Provider Sender Domain(s) Confirmation Pattern
Enterprise enterprise.com [A-Z0-9]+
Hertz hertz.com [A-Z0-9]+
Avis avis.com [0-9]+

Aggregators

Provider Sender Domain(s)
Expedia expedia.com
Kayak kayak.com
Google Travel google.com
TripIt tripit.com

Parsing Strategy

Use a multi-pass extraction approach:

Pass 1: Structured Data Extraction

  1. Check for schema.org markup (present in many airline and hotel emails):

   <script type="application/ld+json">
   {
     "@type": "FlightReservation",
     "reservationNumber": "ABC123",
     "reservationFor": {
       "@type": "Flight",
       "flightNumber": "UA1234",
       "departureAirport": { "iataCode": "ATL" },
       "arrivalAirport": { "iataCode": "CDG" }
     }
   }
   </script>
  • If present, this is the most reliable source; parse it first
  • Google surfaces this data in Gmail; it follows schema.org vocabulary
  1. Check email headers for calendar attachments:
  • .ics attachments contain structured event data
  • Parse using iCalendar parser to extract dates, times, locations

Pass 2: Pattern-Based Text Extraction

When structured data is unavailable, parse the email body:

  1. Confirmation number extraction:
  • Search for labels: “Confirmation”, “Booking Reference”, “Record Locator”, “PNR”, “Reservation #”
  • Extract the adjacent alphanumeric string matching the provider’s known pattern
  • Validate length and format against the provider registry
  1. Flight details extraction:

   Patterns to match:
   - "Flight [A-Z]{2}[0-9]{1,4}" → flight number
   - "[A-Z]{3}" near "depart" or "arrive" → airport codes
   - Date patterns: "April 18, 2026", "04/18/2026", "18 Apr 2026"
   - Time patterns: "7:00 PM", "19:00", "7:00pm"
   - "Terminal [A-Z0-9]+" → terminal
   - "Gate [A-Z0-9]+" → gate
   - "Seat [0-9]+[A-Z]" → seat assignment
  1. Hotel details extraction:

   Patterns to match:
   - Hotel name: text following "Hotel:", "Property:", or in subject line
   - Check-in/out: dates following "Check-in", "Check-out", "Arrival", "Departure"
   - Address: multi-line block with city, state, zip pattern
   - Room type: text following "Room Type:", "Accommodation:"
   - Rate: dollar amount following "Rate:", "Total:", "Nightly Rate:"
  1. Restaurant details extraction:

   Patterns to match:
   - Restaurant name: typically in subject line or header
   - Party size: number following "Party of", "Guests:", "Covers:"
   - Date/time: combined date-time extraction
   - Address: multi-line block or Google Maps link

Pass 3: AI-Assisted Extraction (Fallback)

When pattern matching fails:

  1. Send email text to Claude with a structured extraction prompt:

   Extract travel booking details from this email. Return JSON:
   {
     "type": "flight|hotel|restaurant|car_rental|activity",
     "confirmation_number": "string",
     "provider": "string",
     "dates": { "start": "ISO-8601", "end": "ISO-8601" },
     "details": { ... type-specific fields ... }
   }

   Email text:
   {email_body}
  1. Validate AI output against the provider registry and date sanity checks
  2. Flag low-confidence extractions for user review

Date and Time Handling

  1. Timezone resolution:
  • Airport codes → timezone (use IATA timezone database)
  • Hotel city → timezone (use Google Timezone API or static lookup)
  • If no timezone context: flag for user confirmation
  1. Date normalization: convert all dates to ISO-8601 with timezone offset
  2. Multi-leg timezone handling: departure time in departure timezone, arrival in arrival timezone
  3. Date sanity checks:
  • Check-in before check-out
  • Flight departure before arrival (accounting for timezone changes and date line)
  • Reservation date in the future (or recent past for receipts)
  • Duration within reasonable bounds (flights: 1–20 hours; hotels: 1–30 nights)

Canonical Reservation Schema


{
  "reservation_id": "uuid",
  "type": "flight",
  "provider": "United Airlines",
  "confirmation_number": "ABC123",
  "status": "confirmed",
  "source_email_id": "gmail-message-id",
  "parsed_at": "ISO-8601",
  "parsing_method": "schema_org|pattern|ai_assisted",
  "parsing_confidence": "high|medium|low",
  "details": {
    "flights": [
      {
        "flight_number": "UA1234",
        "departure": {
          "airport": "ATL",
          "airport_name": "Hartsfield-Jackson Atlanta International",
          "terminal": "N",
          "gate": "A23",
          "datetime": "2026-06-01T06:00:00-04:00"
        },
        "arrival": {
          "airport": "CDG",
          "airport_name": "Charles de Gaulle Airport",
          "terminal": "2E",
          "datetime": "2026-06-01T20:00:00+02:00"
        },
        "duration_minutes": 540,
        "class": "economy",
        "seat": "24A",
        "passenger": "Peter Westerman"
      }
    ]
  },
  "cost": {
    "total": 850.00,
    "currency": "USD"
  }
}

Deduplication

  1. Detect duplicate emails: same confirmation number + same provider = same booking
  2. Handle updates: later emails for the same booking may contain updates (gate changes, room upgrades)
  3. Merge strategy: keep the most recent parsed data; preserve history of changes
  4. Cross-provider deduplication: Expedia booking + airline confirmation for the same flight — link them

Error Handling

  1. Unparseable emails: save the raw email; flag for manual review; do not discard
  2. Partial extraction: if confirmation number found but dates missing, create a partial reservation and flag gaps
  3. Ambiguous dates: “04/05/2026” — is it April 5 or May 4? Use locale context and provider patterns to disambiguate
  4. Unknown providers: apply generic parsing patterns; flag as low confidence

Inputs Required

  • Email access method (Gmail API, IMAP, forwarding, manual paste)
  • Provider registry (can start with built-in defaults)
  • User’s timezone and locale (for date disambiguation)
  • Whether to use AI fallback parsing
  • User preferences for which booking types to extract

Output Format

Parsed Reservation Collection


{
  "trip_id": "uuid",
  "trip_name": "Paris Trip 2026",
  "reservations": [
    { "type": "flight", "confirmation_number": "ABC123", ... },
    { "type": "hotel", "confirmation_number": "12345678", ... },
    { "type": "restaurant", "confirmation_number": null, ... }
  ],
  "parsing_summary": {
    "emails_scanned": 15,
    "reservations_found": 6,
    "high_confidence": 4,
    "needs_review": 2
  }
}

Parser Module Structure


email-parser/
  EmailScanner.ts              — email source access and search
  ProviderRegistry.ts          — sender patterns and parsing rules
  parsers/
    SchemaOrgParser.ts         — JSON-LD structured data extraction
    FlightParser.ts            — airline email patterns
    HotelParser.ts             — hotel email patterns
    RestaurantParser.ts        — restaurant email patterns
    CarRentalParser.ts         — car rental email patterns
    GenericParser.ts           — fallback pattern matching
    AIParser.ts                — Claude-assisted extraction
  DateTimeResolver.ts          — timezone resolution and normalization
  Deduplicator.ts              — booking deduplication and merging
  models/
    Reservation.ts             — canonical reservation schema
    ParsingResult.ts           — extraction result with confidence

Anti-Patterns

  • Parsing only the text/plain part — many booking emails use HTML with structured data (schema.org); parse HTML first
  • Ignoring schema.org markup — this is the most reliable data source when present; always check for it
  • Hardcoding date formats — providers use different date formats; support multiple patterns and use locale context
  • Assuming single-leg flights — many bookings contain multi-leg itineraries; parse all flight segments
  • Not handling timezone differences — a flight from Atlanta to Paris spans two timezones; departure and arrival must be in their respective zones
  • Discarding unparseable emails — save them for later improvement of parsing patterns or manual review
  • Trusting AI extraction without validation — AI can hallucinate confirmation numbers or dates; always validate against known patterns and sanity checks
  • Not deduplicating across sources — the same flight may appear in an airline email AND an Expedia email; detect and link them
Table of Contents