What can we help you with?
Email Parsing — Travel Bookings
Email Parsing — Travel Bookings
Instructions
Email Source Access
- Access methods (in order of preference):
- Gmail API: structured access, search operators, rich metadata
- IMAP: universal, works with any email provider
- Email forwarding: user forwards confirmations to a dedicated address
- Manual paste: user copies email text into the app
- Email identification (finding booking emails):
- Search Gmail:
from:(united.com OR delta.com OR aa.com) subject:(confirmation OR itinerary OR receipt) - IMAP search:
SUBJECT "confirmation" FROM "bookings" - Filter by known sender domains (see provider registry below)
Provider Registry
Maintain a registry of known booking providers with parsing patterns:
Airlines
| Provider | Sender Domain(s) | Confirmation Pattern |
|---|---|---|
| United | united.com |
[A-Z0-9]{6} (6-char alphanumeric) |
| Delta | delta.com |
[A-Z0-9]{6} |
| American | aa.com, americanairlines.com |
[A-Z]{6} |
| Southwest | southwestairlines.com |
[A-Z0-9]{6} |
| JetBlue | jetblue.com |
[A-Z]{6} |
| Alaska | alaskaairlines.com |
[A-Z]{6} |
Hotels
| Provider | Sender Domain(s) | Confirmation Pattern |
|---|---|---|
| Marriott | marriott.com |
[0-9]{8,9} |
| Hilton | hilton.com |
[0-9]{9} |
| Hyatt | hyatt.com |
[A-Z0-9]+ |
| IHG | ihg.com |
[0-9]{8,} |
| Airbnb | airbnb.com |
HM[A-Z0-9]+ |
| Booking.com | booking.com |
[0-9]{8,10} |
Restaurants
| Provider | Sender Domain(s) | Confirmation Pattern |
|---|---|---|
| OpenTable | opentable.com |
Reservation ID in URL |
| Resy | resy.com |
Reservation ID in URL |
Car Rental
| Provider | Sender Domain(s) | Confirmation Pattern |
|---|---|---|
| Enterprise | enterprise.com |
[A-Z0-9]+ |
| Hertz | hertz.com |
[A-Z0-9]+ |
| Avis | avis.com |
[0-9]+ |
Aggregators
| Provider | Sender Domain(s) |
|---|---|
| Expedia | expedia.com |
| Kayak | kayak.com |
| Google Travel | google.com |
| TripIt | tripit.com |
Parsing Strategy
Use a multi-pass extraction approach:
Pass 1: Structured Data Extraction
- Check for schema.org markup (present in many airline and hotel emails):
<script type="application/ld+json">
{
"@type": "FlightReservation",
"reservationNumber": "ABC123",
"reservationFor": {
"@type": "Flight",
"flightNumber": "UA1234",
"departureAirport": { "iataCode": "ATL" },
"arrivalAirport": { "iataCode": "CDG" }
}
}
</script>
- If present, this is the most reliable source; parse it first
- Google surfaces this data in Gmail; it follows schema.org vocabulary
- Check email headers for calendar attachments:
.icsattachments contain structured event data- Parse using iCalendar parser to extract dates, times, locations
Pass 2: Pattern-Based Text Extraction
When structured data is unavailable, parse the email body:
- Confirmation number extraction:
- Search for labels: “Confirmation”, “Booking Reference”, “Record Locator”, “PNR”, “Reservation #”
- Extract the adjacent alphanumeric string matching the provider’s known pattern
- Validate length and format against the provider registry
- Flight details extraction:
Patterns to match:
- "Flight [A-Z]{2}[0-9]{1,4}" → flight number
- "[A-Z]{3}" near "depart" or "arrive" → airport codes
- Date patterns: "April 18, 2026", "04/18/2026", "18 Apr 2026"
- Time patterns: "7:00 PM", "19:00", "7:00pm"
- "Terminal [A-Z0-9]+" → terminal
- "Gate [A-Z0-9]+" → gate
- "Seat [0-9]+[A-Z]" → seat assignment
- Hotel details extraction:
Patterns to match:
- Hotel name: text following "Hotel:", "Property:", or in subject line
- Check-in/out: dates following "Check-in", "Check-out", "Arrival", "Departure"
- Address: multi-line block with city, state, zip pattern
- Room type: text following "Room Type:", "Accommodation:"
- Rate: dollar amount following "Rate:", "Total:", "Nightly Rate:"
- Restaurant details extraction:
Patterns to match:
- Restaurant name: typically in subject line or header
- Party size: number following "Party of", "Guests:", "Covers:"
- Date/time: combined date-time extraction
- Address: multi-line block or Google Maps link
Pass 3: AI-Assisted Extraction (Fallback)
When pattern matching fails:
- Send email text to Claude with a structured extraction prompt:
Extract travel booking details from this email. Return JSON:
{
"type": "flight|hotel|restaurant|car_rental|activity",
"confirmation_number": "string",
"provider": "string",
"dates": { "start": "ISO-8601", "end": "ISO-8601" },
"details": { ... type-specific fields ... }
}
Email text:
{email_body}
- Validate AI output against the provider registry and date sanity checks
- Flag low-confidence extractions for user review
Date and Time Handling
- Timezone resolution:
- Airport codes → timezone (use IATA timezone database)
- Hotel city → timezone (use Google Timezone API or static lookup)
- If no timezone context: flag for user confirmation
- Date normalization: convert all dates to ISO-8601 with timezone offset
- Multi-leg timezone handling: departure time in departure timezone, arrival in arrival timezone
- Date sanity checks:
- Check-in before check-out
- Flight departure before arrival (accounting for timezone changes and date line)
- Reservation date in the future (or recent past for receipts)
- Duration within reasonable bounds (flights: 1–20 hours; hotels: 1–30 nights)
Canonical Reservation Schema
{
"reservation_id": "uuid",
"type": "flight",
"provider": "United Airlines",
"confirmation_number": "ABC123",
"status": "confirmed",
"source_email_id": "gmail-message-id",
"parsed_at": "ISO-8601",
"parsing_method": "schema_org|pattern|ai_assisted",
"parsing_confidence": "high|medium|low",
"details": {
"flights": [
{
"flight_number": "UA1234",
"departure": {
"airport": "ATL",
"airport_name": "Hartsfield-Jackson Atlanta International",
"terminal": "N",
"gate": "A23",
"datetime": "2026-06-01T06:00:00-04:00"
},
"arrival": {
"airport": "CDG",
"airport_name": "Charles de Gaulle Airport",
"terminal": "2E",
"datetime": "2026-06-01T20:00:00+02:00"
},
"duration_minutes": 540,
"class": "economy",
"seat": "24A",
"passenger": "Peter Westerman"
}
]
},
"cost": {
"total": 850.00,
"currency": "USD"
}
}
Deduplication
- Detect duplicate emails: same confirmation number + same provider = same booking
- Handle updates: later emails for the same booking may contain updates (gate changes, room upgrades)
- Merge strategy: keep the most recent parsed data; preserve history of changes
- Cross-provider deduplication: Expedia booking + airline confirmation for the same flight — link them
Error Handling
- Unparseable emails: save the raw email; flag for manual review; do not discard
- Partial extraction: if confirmation number found but dates missing, create a partial reservation and flag gaps
- Ambiguous dates: “04/05/2026” — is it April 5 or May 4? Use locale context and provider patterns to disambiguate
- Unknown providers: apply generic parsing patterns; flag as low confidence
Inputs Required
- Email access method (Gmail API, IMAP, forwarding, manual paste)
- Provider registry (can start with built-in defaults)
- User’s timezone and locale (for date disambiguation)
- Whether to use AI fallback parsing
- User preferences for which booking types to extract
Output Format
Parsed Reservation Collection
{
"trip_id": "uuid",
"trip_name": "Paris Trip 2026",
"reservations": [
{ "type": "flight", "confirmation_number": "ABC123", ... },
{ "type": "hotel", "confirmation_number": "12345678", ... },
{ "type": "restaurant", "confirmation_number": null, ... }
],
"parsing_summary": {
"emails_scanned": 15,
"reservations_found": 6,
"high_confidence": 4,
"needs_review": 2
}
}
Parser Module Structure
email-parser/
EmailScanner.ts — email source access and search
ProviderRegistry.ts — sender patterns and parsing rules
parsers/
SchemaOrgParser.ts — JSON-LD structured data extraction
FlightParser.ts — airline email patterns
HotelParser.ts — hotel email patterns
RestaurantParser.ts — restaurant email patterns
CarRentalParser.ts — car rental email patterns
GenericParser.ts — fallback pattern matching
AIParser.ts — Claude-assisted extraction
DateTimeResolver.ts — timezone resolution and normalization
Deduplicator.ts — booking deduplication and merging
models/
Reservation.ts — canonical reservation schema
ParsingResult.ts — extraction result with confidence
Anti-Patterns
- Parsing only the text/plain part — many booking emails use HTML with structured data (schema.org); parse HTML first
- Ignoring schema.org markup — this is the most reliable data source when present; always check for it
- Hardcoding date formats — providers use different date formats; support multiple patterns and use locale context
- Assuming single-leg flights — many bookings contain multi-leg itineraries; parse all flight segments
- Not handling timezone differences — a flight from Atlanta to Paris spans two timezones; departure and arrival must be in their respective zones
- Discarding unparseable emails — save them for later improvement of parsing patterns or manual review
- Trusting AI extraction without validation — AI can hallucinate confirmation numbers or dates; always validate against known patterns and sanity checks
- Not deduplicating across sources — the same flight may appear in an airline email AND an Expedia email; detect and link them
