Skip to main content
< All Topics
Print

Patriot Source Scanner







Patriot Source Scanner

Political content acquisition from RSS feeds and Tavily Search API with hash-based deduplication and content normalization for downstream analysis. Use when configuring source scanning, managing feed parsing, or implementing content deduplication.

Instructions

Acquire political content from configured sources, deduplicate against previously seen items, and normalize for downstream analysis by the Speech Analyzer.

RSS Feed Parsing

  • Parse RSS 2.0 and Atom feeds using standard feed parsing libraries
  • Extract: title, published date, author, full content (prefer content:encoded over description), source URL
  • Handle feed errors gracefully — log and skip malformed entries, do not halt the scan
  • Respect ttl and lastBuildDate to avoid redundant fetches
  • Track ETag and Last-Modified headers per feed for conditional GET requests

Source Categories

Category Examples Priority
Wire services AP, Reuters, UPI High — scan every cycle
Official statements White House, Congress.gov, state gov press offices High
Major outlets NYT, WaPo, WSJ, CNN, Fox News, NPR Medium — daily scan
Analysis/opinion Atlantic, National Review, Jacobin, Reason Low — weekly scan
Social media transcripts Official account archives, press briefing transcripts Medium

Tavily Search Integration

  • Use Tavily Search API for targeted queries when monitoring specific figures or topics
  • Query templates: "{figure name}" AND ("executive order" OR "legislation" OR "policy")
  • Limit to news results from the past 24 hours for breaking scans, past 7 days for full scans
  • Extract the full article content using Tavily’s content extraction, not just snippets
  • Rate-limit API calls to stay within quota

Deduplication

  • Compute SHA-256 hash of normalized content (lowercase, whitespace-collapsed, punctuation-stripped)
  • Check hash against the seen-content database before passing to analysis
  • Store hashes with source URL, fetch timestamp, and scan cycle ID
  • Near-duplicate detection: also hash the first 500 characters to catch minor edits of the same story
  • Retain hash records for 90 days, then archive

Content Normalization

Before passing content to the Speech Analyzer, normalize each item:

  • Strip HTML tags and decode entities
  • Extract plain text body, preserving paragraph breaks
  • Standardize date formats to ISO 8601
  • Attach metadata: source name, source category, fetch timestamp, content hash
  • Truncate to 10,000 characters if longer (preserve the beginning, which contains the lede)

Output Format

Each scanned item produces a record with: content_hash (SHA-256), source_name, source_category, published_date (ISO 8601), title, body (normalized), url, and scan_cycle_id.

Examples

RSS scan: Parse 15 configured feeds → retrieve 120 items → 73 match existing hashes (skip) → 47 new items normalized and queued for analysis.

Tavily search: Query for a specific figure’s recent statements → 12 results → 8 are duplicates of RSS-sourced content → 4 new items normalized and queued.

Table of Contents