Patriot Source Scanner

PostedMay 27, 2026

UpdatedMay 27, 2026

ByPeter Westerman

Patriot Source Scanner

Political content acquisition from RSS feeds and Tavily Search API with hash-based deduplication and content normalization for downstream analysis. Use when configuring source scanning, managing feed parsing, or implementing content deduplication.

Instructions

Acquire political content from configured sources, deduplicate against previously seen items, and normalize for downstream analysis by the Speech Analyzer.

RSS Feed Parsing

Parse RSS 2.0 and Atom feeds using standard feed parsing libraries
Extract: title, published date, author, full content (prefer content:encoded over description), source URL
Handle feed errors gracefully — log and skip malformed entries, do not halt the scan
Respect ttl and lastBuildDate to avoid redundant fetches
Track ETag and Last-Modified headers per feed for conditional GET requests

Source Categories

Category	Examples	Priority
Wire services	AP, Reuters, UPI	High — scan every cycle
Official statements	White House, Congress.gov, state gov press offices	High
Major outlets	NYT, WaPo, WSJ, CNN, Fox News, NPR	Medium — daily scan
Analysis/opinion	Atlantic, National Review, Jacobin, Reason	Low — weekly scan
Social media transcripts	Official account archives, press briefing transcripts	Medium

Tavily Search Integration

Use Tavily Search API for targeted queries when monitoring specific figures or topics
Query templates: "{figure name}" AND ("executive order" OR "legislation" OR "policy")
Limit to news results from the past 24 hours for breaking scans, past 7 days for full scans
Extract the full article content using Tavily’s content extraction, not just snippets
Rate-limit API calls to stay within quota

Deduplication

Compute SHA-256 hash of normalized content (lowercase, whitespace-collapsed, punctuation-stripped)
Check hash against the seen-content database before passing to analysis
Store hashes with source URL, fetch timestamp, and scan cycle ID
Near-duplicate detection: also hash the first 500 characters to catch minor edits of the same story
Retain hash records for 90 days, then archive

Content Normalization

Before passing content to the Speech Analyzer, normalize each item:

Strip HTML tags and decode entities
Extract plain text body, preserving paragraph breaks
Standardize date formats to ISO 8601
Attach metadata: source name, source category, fetch timestamp, content hash
Truncate to 10,000 characters if longer (preserve the beginning, which contains the lede)

Output Format

Each scanned item produces a record with: content_hash (SHA-256), source_name, source_category, published_date (ISO 8601), title, body (normalized), url, and scan_cycle_id.

Examples

RSS scan: Parse 15 configured feeds → retrieve 120 items → 73 match existing hashes (skip) → 47 new items normalized and queued for analysis.

Tavily search: Query for a specific figure’s recent statements → 12 results → 8 are duplicates of RSS-sourced content → 4 new items normalized and queued.

AI Skill

Product Showcase

ITI Knowledge System

AI Agent

User Guide

Requirements

ScubaGPT

Grateful Dead Chatbot

Farmers Bounty

Technical Document

Answer Engine Optimizer

SEO Optimizer

Travel Planner

Fact Checker

Estate Manager

ITI Operations

ITI Marketing

Patriot University

Personal Assistant

Patriot Source Scanner

Patriot Source Scanner

Instructions

RSS Feed Parsing

Source Categories

Tavily Search Integration

Deduplication

Content Normalization

Output Format

Examples