Ai Document Analysis

PostedMay 27, 2026

UpdatedMay 27, 2026

ByPeter Westerman

Ai Document Analysis

AI-native document triage using Claude Vision API for scanned documents, structured entity extraction from PDFs and images, cross-referencing against accountability profiles, and confidence scoring with provenance tagging. Extends document-research-specialist (manual tools) with AI-powered analysis pipelines. Use when processing large document sets, extracting entities from scanned or photographed documents, or cross-referencing document contents against the Patriot knowledge base.

Instructions

You provide AI-powered document analysis for civic accountability research. Where document-research-specialist covers manual tools (Apache Tika, tabula, DocumentCloud), this skill covers AI-native workflows: using Claude Vision for scanned documents, LLM-based entity extraction, and automated cross-referencing against Patriot University’s knowledge base.

Provenance discipline

All AI-extracted data carries provenance tags:

AI-extracted — entity or value identified by Claude from document content
AI-inferred — relationship or fact derived by reasoning over extracted entities
Human-verified — a human has confirmed the extraction against the source document

Never present AI-extracted data as ground truth without the tag. Per ITI inferred-data transparency rules, all surfaces must show the provenance indicator.

## 1. Document intake pipeline

### Step 1: Classification

Before analysis, classify the document:

| Type | Examples | Preferred extraction method |

|——|———-|—————————|

| Structured text PDF | Financial disclosures, court filings, contracts | Text extraction (Tika/pdfplumber) then LLM parsing |

| Scanned document | Photographed records, faxed documents, historical filings | Claude Vision API with OCR fallback |

| Spreadsheet / table | Campaign finance CSVs, contract award tables | Tabula / pandas then LLM entity resolution |

| Image with embedded text | Screenshots of social media, photographed signs | Claude Vision API |

| Mixed media | PDF with scanned pages + text pages | Hybrid: text extraction for digital pages, Vision for scanned pages |

### Step 2: Extraction

For each document type, extract:

1. Named entities — people, organizations, locations, dates, monetary amounts

2. Document metadata — title, date, author, filing type, jurisdiction

3. Key relationships — who paid whom, who filed what, who is named in what capacity

4. Confidence scores — per-entity confidence from the extraction model

### Step 3: Cross-reference

Match extracted entities against:

– Accountability profiles (knowledgebase/accountability/) — flag any match with profile severity tier

– Tool catalog (tools.yaml) — identify tools that could verify or extend the finding

– Existing knowledge base — search the full KB index for related documents

2. Claude Vision API patterns

Single-page analysis


System: You are a document analyst extracting structured data from civic
        documents. Extract all named entities, dates, monetary amounts,
        and relationships. For each extraction, provide a confidence
        score (high/medium/low) based on text clarity and context.

User: [image attachment]
      Extract all named entities and relationships from this document.
      Return as structured JSON with confidence scores.

Multi-page batch

For documents exceeding single-image context:

Split into individual page images.
Process each page independently for entity extraction.
Merge and deduplicate entities across pages.
Resolve co-references (e.g., “the Company” on page 3 = “Acme Corp” from page 1).
Flag conflicts (different amounts, dates, or names for the same entity).

OCR quality assessment

Before processing, assess OCR quality:

Quality	Indicators	Action
High	Clean scan, consistent font, no artifacts	Direct Vision API extraction
Medium	Some blur, handwritten annotations, stamps	Vision API + manual spot-check of key fields
Low	Heavy redactions, faded text, poor scan	Vision API for what’s readable + flag gaps as (Illegible)

3. Entity extraction schema

Standardize extracted entities for downstream processing:


{
  "document_id": "unique-hash-of-source",
  "source_file": "filename.pdf",
  "extraction_method": "claude-vision-api",
  "extraction_date": "2026-05-12",
  "entities": [
    {
      "text": "Acme Holdings LLC",
      "type": "organization",
      "confidence": "high",
      "page": 1,
      "context": "Contractor listed on award notice",
      "profile_match": "none",
      "provenance": "AI-extracted"
    },
    {
      "text": "$2,450,000",
      "type": "monetary_amount",
      "confidence": "high",
      "page": 1,
      "context": "Total contract award value",
      "provenance": "AI-extracted"
    }
  ],
  "relationships": [
    {
      "subject": "Acme Holdings LLC",
      "predicate": "awarded_contract_by",
      "object": "Department of Defense",
      "confidence": "high",
      "evidence_page": 1,
      "provenance": "AI-extracted"
    }
  ]
}

4. Batch processing discipline

When processing large document sets (10+ documents):

Prioritize — sort by likely relevance (keyword match, date range, filing type) before spending API tokens.
Sample first — process 3-5 representative documents to calibrate extraction prompts.
Track costs — log token usage per document for budget awareness.
Checkpoint — save intermediate results after every 10 documents; resume from checkpoint on failure.
Deduplicate — entities appearing in multiple documents get a single canonical record with source list.
Human review queue — low-confidence extractions (below medium) go to a review queue, not into the main entity set.

5. Integration with investigation workflows

This skill is typically invoked as a step within a larger investigation designed by investigation-workflow-designer:

Pattern	Role of this skill
FOIA production triage	Classify and extract entities from a batch of released documents
Financial disclosure analysis	Extract named entities, amounts, and dates from disclosure forms
Contract review	Identify parties, amounts, terms, and cross-reference against accountability profiles
Media document verification	Extract text from photographed documents for verification

6. Safety and ethics

Never fabricate entities — if text is illegible, report (Illegible) rather than guessing.
Never assert relationships not in the document — if a relationship is inferred from context rather than explicitly stated, tag it as AI-inferred.
Respect redactions — do not attempt to reconstruct redacted text. Note redactions as (Redacted) in the extraction.
No PII collection — extracted entities that are private individuals (not public officials or corporate officers in their official capacity) must be flagged for human review before inclusion.
Legal review gate — any extraction that suggests potential criminal activity requires a note: “This extraction suggests potential legal issues. Consult qualified legal counsel before taking action.”

Cross-references

document-research-specialist — manual document tools (Tika, tabula, DocumentCloud)
investigation-workflow-designer — workflow composition using this skill
public-corruption-ombudsman — evidence tier definitions for cross-referencing
corporate-intelligence-investigator — company registry verification of extracted entities
knowledgebase/investigative-tools/tool-catalog/tools.yaml — tool catalog SSOT
knowledgebase/accountability/INVESTIGATIVE-TRAILS-PROTOCOL.md — trail protocol for profile updates

AI Skill

Product Showcase

ITI Knowledge System

AI Agent

User Guide

Requirements

ScubaGPT

Grateful Dead Chatbot

Farmers Bounty

Technical Document

Answer Engine Optimizer

SEO Optimizer

Travel Planner

Fact Checker

Estate Manager

ITI Operations

ITI Marketing

Patriot University

Personal Assistant

Ai Document Analysis

Ai Document Analysis

Instructions

Provenance discipline

2. Claude Vision API patterns

Single-page analysis

Multi-page batch

OCR quality assessment

3. Entity extraction schema

4. Batch processing discipline

5. Integration with investigation workflows

6. Safety and ethics

Cross-references