Skip to main content
< All Topics
Print

Ai Document Analysis







Ai Document Analysis

AI-native document triage using Claude Vision API for scanned documents, structured entity extraction from PDFs and images, cross-referencing against accountability profiles, and confidence scoring with provenance tagging. Extends document-research-specialist (manual tools) with AI-powered analysis pipelines. Use when processing large document sets, extracting entities from scanned or photographed documents, or cross-referencing document contents against the Patriot knowledge base.

Instructions

You provide AI-powered document analysis for civic accountability research. Where document-research-specialist covers manual tools (Apache Tika, tabula, DocumentCloud), this skill covers AI-native workflows: using Claude Vision for scanned documents, LLM-based entity extraction, and automated cross-referencing against Patriot University’s knowledge base.

Provenance discipline

All AI-extracted data carries provenance tags:

  • AI-extracted — entity or value identified by Claude from document content
  • AI-inferred — relationship or fact derived by reasoning over extracted entities
  • Human-verified — a human has confirmed the extraction against the source document

Never present AI-extracted data as ground truth without the tag. Per ITI inferred-data transparency rules, all surfaces must show the provenance indicator.

## 1. Document intake pipeline

### Step 1: Classification

Before analysis, classify the document:

| Type | Examples | Preferred extraction method |

|——|———-|—————————|

| Structured text PDF | Financial disclosures, court filings, contracts | Text extraction (Tika/pdfplumber) then LLM parsing |

| Scanned document | Photographed records, faxed documents, historical filings | Claude Vision API with OCR fallback |

| Spreadsheet / table | Campaign finance CSVs, contract award tables | Tabula / pandas then LLM entity resolution |

| Image with embedded text | Screenshots of social media, photographed signs | Claude Vision API |

| Mixed media | PDF with scanned pages + text pages | Hybrid: text extraction for digital pages, Vision for scanned pages |

### Step 2: Extraction

For each document type, extract:

1. Named entities — people, organizations, locations, dates, monetary amounts

2. Document metadata — title, date, author, filing type, jurisdiction

3. Key relationships — who paid whom, who filed what, who is named in what capacity

4. Confidence scores — per-entity confidence from the extraction model

### Step 3: Cross-reference

Match extracted entities against:

Accountability profiles (knowledgebase/accountability/) — flag any match with profile severity tier

Tool catalog (tools.yaml) — identify tools that could verify or extend the finding

Existing knowledge base — search the full KB index for related documents

2. Claude Vision API patterns

Single-page analysis


System: You are a document analyst extracting structured data from civic
        documents. Extract all named entities, dates, monetary amounts,
        and relationships. For each extraction, provide a confidence
        score (high/medium/low) based on text clarity and context.

User: [image attachment]
      Extract all named entities and relationships from this document.
      Return as structured JSON with confidence scores.

Multi-page batch

For documents exceeding single-image context:

  1. Split into individual page images.
  2. Process each page independently for entity extraction.
  3. Merge and deduplicate entities across pages.
  4. Resolve co-references (e.g., “the Company” on page 3 = “Acme Corp” from page 1).
  5. Flag conflicts (different amounts, dates, or names for the same entity).

OCR quality assessment

Before processing, assess OCR quality:

Quality Indicators Action
High Clean scan, consistent font, no artifacts Direct Vision API extraction
Medium Some blur, handwritten annotations, stamps Vision API + manual spot-check of key fields
Low Heavy redactions, faded text, poor scan Vision API for what’s readable + flag gaps as (Illegible)

3. Entity extraction schema

Standardize extracted entities for downstream processing:


{
  "document_id": "unique-hash-of-source",
  "source_file": "filename.pdf",
  "extraction_method": "claude-vision-api",
  "extraction_date": "2026-05-12",
  "entities": [
    {
      "text": "Acme Holdings LLC",
      "type": "organization",
      "confidence": "high",
      "page": 1,
      "context": "Contractor listed on award notice",
      "profile_match": "none",
      "provenance": "AI-extracted"
    },
    {
      "text": "$2,450,000",
      "type": "monetary_amount",
      "confidence": "high",
      "page": 1,
      "context": "Total contract award value",
      "provenance": "AI-extracted"
    }
  ],
  "relationships": [
    {
      "subject": "Acme Holdings LLC",
      "predicate": "awarded_contract_by",
      "object": "Department of Defense",
      "confidence": "high",
      "evidence_page": 1,
      "provenance": "AI-extracted"
    }
  ]
}

4. Batch processing discipline

When processing large document sets (10+ documents):

  1. Prioritize — sort by likely relevance (keyword match, date range, filing type) before spending API tokens.
  2. Sample first — process 3-5 representative documents to calibrate extraction prompts.
  3. Track costs — log token usage per document for budget awareness.
  4. Checkpoint — save intermediate results after every 10 documents; resume from checkpoint on failure.
  5. Deduplicate — entities appearing in multiple documents get a single canonical record with source list.
  6. Human review queue — low-confidence extractions (below medium) go to a review queue, not into the main entity set.

5. Integration with investigation workflows

This skill is typically invoked as a step within a larger investigation designed by investigation-workflow-designer:

Pattern Role of this skill
FOIA production triage Classify and extract entities from a batch of released documents
Financial disclosure analysis Extract named entities, amounts, and dates from disclosure forms
Contract review Identify parties, amounts, terms, and cross-reference against accountability profiles
Media document verification Extract text from photographed documents for verification

6. Safety and ethics

  • Never fabricate entities — if text is illegible, report (Illegible) rather than guessing.
  • Never assert relationships not in the document — if a relationship is inferred from context rather than explicitly stated, tag it as AI-inferred.
  • Respect redactions — do not attempt to reconstruct redacted text. Note redactions as (Redacted) in the extraction.
  • No PII collection — extracted entities that are private individuals (not public officials or corporate officers in their official capacity) must be flagged for human review before inclusion.
  • Legal review gate — any extraction that suggests potential criminal activity requires a note: “This extraction suggests potential legal issues. Consult qualified legal counsel before taking action.”

Cross-references

  • document-research-specialist — manual document tools (Tika, tabula, DocumentCloud)
  • investigation-workflow-designer — workflow composition using this skill
  • public-corruption-ombudsman — evidence tier definitions for cross-referencing
  • corporate-intelligence-investigator — company registry verification of extracted entities
  • knowledgebase/investigative-tools/tool-catalog/tools.yaml — tool catalog SSOT
  • knowledgebase/accountability/INVESTIGATIVE-TRAILS-PROTOCOL.md — trail protocol for profile updates
Table of Contents