Skip to main content
< All Topics
Print

Estate Document Extraction

name: estate-document-extraction

description: AI-powered extraction of structured data from estate documents (wills, trusts, deeds, financial statements) using Claude Vision API. Document classification, key field extraction, entity resolution, and confidence scoring. Use when building document intake pipelines, extracting entities from scanned legal documents, or classifying estate paperwork.

Estate Document Extraction

Instructions

Extract structured data from estate documents using AI vision and text analysis. Handle scanned PDFs, photographed documents, and digital text with appropriate extraction strategies.

Document Classification

Before extraction, classify the document into one of these categories:

Category Document Types Key Indicators
Testamentary Last Will and Testament, Codicils, Holographic Wills “Last Will”, “I bequeath”, “hereby revoke”, attestation clause
Trust Revocable Living Trust, Irrevocable Trust, SNT “Trust Agreement”, “Settlor”, “Trustee”, “Trust Estate”
Real Property Deeds, Title Insurance, Property Tax Statements “Grantor/Grantee”, “Legal Description”, parcel numbers
Financial Bank Statements, Brokerage Statements, Insurance Policies Account numbers, balances, CUSIP numbers
Court Letters Testamentary, Court Orders, Petitions Case numbers, court stamps, judge signatures
Identity Death Certificates, Birth Certificates, Marriage Certificates Vital records formatting, certificate numbers
Tax Estate Tax Returns (706), Income Tax Returns, Gift Tax Returns IRS form numbers, EIN, SSN references

Claude Vision API Integration

  • Submit document images via the Claude Messages API with type: "image" content blocks
  • For multi-page PDFs, convert each page to a PNG/JPEG and submit as a sequence of images
  • Use a structured extraction prompt that requests JSON output matching the target schema
  • Set temperature to 0 for deterministic extraction
  • For documents >20 pages, extract in batches of 5 pages with overlap context from the previous page’s extracted data

Key Field Extraction by Document Type

Wills:

  • Testator name, date of execution, jurisdiction
  • Executor nominations (primary, successor)
  • Beneficiary names and bequests (specific, residuary)
  • Trust creation provisions
  • Guardian nominations for minors
  • Signature attestation (number of witnesses, notarization)

Trust Agreements:

  • Settlor, Trustee, Successor Trustee names
  • Trust type (revocable, irrevocable, testamentary, SNT)
  • Beneficiary names and distribution provisions
  • Trust assets schedule
  • Amendment and revocation provisions
  • Governing law jurisdiction

Deeds:

  • Grantor and Grantee names
  • Legal description (metes and bounds, lot/block, or section/township/range)
  • Recording information (book, page, instrument number)
  • Consideration amount
  • Deed type (warranty, quitclaim, trust transfer)

Financial Statements:

  • Institution name and account number (last 4 digits only in extracted data)
  • Account type and ownership
  • Balance as of statement date
  • Beneficiary designations if shown

Entity Resolution

  • Normalize person names: “John A. Smith”, “John Smith”, “J. Arthur Smith” should resolve to the same entity
  • Track name variants with confidence: exact match (1.0), partial match (0.8), inferred match (0.5)
  • Cross-reference entities across documents: the “John Smith” in the will should link to the “John Smith” on the deed
  • Flag ambiguous matches for human review rather than auto-resolving

Confidence Scoring

Every extracted field gets a confidence score:

Score Meaning Action
0.95–1.0 High confidence — clearly legible, unambiguous Auto-accept
0.80–0.94 Medium confidence — legible but could be misread Flag for review
0.50–0.79 Low confidence — partially legible or ambiguous Require human verification
<0.50 Very low — illegible or contradictory Mark as unextracted, request better scan

Data Security

  • Never store full SSNs, account numbers, or EINs in extracted data — mask to last 4 digits
  • Process documents in memory; do not cache raw images on disk after extraction
  • Log extraction events (document type, page count, field count) but never log extracted content
  • All extracted data inherits the CONFIDENTIAL classification from the source document

Inputs Required

  • Document image(s): PNG, JPEG, or PDF pages as base64-encoded images
  • Document type hint (optional): if the user pre-classifies the document, skip classification step
  • Extraction schema: which fields to extract (default: all fields for the document type)
  • Existing entity list: previously resolved entities for cross-reference matching

Output Format

  • Document classification with confidence score
  • Structured JSON object with extracted fields, each annotated with confidence score and source page number
  • Entity resolution results: new entities created, existing entities matched, ambiguous matches flagged
  • Extraction summary: total fields extracted, fields requiring review, fields unextracted
  • Processing metadata: pages processed, API calls made, total processing time

Anti-Patterns

  • Storing full PII in extraction output: Always mask SSNs, account numbers, and EINs — only store last 4 digits
  • Auto-resolving ambiguous entities: When “John Smith” appears in two documents, do not assume they are the same person without corroborating evidence — flag for human review
  • Ignoring document quality: A blurry scan produces unreliable extraction — detect low image quality and request a rescan before wasting API calls
  • Extracting without classification: Different document types require different extraction schemas — always classify first
  • Treating OCR output as ground truth: AI extraction can hallucinate fields that do not exist in the source document — always include source page references so humans can verify
  • Processing sensitive documents in bulk without audit trail: Every document extraction must be logged for fiduciary compliance
Table of Contents