Chapter 21: Knowledge Bases

PostedApril 21, 2026

UpdatedApril 22, 2026

ByPeter Westerman

Chapter 21: Knowledge Bases

Last Updated: 2026-04-16

## 21.1 Overview

A knowledge base is a curated collection of documents that an AI can search over at query time to ground its responses in factual content. ITI uses three knowledge base systems:

|——–|———–|——–|———|

### Architecture decision: Dify/pgvector vs Pinecone

New products use Dify Knowledge Bases backed by pgvector. This keeps all data within the ITI Docker stack, eliminates external vector DB costs, and provides a managed UI for content ingestion.

Legacy products (AI News Cafe, Scuba GPT, My Travel Planner, GD Chatbot) retain their Pinecone indexes. The shared library includes a Pinecone API client at ITI/shared/wordpress/api-clients/class-iti-pinecone-api.php. These products also have Python embedding pipelines for Pinecone ingestion.

> Note: Do not create new Pinecone indexes for new products. Use Dify Knowledge Bases instead. Existing Pinecone indexes will be migrated to pgvector as products are updated.

21.2 When to Build a Knowledge Base

Build a knowledge base when:

A product needs to answer questions grounded in a specific document corpus (regulations, product docs, historical data).
The relevant information changes frequently and must be kept current.
The information is too large to include in a system prompt.
Retrieval precision matters more than generalization.

Do not build a knowledge base when:

The information is stable and small enough to include in a system prompt (< 2,000 tokens).
General Claude knowledge is sufficient (widely-known facts, common procedures).
The latency of RAG retrieval would be unacceptable for the use case.

21.3 Knowledge Base Design Principles

One KB per knowledge domain

Do not put unrelated content in the same knowledge base. Mixing domains increases retrieval noise (irrelevant results appearing). Separate KBs:

KB	Content
`iti-expat-tax-laws`	Tax treaty documents, country-specific tax guides
`iti-expat-visa-requirements`	Visa and immigration documents
`iti-travel-destinations`	Destination guides, travel tips

Prefer quality over quantity

A KB with 50 high-quality, well-structured documents outperforms a KB with 500 poorly-formatted, redundant documents. Curate content before ingestion.

Chunk size matches content density

See Chapter 11, Section 11.4 for chunking strategy guidelines.

21.4 RAG Architecture

RAG (Retrieval-Augmented Generation) is the pipeline that combines knowledge base retrieval with LLM generation:


User Query
    │
    ▼
Embedding Model (text-embedding-3-small)
    │  Query → 1536-dimension vector
    ▼
Vector Search (pgvector cosine similarity)
    │  Find top-K most similar document chunks
    ▼
Reranking (optional)
    │  Re-order results for precision
    ▼
Context Assembly
    │  Format chunks into readable context string
    ▼
LLM (Claude)
    │  System prompt + context + user query → response
    ▼
Response to User

21.5 Embedding Model

ITI uses text-embedding-3-small (OpenAI) as the default embedding model. This model:

Produces 1536-dimension vectors
Balances cost and quality well for document retrieval
Is supported natively by Dify and pgvector

Note: If the embedding model is changed for a knowledge base, all existing embeddings must be regenerated (re-index the dataset). Mixing embeddings from different models in the same KB produces incorrect similarity scores.

21.6 Managing the Knowledge Base Inventory

All knowledge base content is tracked in: ITI/operations/knowledgebase-inventory.md

Before creating a new knowledge base, check this inventory — a KB covering the same topic may already exist.

What to track per KB

Field	Description
KB Name	Descriptive name
Dify Dataset ID	UUID from Dify console
Products using it	Which products call this KB
Content source	Where source documents come from
Update frequency	How often content is refreshed
Last indexed	Date of most recent full index
Owner	Who is responsible for content quality

21.7 Keeping Knowledge Bases Current

Static KBs (one-time ingestion of stable documents): No maintenance required beyond periodic quality checks.

Dynamic KBs (frequently updated content):

Establish an update cadence (weekly, monthly) based on how quickly the source content changes.
When new source documents are available:

Add them to Dify via the UI or API.
Remove outdated documents.

After any significant update, run Retrieval Testing to verify quality.
Log the update in knowledgebase-inventory.md.

Automating KB updates with n8n

For KBs that pull from web sources (RSS feeds, website crawls):

Create an n8n workflow triggered by a Schedule Trigger node (weekly).
Use an HTTP Request node to fetch the updated content.
Use the Dify API to add new documents and delete outdated ones.
Log the update in a Dify dataset metadata field or n8n execution notes.

21.8 Direct pgvector Access

For custom tooling or one-off analysis, you can query the Dify embeddings directly in PostgreSQL. Connect to the dify database on iti-postgres.

Warning: Dify’s internal table names and schemas change between versions. The example below is illustrative — verify actual table names against your running Dify version before writing queries.


-- Illustrative: find similar chunks in the Dify embedding table
-- Table names vary by Dify version; check schema first:
--   docker exec iti-postgres psql -U postgres -d dify -c "\dt"
SELECT
    id,
    content,
    document_id,
    embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM embeddings
WHERE dataset_id = 'your-dataset-uuid'
ORDER BY distance ASC
LIMIT 5;

Note: Direct SQL queries bypass Dify’s retrieval pipeline, including metadata filtering and reranking. Use the Dify API for production retrieval.

21.9 Product-Local Knowledge Bases

Some products maintain knowledge base content within their own directory trees:

Product	Location	Contents
Scuba GPT	`products/scuba-gpt/.../knowledgebase/`	14,600+ dive sites, 6,900+ dive operators, marine life references
My TravelPlanner	`products/my-travelplanner.com/knowledgebase/`	Destination guides, scuba data (dive-operators.json), travel topics
Personal Assistant	`Personal/personal-assistant/knowledgebase/`	40+ files: scuba data JSON, almanac, disambiguations, advisory content
Estate Manager	`products/estate-manager/wordpress/knowledgebase/`	Legal, tax, and procedural documents
GD Chatbot	`products/gd-chatbot/plugin/...`	Grateful Dead historical data

These are source files for ingestion into Dify or Pinecone — they are not live knowledge bases themselves.

Previous: Chapter 20 — Agents, Skills & Pipelines | Next: Chapter 22 — Safety & Guardrails

AI Skill

Product Showcase

ITI Knowledge System

AI Agent

User Guide

Requirements

ScubaGPT

Grateful Dead Chatbot

Farmers Bounty

Technical Document

Answer Engine Optimizer

SEO Optimizer

Travel Planner

Fact Checker

Estate Manager

ITI Operations

ITI Marketing

Patriot University

Personal Assistant

Chapter 21: Knowledge Bases

Chapter 21: Knowledge Bases

21.2 When to Build a Knowledge Base

21.3 Knowledge Base Design Principles

One KB per knowledge domain

Prefer quality over quantity

Chunk size matches content density

21.4 RAG Architecture

21.5 Embedding Model

21.6 Managing the Knowledge Base Inventory

What to track per KB

21.7 Keeping Knowledge Bases Current

Automating KB updates with n8n

21.8 Direct pgvector Access

21.9 Product-Local Knowledge Bases