Skip to main content
< All Topics
Print

Chapter 21: Knowledge Bases

Chapter 21: Knowledge Bases

Last Updated: 2026-04-16

## 21.1 Overview

A knowledge base is a curated collection of documents that an AI can search over at query time to ground its responses in factual content. ITI uses three knowledge base systems:

| System | Technology | Status | Best For |

|——–|———–|——–|———|

| Dify Knowledge Bases | Dify + pgvector | Active (new products) | Product-facing RAG; managed UI; Celery-processed |

| Direct pgvector | PostgreSQL + pgvector | Active | Custom pipelines needing direct DB access |

| Pinecone | Pinecone cloud + OpenAI embeddings | Legacy | Older products with existing vector indexes |

### Architecture decision: Dify/pgvector vs Pinecone

New products use Dify Knowledge Bases backed by pgvector. This keeps all data within the ITI Docker stack, eliminates external vector DB costs, and provides a managed UI for content ingestion.

Legacy products (AI News Cafe, Scuba GPT, My Travel Planner, GD Chatbot) retain their Pinecone indexes. The shared library includes a Pinecone API client at ITI/shared/wordpress/api-clients/class-iti-pinecone-api.php. These products also have Python embedding pipelines for Pinecone ingestion.

> Note: Do not create new Pinecone indexes for new products. Use Dify Knowledge Bases instead. Existing Pinecone indexes will be migrated to pgvector as products are updated.

21.2 When to Build a Knowledge Base

Build a knowledge base when:

  • A product needs to answer questions grounded in a specific document corpus (regulations, product docs, historical data).
  • The relevant information changes frequently and must be kept current.
  • The information is too large to include in a system prompt.
  • Retrieval precision matters more than generalization.

Do not build a knowledge base when:

  • The information is stable and small enough to include in a system prompt (< 2,000 tokens).
  • General Claude knowledge is sufficient (widely-known facts, common procedures).
  • The latency of RAG retrieval would be unacceptable for the use case.

21.3 Knowledge Base Design Principles

One KB per knowledge domain

Do not put unrelated content in the same knowledge base. Mixing domains increases retrieval noise (irrelevant results appearing). Separate KBs:

KB Content
iti-expat-tax-laws Tax treaty documents, country-specific tax guides
iti-expat-visa-requirements Visa and immigration documents
iti-travel-destinations Destination guides, travel tips

Prefer quality over quantity

A KB with 50 high-quality, well-structured documents outperforms a KB with 500 poorly-formatted, redundant documents. Curate content before ingestion.

Chunk size matches content density

See Chapter 11, Section 11.4 for chunking strategy guidelines.


21.4 RAG Architecture

RAG (Retrieval-Augmented Generation) is the pipeline that combines knowledge base retrieval with LLM generation:


User Query
    │
    ▼
Embedding Model (text-embedding-3-small)
    │  Query → 1536-dimension vector
    ▼
Vector Search (pgvector cosine similarity)
    │  Find top-K most similar document chunks
    ▼
Reranking (optional)
    │  Re-order results for precision
    ▼
Context Assembly
    │  Format chunks into readable context string
    ▼
LLM (Claude)
    │  System prompt + context + user query → response
    ▼
Response to User

21.5 Embedding Model

ITI uses text-embedding-3-small (OpenAI) as the default embedding model. This model:

  • Produces 1536-dimension vectors
  • Balances cost and quality well for document retrieval
  • Is supported natively by Dify and pgvector

Note: If the embedding model is changed for a knowledge base, all existing embeddings must be regenerated (re-index the dataset). Mixing embeddings from different models in the same KB produces incorrect similarity scores.


21.6 Managing the Knowledge Base Inventory

All knowledge base content is tracked in: ITI/operations/knowledgebase-inventory.md

Before creating a new knowledge base, check this inventory — a KB covering the same topic may already exist.

What to track per KB

Field Description
KB Name Descriptive name
Dify Dataset ID UUID from Dify console
Products using it Which products call this KB
Content source Where source documents come from
Update frequency How often content is refreshed
Last indexed Date of most recent full index
Owner Who is responsible for content quality

21.7 Keeping Knowledge Bases Current

Static KBs (one-time ingestion of stable documents): No maintenance required beyond periodic quality checks.

Dynamic KBs (frequently updated content):

  1. Establish an update cadence (weekly, monthly) based on how quickly the source content changes.
  2. When new source documents are available:
  • Add them to Dify via the UI or API.
  • Remove outdated documents.
  1. After any significant update, run Retrieval Testing to verify quality.
  2. Log the update in knowledgebase-inventory.md.

Automating KB updates with n8n

For KBs that pull from web sources (RSS feeds, website crawls):

  1. Create an n8n workflow triggered by a Schedule Trigger node (weekly).
  2. Use an HTTP Request node to fetch the updated content.
  3. Use the Dify API to add new documents and delete outdated ones.
  4. Log the update in a Dify dataset metadata field or n8n execution notes.

21.8 Direct pgvector Access

For custom tooling or one-off analysis, you can query the Dify embeddings directly in PostgreSQL. Connect to the dify database on iti-postgres.

Warning: Dify’s internal table names and schemas change between versions. The example below is illustrative — verify actual table names against your running Dify version before writing queries.


-- Illustrative: find similar chunks in the Dify embedding table
-- Table names vary by Dify version; check schema first:
--   docker exec iti-postgres psql -U postgres -d dify -c "\dt"
SELECT
    id,
    content,
    document_id,
    embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM embeddings
WHERE dataset_id = 'your-dataset-uuid'
ORDER BY distance ASC
LIMIT 5;

Note: Direct SQL queries bypass Dify’s retrieval pipeline, including metadata filtering and reranking. Use the Dify API for production retrieval.


21.9 Product-Local Knowledge Bases

Some products maintain knowledge base content within their own directory trees:

Product Location Contents
Scuba GPT products/scuba-gpt/.../knowledgebase/ 14,600+ dive sites, 6,900+ dive operators, marine life references
My TravelPlanner products/my-travelplanner.com/knowledgebase/ Destination guides, scuba data (dive-operators.json), travel topics
Personal Assistant Personal/personal-assistant/knowledgebase/ 40+ files: scuba data JSON, almanac, disambiguations, advisory content
Estate Manager products/estate-manager/wordpress/knowledgebase/ Legal, tax, and procedural documents
GD Chatbot products/gd-chatbot/plugin/... Grateful Dead historical data

These are source files for ingestion into Dify or Pinecone — they are not live knowledge bases themselves.


Previous: Chapter 20 — Agents, Skills & Pipelines | Next: Chapter 22 — Safety & Guardrails

Table of Contents