Dify Knowledge Base Management

PostedApril 21, 2026

UpdatedApril 22, 2026

ByPeter Westerman

name: dify-knowledge-base-management

description: Managing Dify knowledge bases including dataset creation, document ingestion, chunking strategy, embedding configuration, and semantic retrieval. Use when building or maintaining RAG pipelines, uploading documents to Dify, or troubleshooting retrieval quality.

Dify Knowledge Base Management

Instructions

Manage the full Dify knowledge base lifecycle for retrieval-augmented generation.

Dataset creation:

Create datasets via Console API or UI with descriptive names and clear descriptions
Supported document formats: md, txt, csv, json, pdf, docx, xlsx, html, xml
One dataset per logical domain (e.g., “product-docs”, “support-tickets”, “policies”)

Chunking strategy:

Use automatic mode (recommended) — Dify handles paragraph and section splitting
For custom: set max chunk size 500-1000 tokens with 50-100 token overlap
Pre-clean documents before upload: strip headers/footers, fix encoding, remove boilerplate

Embedding model selection:

Use text-embedding-3-small via pgvector for cost-effective semantic search
Embedding dimensions: 1536 (default) — sufficient for most retrieval tasks
pgvector stores embeddings in PostgreSQL with HNSW or IVFFlat indexing

Semantic retrieval testing:

Test queries against the dataset before wiring into applications
Evaluate top-k results for relevance; adjust chunk size if results are too broad or narrow
Use similarity score thresholds to filter low-confidence matches

Service API for runtime retrieval:

POST /v1/datasets/{dataset_id}/retrieve with Bearer token authentication
Request body: {"query": "search text", "retrieval_model": {"search_method": "semantic_search", "top_k": 5}}
Use dataset-scoped API tokens, not user-level tokens

Console API authentication workaround:

Generate tokens via docker exec into the dify-api container when Console API is unreliable
Direct DB token creation: insert dataset-scoped API tokens into the api_tokens table
Scope tokens to specific datasets to follow least-privilege principles

Maintenance:

Re-index after bulk document updates or embedding model changes
Monitor retrieval latency — slow queries may indicate missing pgvector indexes
Archive stale datasets rather than deleting to preserve audit trail

AI Skill

Product Showcase

ITI Knowledge System

AI Agent

User Guide

Requirements

ScubaGPT

Grateful Dead Chatbot

Farmers Bounty

Technical Document

Answer Engine Optimizer

SEO Optimizer

Travel Planner

Fact Checker

Estate Manager

ITI Operations

ITI Marketing

Patriot University

Personal Assistant

Dify Knowledge Base Management

Dify Knowledge Base Management

Instructions