Skip to main content
< All Topics
Print

Dify Knowledge Base Management

name: dify-knowledge-base-management

description: Managing Dify knowledge bases including dataset creation, document ingestion, chunking strategy, embedding configuration, and semantic retrieval. Use when building or maintaining RAG pipelines, uploading documents to Dify, or troubleshooting retrieval quality.

Dify Knowledge Base Management

Instructions

Manage the full Dify knowledge base lifecycle for retrieval-augmented generation.

Dataset creation:

  • Create datasets via Console API or UI with descriptive names and clear descriptions
  • Supported document formats: md, txt, csv, json, pdf, docx, xlsx, html, xml
  • One dataset per logical domain (e.g., “product-docs”, “support-tickets”, “policies”)

Chunking strategy:

  • Use automatic mode (recommended) — Dify handles paragraph and section splitting
  • For custom: set max chunk size 500-1000 tokens with 50-100 token overlap
  • Pre-clean documents before upload: strip headers/footers, fix encoding, remove boilerplate

Embedding model selection:

  • Use text-embedding-3-small via pgvector for cost-effective semantic search
  • Embedding dimensions: 1536 (default) — sufficient for most retrieval tasks
  • pgvector stores embeddings in PostgreSQL with HNSW or IVFFlat indexing

Semantic retrieval testing:

  • Test queries against the dataset before wiring into applications
  • Evaluate top-k results for relevance; adjust chunk size if results are too broad or narrow
  • Use similarity score thresholds to filter low-confidence matches

Service API for runtime retrieval:

  • POST /v1/datasets/{dataset_id}/retrieve with Bearer token authentication
  • Request body: {"query": "search text", "retrieval_model": {"search_method": "semantic_search", "top_k": 5}}
  • Use dataset-scoped API tokens, not user-level tokens

Console API authentication workaround:

  • Generate tokens via docker exec into the dify-api container when Console API is unreliable
  • Direct DB token creation: insert dataset-scoped API tokens into the api_tokens table
  • Scope tokens to specific datasets to follow least-privilege principles

Maintenance:

  • Re-index after bulk document updates or embedding model changes
  • Monitor retrieval latency — slow queries may indicate missing pgvector indexes
  • Archive stale datasets rather than deleting to preserve audit trail
Table of Contents