Skip to main content
< All Topics
Print

Pinecone Embedding Management

name: pinecone-embedding-management

description: Manage Pinecone vector index lifecycles including content auditing, chunking strategies, embedding generation via OpenAI, batch upsert, index verification, and metadata schema design. Covers multi-source embedding pipelines (PDFs, markdown knowledgebases, almanac files, structured data), gap analysis against existing vectors, ID prefix conventions, and incremental update workflows. Use when embedding new knowledgebase content into Pinecone, auditing an existing index for coverage gaps, designing metadata schemas for filtered retrieval, building chunking pipelines for RAG, or verifying upsert integrity.

Pinecone Embedding Management

Instructions

Manage Pinecone vector indexes for RAG-powered applications. This skill covers the full lifecycle from content audit through embedding generation, upsert, and verification.

1. Index Audit — What’s Already Embedded?

Before adding content, always audit the existing index:


from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(INDEX_NAME)

# 1. Get total stats
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Dimension: {stats.dimension}")
print(f"Namespaces: {dict(stats.namespaces)}")

# 2. Sample by ID prefix to understand vector sources
prefixes = ["pdf-", "almanac-", "topic-", "dest-"]
for prefix in prefixes:
    count = sum(len(batch) for batch in index.list(prefix=prefix))
    print(f"  {prefix}: {count}")

# 3. Query to inspect metadata shapes
from openai import OpenAI
oai = OpenAI(api_key=OPENAI_API_KEY)
emb = oai.embeddings.create(model="text-embedding-3-small", input=["sample query"])
results = index.query(vector=emb.data[0].embedding, top_k=5, include_metadata=True)
for m in results.matches:
    print(f"  {m.id}: keys={list(m.metadata.keys())}")

Gap Analysis Checklist

Check Method
Which content directories have vectors? Sample by ID prefix
Which are missing? Cross-reference KB file list vs. prefix counts
Were manifests generated but never upserted? Check output/*.json for manifests, fetch IDs from Pinecone
What metadata shapes exist? Query + inspect .metadata.keys()
Are pre-existing vectors from a different pipeline? Compare metadata schemas

2. Content Chunking Strategies

Markdown (KB topics, destinations, almanac)

Split on heading boundaries, then apply paragraph-level chunking with overlap:


import re
import hashlib

MAX_CHUNK_TOKENS = 800
OVERLAP_TOKENS = 200

def estimate_tokens(text):
    return len(text) // 4

def chunk_by_sections(content):
    """Split markdown at ## headings, then chunk each section."""
    sections = re.split(r"\n(?=##\s)", content)
    all_chunks = []

    for section in sections:
        section = section.strip()
        if not section or estimate_tokens(section) < 50:
            continue

        paragraphs = re.split(r"\n\s*\n", section)
        current = []
        current_tokens = 0

        for para in paragraphs:
            para = para.strip()
            if not para:
                continue
            pt = estimate_tokens(para)

            if current_tokens + pt > MAX_CHUNK_TOKENS and current:
                all_chunks.append("\n\n".join(current))
                # Overlap: carry trailing paragraphs into next chunk
                overlap_paras = []
                overlap_t = 0
                for p in reversed(current):
                    t = estimate_tokens(p)
                    if overlap_t + t > OVERLAP_TOKENS:
                        break
                    overlap_paras.insert(0, p)
                    overlap_t += t
                current = overlap_paras
                current_tokens = overlap_t

            current.append(para)
            current_tokens += pt

        if current:
            all_chunks.append("\n\n".join(current))

    return all_chunks

PDF Extracted Text

Use the same paragraph-based chunking. PDFs tend to produce denser text, so use the same token limits but expect more chunks per document.

Structured Data (JSON databases)

Generally NOT suitable for vector embedding. Structured data (dive sites, operators, analytics) is better served by direct lookup, SQL queries, or keyword-triggered injection. Only embed structured data if it contains narrative descriptions worth semantic retrieval.

3. Metadata Schema Design

Rich metadata enables filtered retrieval. Design metadata to support the query patterns your RAG system needs.

Recommended Fields

Field Type Purpose Example
source string Origin filename "equipment-guide.md"
topic string Content category "safety-medicine", "equipment", "destination"
region string Geographic region "caribbean", "southeast-asia"
cert_level string Min certification "OW", "AOW", "Technical"
dive_type string Primary dive type "reef", "wreck", "cave"
content_type string Content classification "factual", "procedural", "advisory", "scientific"
category string Pipeline source "topic", "dest", "almanac", "pdf"
chunk_index int Position in document 0, 1, 2
chunk_count int Total chunks from doc 15
text string Chunk text (truncated) First 1000 chars for display

Metadata for Filtered Queries


# Retrieve only safety content for advanced divers
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "topic": {"$eq": "safety-medicine"},
        "cert_level": {"$in": ["AOW", "Technical"]}
    },
    include_metadata=True
)

4. ID Prefix Conventions

Use consistent prefixes to identify vector sources without querying metadata:

Prefix Source Example ID
scubagpt- PDF corpus extractions scubagpt-751ccd6b02ed4107
almanac- Regional almanac files almanac-b9c2d4cb0a60cccb
topic- Topics KB markdown topic-3a8f2c1d9e0b4567
dest- Destinations KB markdown dest-7c4e8a2f1b3d6590

ID generation pattern:


chunk_id = hashlib.md5(
    f"{prefix}:{filename}:{chunk_index}".encode()
).hexdigest()[:16]
vector_id = f"{prefix}-{chunk_id}"

Deterministic IDs mean re-running a pipeline overwrites (upserts) existing vectors rather than creating duplicates.

5. Embedding Generation

Model Selection

Model Dimensions Max Tokens Cost Use Case
text-embedding-3-small 1536 8,192 $0.02/1M tokens Default — good quality, low cost
text-embedding-3-large 3072 8,192 $0.13/1M tokens When retrieval precision is critical
text-embedding-ada-002 1536 8,191 $0.10/1M tokens Legacy — avoid for new projects

Batch Embedding Pattern


import openai
import time

client = openai.OpenAI(api_key=OPENAI_API_KEY)
BATCH_SIZE = 100
TEXT_TRUNCATION = 6000  # chars — stays safely under 8192 token limit

embedded = []
for batch_start in range(0, len(vectors), BATCH_SIZE):
    batch = vectors[batch_start:batch_start + BATCH_SIZE]
    texts = [v["text"][:TEXT_TRUNCATION] for v in batch]

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    for j, emb in enumerate(response.data):
        v = batch[j]
        embedded.append({
            "id": v["id"],
            "values": emb.embedding,
            "metadata": {**v["metadata"], "text": v["text"][:1000]},
        })

    done = min(batch_start + BATCH_SIZE, len(vectors))
    print(f"  Embedded {done}/{len(vectors)} chunks")
    time.sleep(0.1)  # Rate-limit courtesy

Critical: Truncate text to ~6,000 characters (not 8,000) before sending to the API. Dense text can exceed 8,192 tokens even at 8,000 characters, causing 400 BadRequestError.

6. Pinecone Upsert


from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(INDEX_NAME)

UPSERT_BATCH = 100
for batch_start in range(0, len(embedded), UPSERT_BATCH):
    batch = embedded[batch_start:batch_start + UPSERT_BATCH]
    index.upsert(vectors=batch)
    done = min(batch_start + UPSERT_BATCH, len(embedded))
    print(f"  Upserted {done}/{len(embedded)} vectors")
    time.sleep(0.2)

Upsert is idempotent — vectors with the same ID are overwritten, not duplicated. This makes re-runs safe.

7. Verification

After upsert, always verify:


# 1. Check total count increased as expected
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")

# 2. Spot-check with known IDs from manifest
sample_ids = [v["id"] for v in manifest["vectors"][:10]]
result = index.fetch(ids=sample_ids)
found = sum(1 for vid in sample_ids if vid in result.vectors)
print(f"Spot-check: {found}/{len(sample_ids)} found")

# 3. Semantic query to confirm retrieval works
emb = oai_client.embeddings.create(
    model="text-embedding-3-small",
    input=["test query relevant to new content"]
)
results = index.query(vector=emb.data[0].embedding, top_k=5, include_metadata=True)
for m in results.matches:
    src = m.metadata.get("source", "?")
    topic = m.metadata.get("topic", "?")
    print(f"  {m.score:.3f}  {src}  [{topic}]")

8. Incremental Update Workflow

When knowledgebase content changes:

  1. Identify changed files — compare file modification times or content hashes against the last manifest
  2. Re-chunk only changed files — don’t re-process the entire corpus
  3. Delete stale vectors — if a file was removed, delete its vectors by ID prefix
  4. Generate new embeddings — only for new/changed chunks
  5. Upsert — deterministic IDs ensure changed content overwrites cleanly

# Delete all vectors from a removed source
ids_to_delete = []
for batch in index.list(prefix=f"topic-"):
    for vid in batch:
        if should_delete(vid):
            ids_to_delete.append(vid)

if ids_to_delete:
    for i in range(0, len(ids_to_delete), 100):
        index.delete(ids=ids_to_delete[i:i+100])

9. Manifest Files

Every embedding pipeline should write a manifest JSON for auditability:


{
  "total_vectors": 272,
  "embedding_model": "text-embedding-3-small",
  "embedding_dim": 1536,
  "topics": { "safety-medicine": 31, "equipment": 17 },
  "regions": { "caribbean": 48, "general": 62 },
  "vectors": [
    {
      "id": "topic-3a8f2c1d9e0b4567",
      "metadata": { "source": "equipment-guide.md", "topic": "equipment" },
      "text_preview": "First 200 chars of chunk..."
    }
  ]
}

Manifests enable:

  • Comparing what _should_ be in the index vs. what _is_ there
  • Re-running embedding without re-chunking
  • Auditing coverage by topic/region

10. Environment Variables

All embedding scripts should use these environment variables:

Variable Required Default Purpose
OPENAI_API_KEY Yes OpenAI API for embedding generation
PINECONE_API_KEY Yes Pinecone API for vector operations
PINECONE_INDEX_NAME No scubagpt-1536 Target index name

Never hardcode API keys. Pass via environment variables or secure credential stores.

Anti-Patterns

  • Embedding structured JSON data (dive site databases, analytics JSON): These are better served by direct lookup. Only embed text with narrative content.
  • 8,000-character truncation: Dense text (scientific papers, tables) can exceed 8,192 tokens at 8,000 chars. Use 6,000 characters as the safe ceiling.
  • No manifest: Without a manifest, you can’t audit what’s in the index or detect coverage gaps.
  • No verification after upsert: Always spot-check fetched IDs and run semantic queries to confirm the new content is retrievable.
  • Embedding without metadata: Bare vectors with no topic, region, or source metadata make filtered retrieval impossible and debugging very difficult.
  • Non-deterministic IDs: Random UUIDs mean re-running a pipeline doubles your vectors. Use content-derived hashes so re-runs are idempotent.
  • Single monolithic pipeline: Separate pipelines by content type (PDF, markdown, almanac) so each can be re-run independently.
  • Embedding disambiguation/terminology files: JSON lookup tables for term disambiguation are used by exact-match, not semantic search.

Inputs Required

  • Pinecone API key and index name
  • OpenAI API key (for embedding model)
  • Content to embed: markdown files, extracted PDF text, or other narrative content
  • Existing index state (for gap analysis)

Output Format

  • Embedded vectors upserted to Pinecone index
  • Manifest JSON file per pipeline run (vector IDs, metadata, text previews)
  • Verification report (total count, spot-check results, sample queries)

Reference Implementation

See products/scuba-gpt/data-pipelines/:

  • 05_pinecone_reembed.py — PDF corpus chunking, embedding, and upsert with rich metadata
  • 12_embed_almanac.py — Almanac markdown section-splitting, embedding, and upsert
  • 14_embed_kb_markdown.py — Topics and destinations markdown embedding with heading-based chunking
Table of Contents