Pinecone Embedding Management

PostedApril 21, 2026

UpdatedMay 4, 2026

ByPeter Westerman

name: pinecone-embedding-management

description: Manage Pinecone vector index lifecycles including content auditing, chunking strategies, embedding generation via OpenAI, batch upsert, index verification, and metadata schema design. Covers multi-source embedding pipelines (PDFs, markdown knowledgebases, almanac files, structured data), gap analysis against existing vectors, ID prefix conventions, and incremental update workflows. Use when embedding new knowledgebase content into Pinecone, auditing an existing index for coverage gaps, designing metadata schemas for filtered retrieval, building chunking pipelines for RAG, or verifying upsert integrity.

Pinecone Embedding Management

Instructions

Manage Pinecone vector indexes for RAG-powered applications. This skill covers the full lifecycle from content audit through embedding generation, upsert, and verification.

1. Index Audit — What’s Already Embedded?

Before adding content, always audit the existing index:


from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(INDEX_NAME)

# 1. Get total stats
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Dimension: {stats.dimension}")
print(f"Namespaces: {dict(stats.namespaces)}")

# 2. Sample by ID prefix to understand vector sources
prefixes = ["pdf-", "almanac-", "topic-", "dest-"]
for prefix in prefixes:
    count = sum(len(batch) for batch in index.list(prefix=prefix))
    print(f"  {prefix}: {count}")

# 3. Query to inspect metadata shapes
from openai import OpenAI
oai = OpenAI(api_key=OPENAI_API_KEY)
emb = oai.embeddings.create(model="text-embedding-3-small", input=["sample query"])
results = index.query(vector=emb.data[0].embedding, top_k=5, include_metadata=True)
for m in results.matches:
    print(f"  {m.id}: keys={list(m.metadata.keys())}")

Gap Analysis Checklist

Check	Method
Which content directories have vectors?	Sample by ID prefix
Which are missing?	Cross-reference KB file list vs. prefix counts
Were manifests generated but never upserted?	Check `output/*.json` for manifests, fetch IDs from Pinecone
What metadata shapes exist?	Query + inspect `.metadata.keys()`
Are pre-existing vectors from a different pipeline?	Compare metadata schemas

2. Content Chunking Strategies

Markdown (KB topics, destinations, almanac)

Split on heading boundaries, then apply paragraph-level chunking with overlap:


import re
import hashlib

MAX_CHUNK_TOKENS = 800
OVERLAP_TOKENS = 200

def estimate_tokens(text):
    return len(text) // 4

def chunk_by_sections(content):
    """Split markdown at ## headings, then chunk each section."""
    sections = re.split(r"\n(?=##\s)", content)
    all_chunks = []

    for section in sections:
        section = section.strip()
        if not section or estimate_tokens(section) < 50:
            continue

        paragraphs = re.split(r"\n\s*\n", section)
        current = []
        current_tokens = 0

        for para in paragraphs:
            para = para.strip()
            if not para:
                continue
            pt = estimate_tokens(para)

            if current_tokens + pt > MAX_CHUNK_TOKENS and current:
                all_chunks.append("\n\n".join(current))
                # Overlap: carry trailing paragraphs into next chunk
                overlap_paras = []
                overlap_t = 0
                for p in reversed(current):
                    t = estimate_tokens(p)
                    if overlap_t + t > OVERLAP_TOKENS:
                        break
                    overlap_paras.insert(0, p)
                    overlap_t += t
                current = overlap_paras
                current_tokens = overlap_t

            current.append(para)
            current_tokens += pt

        if current:
            all_chunks.append("\n\n".join(current))

    return all_chunks

PDF Extracted Text

Use the same paragraph-based chunking. PDFs tend to produce denser text, so use the same token limits but expect more chunks per document.

Structured Data (JSON databases)

Generally NOT suitable for vector embedding. Structured data (dive sites, operators, analytics) is better served by direct lookup, SQL queries, or keyword-triggered injection. Only embed structured data if it contains narrative descriptions worth semantic retrieval.

3. Metadata Schema Design

Rich metadata enables filtered retrieval. Design metadata to support the query patterns your RAG system needs.

Recommended Fields

Field	Type	Purpose	Example
`source`	string	Origin filename	`"equipment-guide.md"`
`topic`	string	Content category	`"safety-medicine"`, `"equipment"`, `"destination"`
`region`	string	Geographic region	`"caribbean"`, `"southeast-asia"`
`cert_level`	string	Min certification	`"OW"`, `"AOW"`, `"Technical"`
`dive_type`	string	Primary dive type	`"reef"`, `"wreck"`, `"cave"`
`content_type`	string	Content classification	`"factual"`, `"procedural"`, `"advisory"`, `"scientific"`
`category`	string	Pipeline source	`"topic"`, `"dest"`, `"almanac"`, `"pdf"`
`chunk_index`	int	Position in document	`0`, `1`, `2`
`chunk_count`	int	Total chunks from doc	`15`
`text`	string	Chunk text (truncated)	First 1000 chars for display

Metadata for Filtered Queries


# Retrieve only safety content for advanced divers
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "topic": {"$eq": "safety-medicine"},
        "cert_level": {"$in": ["AOW", "Technical"]}
    },
    include_metadata=True
)

4. ID Prefix Conventions

Use consistent prefixes to identify vector sources without querying metadata:

Prefix	Source	Example ID
`scubagpt-`	PDF corpus extractions	`scubagpt-751ccd6b02ed4107`
`almanac-`	Regional almanac files	`almanac-b9c2d4cb0a60cccb`
`topic-`	Topics KB markdown	`topic-3a8f2c1d9e0b4567`
`dest-`	Destinations KB markdown	`dest-7c4e8a2f1b3d6590`

ID generation pattern:


chunk_id = hashlib.md5(
    f"{prefix}:{filename}:{chunk_index}".encode()
).hexdigest()[:16]
vector_id = f"{prefix}-{chunk_id}"

Deterministic IDs mean re-running a pipeline overwrites (upserts) existing vectors rather than creating duplicates.

5. Embedding Generation

Model Selection

Model	Dimensions	Max Tokens	Cost	Use Case
`text-embedding-3-small`	1536	8,192	$0.02/1M tokens	Default — good quality, low cost
`text-embedding-3-large`	3072	8,192	$0.13/1M tokens	When retrieval precision is critical
`text-embedding-ada-002`	1536	8,191	$0.10/1M tokens	Legacy — avoid for new projects

Batch Embedding Pattern


import openai
import time

client = openai.OpenAI(api_key=OPENAI_API_KEY)
BATCH_SIZE = 100
TEXT_TRUNCATION = 6000  # chars — stays safely under 8192 token limit

embedded = []
for batch_start in range(0, len(vectors), BATCH_SIZE):
    batch = vectors[batch_start:batch_start + BATCH_SIZE]
    texts = [v["text"][:TEXT_TRUNCATION] for v in batch]

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    for j, emb in enumerate(response.data):
        v = batch[j]
        embedded.append({
            "id": v["id"],
            "values": emb.embedding,
            "metadata": {**v["metadata"], "text": v["text"][:1000]},
        })

    done = min(batch_start + BATCH_SIZE, len(vectors))
    print(f"  Embedded {done}/{len(vectors)} chunks")
    time.sleep(0.1)  # Rate-limit courtesy

Critical: Truncate text to ~6,000 characters (not 8,000) before sending to the API. Dense text can exceed 8,192 tokens even at 8,000 characters, causing 400 BadRequestError.

6. Pinecone Upsert


from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(INDEX_NAME)

UPSERT_BATCH = 100
for batch_start in range(0, len(embedded), UPSERT_BATCH):
    batch = embedded[batch_start:batch_start + UPSERT_BATCH]
    index.upsert(vectors=batch)
    done = min(batch_start + UPSERT_BATCH, len(embedded))
    print(f"  Upserted {done}/{len(embedded)} vectors")
    time.sleep(0.2)

Upsert is idempotent — vectors with the same ID are overwritten, not duplicated. This makes re-runs safe.

7. Verification

After upsert, always verify:


# 1. Check total count increased as expected
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")

# 2. Spot-check with known IDs from manifest
sample_ids = [v["id"] for v in manifest["vectors"][:10]]
result = index.fetch(ids=sample_ids)
found = sum(1 for vid in sample_ids if vid in result.vectors)
print(f"Spot-check: {found}/{len(sample_ids)} found")

# 3. Semantic query to confirm retrieval works
emb = oai_client.embeddings.create(
    model="text-embedding-3-small",
    input=["test query relevant to new content"]
)
results = index.query(vector=emb.data[0].embedding, top_k=5, include_metadata=True)
for m in results.matches:
    src = m.metadata.get("source", "?")
    topic = m.metadata.get("topic", "?")
    print(f"  {m.score:.3f}  {src}  [{topic}]")

8. Incremental Update Workflow

When knowledgebase content changes:

Identify changed files — compare file modification times or content hashes against the last manifest
Re-chunk only changed files — don’t re-process the entire corpus
Delete stale vectors — if a file was removed, delete its vectors by ID prefix
Generate new embeddings — only for new/changed chunks
Upsert — deterministic IDs ensure changed content overwrites cleanly


# Delete all vectors from a removed source
ids_to_delete = []
for batch in index.list(prefix=f"topic-"):
    for vid in batch:
        if should_delete(vid):
            ids_to_delete.append(vid)

if ids_to_delete:
    for i in range(0, len(ids_to_delete), 100):
        index.delete(ids=ids_to_delete[i:i+100])

9. Manifest Files

Every embedding pipeline should write a manifest JSON for auditability:


{
  "total_vectors": 272,
  "embedding_model": "text-embedding-3-small",
  "embedding_dim": 1536,
  "topics": { "safety-medicine": 31, "equipment": 17 },
  "regions": { "caribbean": 48, "general": 62 },
  "vectors": [
    {
      "id": "topic-3a8f2c1d9e0b4567",
      "metadata": { "source": "equipment-guide.md", "topic": "equipment" },
      "text_preview": "First 200 chars of chunk..."
    }
  ]
}

Manifests enable:

Comparing what _should_ be in the index vs. what _is_ there
Re-running embedding without re-chunking
Auditing coverage by topic/region

10. Environment Variables

All embedding scripts should use these environment variables:

Variable	Required	Default	Purpose
`OPENAI_API_KEY`	Yes	—	OpenAI API for embedding generation
`PINECONE_API_KEY`	Yes	—	Pinecone API for vector operations
`PINECONE_INDEX_NAME`	No	`scubagpt-1536`	Target index name

Never hardcode API keys. Pass via environment variables or secure credential stores.

Anti-Patterns

Embedding structured JSON data (dive site databases, analytics JSON): These are better served by direct lookup. Only embed text with narrative content.
8,000-character truncation: Dense text (scientific papers, tables) can exceed 8,192 tokens at 8,000 chars. Use 6,000 characters as the safe ceiling.
No manifest: Without a manifest, you can’t audit what’s in the index or detect coverage gaps.
No verification after upsert: Always spot-check fetched IDs and run semantic queries to confirm the new content is retrievable.
Embedding without metadata: Bare vectors with no topic, region, or source metadata make filtered retrieval impossible and debugging very difficult.
Non-deterministic IDs: Random UUIDs mean re-running a pipeline doubles your vectors. Use content-derived hashes so re-runs are idempotent.
Single monolithic pipeline: Separate pipelines by content type (PDF, markdown, almanac) so each can be re-run independently.
Embedding disambiguation/terminology files: JSON lookup tables for term disambiguation are used by exact-match, not semantic search.

Inputs Required

Pinecone API key and index name
OpenAI API key (for embedding model)
Content to embed: markdown files, extracted PDF text, or other narrative content
Existing index state (for gap analysis)

Output Format

Embedded vectors upserted to Pinecone index
Manifest JSON file per pipeline run (vector IDs, metadata, text previews)
Verification report (total count, spot-check results, sample queries)

Reference Implementation

See products/scuba-gpt/data-pipelines/:

05_pinecone_reembed.py — PDF corpus chunking, embedding, and upsert with rich metadata
12_embed_almanac.py — Almanac markdown section-splitting, embedding, and upsert
14_embed_kb_markdown.py — Topics and destinations markdown embedding with heading-based chunking

AI Skill

Product Showcase

ITI Knowledge System

AI Agent

User Guide

Requirements

ScubaGPT

Grateful Dead Chatbot

Farmers Bounty

Technical Document

Answer Engine Optimizer

SEO Optimizer

Travel Planner

Fact Checker

Estate Manager

ITI Operations

ITI Marketing

Patriot University

Personal Assistant

Pinecone Embedding Management

Pinecone Embedding Management

Instructions

1. Index Audit — What’s Already Embedded?

Gap Analysis Checklist

2. Content Chunking Strategies

Markdown (KB topics, destinations, almanac)

PDF Extracted Text

Structured Data (JSON databases)

3. Metadata Schema Design

Recommended Fields

Metadata for Filtered Queries

4. ID Prefix Conventions

5. Embedding Generation

Model Selection

Batch Embedding Pattern

6. Pinecone Upsert

7. Verification

8. Incremental Update Workflow

9. Manifest Files

10. Environment Variables

Anti-Patterns

Inputs Required

Output Format

Reference Implementation