Pinecone Embedding Management
Pinecone Embedding Management
Instructions
Manage Pinecone vector indexes for RAG-powered applications. This skill covers the full lifecycle from content audit through embedding generation, upsert, and verification.
1. Index Audit — What’s Already Embedded?
Before adding content, always audit the existing index:
from pinecone import Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(INDEX_NAME)
# 1. Get total stats
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Dimension: {stats.dimension}")
print(f"Namespaces: {dict(stats.namespaces)}")
# 2. Sample by ID prefix to understand vector sources
prefixes = ["pdf-", "almanac-", "topic-", "dest-"]
for prefix in prefixes:
count = sum(len(batch) for batch in index.list(prefix=prefix))
print(f" {prefix}: {count}")
# 3. Query to inspect metadata shapes
from openai import OpenAI
oai = OpenAI(api_key=OPENAI_API_KEY)
emb = oai.embeddings.create(model="text-embedding-3-small", input=["sample query"])
results = index.query(vector=emb.data[0].embedding, top_k=5, include_metadata=True)
for m in results.matches:
print(f" {m.id}: keys={list(m.metadata.keys())}")
Gap Analysis Checklist
| Check | Method |
|---|---|
| Which content directories have vectors? | Sample by ID prefix |
| Which are missing? | Cross-reference KB file list vs. prefix counts |
| Were manifests generated but never upserted? | Check output/*.json for manifests, fetch IDs from Pinecone |
| What metadata shapes exist? | Query + inspect .metadata.keys() |
| Are pre-existing vectors from a different pipeline? | Compare metadata schemas |
2. Content Chunking Strategies
Markdown (KB topics, destinations, almanac)
Split on heading boundaries, then apply paragraph-level chunking with overlap:
import re
import hashlib
MAX_CHUNK_TOKENS = 800
OVERLAP_TOKENS = 200
def estimate_tokens(text):
return len(text) // 4
def chunk_by_sections(content):
"""Split markdown at ## headings, then chunk each section."""
sections = re.split(r"\n(?=##\s)", content)
all_chunks = []
for section in sections:
section = section.strip()
if not section or estimate_tokens(section) < 50:
continue
paragraphs = re.split(r"\n\s*\n", section)
current = []
current_tokens = 0
for para in paragraphs:
para = para.strip()
if not para:
continue
pt = estimate_tokens(para)
if current_tokens + pt > MAX_CHUNK_TOKENS and current:
all_chunks.append("\n\n".join(current))
# Overlap: carry trailing paragraphs into next chunk
overlap_paras = []
overlap_t = 0
for p in reversed(current):
t = estimate_tokens(p)
if overlap_t + t > OVERLAP_TOKENS:
break
overlap_paras.insert(0, p)
overlap_t += t
current = overlap_paras
current_tokens = overlap_t
current.append(para)
current_tokens += pt
if current:
all_chunks.append("\n\n".join(current))
return all_chunks
PDF Extracted Text
Use the same paragraph-based chunking. PDFs tend to produce denser text, so use the same token limits but expect more chunks per document.
Structured Data (JSON databases)
Generally NOT suitable for vector embedding. Structured data (dive sites, operators, analytics) is better served by direct lookup, SQL queries, or keyword-triggered injection. Only embed structured data if it contains narrative descriptions worth semantic retrieval.
3. Metadata Schema Design
Rich metadata enables filtered retrieval. Design metadata to support the query patterns your RAG system needs.
Recommended Fields
| Field | Type | Purpose | Example |
|---|---|---|---|
source |
string | Origin filename | "equipment-guide.md" |
topic |
string | Content category | "safety-medicine", "equipment", "destination" |
region |
string | Geographic region | "caribbean", "southeast-asia" |
cert_level |
string | Min certification | "OW", "AOW", "Technical" |
dive_type |
string | Primary dive type | "reef", "wreck", "cave" |
content_type |
string | Content classification | "factual", "procedural", "advisory", "scientific" |
category |
string | Pipeline source | "topic", "dest", "almanac", "pdf" |
chunk_index |
int | Position in document | 0, 1, 2 |
chunk_count |
int | Total chunks from doc | 15 |
text |
string | Chunk text (truncated) | First 1000 chars for display |
Metadata for Filtered Queries
# Retrieve only safety content for advanced divers
results = index.query(
vector=query_embedding,
top_k=5,
filter={
"topic": {"$eq": "safety-medicine"},
"cert_level": {"$in": ["AOW", "Technical"]}
},
include_metadata=True
)
4. ID Prefix Conventions
Use consistent prefixes to identify vector sources without querying metadata:
| Prefix | Source | Example ID |
|---|---|---|
scubagpt- |
PDF corpus extractions | scubagpt-751ccd6b02ed4107 |
almanac- |
Regional almanac files | almanac-b9c2d4cb0a60cccb |
topic- |
Topics KB markdown | topic-3a8f2c1d9e0b4567 |
dest- |
Destinations KB markdown | dest-7c4e8a2f1b3d6590 |
ID generation pattern:
chunk_id = hashlib.md5(
f"{prefix}:{filename}:{chunk_index}".encode()
).hexdigest()[:16]
vector_id = f"{prefix}-{chunk_id}"
Deterministic IDs mean re-running a pipeline overwrites (upserts) existing vectors rather than creating duplicates.
5. Embedding Generation
Model Selection
| Model | Dimensions | Max Tokens | Cost | Use Case |
|---|---|---|---|---|
text-embedding-3-small |
1536 | 8,192 | $0.02/1M tokens | Default — good quality, low cost |
text-embedding-3-large |
3072 | 8,192 | $0.13/1M tokens | When retrieval precision is critical |
text-embedding-ada-002 |
1536 | 8,191 | $0.10/1M tokens | Legacy — avoid for new projects |
Batch Embedding Pattern
import openai
import time
client = openai.OpenAI(api_key=OPENAI_API_KEY)
BATCH_SIZE = 100
TEXT_TRUNCATION = 6000 # chars — stays safely under 8192 token limit
embedded = []
for batch_start in range(0, len(vectors), BATCH_SIZE):
batch = vectors[batch_start:batch_start + BATCH_SIZE]
texts = [v["text"][:TEXT_TRUNCATION] for v in batch]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
for j, emb in enumerate(response.data):
v = batch[j]
embedded.append({
"id": v["id"],
"values": emb.embedding,
"metadata": {**v["metadata"], "text": v["text"][:1000]},
})
done = min(batch_start + BATCH_SIZE, len(vectors))
print(f" Embedded {done}/{len(vectors)} chunks")
time.sleep(0.1) # Rate-limit courtesy
Critical: Truncate text to ~6,000 characters (not 8,000) before sending to the API. Dense text can exceed 8,192 tokens even at 8,000 characters, causing 400 BadRequestError.
6. Pinecone Upsert
from pinecone import Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(INDEX_NAME)
UPSERT_BATCH = 100
for batch_start in range(0, len(embedded), UPSERT_BATCH):
batch = embedded[batch_start:batch_start + UPSERT_BATCH]
index.upsert(vectors=batch)
done = min(batch_start + UPSERT_BATCH, len(embedded))
print(f" Upserted {done}/{len(embedded)} vectors")
time.sleep(0.2)
Upsert is idempotent — vectors with the same ID are overwritten, not duplicated. This makes re-runs safe.
7. Verification
After upsert, always verify:
# 1. Check total count increased as expected
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
# 2. Spot-check with known IDs from manifest
sample_ids = [v["id"] for v in manifest["vectors"][:10]]
result = index.fetch(ids=sample_ids)
found = sum(1 for vid in sample_ids if vid in result.vectors)
print(f"Spot-check: {found}/{len(sample_ids)} found")
# 3. Semantic query to confirm retrieval works
emb = oai_client.embeddings.create(
model="text-embedding-3-small",
input=["test query relevant to new content"]
)
results = index.query(vector=emb.data[0].embedding, top_k=5, include_metadata=True)
for m in results.matches:
src = m.metadata.get("source", "?")
topic = m.metadata.get("topic", "?")
print(f" {m.score:.3f} {src} [{topic}]")
8. Incremental Update Workflow
When knowledgebase content changes:
- Identify changed files — compare file modification times or content hashes against the last manifest
- Re-chunk only changed files — don’t re-process the entire corpus
- Delete stale vectors — if a file was removed, delete its vectors by ID prefix
- Generate new embeddings — only for new/changed chunks
- Upsert — deterministic IDs ensure changed content overwrites cleanly
# Delete all vectors from a removed source
ids_to_delete = []
for batch in index.list(prefix=f"topic-"):
for vid in batch:
if should_delete(vid):
ids_to_delete.append(vid)
if ids_to_delete:
for i in range(0, len(ids_to_delete), 100):
index.delete(ids=ids_to_delete[i:i+100])
9. Manifest Files
Every embedding pipeline should write a manifest JSON for auditability:
{
"total_vectors": 272,
"embedding_model": "text-embedding-3-small",
"embedding_dim": 1536,
"topics": { "safety-medicine": 31, "equipment": 17 },
"regions": { "caribbean": 48, "general": 62 },
"vectors": [
{
"id": "topic-3a8f2c1d9e0b4567",
"metadata": { "source": "equipment-guide.md", "topic": "equipment" },
"text_preview": "First 200 chars of chunk..."
}
]
}
Manifests enable:
- Comparing what _should_ be in the index vs. what _is_ there
- Re-running embedding without re-chunking
- Auditing coverage by topic/region
10. Environment Variables
All embedding scripts should use these environment variables:
| Variable | Required | Default | Purpose |
|---|---|---|---|
OPENAI_API_KEY |
Yes | — | OpenAI API for embedding generation |
PINECONE_API_KEY |
Yes | — | Pinecone API for vector operations |
PINECONE_INDEX_NAME |
No | scubagpt-1536 |
Target index name |
Never hardcode API keys. Pass via environment variables or secure credential stores.
Anti-Patterns
- Embedding structured JSON data (dive site databases, analytics JSON): These are better served by direct lookup. Only embed text with narrative content.
- 8,000-character truncation: Dense text (scientific papers, tables) can exceed 8,192 tokens at 8,000 chars. Use 6,000 characters as the safe ceiling.
- No manifest: Without a manifest, you can’t audit what’s in the index or detect coverage gaps.
- No verification after upsert: Always spot-check fetched IDs and run semantic queries to confirm the new content is retrievable.
- Embedding without metadata: Bare vectors with no topic, region, or source metadata make filtered retrieval impossible and debugging very difficult.
- Non-deterministic IDs: Random UUIDs mean re-running a pipeline doubles your vectors. Use content-derived hashes so re-runs are idempotent.
- Single monolithic pipeline: Separate pipelines by content type (PDF, markdown, almanac) so each can be re-run independently.
- Embedding disambiguation/terminology files: JSON lookup tables for term disambiguation are used by exact-match, not semantic search.
Inputs Required
- Pinecone API key and index name
- OpenAI API key (for embedding model)
- Content to embed: markdown files, extracted PDF text, or other narrative content
- Existing index state (for gap analysis)
Output Format
- Embedded vectors upserted to Pinecone index
- Manifest JSON file per pipeline run (vector IDs, metadata, text previews)
- Verification report (total count, spot-check results, sample queries)
Reference Implementation
See products/scuba-gpt/data-pipelines/:
05_pinecone_reembed.py— PDF corpus chunking, embedding, and upsert with rich metadata12_embed_almanac.py— Almanac markdown section-splitting, embedding, and upsert14_embed_kb_markdown.py— Topics and destinations markdown embedding with heading-based chunking
