Embeddings API Guide: Turning Text into Vectors for Search & RAG
The embeddings API is one of the most quietly powerful primitives in modern AI development. You send text in, you get back a dense numeric vector — typically hundreds or thousands of floating-point numbers — and that vector captures the semantic meaning of the input in a form machines can operate on. Texts that mean similar things map to nearby points in vector space; unrelated texts land far apart. That single property unlocks semantic search, retrieval-augmented generation (RAG), clustering, classification, deduplication, and recommendation — all without a fine-tuned model for each task.
This guide covers how the embeddings API works, what you can build with it, how to call it, how to choose an embedding model, and practical tips for production use — including why cost tracking matters more than most developers expect.
What are text embeddings?
An embedding is a fixed-length vector of floats produced by an encoder model trained to preserve semantic relationships. The training objective forces the model to place semantically similar inputs close together in the vector space and dissimilar inputs far apart. The result is that "I need to cancel my subscription" and "How do I stop my billing?" land very near each other, even though they share no keywords.
A few properties worth knowing upfront:
- Dimensionality: Common embedding models output 384, 768, 1536, or 3072 dimensions depending on the model. More dimensions generally means more expressive power but more memory and compute when searching at scale.
- Normalization: Many models return unit-normalized vectors, which makes cosine similarity equivalent to a dot product — useful for performance in vector databases.
- Max input length: Every model has a token limit for the input. Exceeding it typically means the text is truncated or the request errors. For long documents, you need to chunk before embedding.
- Language coverage: Some models are trained on multilingual data and produce compatible vectors across languages; others are English-centric. Check model cards before assuming cross-lingual search will work.
What embeddings power
Once you have vectors, you can do a lot with distance math:
- Semantic search: Embed the user's query, search a vector index for the nearest document chunks. Unlike keyword search, it finds relevant results even when the user's phrasing differs from the indexed text.
- Retrieval-augmented generation (RAG): The dominant architecture for grounding LLM responses in private or recent data. You retrieve relevant chunks via semantic search, inject them into the prompt context, and the LLM synthesizes an answer.
- Clustering: Group documents by topic without predefined labels. Embed a corpus, run k-means or HDBSCAN on the vectors, inspect the clusters.
- Classification: Train a lightweight classifier (logistic regression, SVM) on top of embeddings. Often dramatically outperforms bag-of-words approaches with a fraction of the labeled data.
- Deduplication: High cosine similarity between two document embeddings is a strong signal they contain the same content, even with minor rewording.
- Recommendations: Embed items and users (or their interaction history) into the same space. Surface items nearest to a user's embedding.
The API shape
The embeddings endpoint follows a simple, OpenAI-compatible contract: POST /v1/embeddings with a JSON body containing input (a string or array of strings) and model. The response includes an embeddings array, one entry per input.
curl example
curl https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-3-small",
"input": [
"Semantic search finds results by meaning, not keywords.",
"Vector databases store high-dimensional embeddings for fast nearest-neighbor queries."
]
}'
The response looks like:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0091, 0.0412, ...] // 1536 floats for text-embedding-3-small
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0019, -0.0083, 0.0389, ...]
}
],
"model": "text-embedding-3-small",
"usage": { "prompt_tokens": 28, "total_tokens": 28 }
}
Python example
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="sk-...")
texts = [
"Semantic search finds results by meaning, not keywords.",
"Vector databases store high-dimensional embeddings.",
"The cat sat on the mat.", # intentionally unrelated
]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
# Extract vectors in index order
vectors = [item.embedding for item in sorted(response.data, key=lambda x: x.index)]
# Cosine similarity between first two (should be high)
def cosine_sim(a, b):
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine_sim(vectors[0], vectors[1])) # ~0.88 — similar topic
print(cosine_sim(vectors[0], vectors[2])) # ~0.28 — unrelated
Because the endpoint is OpenAI-compatible, the same code works against any provider that implements /v1/embeddings — just swap base_url and api_key. That's exactly what the OpenAI-compatible API pattern enables.
Choosing an embedding model
The right model depends on your use case. Key axes to evaluate (verify current specs and pricing directly with each provider before committing):
- Dimensions: Higher dimensions can encode more nuance but cost more to store and search. Some models support Matryoshka representation learning, letting you truncate to fewer dimensions with graceful quality degradation.
- Max input tokens: If your documents are long, you need either a model with a large context window or a solid chunking strategy. Typical limits range from 512 to 8192 tokens.
- Multilingual support: If your users write in multiple languages and you want cross-lingual retrieval, confirm the model was trained on multilingual data and benchmark it on your languages specifically.
- Cost per token: Embedding models are generally much cheaper than chat models, but at high volume (millions of documents), cost adds up. Always check the provider's current pricing page.
- Open-weight options: Models like
nomic-embed-text,bge-m3, ande5-mistral-7b-instructare available through inference providers and can be self-hosted. See best open-source LLM APIs for a survey of where to run them.
A reasonable default for English-only RAG at moderate scale: start with a well-supported hosted embedding model, benchmark retrieval quality on a held-out set from your own data, then explore open-weight alternatives if cost becomes a factor at your volume.
Practical tips for production
Normalize your vectors
If your model doesn't return unit-normalized vectors, normalize them before storing. This makes cosine similarity a simple dot product, which is faster and what most approximate nearest-neighbor (ANN) indices (HNSW, IVF) are optimized for.
import numpy as np
def normalize(v):
norm = np.linalg.norm(v)
return v / norm if norm > 0 else v
normalized = [normalize(np.array(item.embedding)) for item in response.data]
Batch your inputs
The embeddings API accepts an array of strings. Sending a batch of 100 texts in a single request is far more efficient than 100 individual requests — lower latency overhead, fewer connections, and often better throughput. Most providers cap batch size at 100–2048 inputs; check the docs and chunk your batches accordingly.
Chunk documents thoughtfully
Long documents need to be split before embedding. Common strategies:
- Fixed-size chunks: Simple to implement. Overlap by 10–20% of the chunk size to avoid splitting mid-sentence and losing context at boundaries.
- Sentence or paragraph splitting: Preserves natural semantic units. Libraries like
langchainandllama-indexinclude ready-made splitters. - Semantic chunking: Split where embedding similarity drops sharply between adjacent sentences — more expensive but often improves retrieval quality for heterogeneous documents.
Pick the right vector database
Once you have vectors, you need somewhere to store and query them. Options range from purpose-built vector databases (Pinecone, Qdrant, Weaviate, Milvus) to extensions on existing stores (pgvector for Postgres, SQLite-vec). For small datasets (<100k vectors), an in-memory index like FAISS or even a simple NumPy dot-product loop is often sufficient and avoids operational overhead.
Track embedding costs separately
Embedding tokens are cheap — often 10–20× cheaper per token than chat completions. But RAG pipelines can embed a lot: every document in your corpus at index time, and every user query at inference time. At scale, embedding costs are real. Run your LLM calls through a gateway that tracks usage per model, so you can see embedding token consumption alongside completion token consumption and make informed decisions about re-embedding frequency and model choice.
flo2 routes requests to any provider using your own keys with zero token markup and full per-request usage visibility — so you can see exactly what your embedding pipeline is costing you as it scales. flo2 is free during beta.
Putting it together: a minimal RAG pipeline
from openai import OpenAI
import numpy as np
client = OpenAI(
base_url="https://flo2.com/v1", # or any OpenAI-compatible gateway
api_key="your-flo2-key",
)
# --- Index time ---
documents = [
"flo2 is a developer-first LLM gateway with zero token markup.",
"RAG grounds LLM responses in retrieved context.",
"Vector databases store embeddings for fast similarity search.",
]
index_response = client.embeddings.create(
model="text-embedding-3-small",
input=documents,
)
doc_vectors = np.array([item.embedding for item in sorted(index_response.data, key=lambda x: x.index)])
# --- Query time ---
query = "How does flo2 handle pricing?"
query_response = client.embeddings.create(model="text-embedding-3-small", input=[query])
query_vector = np.array(query_response.data[0].embedding)
# Cosine similarity (vectors already unit-normalized by this model)
scores = doc_vectors @ query_vector
top_idx = int(np.argmax(scores))
retrieved_context = documents[top_idx]
# Feed retrieved context to chat completions
chat_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer using only this context:\n{retrieved_context}"},
{"role": "user", "content": query},
],
)
print(chat_response.choices[0].message.content)
This is a minimal but complete RAG loop: embed documents, embed query, find the nearest document, inject it as context, generate an answer. Real implementations add chunking, a proper vector database, metadata filtering, and reranking — but the core pattern is exactly this.
If you're building on multiple providers or want to swap embedding models without changing application code, flo2 gives you a single OpenAI-compatible endpoint that routes to whichever provider and model you configure — with your own API keys, zero markup, and usage tracking built in.