2026-06-03 · flo2 blog

Embeddings API Guide: Turning Text into Vectors for Search & RAG

The embeddings API is one of the most quietly powerful primitives in modern AI development. You send text in, you get back a dense numeric vector — typically hundreds or thousands of floating-point numbers — and that vector captures the semantic meaning of the input in a form machines can operate on. Texts that mean similar things map to nearby points in vector space; unrelated texts land far apart. That single property unlocks semantic search, retrieval-augmented generation (RAG), clustering, classification, deduplication, and recommendation — all without a fine-tuned model for each task.

This guide covers how the embeddings API works, what you can build with it, how to call it, how to choose an embedding model, and practical tips for production use — including why cost tracking matters more than most developers expect.

What are text embeddings?

An embedding is a fixed-length vector of floats produced by an encoder model trained to preserve semantic relationships. The training objective forces the model to place semantically similar inputs close together in the vector space and dissimilar inputs far apart. The result is that "I need to cancel my subscription" and "How do I stop my billing?" land very near each other, even though they share no keywords.

A few properties worth knowing upfront:

What embeddings power

Once you have vectors, you can do a lot with distance math:

The API shape

The embeddings endpoint follows a simple, OpenAI-compatible contract: POST /v1/embeddings with a JSON body containing input (a string or array of strings) and model. The response includes an embeddings array, one entry per input.

curl example

curl https://api.openai.com/v1/embeddings \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-3-small",
    "input": [
      "Semantic search finds results by meaning, not keywords.",
      "Vector databases store high-dimensional embeddings for fast nearest-neighbor queries."
    ]
  }'

The response looks like:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023, -0.0091, 0.0412, ...]  // 1536 floats for text-embedding-3-small
    },
    {
      "object": "embedding",
      "index": 1,
      "embedding": [0.0019, -0.0083, 0.0389, ...]
    }
  ],
  "model": "text-embedding-3-small",
  "usage": { "prompt_tokens": 28, "total_tokens": 28 }
}

Python example

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="sk-...")

texts = [
    "Semantic search finds results by meaning, not keywords.",
    "Vector databases store high-dimensional embeddings.",
    "The cat sat on the mat.",  # intentionally unrelated
]

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
)

# Extract vectors in index order
vectors = [item.embedding for item in sorted(response.data, key=lambda x: x.index)]

# Cosine similarity between first two (should be high)
def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_sim(vectors[0], vectors[1]))  # ~0.88 — similar topic
print(cosine_sim(vectors[0], vectors[2]))  # ~0.28 — unrelated

Because the endpoint is OpenAI-compatible, the same code works against any provider that implements /v1/embeddings — just swap base_url and api_key. That's exactly what the OpenAI-compatible API pattern enables.

Choosing an embedding model

The right model depends on your use case. Key axes to evaluate (verify current specs and pricing directly with each provider before committing):

A reasonable default for English-only RAG at moderate scale: start with a well-supported hosted embedding model, benchmark retrieval quality on a held-out set from your own data, then explore open-weight alternatives if cost becomes a factor at your volume.

Practical tips for production

Normalize your vectors

If your model doesn't return unit-normalized vectors, normalize them before storing. This makes cosine similarity a simple dot product, which is faster and what most approximate nearest-neighbor (ANN) indices (HNSW, IVF) are optimized for.

import numpy as np

def normalize(v):
    norm = np.linalg.norm(v)
    return v / norm if norm > 0 else v

normalized = [normalize(np.array(item.embedding)) for item in response.data]

Batch your inputs

The embeddings API accepts an array of strings. Sending a batch of 100 texts in a single request is far more efficient than 100 individual requests — lower latency overhead, fewer connections, and often better throughput. Most providers cap batch size at 100–2048 inputs; check the docs and chunk your batches accordingly.

Chunk documents thoughtfully

Long documents need to be split before embedding. Common strategies:

Pick the right vector database

Once you have vectors, you need somewhere to store and query them. Options range from purpose-built vector databases (Pinecone, Qdrant, Weaviate, Milvus) to extensions on existing stores (pgvector for Postgres, SQLite-vec). For small datasets (<100k vectors), an in-memory index like FAISS or even a simple NumPy dot-product loop is often sufficient and avoids operational overhead.

Track embedding costs separately

Embedding tokens are cheap — often 10–20× cheaper per token than chat completions. But RAG pipelines can embed a lot: every document in your corpus at index time, and every user query at inference time. At scale, embedding costs are real. Run your LLM calls through a gateway that tracks usage per model, so you can see embedding token consumption alongside completion token consumption and make informed decisions about re-embedding frequency and model choice.

flo2 routes requests to any provider using your own keys with zero token markup and full per-request usage visibility — so you can see exactly what your embedding pipeline is costing you as it scales. flo2 is free during beta.

Putting it together: a minimal RAG pipeline

from openai import OpenAI
import numpy as np

client = OpenAI(
    base_url="https://flo2.com/v1",  # or any OpenAI-compatible gateway
    api_key="your-flo2-key",
)

# --- Index time ---
documents = [
    "flo2 is a developer-first LLM gateway with zero token markup.",
    "RAG grounds LLM responses in retrieved context.",
    "Vector databases store embeddings for fast similarity search.",
]

index_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents,
)
doc_vectors = np.array([item.embedding for item in sorted(index_response.data, key=lambda x: x.index)])

# --- Query time ---
query = "How does flo2 handle pricing?"
query_response = client.embeddings.create(model="text-embedding-3-small", input=[query])
query_vector = np.array(query_response.data[0].embedding)

# Cosine similarity (vectors already unit-normalized by this model)
scores = doc_vectors @ query_vector
top_idx = int(np.argmax(scores))
retrieved_context = documents[top_idx]

# Feed retrieved context to chat completions
chat_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": f"Answer using only this context:\n{retrieved_context}"},
        {"role": "user", "content": query},
    ],
)
print(chat_response.choices[0].message.content)

This is a minimal but complete RAG loop: embed documents, embed query, find the nearest document, inject it as context, generate an answer. Real implementations add chunking, a proper vector database, metadata filtering, and reranking — but the core pattern is exactly this.

If you're building on multiple providers or want to swap embedding models without changing application code, flo2 gives you a single OpenAI-compatible endpoint that routes to whichever provider and model you configure — with your own API keys, zero markup, and usage tracking built in.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to