Back to Blog
EngineeringFeatured

Embedding + Rerank Gateway: Rust vs Python (28% Faster, 67% Less RAM)

We built the same embedding + rerank gateway in Python, Rust (ONNX), and a split architecture — then benchmarked all three on GCP. Rust hit 28% more RPS with 67% less memory. Same model, same API.

NavyaAI Engineering Team
10 min read
RAGEmbeddingsSearchAI InfrastructureScalabilityPerformance
Embedding + Rerank Gateway: Rust vs Python (28% Faster, 67% Less RAM)

Embedding + Rerank Gateways: Small Services, Big Performance Wins

The Tiny Service Everyone Depends On

Every RAG product, enterprise search tool, and "chat with docs" feature quietly depends on the same thing:

An Embedding + Rerank gateway.

It looks boring from the outside:

  • Accept chunks of text
  • Call an embeddings provider
  • Store vectors somewhere
  • Serve /search with retrieval + rerank + citations

But this tiny service is doing a lot of work:

  • Fanning out to embedding APIs
  • Normalizing and deduplicating content
  • Handling high-concurrency search traffic
  • Keeping vector storage and metadata consistent

It also sits directly on your hot path:

  • Every query hits it
  • Every ingestion job goes through it
  • Every bad performance decision shows up in p95 latency

So we asked a simple question:

What does a good Embedding + Rerank gateway look like — and how well can a single instance perform?


Two Implementations, One API

We built the same gateway in three ways: Python (embeddings in-process), Go (thin layer, no ML), and Rust + ONNX (same model as Python for apples-to-apples comparison). All expose the same four endpoints so we can benchmark them with the same harness:

  • Python: FastAPI + sentence-transformers/all-MiniLM-L6-v2 (CPU), in-memory NumPy index, real embeddings.
  • Go: Stdlib HTTP + in-memory index, deterministic pseudo-embeddings (no ML runtime); same request/response shapes.
  • Rust (ONNX): Same model (all-MiniLM-L6-v2) exported to ONNX and run via ONNX Runtime; HuggingFace tokenizers; same API. Use this when you want apples-to-apples throughput, latency, and footprint vs Python.

Shared API:

  • POST /v1/ingest – Batch ingest of documents (async job)
  • GET /v1/jobs/{id} – Job status
  • POST /v1/search – Query → top‑K results + scores + metadata
  • GET /healthz – Basic health probe

You can find the implementation in the repo:

github.com/xadnavyaai/NavyaAIBlogs/embedding-rerank-gateway-high-performance

The Python gateway search endpoint looks like this:

from fastapi import FastAPI, BackgroundTasks
from sentence_transformers import SentenceTransformer

app = FastAPI(title="Embedding + Rerank Gateway")

@app.post("/v1/search", response_model=SearchResponse)
def search(request: SearchRequest) -> SearchResponse:
    start = time.time()
    results = index.search(request.query, top_k=request.top_k)
    took_ms = (time.time() - start) * 1000.0
    return SearchResponse(results=results, took_ms=took_ms)

This setup is intentionally modest:

  • No external vector database
  • No distributed queue
  • No exotic runtime features

The result: a direct comparison of Python, Go, and Rust (ONNX) — and of Python vs Rust on the same model.


How We Measured It

We used a small harness that sends concurrent POST /v1/search requests, records per-request timestamps, and computes p50/p95/p99 latency and throughput. RSS for the gateway process is sampled under load.

Metrics: p95 latency for /v1/search, throughput (RPS) at different concurrency levels, approximate RSS under load, and cold start to a healthy /healthz. All numbers below are from a dedicated 4‑vCPU, 16‑GB RAM node with no competing workloads.


Results (single GCP node, c=16, 2000 req)

All four gateways were benchmarked on a single 4‑vCPU GCP node. Python: 356 RPS, p95 150 ms, 387 MiB. Go: 451 RPS, p95 114 ms, 6.5 MiB (pseudo-embeddings; compare on footprint only). Rust: 456 RPS, p95 123 ms, 126 MiB. Split: 450 RPS, p95 117 ms, 228 MiB (embed+gateway). Full results and setup →

Apples-to-apples: Python, Rust, and Split use the same model (all-MiniLM-L6-v2); Go uses pseudo-embeddings — compare Go on footprint (image ~13 MB, RSS ~6.5 MiB) only.

Python vs Rust vs Split: one table, improvement, projected savings

Identical model (all-MiniLM-L6-v2) and workload (c=16, 2000 requests). Baseline: Python.

Summary (Rust vs Python, same model): +28% throughput (RPS), −67% memory per replica. At production load (e.g. 5K–20K RPS), that’s ~25–30% fewer nodes and ~$5K–$22K/year savings depending on scale (4‑vCPU nodes, GCP-style pricing).

Python (A) Rust (B) Split (C)
RPS 356 456 (+28%) 450 (+27%)
p95 (ms) 150 123 (−18%) 117 (−22%)
RSS 387 MiB 126 MiB (−67%) 228 MiB (−41%)

What you gain with Rust or Split

  • Throughput: ~28% more requests per second per replica → fewer replicas for the same traffic, or headroom for growth.
  • Latency: ~18–22% lower p95 → better UX and easier SLOs.
  • Memory: Rust monolith uses 67% less RAM than Python; Split uses 41% less. On a 4‑vCPU node you can run more replicas or leave room for other workloads.

Business impact: production-grade estimate (Rust vs Python, same model)

Metric Rust vs Python
Throughput +28% RPS per replica
Memory −67% per replica (387 MiB → 126 MiB)

Replicas needed at sustained load (from our 4‑vCPU benchmark, same API and model):

Target load Python (356 RPS/replica) Rust (456 RPS/replica) Fewer replicas
5,000 RPS 15 11 4
10,000 RPS 29 22 7
20,000 RPS 57 44 13

Using a typical 4‑vCPU node cost of ~$1,200–$1,700/year (GCP e.g. n2-standard-4, on-demand or 1‑yr commit), switching the gateway from Python to Rust (same semantics) gives:

  • 5K RPS: ~$4.8K–$6.8K annual savings (4 fewer nodes).
  • 10K RPS: ~$8.4K–$12K annual savings (7 fewer nodes).
  • 20K RPS: ~$16K–$22K annual savings (13 fewer nodes).

That’s ~27% fewer nodes at a given load, so ~25–30% lower infrastructure cost for the gateway layer at production scale. Add ~67% less memory per replica for better packing or smaller instance types where applicable.

Projected savings (illustrative, low load)

At 500 RPS or 1,000 RPS you may need the same replica count (2–3) for both stacks; the gain is ~68% less memory (e.g. 1.16 GiB → 378 MiB at 1K RPS), so you can downsize instances or pack more services per node.

When to pick which

  • Rust monolith (B): One service, same API as Python, best footprint and latency. Use when you want to replace Python without changing topology.
  • Split (C): Embed service + gateway; scale each independently. Slightly better p95 than B; total RSS still 41% below Python.

Rust (ONNX): same model, apples-to-apples

To compare Python vs a native stack on the exact same embedding model, we added a Rust gateway that runs all-MiniLM-L6-v2 via ONNX. Same tokenizer and API. At c=16 on the same node: 456 RPS, p95 123 ms, ~126 MiB RSS. Full results →

Production architectures: a case study

Three production architectures, all exposing the same API (/v1/search, /v1/ingest) with different deployment shapes.

flowchart LR
  subgraph A [A. Python monolith]
    ClientA[Client]
    PythonMonolith[PythonMonolith]
    ClientA --> PythonMonolith
  end
  subgraph B [B. Native monolith]
    ClientB[Client]
    RustMonolith[RustMonolith]
    ClientB --> RustMonolith
  end
  subgraph C [C. Split native]
    ClientC[Client]
    RustGateway[RustGateway]
    RustEmbed[RustEmbedService]
    ClientC --> RustGateway
    RustGateway --> RustEmbed
  end
  • A. Python monolith: One service (FastAPI + sentence-transformers). Embed + index + search in one process. Baseline.
  • B. Native monolith: One Rust service (ONNX). Same model, same API. Smaller footprint than A.
  • C. Split (native): Embedding service (Rust, ONNX, POST /embed only) + thin gateway (Rust, EMBEDDING_API_URL → embed). Client hits the gateway only; gateway calls the embed service for vectors. Scale embed and search independently; still beat Python on total footprint and latency.

Comparison:

Architecture RPS (c=16) p95 (ms) Total RSS
A. Python monolith 356 150 387 MiB
B. Rust monolith 456 123 126 MiB
C. Split (Rust) 450 117 228 MiB (embed+gateway)

Same node, identical workload.

Setup details for each architecture are in the repo: docs/CASE_STUDY.md.

Decision flow: Scale embedding and search independently? → Split (C). Prefer one service? → Python (A) or Rust (B). Best footprint and latency? → Rust monolith (B) or Split (C).

When to use which

  • Use the Python gateway when you want embeddings in-process: one service that loads the model, ingests docs, and serves search. You get real semantic search with a single deployment; you pay with larger image and ~1 GB+ RSS per replica.
  • Use the Go gateway when you want a thin orchestration layer: the gateway calls an external embedding API (or a dedicated embedding service), does retrieval and rerank, and returns results. You get the same API shape with a tiny image and ~35 MB RSS, so you can pack many more replicas per node or run on small instances. Throughput and latency then depend on your embedding backend, not on Go.
  • Use the Rust (ONNX) gateway when you want the same model as Python with a smaller footprint and native speed — direct comparison on RPS, p95, and RSS.

All three implementations are in the repo; plug in your own targets and workloads as needed.

Super gateway: Rust with three modes (merge of Rust + Go into one binary)

We merged the Rust (ONNX) and Go use cases into one Rust binary that can run in three modes, chosen by env vars:

  • ONNX (MODEL_DIR set): Same model as Python — apples-to-apples, real embeddings.
  • Remote (EMBEDDING_API_URL set): Thin layer that calls your embedding API — same idea as the Go gateway, but in Rust.
  • Pseudo (neither set): Deterministic vectors, no ML — same idea as Go's no-ML mode for dev/bench.

Advantages of the super gateway: One codebase and one binary to maintain; you choose behavior at runtime. You can still build a slim image (no model, Remote or Pseudo only) or a full image (ONNX + model). The Go gateway remains the option if you want the smallest thin binary with zero C/FFI dependencies; the Rust super gateway is the option if you want one stack that does "with model," "thin," and "pseudo" without maintaining two gateways. See rust-gateway/README.md for env vars and the remote API contract.

Why no ML in Go? What about C/Rust?

We left the Go gateway without a real embedding model on purpose: a thin layer with the same API and minimal dependencies illustrates footprint (image size, RSS) when the gateway doesn't run the model. In the Go ecosystem there's no mature, drop-in equivalent of sentence-transformers; inference is dominated by Python (PyTorch/Transformers) and C++ runtimes.

The same model can run from C or Rust (and from Go via cgo), giving real embeddings with a smaller image than full PyTorch + Python:

  • ONNX Runtime (C API): export the model to ONNX, then run it from C, Rust (ort), or Go (cgo). Image stays much smaller than a Python stack, though you still ship the ONNX runtime and model weights.
  • Rust: crates like candle, tract, or ort let you run transformer models natively; you get a single binary and control over memory and deployment.
  • Go: link to ONNX Runtime or a C inference library via cgo; the binary is no longer "pure Go" and you'll add runtime dependencies, but you can keep the image and RSS well below the 2 GB Python setup.

So the choice isn't "Python or fake vectors." It's Python in-process vs thin Go (external embeddings) vs native inference (C/Rust/Go+ONNX) when you want real embeddings and a smaller footprint. We kept the Go version ML-free to make the thin-layer case and the numbers easy to compare; a follow-up could add an ONNX-backed or Rust-based gateway with the same API and real embeddings.


Why This Matters for Real Systems

A gateway like this becomes a shared dependency for many features:

  • Public search APIs
  • Internal "chat with docs"
  • Agentic workflows that call /search multiple times

If it's slow or memory-hungry, you pay for it everywhere:

  • Higher p95 at the user-facing layer
  • More pods to hit your SLOs
  • Larger instances just to fit the model and index

Our benchmarks give you two clear options:

  • Python (in-process embeddings): A single CPU-only instance can handle 70–100+ RPS of real search traffic; p95 stays under 150 ms at high concurrency; memory is ~1–1.5 GB per replica. Good when the gateway is your embedding service.
  • Go (thin layer): ~13 MB image and ~35 MB RSS for the same API shape — so you can run many more replicas per node or smaller instances when embeddings live in a separate service.

Choose by where your embedding model runs; then measure and tune from there.


Design Lessons for Embedding + Rerank Gateways

From this experiment, a few practical lessons stand out:

1. Keep the Gateway Focused

The gateway should do a few things extremely well:

  • Embedding calls
  • Vector search
  • Rerank and scoring
  • Simple metadata filters

Push everything else (authorization, billing, orchestration) to other layers. This keeps:

  • Latency low
  • Memory predictable
  • Failure modes simple

2. Make Performance Measurable

Your gateway isn't "fast" or "slow" — it has numbers:

  • p95 latency at realistic concurrency
  • RPS at your target p95
  • RSS memory under load
  • Cold start behavior

Treat those as APIs you own, just like /v1/search.

3. Optimize for the Common Path

Most queries:

  • Hit a small subset of the corpus
  • Don't need complex reranking logic
  • Are tolerant of ~100 ms of search latency

Design for that:

  • Use a small, efficient embedding model
  • Keep the index in-memory or on a fast local store
  • Avoid unnecessary round-trips or serialization

From Idea to Measurable Gateways

Every RAG system hides an Embedding + Rerank gateway. We made it visible — and measurable.

We implemented the gateway in Python (FastAPI + sentence-transformers) and Go (stdlib HTTP, same API), added a Rust (ONNX) variant for apples-to-apples comparison, and measured footprint, throughput, and latency on a single node. The full setup is here:

github.com/xadnavyaai/NavyaAIBlogs/embedding-rerank-gateway-high-performance


Why We Care at NavyaAI

At NavyaAI, we build agentic and RAG-based systems where:

  • Every millisecond of latency compounds through multi-hop agents
  • Every shared service shows up in many SLOs
  • Every extra pod or larger instance increases real cost

Embedding + Rerank gateways are one of those "small" services that sit at the center of it all.

By making them:

  • Simple to reason about
  • Measurable with concrete metrics
  • Efficient on modest hardware

…we can ship systems that are fast, observable, and economically sane.


Try It Yourself

Get started:

  • Clone the repo and bring up the gateways locally or via Docker.
  • Swap in your own documents and embedding models.
  • Adjust concurrency and request volume to match your production traffic.
  • Export metrics into your existing observability stack.

Then ask a simple question:

If a single, well-designed gateway can handle your current traffic — what else could you do with the pods you no longer need?

The answer starts with a small service, a few endpoints, and a gateway you can actually measure.


From the NavyaAI Network

This embedding gateway powers the retrieval layer behind VectraGPT — our secure, document-grounded AI chatbot platform. See how it all comes together in the VectraGPT blog post on how RAG chatbots eliminate hallucinations.