← Foundations
Codename · Forge
Serving · Foundation

Production AI serving.
Without the idle GPU bill.

Forge is a production-grade serving stack built on Kubernetes — GPU-aware autoscaling, intelligent node provisioning, spot-instance integration, and multi-model routing behind one standardized API gateway. Assembled from the best open-source infrastructure available, and already running real-time inference for consumer-facing AI at scale.

See how it's built In production today — serving real-time inference for consumer products at millions of requests a day.
Orchestration
Kubernetes
Autoscaling
KEDA
Provisioning
Karpenter
Observability
Prometheus + Grafana
Service mesh
Istio
Gateway
Multi-model routing
01 · The problem

GPUs are expensive. Idle ones are inexcusable.

Most serving infrastructure forces a choice between cost and latency. Forge refuses it.

The moment a model hits production, the economics become the problem. Keep GPUs running around the clock and the bill compounds regardless of traffic. Scale them down and cold starts punish the first user after every quiet period. Stay on on-demand instances for safety and pay a premium on every single inference.

So teams pick a poison. They overprovision — paying for capacity that sits idle through nights, weekends, and off-peak hours — or they underprovision and eat the cold-start tax on user experience. In a consumer-facing product, neither is acceptable.

Forge exists because we faced exactly this set of tradeoffs in production, at scale, and resolved them. The infrastructure is built. The tradeoffs are already settled.

02 · Architecture

Every layer accounted for. Every component chosen on purpose.

Forge is assembled from purpose-built open-source tools, each owning a distinct layer — no overlap, no homegrown glue holding it together. Together they erase the cost–latency tradeoff that makes production serving painful.

01API gateway
single entry point
requests →
One standardized API surfacePOST /v1/serve
routed by version · type · load
02Multi-model routing
one gateway · many models
llm-serve
v3 · GPU
vision
v2 · GPU
embed
v1 · CPU
KEDA scales pods to live queue depth
03Serving pods
GPU + CPU · scaled independently
GPU pods · queue-driven
live ×3 · +1 scaling up
CPU pods · own signal
live ×2
KEDA↑ queue rises↓ queue drops
Karpenter provisions matching nodes
04Node groups
right node, right workload
GPU nodes · on-demand + spot
g5.xlargeg5.2xlarge · spot
idle nodes do not persist
CPU nodes
m6i.large
released when idle
Karpenter↑ provisions↓ terminates
Fig. 1 — One gateway in. Requests route across models; KEDA scales serving pods to live queue depth; Karpenter provisions and releases the matching GPU and CPU nodes underneath, spot included. Istio secures every hop with mutual TLS; Prometheus scrapes every layer and Grafana makes each decision visible.
Named layers — each owning a distinct responsibility
KEDA

Event-driven autoscaling

KEDA watches the actual inference request queue, not CPU or memory thresholds. Queue depth rises, serving pods scale up; it drops, they scale down — the difference between scaling on what is happening and guessing at what might. GPU pods scale independently from CPU-bound work, each on its own demand signal. The same policy anticipates ramp-up and keeps the serving path warm enough to hold latency through the first request after a quiet period — without paying for idle capacity to do it.

Karpenter

Intelligent node provisioning

A pod declares a GPU requirement; Karpenter provisions the right EC2 instance to satisfy it — right type, right size — and terminates it when the work is done. CPU steps get CPU nodes, GPU steps get GPU nodes, idle nodes do not persist. Cluster composition at any moment reflects live workload, not a baseline someone set six months ago and never revisited.

Prometheus + Grafana

Full observability

Prometheus scrapes metrics from every component in the stack; Grafana renders real-time dashboards across inference latency, throughput, GPU utilization, pod health, scaling events, and cost signals. When traffic spikes, a cold start fires, or a spot node is reclaimed, the observability layer shows exactly what happened and when. Every serving decision the system makes is visible and auditable.

Istio

Secure service mesh

All pod-to-pod communication inside the cluster is encrypted under mutual TLS and governed by Istio, with network policy enforced at the mesh layer. Traffic between the gateway, the routing layer, and the serving pods is authenticated, observable, and policy-controlled by default — with no application-level changes. For regulated environments this is not optional infrastructure. It is the baseline.

What the wiring unlocks — capabilities the layers make possible
Multi-model routing

One gateway, many models

A single API gateway routes requests across multiple models — different versions, architectures, and hardware targets — without the calling application knowing which model serves which request. Routing is configurable by model version, request type, or load. The interface the application sees never changes, even as the serving layer underneath evolves.

Spot instances

Fractional cost, fully handled

Spot instances cost a fraction of on-demand and get reclaimed unpredictably. Forge owns the interruption logic — rerouting in-flight requests, holding availability through replacement, and recovering gracefully. Karpenter manages the node lifecycle, KEDA the pod lifecycle: together they make spot economics achievable without spot-instance reliability risk.

03 · Why not just…

Two easier paths. Both cost you later.

There are two obvious ways to serve a model without Forge. Each is reasonable on day one and expensive by the time you are at scale.

vs. a managed serving platform

SageMaker · Bedrock · hosted inference APIs

Fastest to start, and you pay for it forever. Per-token and per-hour markup compounds on every request, the GPU autoscaling logic is a black box you cannot tune, multi-model routing bends to the vendor's shape, and your traffic — and your data — lives in someone else's environment. The moment your volume is real, the convenience tax is the line item that will not stop growing.

vs. rolling your own from scratch

months of specialized infrastructure work

The right answer, attempted the hard way. Wiring KEDA, Karpenter, Prometheus, Istio, spot-interruption handling, and a routing gateway into something production-stable is months of specialized infra work — and the failure modes only surface under real load, at 3 a.m., after you have already shipped. Forge is that build, already done and already proven.

Forge is the third path: open-source economics, self-hosted control, and none of the assembly risk. No proprietary platform, no per-token markup, no data leaving your environment.

04 · Validation

Battle-tested at scale.

Not a reference implementation — production infrastructure, proven under real traffic and real cost pressure.

The infrastructure behind Forge has served real-time inference for consumer-facing AI products at millions of requests a day — across GPU- and CPU-bound workloads, across model generations, and under real production traffic with real cost pressure. Which tool owns which layer, how scaling events fire, how spot interruptions are absorbed — those decisions were made under that pressure, not in a design document.

Millions
Requests a day
consumer-facing AI · in production
~35%
Lower GPU cost
spot economics · on-demand reliability
One
API surface
across any number of models or versions
The cost of serving compounds every month. The infrastructure decisions you make on day one are the ones you live with at scale.

Every engagement that draws on Forge starts with the serving architecture already assembled and already in production. KEDA, Karpenter, Prometheus, Istio, and the gateway are wired together and proven. What changes is your model, your traffic pattern, and your cost target. That is the work worth doing together.

Start somewhere

Serving a model in production and paying for more GPU than you're using?

Tell us what you're running — your models, your traffic patterns, and where the infrastructure cost is coming from. We'll tell you honestly what Forge can do for it, and what it would take.