Forge is a production-grade serving stack built on Kubernetes — GPU-aware autoscaling, intelligent node provisioning, spot-instance integration, and multi-model routing behind one standardized API gateway. Assembled from the best open-source infrastructure available, and already running real-time inference for consumer-facing AI at scale.
Most serving infrastructure forces a choice between cost and latency. Forge refuses it.
The moment a model hits production, the economics become the problem. Keep GPUs running around the clock and the bill compounds regardless of traffic. Scale them down and cold starts punish the first user after every quiet period. Stay on on-demand instances for safety and pay a premium on every single inference.
So teams pick a poison. They overprovision — paying for capacity that sits idle through nights, weekends, and off-peak hours — or they underprovision and eat the cold-start tax on user experience. In a consumer-facing product, neither is acceptable.
Forge exists because we faced exactly this set of tradeoffs in production, at scale, and resolved them. The infrastructure is built. The tradeoffs are already settled.
Forge is assembled from purpose-built open-source tools, each owning a distinct layer — no overlap, no homegrown glue holding it together. Together they erase the cost–latency tradeoff that makes production serving painful.
Forge is assembled from purpose-built open-source tools, each owning a distinct layer — no overlap, no homegrown glue holding it together. Together they erase the cost–latency tradeoff that makes production serving painful.
KEDA watches the actual inference request queue, not CPU or memory thresholds. Queue depth rises, serving pods scale up; it drops, they scale down — the difference between scaling on what is happening and guessing at what might. GPU pods scale independently from CPU-bound work, each on its own demand signal. The same policy anticipates ramp-up and keeps the serving path warm enough to hold latency through the first request after a quiet period — without paying for idle capacity to do it.
A pod declares a GPU requirement; Karpenter provisions the right EC2 instance to satisfy it — right type, right size — and terminates it when the work is done. CPU steps get CPU nodes, GPU steps get GPU nodes, idle nodes do not persist. Cluster composition at any moment reflects live workload, not a baseline someone set six months ago and never revisited.
Prometheus scrapes metrics from every component in the stack; Grafana renders real-time dashboards across inference latency, throughput, GPU utilization, pod health, scaling events, and cost signals. When traffic spikes, a cold start fires, or a spot node is reclaimed, the observability layer shows exactly what happened and when. Every serving decision the system makes is visible and auditable.
All pod-to-pod communication inside the cluster is encrypted under mutual TLS and governed by Istio, with network policy enforced at the mesh layer. Traffic between the gateway, the routing layer, and the serving pods is authenticated, observable, and policy-controlled by default — with no application-level changes. For regulated environments this is not optional infrastructure. It is the baseline.
A single API gateway routes requests across multiple models — different versions, architectures, and hardware targets — without the calling application knowing which model serves which request. Routing is configurable by model version, request type, or load. The interface the application sees never changes, even as the serving layer underneath evolves.
Spot instances cost a fraction of on-demand and get reclaimed unpredictably. Forge owns the interruption logic — rerouting in-flight requests, holding availability through replacement, and recovering gracefully. Karpenter manages the node lifecycle, KEDA the pod lifecycle: together they make spot economics achievable without spot-instance reliability risk.
There are two obvious ways to serve a model without Forge. Each is reasonable on day one and expensive by the time you are at scale.
There are two obvious ways to serve a model without Forge. Each is reasonable on day one and expensive by the time you are at scale.
Fastest to start, and you pay for it forever. Per-token and per-hour markup compounds on every request, the GPU autoscaling logic is a black box you cannot tune, multi-model routing bends to the vendor's shape, and your traffic — and your data — lives in someone else's environment. The moment your volume is real, the convenience tax is the line item that will not stop growing.
The right answer, attempted the hard way. Wiring KEDA, Karpenter, Prometheus, Istio, spot-interruption handling, and a routing gateway into something production-stable is months of specialized infra work — and the failure modes only surface under real load, at 3 a.m., after you have already shipped. Forge is that build, already done and already proven.
Forge is the third path: open-source economics, self-hosted control, and none of the assembly risk. No proprietary platform, no per-token markup, no data leaving your environment.
Not a reference implementation — production infrastructure, proven under real traffic and real cost pressure.
Not a reference implementation — production infrastructure, proven under real traffic and real cost pressure.
The infrastructure behind Forge has served real-time inference for consumer-facing AI products at millions of requests a day — across GPU- and CPU-bound workloads, across model generations, and under real production traffic with real cost pressure. Which tool owns which layer, how scaling events fire, how spot interruptions are absorbed — those decisions were made under that pressure, not in a design document.
“The cost of serving compounds every month. The infrastructure decisions you make on day one are the ones you live with at scale.”
Every engagement that draws on Forge starts with the serving architecture already assembled and already in production. KEDA, Karpenter, Prometheus, Istio, and the gateway are wired together and proven. What changes is your model, your traffic pattern, and your cost target. That is the work worth doing together.
Tell us what you're running — your models, your traffic patterns, and where the infrastructure cost is coming from. We'll tell you honestly what Forge can do for it, and what it would take.