Codename · Forge

Serving · Foundation

Production AI serving.
Without the idle GPU bill.

Forge is a production-grade serving stack built on Kubernetes — GPU-aware autoscaling, intelligent node provisioning, spot-instance integration, and multi-model routing behind one standardized API gateway. Assembled from the best open-source infrastructure available, and already running real-time inference for consumer-facing AI at scale.

See how it's built ↓In production today — serving real-time inference for consumer products at millions of requests a day.

Orchestration

Kubernetes

Autoscaling

KEDA

Provisioning

Karpenter

Observability

Prometheus + Grafana

Service mesh

Istio

Gateway

Multi-model routing

01 · The problem

GPUs are expensive. Idle ones are inexcusable.

Most serving infrastructure forces a choice between cost and latency. Forge refuses it.

The moment a model hits production, the economics become the problem. Keep GPUs running around the clock and the bill compounds regardless of traffic. Scale them down and cold starts punish the first user after every quiet period. Stay on on-demand instances for safety and pay a premium on every single inference.

So teams pick a poison. They overprovision — paying for capacity that sits idle through nights, weekends, and off-peak hours — or they underprovision and eat the cold-start tax on user experience. In a consumer-facing product, neither is acceptable.

Forge exists because we faced exactly this set of tradeoffs in production, at scale, and resolved them. The infrastructure is built. The tradeoffs are already settled.

02 · Architecture

Every layer accounted for. Every component chosen on purpose.

Forge is assembled from purpose-built open-source tools, each owning a distinct layer — no overlap, no homegrown glue holding it together. Together they erase the cost–latency tradeoff that makes production serving painful.

02 · Architecture

Every layer accounted for. Every component chosen on purpose.

01API gateway

single entry point

requests →

One standardized API surfacePOST /v1/serve

routed by version · type · load ↓

02Multi-model routing

one gateway · many models

llm-serve

v3 · GPU

vision

v2 · GPU

embed

v1 · CPU

KEDA scales pods to live queue depth ↓

03Serving pods

GPU + CPU · scaled independently

GPU pods · queue-driven

live ×3 · +1 scaling up

CPU pods · own signal

live ×2

KEDA↑ queue rises↓ queue drops

Karpenter provisions matching nodes ↓

04Node groups

right node, right workload

GPU nodes · on-demand + spot

g5.xlargeg5.2xlarge · spot

idle nodes do not persist

CPU nodes

m6i.large

released when idle

Karpenter↑ provisions↓ terminates

Istio

Mesh

mTLS
every hop

01API gateway

single entry point

requests →

One standardized API surfacePOST /v1/serve

routed by version · type · load ↓

02Multi-model routing

one gateway · many models

llm-serve

v3 · GPU

vision

v2 · GPU

embed

v1 · CPU

KEDA scales pods to live queue depth ↓

03Serving pods

GPU + CPU · scaled independently

GPU pods · queue-driven

live ×3 · +1 scaling up

CPU pods · own signal

live ×2

KEDA↑ queue rises↓ queue drops

Karpenter provisions matching nodes ↓

04Node groups

right node, right workload

GPU nodes · on-demand + spot

g5.xlargeg5.2xlarge · spot

idle nodes do not persist

CPU nodes

m6i.large

released when idle

Karpenter↑ provisions↓ terminates

Prometheus

Scrape

Grafana

Fig. 1 — One gateway in. Requests route across models; KEDA scales serving pods to live queue depth; Karpenter provisions and releases the matching GPU and CPU nodes underneath, spot included. Istio secures every hop with mutual TLS; Prometheus scrapes every layer and Grafana makes each decision visible.

Named layers — each owning a distinct responsibility

KEDA

Event-driven autoscaling

KEDA watches the actual inference request queue, not CPU or memory thresholds. Queue depth rises, serving pods scale up; it drops, they scale down — the difference between scaling on what is happening and guessing at what might. GPU pods scale independently from CPU-bound work, each on its own demand signal. The same policy anticipates ramp-up and keeps the serving path warm enough to hold latency through the first request after a quiet period — without paying for idle capacity to do it.

Karpenter

Intelligent node provisioning

A pod declares a GPU requirement; Karpenter provisions the right EC2 instance to satisfy it — right type, right size — and terminates it when the work is done. CPU steps get CPU nodes, GPU steps get GPU nodes, idle nodes do not persist. Cluster composition at any moment reflects live workload, not a baseline someone set six months ago and never revisited.

Prometheus + Grafana

Full observability

Prometheus scrapes metrics from every component in the stack; Grafana renders real-time dashboards across inference latency, throughput, GPU utilization, pod health, scaling events, and cost signals. When traffic spikes, a cold start fires, or a spot node is reclaimed, the observability layer shows exactly what happened and when. Every serving decision the system makes is visible and auditable.

Istio

Secure service mesh

All pod-to-pod communication inside the cluster is encrypted under mutual TLS and governed by Istio, with network policy enforced at the mesh layer. Traffic between the gateway, the routing layer, and the serving pods is authenticated, observable, and policy-controlled by default — with no application-level changes. For regulated environments this is not optional infrastructure. It is the baseline.

What the wiring unlocks — capabilities the layers make possible

Multi-model routing

One gateway, many models

A single API gateway routes requests across multiple models — different versions, architectures, and hardware targets — without the calling application knowing which model serves which request. Routing is configurable by model version, request type, or load. The interface the application sees never changes, even as the serving layer underneath evolves.

Spot instances

Fractional cost, fully handled

Spot instances cost a fraction of on-demand and get reclaimed unpredictably. Forge owns the interruption logic — rerouting in-flight requests, holding availability through replacement, and recovering gracefully. Karpenter manages the node lifecycle, KEDA the pod lifecycle: together they make spot economics achievable without spot-instance reliability risk.

03 · Why not just…

Two easier paths. Both cost you later.

There are two obvious ways to serve a model without Forge. Each is reasonable on day one and expensive by the time you are at scale.

03 · Why not just…

Two easier paths. Both cost you later.

There are two obvious ways to serve a model without Forge. Each is reasonable on day one and expensive by the time you are at scale.

vs. a managed serving platform

SageMaker · Bedrock · hosted inference APIs

Fastest to start, and you pay for it forever. Per-token and per-hour markup compounds on every request, the GPU autoscaling logic is a black box you cannot tune, multi-model routing bends to the vendor's shape, and your traffic — and your data — lives in someone else's environment. The moment your volume is real, the convenience tax is the line item that will not stop growing.

vs. rolling your own from scratch

months of specialized infrastructure work

The right answer, attempted the hard way. Wiring KEDA, Karpenter, Prometheus, Istio, spot-interruption handling, and a routing gateway into something production-stable is months of specialized infra work — and the failure modes only surface under real load, at 3 a.m., after you have already shipped. Forge is that build, already done and already proven.

Forge is the third path: open-source economics, self-hosted control, and none of the assembly risk. No proprietary platform, no per-token markup, no data leaving your environment.

04 · Validation

Battle-tested at scale.

Not a reference implementation — production infrastructure, proven under real traffic and real cost pressure.

04 · Validation

Battle-tested at scale.

Not a reference implementation — production infrastructure, proven under real traffic and real cost pressure.

The infrastructure behind Forge has served real-time inference for consumer-facing AI products at millions of requests a day — across GPU- and CPU-bound workloads, across model generations, and under real production traffic with real cost pressure. Which tool owns which layer, how scaling events fire, how spot interruptions are absorbed — those decisions were made under that pressure, not in a design document.

Millions

Requests a day

consumer-facing AI · in production

~35%

Lower GPU cost

spot economics · on-demand reliability

One

API surface

across any number of models or versions

“The cost of serving compounds every month. The infrastructure decisions you make on day one are the ones you live with at scale.”

Every engagement that draws on Forge starts with the serving architecture already assembled and already in production. KEDA, Karpenter, Prometheus, Istio, and the gateway are wired together and proven. What changes is your model, your traffic pattern, and your cost target. That is the work worth doing together.

Start somewhere

Serving a model in production and paying for more GPU than you're using?

Tell us what you're running — your models, your traffic patterns, and where the infrastructure cost is coming from. We'll tell you honestly what Forge can do for it, and what it would take.

Brief us on what you're running →or reach us at hello@coraltree.ai

Production AI serving.Without the idle GPU bill.

GPUs are expensive. Idle ones are inexcusable.

Every layer accounted for. Every component chosen on purpose.

Every layer accounted for. Every component chosen on purpose.

Event-driven autoscaling

Intelligent node provisioning

Full observability

Secure service mesh

One gateway, many models

Fractional cost, fully handled

Two easier paths. Both cost you later.

Two easier paths. Both cost you later.

vs. a managed serving platform

vs. rolling your own from scratch

Battle-tested at scale.

Battle-tested at scale.

Serving a model in production and paying for more GPU than you're using?

Production AI serving.
Without the idle GPU bill.