Codename · Cantata

Generative · Foundation

Built to generate everything, at once.

Cantata is a configurable, LLaMA-based generative backbone that produces many structured output streams at once — each one coherent with the others across long sequences, all from a single forward pass. Validated at 3B scale on a problem most architectures avoid: multi-stem music.

Hear the validation run ↓Brief us on what you're building →

Backbone

LLaMA-based

Parameters

3B (validation)

Streams

Parallel · N-scalable

Conditioning

Language + segment

Validated on

Multi-stem music

01 · What it is

Built to generate in parallel, not in sequence.

Most generative models emit a single stream of tokens. Cantata emits many — in parallel, from one unified backbone. Rather than generating a mixture signal and decomposing it afterward, or producing each stream in sequence, it generates every stream at once, with each one remaining temporally aligned to all the others.

We validated it on one of the hardest structured-generation problems there is: coherent multi-stem music with synchronized vocals and accompaniment, conditioned on lyrics, musical structure, and style, across full songs several minutes long. The backbone itself is not music-specific.

02 · The core problem

Three problems, solved together.

Generating multiple coherent streams at once is harder than it looks — and solving one of these tends to make another worse.

02 · The core problem

Three problems, solved together.

Generating multiple coherent streams at once is harder than it looks — and solving one of these tends to make another worse.

The sequential inference problem

The obvious approach to multi-stream generation is to produce streams one after another — accompaniment first, then condition the vocal on it. It works for two streams and falls apart past that. Each added stream multiplies inference time. At five or six stems, a single pass becomes too slow to ship.

The codebook pattern problem

Audio is represented as discrete tokens across multiple residual quantization codebooks. At every timestep the model predicts several codebook entries per stream. Flatten them naively and the sequence length explodes; interleave them wrong and the model never learns the within-timestep dependencies. The pattern decides whether training converges at all.

Model size versus stream count

The naive fix for more streams is more parameters. In practice that fails — larger models stop converging reliably on this task, and instability compounds with every stream added. Cantata had to support an arbitrary number of stems without an exponentially larger model or a redesign.

Fig. 1 — Three failure modes of naïve multi-stream generation: inference time grows with each stream, codebook tokens stagger and explode in length, and parameters balloon past the point of stable convergence.

03 · How Cantata solves it

Four decisions, made deliberately.

Each one is a departure from the standard single-stream playbook — and each is what makes the next one possible.

03 · How Cantata solves it

Four decisions, made deliberately.

Each one is a departure from the standard single-stream playbook — and each is what makes the next one possible.

A custom codebook pattern

Instead of interleaving patterns built for single-stream audio, Cantata uses a token arrangement designed for the multi-stream case. It lets the model learn dependencies across all streams within a timestep without exploding sequence length — which is what makes training converge and inference fast enough to ship.

Single-stage parallel generation

Every stream is generated in one forward pass. This is architecturally distinct from cascades, where each stream conditions the next in sequence. The model attends to all streams at once, at every timestep — producing output that stays temporally coherent across as many stems as needed, without multiplying inference time.

Segment-level structural conditioning

The model generates knowing whether it is in an intro, verse, chorus, bridge, or outro — and its output reflects that structure rather than treating the sequence as a uniform token stream. Structural conditioning anchors generation at the segment level, not just the token level, which is what holds coherence together across a full-length piece.

Scales to more streams, no redesign

Adding streams does not mean retraining from scratch or rebuilding the model. The validation run used two — vocals and accompaniment. The same backbone extends to additional instrument stems, multiple vocal layers, or spatial streams.

Stems ↑

+ more

Percussion

Bass

Keys

Lead synth

Atmosphere

Vocals

Intro

Verse

Chorus

Bridge

Outro

time →

atmospheric pads

tighter rhythmminimal bass

layered synthsdriving drumsdrop feel

rising tension

reduced energylong reverb tails

Fig. 2 — Segment-driven, multi-stem transformer. Streams stay aligned across time while structure is anchored at the segment level; pill opacity encodes per-stem energy, and terracotta marks the chorus peak.

04 · Hear it

The proof, in full.

Direct outputs from the validation run — lyric-conditioned vocals over accompaniment, and instrumentals. Structured multi-stem generation, coherent lyric–music alignment, temporal stability.

04 · Hear it

The proof, in full.

Direct outputs from the validation run — lyric-conditioned vocals over accompaniment, and instrumentals. Structured multi-stem generation, coherent lyric–music alignment, temporal stability.

Sample A · Vocals + Accompaniment

0:00 / 0:08

Lyrics

Hey little girl, come on and dance with me.

I'm a fire in the night. You're a candle in my darkness.

Sample B · Vocals + Accompaniment

0:00 / 0:17

Lyrics

City lights are calling my name

Heart on fire, I'm in the game

We're running wild, we're feeling alive

Hands up high, we touch the sky

Instrumental 01

0:00 / 0:19

Instrumental 02

0:00 / 0:19

Instrumental 03

0:00 / 0:19

05 · Why it's hard to copy

Designed from first principles

Not a foundation model fine-tuned on a new dataset. The architecture, the codebook pattern, the conditioning strategy, and the training recipe were all chosen for this problem — not inherited from a model built for something else.

Single-pass parallelism

Generating every stream in one forward pass is harder to build correctly than a cascade — and it is what produces output more coherent, more scalable, and fast enough for production inference.

Structural conditioning

Most models generate sequences that are locally coherent but globally shapeless. Cantata can be guided at the level of form, not just prompted at the token level.

06 · Where it applies

Music is the proof. The backbone is general.

Any problem that needs multiple structured streams to stay coherent with each other over time is a candidate. A few of them:

06 · Where it applies

Music is the proof. The backbone is general.

Any problem that needs multiple structured streams to stay coherent with each other over time is a candidate. A few of them:

Video

Visual and audio streams that have to stay synchronized as they are generated together.

Scientific sequences

Multiple molecular or protein properties generated jointly rather than one at a time.

Robotics

Several actuator control streams that need tight temporal coordination across a horizon.

“Any domain where single-stream architectures have been the only option by default — not by design.”

07 · In an engagement

A validated starting point.

For teams working on generative problems that don't fit standard single-stream architectures, Cantata is a validated starting point: a known training recipe and a proven answer to the codebook and inference problems. We adapt the architecture to your domain, retrain it on your data, and extend it to as many streams as the problem demands — detailed architecture, the codebook methodology, the multi-stem expansion approach, and full training artifacts all come with the engagement.

Start somewhere

Working on a generative problem that needs more than one output stream?

Tell us what you're building. We'll tell you honestly whether — and how — Cantata's approach fits.

Brief us on what you're building →or reach us directly — hello@coraltree.ai