Melodia is a configurable, LLaMA-based generative backbone that produces many structured output streams at once — each one coherent with the others across long sequences, all from a single forward pass. Validated at 3B scale on a problem most architectures avoid: multi-stem music.
Most generative models emit a single stream of tokens. Melodia emits many — in parallel, from one unified backbone. Rather than generating a mixture signal and decomposing it afterward, or producing each stream in sequence, it generates every stream at once, with each one remaining temporally aligned to all the others.
We validated it on one of the hardest structured-generation problems there is: coherent multi-stem music with synchronized vocals and accompaniment, conditioned on lyrics, musical structure, and style, across full songs several minutes long. The backbone itself is not music-specific.
Generating multiple coherent streams at once is harder than it looks — and solving one of these tends to make another worse.
Generating multiple coherent streams at once is harder than it looks — and solving one of these tends to make another worse.
The obvious approach to multi-stream generation is to produce streams one after another — accompaniment first, then condition the vocal on it. It works for two streams and falls apart past that. Each added stream multiplies inference time. At five or six stems, a single pass becomes too slow to ship.
Audio is represented as discrete tokens across multiple residual quantization codebooks. At every timestep the model predicts several codebook entries per stream. Flatten them naively and the sequence length explodes; interleave them wrong and the model never learns the within-timestep dependencies. The pattern decides whether training converges at all.
The naive fix for more streams is more parameters. In practice that fails — larger models stop converging reliably on this task, and instability compounds with every stream added. Melodia had to support an arbitrary number of stems without an exponentially larger model or a redesign.
Each one is a departure from the standard single-stream playbook — and each is what makes the next one possible.
Each one is a departure from the standard single-stream playbook — and each is what makes the next one possible.
Instead of interleaving patterns built for single-stream audio, Melodia uses a token arrangement designed for the multi-stream case. It lets the model learn dependencies across all streams within a timestep without exploding sequence length — which is what makes training converge and inference fast enough to ship.
Every stream is generated in one forward pass. This is architecturally distinct from cascades, where each stream conditions the next in sequence. The model attends to all streams at once, at every timestep — producing output that stays temporally coherent across as many stems as needed, without multiplying inference time.
The model generates knowing whether it is in an intro, verse, chorus, bridge, or outro — and its output reflects that structure rather than treating the sequence as a uniform token stream. Structural conditioning anchors generation at the segment level, not just the token level, which is what holds coherence together across a full-length piece.
Adding streams does not mean retraining from scratch or rebuilding the model. The validation run used two — vocals and accompaniment. The same backbone extends to additional instrument stems, multiple vocal layers, or spatial streams.
Direct outputs from the validation run — lyric-conditioned vocals over accompaniment, and instrumentals. Structured multi-stem generation, coherent lyric–music alignment, temporal stability.
Direct outputs from the validation run — lyric-conditioned vocals over accompaniment, and instrumentals. Structured multi-stem generation, coherent lyric–music alignment, temporal stability.
Hey little girl, come on and dance with me.
I'm a fire in the night. You're a candle in my darkness.
City lights are calling my name
Heart on fire, I'm in the game
We're running wild, we're feeling alive
Hands up high, we touch the sky
Not a foundation model fine-tuned on a new dataset. The architecture, the codebook pattern, the conditioning strategy, and the training recipe were all chosen for this problem — not inherited from a model built for something else.
Generating every stream in one forward pass is harder to build correctly than a cascade — and it is what produces output more coherent, more scalable, and fast enough for production inference.
Most models generate sequences that are locally coherent but globally shapeless. Melodia can be guided at the level of form, not just prompted at the token level.
Any problem that needs multiple structured streams to stay coherent with each other over time is a candidate. A few of them:
Any problem that needs multiple structured streams to stay coherent with each other over time is a candidate. A few of them:
Visual and audio streams that have to stay synchronized as they are generated together.
Multiple molecular or protein properties generated jointly rather than one at a time.
Several actuator control streams that need tight temporal coordination across a horizon.
“Any domain where single-stream architectures have been the only option by default — not by design.”
For teams working on generative problems that don't fit standard single-stream architectures, Melodia is a validated starting point: a known training recipe and a proven answer to the codebook and inference problems. We adapt the architecture to your domain, retrain it on your data, and extend it to as many streams as the problem demands — detailed architecture, the codebook methodology, the multi-stem expansion approach, and full training artifacts all come with the engagement.
Tell us what you're building. We'll tell you honestly whether — and how — Melodia's approach fits.