Gemma 4 Threw Out the Multimodal Encoders

Google calls Gemma 4 12B "encoder-free multimodal." Strip the marketing and a language model is still just a machine that eats a list of numbers and guesses the next one. It has no eyes and no ears. That's the whole reason multimodal models are built the way they are.

Text doesn't go into a model as text. It gets chopped into tokens, and every token becomes a vector, a few thousand numbers that stand in for "the word cat, here, in this sentence." The model reads that run of vectors and predicts the one that comes next. The trick that lets it do that is attention. Every vector gets to look at every other one and pull in whatever's relevant, which is how it ties a word to another one ten tokens back.

So an image is a problem. A screenshot is not a sequence of word-vectors, and a voice note isn't either. To get the model to answer a question about a picture, you first have to turn that picture into the one shape it knows how to read.

For about three years, the answer has been to hire a specialist.

How a text model learned to see

You take a separate model that already understands images and bolt it onto the front. Usually that's a Vision Transformer trained with CLIP or SigLIP. Show it hundreds of millions of image-caption pairs until it can turn a picture into a handful of vectors that mean something. That model is the encoder. Its vectors don't live in the language model's space, so you add a small projector, a couple of dense layers that translate them into word embeddings the LLM understands. Then you put those translated vectors in front of the text and let the LLM read the whole lot as one sequence.

Three boxes in a row: encoder, projector, language model. LLaVA made this the default recipe and almost everyone copied it. Audio works the same way, with a Whisper speech encoder in place of the vision one. And before you decide this is someone else's design, Gemma 3 is built exactly like it. A frozen 400M-parameter SigLIP encoder sits out front, turning each 896×896 image into 256 tokens before the language model sees a thing.

The recipe caught on for good reasons. The encoder is already smart, so you're not teaching a language model to see from scratch. You can train in stages and keep the encoder frozen, which is stable and cheap. It's modular too. When a better encoder shows up you swap it in, assuming the projector survives the swap.

The bill comes in four parts. The encoder is extra weight that has nothing to do with language, anywhere from a few hundred million parameters to more than a billion. It's an extra hop, since pixels have to clear the encoder before the model can start thinking. It's a second training pipeline to feed and maintain. The fourth is subtler. The encoder decides ahead of time what the language model is allowed to perceive. Its resolution and aspect ratios, plus whatever priors it baked in during pretraining, become a ceiling. The LLM only ever sees the encoder's read on an image, never the pixels themselves.

That ceiling is what Gemma 4 set out to remove.

Where multimodal input enters the model

conventional VLM stack

image

patches

SigLIP ViT

~400M

MLP projector

~20M

audio

mel spectrogram

speech encoder

~600M

adapter

~15M

text

tokenizer

embedding

LLM backbone

12B params

handles the rest

≈ 1B+ paramsof separate, separately-trained encoders and projectors in front of the backbone, adding weight and a latency hop.

encoder sizes are typical-stack estimates, not Gemma's exact numbers

The bet: skip the specialist

The idea is blunt. Don't hire a vision model at all. Take the raw pixels, do the cheapest possible thing to make them vector-shaped, and hand them straight to the language model. It learns to see using the same weights it uses for everything else. Google's framing:

No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

The "cheapest possible thing" turns out to be very cheap. Per the developer guide, the 27-layer vision transformer that the other mid-size Gemma 4 models carry is gone. That stack was around 550M parameters. In its place is a 35M-parameter "embedder" that cuts the image into 48×48 pixel patches and runs each one through a single matrix multiply to land it in the model's hidden dimension, the fixed width of those few-thousand-number vectors from the opening. Position comes from a simple lookup instead. There's a learned table for the x coordinate and another for the y, and you add them together. The patches never attend to each other at all. Each one is embedded on its own and dropped into the sequence next to the text.

Audio is the bolder half, and the part with less precedent. The audio encoder gets deleted outright. It was twelve conformer layers in the smaller Gemma 4 models, and here there's nothing. Raw 16 kHz audio gets sliced into 40-millisecond frames of 640 samples each, and every frame goes through one linear projection into the same space as the text tokens. No spectrogram, no feature front-end. The model's existing position machinery handles time.

The whole thing is almost insultingly simple.

Raw input, one matrix multiply, done

vision

48 × 48 × 3 patch

6,912 numbers

one matrix multiply

vector

≈ 3,840 wide

plus a looked-up (x, y) position, then a normalize

audio

40 ms frame at 16 kHz

640 samples

one matrix multiply

vector

≈ 3,840 wide

temporal order comes from the model's existing position embeddings

Both land in the same ~3,840-wide space the text tokens already live in. No encoder, no spectrogram, just one projection each.

the ~3,840 width is inferred from public arithmetic, not stated by Google

(The vision projection alone, 6,912 numbers into 3,840, is about 26M parameters.)

Then all of it goes into one decoder-only transformer that reuses the structure of the larger Gemma 4 31B Dense model. That's the text vectors, the image-patch vectors, and the audio-frame vectors, in a single stream. Dense just means every weight runs on every token, with no routing. One stack of weights, and the model never gets told "this token is an image." It just sees a run of same-sized vectors. Attention, remember, lets every vector pull in whatever's relevant from every other one, so the model relates an image patch to the nearby words the same way it relates two words. Fusion isn't a separate stage you build anymore. It's attention doing its usual job on a longer, mixed sequence.

One embedding space, one sequence

every modality becomes a vector of width d

text tokens

image patches

audio frames

single token stream into the transformer

The transformer never learns "this token is an image." It sees one sequence of equal-width vectors and attends across all of them at once. Fusion is not a separate stage. It is just attention.

The encoder used to hand the model a pre-chewed, meaningful picture. Now the model gets raw projected pixels and has to build that understanding itself, inside the same weights it uses for language. The work just moved inward.

"Encoder-free" is a generous name

A 35M-parameter module that turns pixels into position-aware vectors for a transformer is encoding. It's a small, shallow, attention-less encoder, but it encodes. The sharpest reply in the Hacker News thread put it in one line. That's technically encoding, just without a dedicated model like SigLIP doing it.

The accurate claim is "no heavy, separately trained encoder," not "no encoding." Dropping from ~550M dedicated vision parameters to ~35M is a real cut, about 94% off that slice of the stack. It's a smaller and more specific win than the name advertises.

There's an asymmetry, too. Google says the backbone "takes over visual processing." For audio they only say the signal is projected into the text-token space. They'll make that stronger claim for vision and pointedly not for audio.

None of this is new

Read only Google's post and "encoder-free multimodal" sounds invented. It's mostly a clean productization.

The vision half is basically Fuyu-8B, which Adept put out in October 2023. It's a plain decoder-only transformer with no image encoder, image patches projected straight into the first layer. Same move, two and a half years earlier. Fuyu came with an asterisk worth remembering here, since its training data and procedure were never disclosed.

The wider "every modality is just tokens in one transformer" idea has a whole lineage. Meta's Chameleon ran every modality as discrete tokens through a single transformer back in May 2024, and it documented how badly early fusion wants to diverge during training. They needed extra normalization tricks just to keep the loss stable past the first fifth of an epoch. Emu3 went further and trained one transformer from scratch on image, text and video tokens, with no CLIP and no pretrained LLM underneath. SOLO and Mono-InternVL made the single-transformer argument using almost exactly Google's reasoning about heterogeneous stacks being a pain to scale. And EVE, the paper named around encoder-free vision-language models, is the one that spells out the catch. Training without encoders gives you "slow convergence and large performance gaps," and the way they closed the gap was, wonderfully, to distill from a vision encoder.

Audio has prior art too, some of it Google's own. AudioPaLM put speech and text in one decoder-only model back in 2023. Those systems tokenized audio with a learned codec first, though. Gemma 4's straight linear projection of raw waveform frames, with no codec, is the one piece I'd actually call new.

What's new here is the integration. The idea is two years old. One shipped, 12B, production model that's encoder-free across both vision and audio at once, sharing a single set of weights, plus the factorized coordinate trick for position. That part is real engineering.

The lineage also flags what tends to break. Encoder-free vision models have a documented habit of struggling on fine detail, OCR, dense text and high resolution unless you throw a lot of data and compute at them. That's the baseline Gemma 4 walks into. Pulling the encoder doesn't make that difficulty go away. It shifts onto the backbone and the training data, and you pay for it during training instead.

Why you'd actually want this

Start with what's wrong. Encoder-free gets sold as a memory win, and it mostly isn't one. Quantization does most of the shrinking. Storing each weight in 4 bits instead of 16 makes the model about 4x smaller for a small quality hit, which gets a 12B model down to roughly 6.7GB. A 550M encoder at that precision is a rounding error next to it. Quantization is what fits Gemma 4 on a laptop. The missing encoder barely moves the needle.

Does it fit on the machine you already own?

Precision

Available VRAM / unified memory

Gemma 4 26B MoE16.5 GB ✗ over

12B + bolted-on encoders8.8 GB ✓ fits

Gemma 4 12B encoder-free8.2 GB ✓ fits

16 GB

At int4 (Q4), the encoder-free 12B clears 16 GB with room to spare. The encoders you remove are not the dominant cost. Quantization is. But every gigabyte you do not spend on a frozen ViT is a gigabyte of KV cache for a longer context.

rough arithmetic (params × bytes, plus a little overhead). real footprints depend on context length, batch, and runtime

The reasons that hold up are structural.

One set of weights means one fine-tune. The old stack has you managing a frozen encoder, alignment stages, and the eternal question of whether to unfreeze it. With a single backbone, one fine-tune updates vision, audio and text at once. That can be a full retrain or a LoRA, the cheap kind that just trains a small add-on instead of all the weights. Anyone who's fought a multi-stage vision-language training script knows that's not nothing.

You also lose the encoder's ceiling. A frozen SigLIP decides for good what resolutions and aspect ratios your model handles well. Feed raw patches in and perception is bounded by the backbone and the data instead of by a component someone else froze last year.

And there are fewer parts to ship. Think local agents, screen-reading agents, offline assistants, anything on a device with no network. Those are exactly the places where one stack of weights with one runtime beats an LLM plus a ViT plus an audio encoder plus three projectors, each with its own kernels to compile and quantize. It also ships with multi-token-prediction drafters for speculative decoding, which is the right instinct for latency on a laptop. Google never published a speedup number alongside the model, though, so the latency claim is unverified for now.

If you're building something that has to run in a browser tab or on a plane, a model that handles pixels and audio in the same loop is the shape you want, with or without a winning benchmark.

What hasn't been shown

Four days after launch, almost nothing about Gemma 4 has been checked independently. That's lucky, because everything below is checkable.

Every headline number is Google's own, measured on the native BF16 weights, with no outside reproduction yet. The model card lists MMLU Pro 77.2, GPQA Diamond 78.8, MMMU Pro 69.1, MATH-Vision 79.7, and so on down the page. MarkTechPost notes Google didn't publish full tables at launch. The one comparison they do make is to "our larger 26B MoE model" at "less than half the total memory footprint." That sibling is a mixture-of-experts, which fires only a slice of its weights per token. There's no table behind the claim, and no head-to-head against anything like Qwen3-VL.

Benchmark	Gemma 4 12B	What it measures
MMLU Pro	77.2	text reasoning
GPQA Diamond	78.8	hard-science QA
MMMU Pro	69.1	multimodal reasoning
MATH-Vision	79.7	visual math
InfoVQA	88.4	infographic QA
CoVoST	38.5 (excl. CN)	speech translation
MRCR v2 (128k)	43.4	long-context recall

All Google's own numbers, on BF16 weights, unconfirmed.

A few of these claims are softer than the table makes them look.

OCR and dense text are exactly where encoder-free models have always wobbled, and nobody has tested Gemma 4 there yet. The tell is in Google's own guidance. They tell you to use the biggest visual-token budget for small text and documents, which is a polite way of admitting the cheap path costs you on the hard cases.

The audio worries me more. The only hands-on test I could find ran the smaller variant against Whisper and watched it loop, hallucinate, drop words, and take over five minutes on a clip Whisper Small nailed in seconds. That was the 2.3B model on Japanese, so it's one anecdote and I wouldn't weight it heavily. But Google's published audio scores (CoVoST and FLEURS) both leave out Chinese, and there's no Whisper comparison anywhere. Raw-waveform projection with no codec is the least-proven choice in the whole model, and I'd want to hear it before trusting it.

The 16GB story is the shakiest. The benchmarks are BF16. The 16GB fit needs 4-bit. And Google published no quantization-quality numbers for this generation, only a precedent from Gemma 3. "16GB" also means 16GB of VRAM or unified memory, not system RAM, and an 18GB MacBook reportedly already failed to load the 12B. The 256K context window is real but not free either. A transformer also keeps a running memory for every token it's already read, the KV cache, and it grows as the conversation does. None of that is in the 16GB figure. The model scores 43.4 on a 128k recall test on top of it, so it's already missing most of a context well short of its own ceiling.

"Cheaper because encoder-free" has nothing behind it either. Google published no training compute, no token count, no data volume. The encoder-free literature is consistent that the work shifts onto a heavier backbone and more data, which means deleting the encoder at inference time doesn't necessarily save you anything at training time. If anything the theory points the other way.

The one independent capability test I found was single-prompt and vibes-grade. Gemma 4 won on speed and lost on coding and translation to Qwen 3.6 27B. One data point, and right now it's still more outside evidence than the entire benchmark table has.

Does it actually matter?

Yeah, mostly.

The encoder-projector-LLM stack is a wart the field has been trying to remove for two years, and the reasons are good. Extra weight, an extra hop, an extra pipeline, and a frozen cap on what the model can see. Folding the modalities into one backbone is very likely the right long-term shape, especially for the local and edge cases where a single set of weights is the entire game. Gemma 4 12B is a genuine step there, and because it's open under a permissive license, you can go kick the tires yourself instead of trusting a chart.

The marketing oversells two things. It oversells how new this is, since it's Fuyu's 2023 idea plus Chameleon-era early fusion plus Google's own AudioPaLM lineage, integrated and shipped cleanly. And it oversells how sure we are, since the quality claims are vendor numbers on a four-day-old model in a regime with known soft spots that nobody outside Google has poked at.

The encoder stack is worth removing, encoder-free is a sound bet, and Gemma 4 12B is the cleanest production version so far. It sits on research older than the announcement implies, and the numbers are still entirely Google's word.

More boring than the launch post. Also the version I'd put money on.

What I'd test next

Everything above is checkable on a laptop, which is the good part. Roughly in order of how much each would move me:

Run it against Gemma 3 on the nasty stuff: a stack of receipts, dense tables, small-font screenshots, a multi-column PDF page. Gemma 3 has the encoder and Gemma 4 doesn't, so this is where an encoder-free tax shows up if there is one. Try both the small and large visual-token budgets and see what the big one actually buys.
Then audio, and not the clean read-aloud kind. Accented speech, background noise, a language that isn't English or Chinese, and two people talking over each other to see whether it can tell who said what. Measure transcription word error rate against Whisper Small. If raw-waveform projection can't clear a model that small, the encoder-free audio path has a problem.
Run the same eval set at BF16, 8-bit, and 4-bit on identical prompts. The whole "16GB" story lives on that 4-bit gap, and Google didn't publish it.
Load it and push the context toward 32k, 64k, 128k while watching resident memory rather than the advertised number. Cross-check the 43.4 recall score while you're there, and see whether it actually uses the window it loads.
Try a single fine-tune, since the cleanest practical upside is updating everything at once. LoRA it on a small task that mixes an image and text, then compare the effort to the last encoder-based VLM you tuned.
Measure MTP acceptance on your own traffic. Speculative decoding only pays off when the draft gets accepted, so check tokens per second with and without it on your prompts, not a demo.

If the encoder-free model holds up on OCR and audio, I'll revise upward happily. That's the point of it being open and small enough to run.

Sources

Google's announcement and docs

Introducing Gemma 4 12B: a unified, encoder-free multimodal model. The launch post: architecture framing, motivation, and the 16GB / Apache 2.0 / "nearing 26B MoE" claims.
Gemma 4 12B: The Developer Guide. The concrete numbers: 35M vision embedder vs 27 ViT layers, 48×48 patches, factorized x/y position lookup, 16 kHz / 40 ms / 640-sample audio frames, 12 conformer layers removed, decoder reused from Gemma 4 31B Dense.
google/gemma-4-12B model card. 11.95B params, 256K context, hybrid local/global attention, and the self-reported benchmark table.

Independent technical analysis

A Visual Guide to Gemma 4 12B. Maarten Grootendorst on the projection arithmetic, the ~3,840 hidden size, the factorized position matrices, and the "encoders really are gone" read.
Google DeepMind Releases Gemma 4 12B. MarkTechPost: no feature extraction, no conformer layers, and full benchmarks not released at launch.
What is Gemma 4 12B?. Per-layer embeddings, shared KV cache, and a ~8GB 4-bit footprint once runtime overhead is counted.

The conventional multimodal stack

CLIP · SigLIP · LLaVA · Flamingo. Encoders, projectors, prefix tokens vs gated cross-attention.
Gemma 3 Technical Report. Gemma 3's frozen 400M SigLIP encoder, 896×896 to 256 image tokens.
Whisper · Qwen2-Audio. The audio encoder plus adapter pattern.
Qwen2-VL · Qwen3-VL. Modern encoder-based VLMs still riding on SigLIP-class encoders.

Encoder-free / early-fusion precedent

Fuyu-8B (Adept, Oct 2023). The direct precedent: raw image patches linearly projected into a decoder-only transformer, no encoder.
Chameleon (Meta, May 2024). Early-fusion tokens, and the training instability that came with them.
EVE. "Encoder-free is viable but hard": slow convergence, and closing the gap by distilling from a vision encoder.
Emu3 · SOLO · Mono-InternVL. Single-transformer multimodal models.
AudioPaLM (Google, 2023). Audio and text in one decoder-only model, predating Gemma 4.

Skeptical / hands-on

Hacker News discussion. "That's technically encoding," the BF16-vs-4-bit gap, and an 18GB Mac failing to load the 12B.
Gemma-4 audio vs Whisper, hands-on. Looping, hallucination, and latency on the smaller variant.
Gemma 4 12B vs Qwen 3.6 27B. Faster, but lost coding and translation quality.