> To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory.
> (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency.
(emphasis mine)
This sounds really cool. The fact that one generation "attends" to the other is really interesting. I'm curious if this would hold for other modalities. I'm thinking coding specific applications, where things can change once something is generated. My hunch is that coding would benefit a lot from this approach, because the "manual" way of writing code often resembles diffusion more than autoregressive (that is, we often edit something here, then because we did that we have to import something, then change something there, then that leads to further changes, etc).
For now coding seems to benefit a lot from <thinking> -> <coding> -> <env_feedback> -> <reflexion> -> <thinking> -> <coding>, but this seems at a glance to be shoehorned in for autoregressive generation... GPT5 in particular seems to be better at this, with multiple "tool calls" interleaved in its thinking sessions. I wonder if this would get better with the paralel denoising thing proposed here, where both thinking and coding are done in paralel, and one can "attend" to the other. Add some feedback (linters, compilers, LSPs, tests, etc.) and this can go places. If it works.
Diffusion text models aren't new, I've made them at home. Also, plenty of frontier models are good at tool calling, GPT-5 has just been trained to do it more so that it appears to do better at coding exercises with codex/IDEs.
If you haven't tried an agentic IDE such as Cursor yet, or at least an extension such as Copilot, I would recommend checking them out and trying out Anthropic's models as well.
Do you have any examples / papers where they do the parallel thing proposed here? I've tried googles diffusion coding model, but AFAICT they don't do parallel thinking & coding. It seems to just take a prompt and output code.
What's cool with this thinking & generation in parallel is that one can attend to the other. So you're not limited by prompt influences code, but can do prompt influences both thinking and code, and code can influence thinking and thinking can influence code.
> We provide two varients of MMaDA-Parallel with different tokenizers. MMaDA-Parallel-A is trained with tokenizer Amused-VQ, and MMaDA-Parallel-M is trained with tokenizer Magvitv2.
This looks awesome. Although from a UX perspective might not be as good as streaming token by token for text generation use cases. However for image gen and editing - 100%
> To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory.
> (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency.
(emphasis mine)
This sounds really cool. The fact that one generation "attends" to the other is really interesting. I'm curious if this would hold for other modalities. I'm thinking coding specific applications, where things can change once something is generated. My hunch is that coding would benefit a lot from this approach, because the "manual" way of writing code often resembles diffusion more than autoregressive (that is, we often edit something here, then because we did that we have to import something, then change something there, then that leads to further changes, etc).
For now coding seems to benefit a lot from <thinking> -> <coding> -> <env_feedback> -> <reflexion> -> <thinking> -> <coding>, but this seems at a glance to be shoehorned in for autoregressive generation... GPT5 in particular seems to be better at this, with multiple "tool calls" interleaved in its thinking sessions. I wonder if this would get better with the paralel denoising thing proposed here, where both thinking and coding are done in paralel, and one can "attend" to the other. Add some feedback (linters, compilers, LSPs, tests, etc.) and this can go places. If it works.
Diffusion text models aren't new, I've made them at home. Also, plenty of frontier models are good at tool calling, GPT-5 has just been trained to do it more so that it appears to do better at coding exercises with codex/IDEs.
If you haven't tried an agentic IDE such as Cursor yet, or at least an extension such as Copilot, I would recommend checking them out and trying out Anthropic's models as well.
Do you have any examples / papers where they do the parallel thing proposed here? I've tried googles diffusion coding model, but AFAICT they don't do parallel thinking & coding. It seems to just take a prompt and output code.
What's cool with this thinking & generation in parallel is that one can attend to the other. So you're not limited by prompt influences code, but can do prompt influences both thinking and code, and code can influence thinking and thinking can influence code.
Be aware that the project page has the wrong Arxiv link at the time of writing. This is the correct one:
https://arxiv.org/abs/2511.09611
Interesting approach and a very readable paper.
> We provide two varients of MMaDA-Parallel with different tokenizers. MMaDA-Parallel-A is trained with tokenizer Amused-VQ, and MMaDA-Parallel-M is trained with tokenizer Magvitv2.
tyfeld/MMaDA-Parallel-A: https://huggingface.co/tyfeld/MMaDA-Parallel-A/tree/main
tyfeld/MMaDA-Parallel-M: https://huggingface.co/tyfeld/MMaDA-Parallel-M/tree/main
This looks awesome. Although from a UX perspective might not be as good as streaming token by token for text generation use cases. However for image gen and editing - 100%