That felt so wrong AND someone is cheating here. This felt really suspicious...
I got to the graffiti world and there were some stairs right next to me. So I started going up them. It felt like I was walking forward and the stairs were pushing under me until I just got stuck. So I turned to go back down and half way around everything morphed and I ended up back down at the ground level where I originally was. I was teleported. That's why I feel like something is cheating here. If we had mode collapse I'm not sure how we should be able to completely recover our entire environment. Not unless the model is building mini worlds with boundaries. It was like the out of bond teleportation you get in some games but way more fever dream like. That's not what we want from these systems, we don't want to just build a giant poorly compressed videogame, we want continuous generation. If you have mode collapse and recover, it should recover to somewhere new, now where you've been. At least this is what makes me highly suspicious.
Yes the thing that got me was i went through the channels multiple times (multiple browser sessions). The channels are the same everytime (the numbers don't align to any navigation though - flip back and forth between two numbers and you'll just hit a random channel everytime - don't be fooled by that). Every object is in the same position and the layout is the same.
What makes this AI generated over just rendering a generated 3D scene?
Like it may seem impressive to have no glitches (often in AI generated works you can turn around a full rotation and you're what's in front of you isn't what was there originally) but here it just acts as a fully modelled 3D scene rendering at low resolution? I can't even walk outside of certain bounds which doesn't make sense if this really is generated on the fly.
This needs a lot of skepticism and i'm surprised you're the first commenting on the lack of actual generation here. It's a series of static scenes rendered at low fidelity with limited bounds.
Ok playing with this more there's very subtle differences between sessions. As in there is some hallucination here with certain small differences.
I think what's happening is this is AI generated but it is very very overfitted to real world 3D scenes. The AI is almost rendering exactly a real world scene and not much more. They can't travel out of bounds or the model stops working since it's so overfitted to these scenes. The overfitting solves hallucinations but it also makes it almost indistinguishable from pre modelled 3D scenes.
I think the most likely explanation is that they trained a diffusion WM (like DIAMOND) on video rollouts recorded from within a 3D scene representation (like NeRF/GS), with some collision detection enabled.
This would explain:
1. How collisions / teleportation work and why they're so rigid (the WM is mimicking hand-implemented scene-bounds logic)
2. Why the scenes are static and, in the case of should-be-dynamic elements like water/people/candles, blurred (the WM is mimicking artifacts from the 3D representation)
3. Why they are confident that "There's no map or explicit 3D representation in the outputs. This is a diffusion model, and video in/out" https://x.com/olivercameron/status/1927852361579647398 (the final product is indeed a diffusion WM trained on videos, they just have a complicated pipeline for getting those training videos)
All of this is really great work, and I'm excited to see great labs pushing this research forward.
From our perspective, what separates our work is two things:
1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.
2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.
Is it possible that this behavior is a result from training on Google Maps or something similar? I tried to walk off a bridge and you get completely stuck, which is the only reason I can think of that, other than not having first person video views of people walking off bridges.
Hi! CEO of Odyssey here. Thanks for giving this a shot.
To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.
The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.
> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.
> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.
> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.
This is super cool. I love how you delivered this as an experience, with very cool UI, background music, etc. It was a real trip. Different music and also an ambient mode where it’s atmosphere designed for the particular world would level this up artistically. But I gotta say, I haven’t been this intrigued by anything like this online in a while.
Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models? If a world model was truly managing to model a manifold of a 3D scene, it should be pretty easy to extract a mesh or SDF from it and drop that into an engine where you could then impose more concrete rules or sanity check the output of the model. Then you could actually model player movement inside of the 3D engine instead of trying to train the world model to accept any kind of player input you might want to do now or in the future.
Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel is - is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
> Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models?
I absolutely think there's going to be super cool startups that accelerate film and game dev as it is today, inside existing 3D engines. Those workflows could be made much faster with generative models.
That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
> Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel
To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
> is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
Yes, exactly. The model just learns better this way (instead of breaking it down into discrete components) and I think the end experience will be weirder and more wonderful for it.
> Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
Isn’t the entire aim of world models (at least, in this particular case) to learn a very high quality 3D representation from 2D video data? My point is if that you manage to train a navigable world model for a particular location, that model has managed to fit a very high quality 3D representation of that location. There’s lots of research dealing with NERFs that demonstrate how you can extract these 3D scenes as meshes once a model has managed to fit it. (NERFs are another great example of learning a high quality 3D representation from sparse 2D data.)
>That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
There’s a lot of focus in the material on your site about the models learning physics by training on real world video - wouldn’t that imply that you’re trying to converge on a physically accurate world model? I imagine that would make weirdness and wackiness rather difficult
> To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
I see! Do you have any ideas about the kinds of experiences that you would want to see or experience personally? For me it’s hard to imagine anything that substantially deviates from navigating and interacting with a 3D engine, especially given it seems like you want your world models to converge to be physically realistic. Maybe you could prompt it to warp to another scene?
> wouldn’t that imply that you’re trying to converge on a physically accurate world model?
I'm not the CEO or associated with them at all, but yes, this is what most of these "world model" researchers are aiming for. As a researcher myself, I do not think this is the way to develop a world model and I'm fairly certain that this cannot be done through observations alone. I explain more in my response to the CEO[0]. This is a common issue is many ways that ML is experimenting, and you simply cannot rely on benchmarks to get you to AGI. Scaling of parameters and data only go so far. If you're seeing slowing advancements, it is likely due to over reliance on benchmarks and under reliance on what benchmarks intend to measure. But this is a much longer conversation (I think I made a long comment about it recently, I can dig up).
Does it stores a model of the world (like, some memory of the 3d structure) that goes beyond the pixels that are shown?
Or is the next frame a function of just the previous frame and the user input? Like (previous frame, input) -> next frame
I'm asking because, if some world has two distinct locations that look exactly the same, will the AI distinguish them, or will they get coalesced into one location?
> one of the greatest challenges of world models is maintaining coherence of video over long time periods
To be honest most of the appeal to me of this type of thing is the fact that it gets incoherent and morph-y and rotating 360 degrees can completely change the scenery. It's a trippy dreamlike experience whereas this kind of felt like a worse version of existing stuff.
Thanks for the reply and adding some additional context. I'm also a vision researcher, fwiw (I'll be at CVPR if you all are).
(Some of this will be for benefit of other HN non-researcher readers)
I'm hoping you can provide some more. Are these training on single video moving through these environments, where the camera is not turning? What I am trying to understand is what is being generated vs what is being recalled.
It may be a more contentious view, but I do not think we're remotely ready to call these systems "world models" if they are primarily performing recall. Maybe this is the bias from an education in physics (I have a degree), but world modeling is not just about creating consistent imagery, but actually being capable of recovering the underlying physics of the videospace (as opposed to reality which the videos come from). I've yet to see a demonstration of a model that comes anywhere near this or convinces me we're on the path towards this.
The key difference here is are we building Doom which has a system requirements of 100MB disk and 8MB RAM with minimal computation or are we building a extremely decompressed version that requires 4GB of disk and a powerful GPU to run only the first level and can't even get critical game dynamics right like shooting the right enemy (GameNGen).
The problem is not the ability to predict future states based on previous ones, the problem is the ability to recover /causal structures/ from observation.
Critically, a p̶h̶y̶s̶i̶c̶s̶ world model is able to process a counterfactual.
Our video game is able to make predictions, even counterfactual predictions, with its engine. Of course, this isn't generated by observation and environment interaction, it is generated through directed programming and testing (where the testing includes observing and probing the environment). If the goal was just that, then our diffusion models would comparatively be a poor contender. It's the wrong metric. The coherence is a consequence of the world modeling (i.e. game engine) but it coherence can also be developed from recall. Recall alone will be unable to make a counterfactual.
Certainly we're in research phase and need to make tons of improvements, but we can't make these improvements if we're blindly letting our models cheat the physics and are only capable of picking up "user clicks fire" correlating with "monster usually dies when user shoots". LLMs have tons of similar problems with making such shortcuts and the physics will tell you that you are not going to be able to pick up such causal associations without some very specific signals to observe. Unfortunately, causality can not be determined from observation alone (a well known physics result![0]). You end up with many models that generate accurate predictions, and these become non-differentiable without careful factorization, probing, and often requiring careful integration of various other such models. It is this much harder and more nuanced task that is required of a world model rather than memory.
Essentially, do we have "world models" or "cargo cult world models" (recall or something else).
That's the context of my data question. To help us differentiate the two. Certainly the work is impressive and tbh I do believe there is quite a bit of utility in the cargo cult setting, but we should also be clear about what is being claimed and what isn't.
I'm also interested in how you're trying to address the causal modeling problem.
[0] There is much discussion on the Duhem-Quine thesis, which is a much stronger claim than I stated. There's the famous Michelson-Morley experiment, which actually did not rule out an aether, but rather only showed that it had no directionality. Or we could even use the classic Heisenberg Uncertainty Principle which revolutionized quantum mechanics showing that there are things that are unobservable, leading to Schrodinger's Cat (some weird hypotheses of multiverses). And we even have String Theory, which the main gripe remains that it is non-differentiable from other TOES due to differences in predictions being non-observable.
It’s essentially this paper but applied to a bunch of video recordings of a bunch of different real world locations instead of counter strike maps. Each channel is just changing the location.
> Not unless the model is building mini worlds with boundaries.
Right. I was never able to get very far from the starting point, and kept getting thrown back to the start. It looks like they generated a little spherical image, and they're able to extrapolate a bit from that. Try to go through a door or reach a distant building, and you don't get there.
It feels like an interpolated Street View imagery. There is one scene with two people between cars in a parking lot. It is the only one I have found that has objects you would expect to change over time. When exploring the scene, those people sometimes disappear altogether and sometimes teleport around, as they would when exploring Street View panoramas. You can clearly tell when you are switching between photos taken a few seconds apart.
I mean... https://news.ycombinator.com/item?id=44121671
informed you of exactly why this happens a whole hour before you posted this comment and the creator is chatting with people in the comments. I get that you feel personally cheated, but I really don't think anyone was deliberately trying to cheat you. In light of that, your comment (and i only say this because it's the top comment on this post) is effectively a stereotypical "who needs dropbox" levels of shallow dismissal.
> I mean... https://news.ycombinator.com/item?id=44121671 informed you of exactly why this happens a whole hour before you posted this comment and the creator is chatting with people in the comments.
I apologize for replying with my experience and not reading every comment before posting. This was not the top comment when I wrote mine.
> and the creator is chatting with people in the comments.
This had yet to occur when I left my comment. I looked at their comments and it appears I'm the first person they responded to in this thread. Took me a bit to respond, but hey, I got stuff to do too.
> is effectively a stereotypical "who needs dropbox" levels of shallow dismissal.
I'll loop you into my response to the creator, which adds context to my question. This not a "who needs dropbox" so much as "why are you calling dropbox 'storing data without taking any disk space'". Sure, it doesn't take your disk space, but that's not no disk space... Things are a bit clearer now, but I gotta work with the context of what's given to me.
https://news.ycombinator.com/item?id=44147777
> and i only say this because it's the top comment on this post
I appreciate you holding people to high standards. No issues there. It's why I made my comment in the first place! Hopefully my other comment clarifies what was definitely lacking in the original.
Note that it isn't being created from whole cloth, it is trained on videos of the places and then it is generating the frames:
"To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation."
Well, that felt like entering a dream on my phone. Fuzzy virtual environments generated by "a mind" based on its memory of real environments...
I wonder if it'd break our brains more if the environment changes as the viewpoint changes, but doesn't change back (e.g. if there's a horse, you pan left, pan back right, and the horse is now a tiger).
I kept expecting that to happen, but it apparently has some mechanism to persist context outside the user’s FOV.
In a way, that almost makes it more dreamlike, in that you have what feels like high local coherence (just enough not to immediately tip you off that it’s a dream) that de-coheres over time as you move through it.
This is pretty much the same thing as those models that baked dust2 into a diffusion model then used the last few frames as context to continue generating - same failure modes and everything.
This is similar to the Minecraft version of this from a few months back [0], but it does seem to have a better time keeping a memory of what you've already seen, at least for a bit. Spinning in circles doesn't lose your position quite as easily, but I did find that exiting a room and then turning back around and re-entering leaves you with a totally different room than you exited.
Only at first glance. It can easily render things that would be very hard to implement in an FPS engine.
What AI can dream up in milliseconds could take hundreds of human hours to encode using traditional tech (meshes, shaders, ray tracing, animation, logic scripts, etc.), and it still wouldn't look as natural and smooth as AI renderings — I refer to the latest developments in video synthesis like Google's Veo 3. Imagine it as a game engine running in real time.
Why do you think this is so hard, even for technical people here, to make the inductive leap on this one? Is it that close to magic? The AI is rendering pillars and also determining collision detection on it. As in, no one went in there and selected a bunch of pillars and marked it as a barrier. That means in the long run, I'll be able to take some video or pictures of the real world and have it be game level.
Because that's been a thing for years already - and works way better then this research does.
Unreal engine 5 has been demoing these features for a while now, I heard about it early 2020 iirc, but the techniques like gaussian splattering predate it.
I have no experience in either of these, but I believe MegaScans and RealityCapture are two examples doing this. And the last nanite demo touched on it, too.
I'm sorry, what's a thing? Unreal engine 5 does those things with machine learning? Imagine someone shows me Claude generating a full React app, and I say "well you see, React apps have always been a thing". The thing we're talking about is AI, nothing else. There is no other thing is the whole point of the AI hype.
What they meant is that 3D scanning real places, and translating them into 3D worlds with collision already exists, and provides much, much better results than the AI videos here. Additionally, it does not need what is likely hours of random footage wandering in the space, just a few minutes of scans.
I think an actual 3D engine with AI that can make new high quality 3D models and environments on the fly would be the pinnacle. And maybe even add new game and control mechanics on the fly.
it’s super cool. I keep thinking it kind of feels like dream logic. It looks amazing at first but I’m not sure I’d want to stay in a world like that for too long. I actually like when things have limits. When the world pushes back a bit and gives you rules to work with.
Doesn't it have rules? I couldn't move past a certain point and hitting a wall made you teleport. Maybe I was just rationalizing random events, though.
Hi HN, I hope you enjoy our research preview of interactive video!
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
I found an interesting glitch where you could never actually reach a parked car, as you move forward the car also moved. It looked a lot like traffic moving through Google Street View.
Yeah. I found the same thing. Cars would disappear in front of me, then I reached the end of the world and it reset me. I'm not sure I believe this is AI, and instead some crappy street view interface.
This is amazing! I think the AI will completely replace the way we create and consume media currently. A well written story, with an amazing graphics generation AI can be both interactive and surprising every time you watch it again.
In playing with this it was unclear to me how this differs from a pre-programmed 3d world with bit mapped walls. What is the AI adding that I wouldn't get otherwise?
I do not get the "interactive" part. I expect to be able to manipulate objects or at least move them, you know, "interact" with the "video". Now it is some cheap walking simulator, without narration or any plot. Disappearing lamp posts when you get near them also should not be considered an interaction.
Maybe you should take a bit different approach to interactive videos, and let's say build a tech review video for some gadget or device, where viewer could interrupt host, using voice, and ask them questions, skip to some part or repeat something in more detail, explain some concept, even compare to other devices.
If I had to choose one, I'd easily say maintaining video coherence over long periods of time. The typical failure case of world models that's attempting to generate diverse pixels (i.e. beyond a single video game) is that they degrade to a mush of incoherent pixels after 10-20 seconds of video.
We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!
> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.
In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.
To me, this is evidence we're not in a simulation. Even with a gazillion H100's the model runs out of memory just (very roughly) simulating a 50'x50' space over just a few seconds.
I'm unable to navigate anywhere. I'm on a laptop with a touchscreen and a trackpad. I clicked, double clicked, scrolled, and tried everything I could think of and the views just hovered around the same spot.
For sure, but consider it a "first draft" of what this type of generative AI can do.
The resolution is extremely low. The website doesn't specify, but I'd guess it's only 160x120. Such a low resolution was necessary to render it in real time and maintain a reasonable frame rate. To try to hide the blurring a bit, they apply some filters to add scan lines and other effects to make it look like an old TV.
That said, I'd be surprised if anybody could gather the hardware to work well enough to get it to a useable resolution, let alone even something like 1080p. It's literally over 100x the pixels of 160x120.
I think this step towards a more immerse virtual reality can actually be dangerous. A lot of intellectual types might disagree but I do think that creating such immersion is a dangerous thing because it will reduce the value people place on the real world and especially the natural world, making them even less likely to care if big corporations screw it up with biospheric degradation.
It seems like it has a high chance of leading to even more narcissism as well because we are reducing our dependence on others to such a degree that we will care about others less and less, which is something that has already started happening with increasingly advanced interactive technology like AI.
> I think this step towards a more immerse virtual reality can actually be dangerous
I don't think its a step toward that; I think this is literally trained using techniques to generate more immersive virtual reality that already exists and takes less compute, to produce a more computationally expensive and less accurate AI version.
At least, that's what every other demo of a real-time interactive AI world model has been, and they aren't trumpeting any clear new distinction.
This is why we never see any alien life. When they reach a sufficient level of technology, they realize the virtual/mental universe is much more compelling and fun than the boring rule-bound physical one.
Actually, it's that the aliens eventually get completely immersed in technology to the point where they just self-destruct due to the meaninglessness of the life they have created.
So your argument is that because it didn't work once, that it won't work now? At some point, a breakthrough is made and in this case the breakthrough is a bad one.
That felt so wrong AND someone is cheating here. This felt really suspicious...
I got to the graffiti world and there were some stairs right next to me. So I started going up them. It felt like I was walking forward and the stairs were pushing under me until I just got stuck. So I turned to go back down and half way around everything morphed and I ended up back down at the ground level where I originally was. I was teleported. That's why I feel like something is cheating here. If we had mode collapse I'm not sure how we should be able to completely recover our entire environment. Not unless the model is building mini worlds with boundaries. It was like the out of bond teleportation you get in some games but way more fever dream like. That's not what we want from these systems, we don't want to just build a giant poorly compressed videogame, we want continuous generation. If you have mode collapse and recover, it should recover to somewhere new, now where you've been. At least this is what makes me highly suspicious.
Yes the thing that got me was i went through the channels multiple times (multiple browser sessions). The channels are the same everytime (the numbers don't align to any navigation though - flip back and forth between two numbers and you'll just hit a random channel everytime - don't be fooled by that). Every object is in the same position and the layout is the same.
What makes this AI generated over just rendering a generated 3D scene?
Like it may seem impressive to have no glitches (often in AI generated works you can turn around a full rotation and you're what's in front of you isn't what was there originally) but here it just acts as a fully modelled 3D scene rendering at low resolution? I can't even walk outside of certain bounds which doesn't make sense if this really is generated on the fly.
This needs a lot of skepticism and i'm surprised you're the first commenting on the lack of actual generation here. It's a series of static scenes rendered at low fidelity with limited bounds.
Ok playing with this more there's very subtle differences between sessions. As in there is some hallucination here with certain small differences.
I think what's happening is this is AI generated but it is very very overfitted to real world 3D scenes. The AI is almost rendering exactly a real world scene and not much more. They can't travel out of bounds or the model stops working since it's so overfitted to these scenes. The overfitting solves hallucinations but it also makes it almost indistinguishable from pre modelled 3D scenes.
I think the most likely explanation is that they trained a diffusion WM (like DIAMOND) on video rollouts recorded from within a 3D scene representation (like NeRF/GS), with some collision detection enabled.
This would explain:
1. How collisions / teleportation work and why they're so rigid (the WM is mimicking hand-implemented scene-bounds logic)
2. Why the scenes are static and, in the case of should-be-dynamic elements like water/people/candles, blurred (the WM is mimicking artifacts from the 3D representation)
3. Why they are confident that "There's no map or explicit 3D representation in the outputs. This is a diffusion model, and video in/out" https://x.com/olivercameron/status/1927852361579647398 (the final product is indeed a diffusion WM trained on videos, they just have a complicated pipeline for getting those training videos)
Odyssey Systems is six months behind way more impressive demos. They're following in the footsteps of this work:
- Open Source Diamond WM that you can run on consumer hardware [1]
- Google's Genie 2 (way better than this) [2]
- Oasis [3]
[1] https://diamond-wm.github.io/
[2] https://deepmind.google/discover/blog/genie-2-a-large-scale-...
[3] https://oasis.decart.ai/welcome
There are a lot of papers and demos in this space. They have the same artifacts.
All of this is really great work, and I'm excited to see great labs pushing this research forward.
From our perspective, what separates our work is two things:
1. Our model is able to be experienced by anyone today, and in real-time at 30 FPS.
2. Our data domain is real-world, meaning learning life-like pixels and actions. This is, from our perspective, more complex than learning from a video game.
Is it possible that this behavior is a result from training on Google Maps or something similar? I tried to walk off a bridge and you get completely stuck, which is the only reason I can think of that, other than not having first person video views of people walking off bridges.
Hi! CEO of Odyssey here. Thanks for giving this a shot.
To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.
The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.
We share a lot more about this in our blog post here (https://odyssey.world/introducing-interactive-video), and share outputs from a more generalized model.
> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.
> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.
> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.
Let me know any questions. Happy to go deeper!
This is super cool. I love how you delivered this as an experience, with very cool UI, background music, etc. It was a real trip. Different music and also an ambient mode where it’s atmosphere designed for the particular world would level this up artistically. But I gotta say, I haven’t been this intrigued by anything like this online in a while.
Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models? If a world model was truly managing to model a manifold of a 3D scene, it should be pretty easy to extract a mesh or SDF from it and drop that into an engine where you could then impose more concrete rules or sanity check the output of the model. Then you could actually model player movement inside of the 3D engine instead of trying to train the world model to accept any kind of player input you might want to do now or in the future.
Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel is - is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
Great questions!
> Why are you going all in on world models instead of basing everything on top of a 3D engine that could be manipulated / rendered with separate models?
I absolutely think there's going to be super cool startups that accelerate film and game dev as it is today, inside existing 3D engines. Those workflows could be made much faster with generative models.
That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
> Additionally, curious about what exactly the difference between the new mode of storytelling you’re describing and something like a crpg or visual novel
To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
> is your hope that you can just bake absolutely everything into the world model instead of having to implement systems for dialogue/camera controls/rendering/everything else that’s difficult about working with a 3D engine?
Yes, exactly. The model just learns better this way (instead of breaking it down into discrete components) and I think the end experience will be weirder and more wonderful for it.
> Plus, you've got the very large challenge of learning a rich, high-quality 3D representation from a very small pool of 3D data. The volume of 3D data is just so small, compared to the volumes generative models really need to begin to shine.
Isn’t the entire aim of world models (at least, in this particular case) to learn a very high quality 3D representation from 2D video data? My point is if that you manage to train a navigable world model for a particular location, that model has managed to fit a very high quality 3D representation of that location. There’s lots of research dealing with NERFs that demonstrate how you can extract these 3D scenes as meshes once a model has managed to fit it. (NERFs are another great example of learning a high quality 3D representation from sparse 2D data.)
>That said, our belief is that model-imagined experiences are going to become a totally new form of storytelling, and that these experiences might not be free to be as weird and whacky as they could because of heuristics or limitations in existing 3D engines. This is our focus, and why the model is video-in and video-out.
There’s a lot of focus in the material on your site about the models learning physics by training on real world video - wouldn’t that imply that you’re trying to converge on a physically accurate world model? I imagine that would make weirdness and wackiness rather difficult
> To be clear, we don't yet know what shape these new experiences will take. I'm hoping we can avoid an awkward initial phase where these experiences resemble traditional game mechanics too much (although we have much to learn from them), and just fast-forward to enabling totally new experiences that just aren't feasible with existing technologies and budgets. Let's see!
I see! Do you have any ideas about the kinds of experiences that you would want to see or experience personally? For me it’s hard to imagine anything that substantially deviates from navigating and interacting with a 3D engine, especially given it seems like you want your world models to converge to be physically realistic. Maybe you could prompt it to warp to another scene?
[0] https://news.ycombinator.com/item?id=44147777
Does it stores a model of the world (like, some memory of the 3d structure) that goes beyond the pixels that are shown?
Or is the next frame a function of just the previous frame and the user input? Like (previous frame, input) -> next frame
I'm asking because, if some world has two distinct locations that look exactly the same, will the AI distinguish them, or will they get coalesced into one location?
> one of the greatest challenges of world models is maintaining coherence of video over long time periods
To be honest most of the appeal to me of this type of thing is the fact that it gets incoherent and morph-y and rotating 360 degrees can completely change the scenery. It's a trippy dreamlike experience whereas this kind of felt like a worse version of existing stuff.
What's the business model?
Thanks for the reply and adding some additional context. I'm also a vision researcher, fwiw (I'll be at CVPR if you all are).
(Some of this will be for benefit of other HN non-researcher readers)
I'm hoping you can provide some more. Are these training on single video moving through these environments, where the camera is not turning? What I am trying to understand is what is being generated vs what is being recalled.
It may be a more contentious view, but I do not think we're remotely ready to call these systems "world models" if they are primarily performing recall. Maybe this is the bias from an education in physics (I have a degree), but world modeling is not just about creating consistent imagery, but actually being capable of recovering the underlying physics of the videospace (as opposed to reality which the videos come from). I've yet to see a demonstration of a model that comes anywhere near this or convinces me we're on the path towards this.
The key difference here is are we building Doom which has a system requirements of 100MB disk and 8MB RAM with minimal computation or are we building a extremely decompressed version that requires 4GB of disk and a powerful GPU to run only the first level and can't even get critical game dynamics right like shooting the right enemy (GameNGen).
The problem is not the ability to predict future states based on previous ones, the problem is the ability to recover /causal structures/ from observation.
Critically, a p̶h̶y̶s̶i̶c̶s̶ world model is able to process a counterfactual.
Our video game is able to make predictions, even counterfactual predictions, with its engine. Of course, this isn't generated by observation and environment interaction, it is generated through directed programming and testing (where the testing includes observing and probing the environment). If the goal was just that, then our diffusion models would comparatively be a poor contender. It's the wrong metric. The coherence is a consequence of the world modeling (i.e. game engine) but it coherence can also be developed from recall. Recall alone will be unable to make a counterfactual.
Certainly we're in research phase and need to make tons of improvements, but we can't make these improvements if we're blindly letting our models cheat the physics and are only capable of picking up "user clicks fire" correlating with "monster usually dies when user shoots". LLMs have tons of similar problems with making such shortcuts and the physics will tell you that you are not going to be able to pick up such causal associations without some very specific signals to observe. Unfortunately, causality can not be determined from observation alone (a well known physics result![0]). You end up with many models that generate accurate predictions, and these become non-differentiable without careful factorization, probing, and often requiring careful integration of various other such models. It is this much harder and more nuanced task that is required of a world model rather than memory.
Essentially, do we have "world models" or "cargo cult world models" (recall or something else).
That's the context of my data question. To help us differentiate the two. Certainly the work is impressive and tbh I do believe there is quite a bit of utility in the cargo cult setting, but we should also be clear about what is being claimed and what isn't.
I'm also interested in how you're trying to address the causal modeling problem.
[0] There is much discussion on the Duhem-Quine thesis, which is a much stronger claim than I stated. There's the famous Michelson-Morley experiment, which actually did not rule out an aether, but rather only showed that it had no directionality. Or we could even use the classic Heisenberg Uncertainty Principle which revolutionized quantum mechanics showing that there are things that are unobservable, leading to Schrodinger's Cat (some weird hypotheses of multiverses). And we even have String Theory, which the main gripe remains that it is non-differentiable from other TOES due to differences in predictions being non-observable.
It’s essentially this paper but applied to a bunch of video recordings of a bunch of different real world locations instead of counter strike maps. Each channel is just changing the location.
https://diamond-wm.github.io/
> Not unless the model is building mini worlds with boundaries.
Right. I was never able to get very far from the starting point, and kept getting thrown back to the start. It looks like they generated a little spherical image, and they're able to extrapolate a bit from that. Try to go through a door or reach a distant building, and you don't get there.
It feels like an interpolated Street View imagery. There is one scene with two people between cars in a parking lot. It is the only one I have found that has objects you would expect to change over time. When exploring the scene, those people sometimes disappear altogether and sometimes teleport around, as they would when exploring Street View panoramas. You can clearly tell when you are switching between photos taken a few seconds apart.
Same. Got to the corner of the house, turned around, and got teleported back to the starting point.
I call BS.
I mean... https://news.ycombinator.com/item?id=44121671 informed you of exactly why this happens a whole hour before you posted this comment and the creator is chatting with people in the comments. I get that you feel personally cheated, but I really don't think anyone was deliberately trying to cheat you. In light of that, your comment (and i only say this because it's the top comment on this post) is effectively a stereotypical "who needs dropbox" levels of shallow dismissal.
Note that it isn't being created from whole cloth, it is trained on videos of the places and then it is generating the frames:
"To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation."
https://odyssey.world/introducing-interactive-video
I recognized the Santa Cruz Beach Boardwalk channel. It was exactly as I remember.
Could probably be (semi?)automated to run on 3d models of places that doesn't exist. Even ai-built 3d models.
The paper they’re basing this off already does this.
https://diamond-wm.github.io/
What's the point? You already have the 3d models. If you want an interactive video just use the 3d models.
Well, that felt like entering a dream on my phone. Fuzzy virtual environments generated by "a mind" based on its memory of real environments...
I wonder if it'd break our brains more if the environment changes as the viewpoint changes, but doesn't change back (e.g. if there's a horse, you pan left, pan back right, and the horse is now a tiger).
I kept expecting that to happen, but it apparently has some mechanism to persist context outside the user’s FOV.
In a way, that almost makes it more dreamlike, in that you have what feels like high local coherence (just enough not to immediately tip you off that it’s a dream) that de-coheres over time as you move through it.
Fascinatingly strange demo.
Our minds are used to that: dreams
This is pretty much the same thing as those models that baked dust2 into a diffusion model then used the last few frames as context to continue generating - same failure modes and everything.
https://diamond-wm.github.io/
This is similar to the Minecraft version of this from a few months back [0], but it does seem to have a better time keeping a memory of what you've already seen, at least for a bit. Spinning in circles doesn't lose your position quite as easily, but I did find that exiting a room and then turning back around and re-entering leaves you with a totally different room than you exited.
[0] Minecraft with object impermanence (229 points, 146 comments) https://news.ycombinator.com/item?id=42762426
This seems like a staggeringly inefficient way to develop what is essentially a FPS engine.
Only at first glance. It can easily render things that would be very hard to implement in an FPS engine.
What AI can dream up in milliseconds could take hundreds of human hours to encode using traditional tech (meshes, shaders, ray tracing, animation, logic scripts, etc.), and it still wouldn't look as natural and smooth as AI renderings — I refer to the latest developments in video synthesis like Google's Veo 3. Imagine it as a game engine running in real time.
This thing looks like a complete joke next to the Unreal Engine.
Why do you think this is so hard, even for technical people here, to make the inductive leap on this one? Is it that close to magic? The AI is rendering pillars and also determining collision detection on it. As in, no one went in there and selected a bunch of pillars and marked it as a barrier. That means in the long run, I'll be able to take some video or pictures of the real world and have it be game level.
Because that's been a thing for years already - and works way better then this research does.
Unreal engine 5 has been demoing these features for a while now, I heard about it early 2020 iirc, but the techniques like gaussian splattering predate it.
I have no experience in either of these, but I believe MegaScans and RealityCapture are two examples doing this. And the last nanite demo touched on it, too.
I'm sorry, what's a thing? Unreal engine 5 does those things with machine learning? Imagine someone shows me Claude generating a full React app, and I say "well you see, React apps have always been a thing". The thing we're talking about is AI, nothing else. There is no other thing is the whole point of the AI hype.
What they meant is that 3D scanning real places, and translating them into 3D worlds with collision already exists, and provides much, much better results than the AI videos here. Additionally, it does not need what is likely hours of random footage wandering in the space, just a few minutes of scans.
I think an actual 3D engine with AI that can make new high quality 3D models and environments on the fly would be the pinnacle. And maybe even add new game and control mechanics on the fly.
Yeah. Why throw away a perfectly fine 3d engine
I feel like we're so close to remaking the classic Rob Schneider full motion video game "A Fork in the Tale"
https://m.youtube.com/watch?v=YXPIv7pS59o
it’s super cool. I keep thinking it kind of feels like dream logic. It looks amazing at first but I’m not sure I’d want to stay in a world like that for too long. I actually like when things have limits. When the world pushes back a bit and gives you rules to work with.
Doesn't it have rules? I couldn't move past a certain point and hitting a wall made you teleport. Maybe I was just rationalizing random events, though.
I kept getting teleported to the start when I picked the world channel that showed some sort of well-lit catacombs.
Eventually managed to leave the first room, but then got teleported somewhere else.
Sounds like something a deity would say before creating a universe such as ours.
related (and quite cool) -- minecraft generated on-the-fly which you can interact with: https://news.ycombinator.com/item?id=42014650
Hi HN, I hope you enjoy our research preview of interactive video!
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
Once you've taken the research preview for a whirl, you can learn a lot more about our technical work behind this here (https://odyssey.world/introducing-interactive-video).
I found an interesting glitch where you could never actually reach a parked car, as you move forward the car also moved. It looked a lot like traffic moving through Google Street View.
Yeah. I found the same thing. Cars would disappear in front of me, then I reached the end of the world and it reset me. I'm not sure I believe this is AI, and instead some crappy street view interface.
Can you say where the underground cellar with the red painting is? It's compelling.
This is cool. I think there is good chance that this is the future of videogames.
do u personally feel like scaling this approach is going to be the end game for generating navigatable worlds?
ie. as opposed to first generating a 3d env then doing some sorts of img2img on top of it?
This is amazing! I think the AI will completely replace the way we create and consume media currently. A well written story, with an amazing graphics generation AI can be both interactive and surprising every time you watch it again.
Wait till the bills arrive.
In playing with this it was unclear to me how this differs from a pre-programmed 3d world with bit mapped walls. What is the AI adding that I wouldn't get otherwise?
Seems like it ingested google street view
Would be more interesting with people in it.
I agree! Check out outputs from our next-gen world model here(https://odyssey.world/introducing-interactive-video), featuring richer pixels and dynamics.
I do not get the "interactive" part. I expect to be able to manipulate objects or at least move them, you know, "interact" with the "video". Now it is some cheap walking simulator, without narration or any plot. Disappearing lamp posts when you get near them also should not be considered an interaction. Maybe you should take a bit different approach to interactive videos, and let's say build a tech review video for some gadget or device, where viewer could interrupt host, using voice, and ask them questions, skip to some part or repeat something in more detail, explain some concept, even compare to other devices.
Feels like the Mist (Myst?) game.
Am I the only one stuck with a black screen as the audio plays?
very cool - what was the hardest part of building this?
If I had to choose one, I'd easily say maintaining video coherence over long periods of time. The typical failure case of world models that's attempting to generate diverse pixels (i.e. beyond a single video game) is that they degrade to a mush of incoherent pixels after 10-20 seconds of video.
We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!
> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.
In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.
Thank you for this experience. Feels like you are exploring a dream.
I LOVE dreamy AI content. That stuff where everything turned into dogs for example.
As AI is maturing, we are slowly losing that im favor of boring realism and coherence.
Exploring babel’s library!
To me, this is evidence we're not in a simulation. Even with a gazillion H100's the model runs out of memory just (very roughly) simulating a 50'x50' space over just a few seconds.
This reminds me of the scapes in Diaspora (by Greg Egan).
I'm unable to navigate anywhere. I'm on a laptop with a touchscreen and a trackpad. I clicked, double clicked, scrolled, and tried everything I could think of and the views just hovered around the same spot.
Love the atmosphere.
It’s pointless to do this with real world places. Why not do it for TV shows or a photograph? You could walk around inside and explore the scenes.
Interactive ads and interactive porn are the AI killer apps we miss so much.
Now this is an Assassin’s Creed memory machine that I can get behind
going outside breaks the ai lol
[dead]
[flagged]
For sure, but consider it a "first draft" of what this type of generative AI can do.
The resolution is extremely low. The website doesn't specify, but I'd guess it's only 160x120. Such a low resolution was necessary to render it in real time and maintain a reasonable frame rate. To try to hide the blurring a bit, they apply some filters to add scan lines and other effects to make it look like an old TV.
That said, I'd be surprised if anybody could gather the hardware to work well enough to get it to a useable resolution, let alone even something like 1080p. It's literally over 100x the pixels of 160x120.
I think this step towards a more immerse virtual reality can actually be dangerous. A lot of intellectual types might disagree but I do think that creating such immersion is a dangerous thing because it will reduce the value people place on the real world and especially the natural world, making them even less likely to care if big corporations screw it up with biospheric degradation.
It seems like it has a high chance of leading to even more narcissism as well because we are reducing our dependence on others to such a degree that we will care about others less and less, which is something that has already started happening with increasingly advanced interactive technology like AI.
> I think this step towards a more immerse virtual reality can actually be dangerous
I don't think its a step toward that; I think this is literally trained using techniques to generate more immersive virtual reality that already exists and takes less compute, to produce a more computationally expensive and less accurate AI version.
At least, that's what every other demo of a real-time interactive AI world model has been, and they aren't trumpeting any clear new distinction.
This is why we never see any alien life. When they reach a sufficient level of technology, they realize the virtual/mental universe is much more compelling and fun than the boring rule-bound physical one.
Actually, it's that the aliens eventually get completely immersed in technology to the point where they just self-destruct due to the meaninglessness of the life they have created.
People were saying literally the exact same thing when those crappy VR headsets were all the rage. I think we're ok.
So your argument is that because it didn't work once, that it won't work now? At some point, a breakthrough is made and in this case the breakthrough is a bad one.