Article originally appeared on Replicate.
This week I’m thinking about how multimedia AI models will lead to real-time interactive world generation, and how it’s the bull case for VR and the metaverse. I talked with fellow Replicant Mattt about it, and watched his talk (see Research Radar below), and I can’t get it out of my mind. (Editor’s note: Neither Mattt nor Replicate are responsible for the following conjectures)
Just this week: You can now fine-tune FLUX.1, Tavus launched their Conversational Video Interface, a “digital twin” API that can looks like a person and does real-time video chat. Puppet-Master adds drag tokens to Stable Video Diffusion so you can close the door on a picture of a microwave. Sketch2scene is a big Rube Goldberg-like project involving several models to get from a crude drawing to a fully playable game world, but it works.
Image generators based on FLUX.1 can do hands and text and all the stuff we previously used to distinguish AI images. It’s the worst this technology will ever be. And it shows the clear desire for the ability to generate worlds.
What does the next level up look like? Once you have agents that can deal with multi-step decisions?
Right now, if you’re a knowledge worker, your might look like this: You direct your AI to write stuff. You see if you like its answer, copy and paste, maybe make some edits.
The other thing you do is you imagine possible worlds and pick between them. That is, you plan, you have goals. You know which things to try, which text or images to copy and paste where.
Planning is also being automated, and partly by your choices. Every time you make a choice now, the machine records it. In my code editor this happens: they have a special model trained to predict the next place your cursor will go, and what change it will make there.
Once these large models can plan actions reliably, we’ll ask them to complete long-running programs. You’ll ask your agent to do research and it will study the problem, hypothesize, figure out some test it can run, run the test, and write a report, before giving that back to you. This starts to look more like a person than a program.
We need interfaces to interact with long-running, smart, people-like things. We will want them to look people-ish and live in a world-ish place. We have that technology on the way, with huge amounts of money in the Metaverse and Apple Vision and similar plays.
Virtual people will work in your editor, on Zoom, in AR and in VR. They will scale up and down in reality, have more or less reality fluid applied to them. They will interact, and large ones will teach small ones, and their worlds will be as real as they need to be for all these agents to interact. Maybe not “Earth real” but “video game real” at least. They’ll have their own physics, which apply to everyone, even if that includes flight or fireballs or whatever.
Infinite worlds, spawned by raw computing power. We’ll parallelize everything: experiments, researchers, entire realities. We will seek secret knowledge for better futures. Medical breakthroughs. Clean energy. New ways to love, hate, worship each other. It will be beautiful and terrible. We will explore every direction, unlocking new worlds and new ways to be human.
The metaverse is also the multiverse. The portal to all the other worlds will open, and humans will explore deeper and deeper inside. We will bring treasures out to the real world as well. But the range of possible virtual worlds is infinitely more broad.
We are at the very beginning of that era now. The opening of the colossal cavern.
Do you dare to delve?
You can now fine-tune the FLUX.1 image generation model on Replicate. Upload a few images and teach a model to match your style, or character, or anything you can imagine.
Fine-tuning FLUX.1 is straightforward: upload 12-20 diverse images, choose a trigger word, and let our system do the rest. In about 30 minutes, you’ll have a custom model that can generate images featuring your unique style or your specific subject.
Tavus launched their Conversational Video Interface, a “digital twin” API that looks like a person and does real-time video chat. With less than one second of latency, these AI avatars offer natural interactions for customer support, sales, and more.
The system combines speech recognition, vision processing, and natural language understanding to create lifelike digital replicas. Developers can easily integrate this technology into their applications, opening up new possibilities for scalable, personalized video interactions.
Built on Replicate!
Sketch2Scene is an ambitious project that transforms crude drawings into fully playable game worlds. Draw a simple overhead map, and the system will generate 3D terrain, textures, objects, and even playable character models.
The project combines multiple AI models in a complex pipeline, including isometric image generation, visual scene understanding, and procedural 3D scene generation. It’s a glimpse into the future of game development and AI-generated interactive environments.
Puppet-Master adds drag tokens to Stable Video Diffusion, allowing fine-grained control over objects in generated videos. Draw bounding boxes around objects, and the model will let you control their motion and placement.
This technology brings us one step closer to fully interactive AI-generated content. Expect to see more work assigning tokens to concepts in video space in future.
Our own Mattt shared his thoughts on the future of AR, VR, and AI agents in a prescient talk from 2022. He argues that the smartphone era will give way to augmented and virtual reality experiences, creating new opportunities for developers.
Mattt discusses the potential of AR/VR to revolutionize education, work, and social interactions. He also emphasizes the importance of using these technologies responsibly. The talk is prescient: notably, he predicted Meta’s stock rebound when it was at its lowest point. It’s up 436% since then.
That’s all for this week! What do you think about the future of AI-generated worlds and embodied AI agents? Hit reply and let me know if you’re real. Please. Anyone.
If you enjoyed this newsletter, forward it to a friend who might find it interesting. If someone forwarded this to you, don’t forget to sign up!
Until next time,
— deepfates