I wrote about auto-completing language models — Transformers — and the feedback loop they create with our minds:
Robert Anton Wilson, a philosopher-entertainer of the 20th century, said “the border between the Real and the Unreal is not fixed, but just marks the last place where rival gangs of shamans fought each other to a standstill.” This is more true now than ever, because of the incredible power that intelligence augmentation will unleash.
Platforms will raise their language models with different values and those values will feed back into their decision-making apparatus. The worldviews encoded in the models will diverge, and our thoughts will be nudged by them. The tense border of Reality will erupt into a multifront war.
Perhaps this war has already begun.
I published those words on January 6th, at 10:16 my time. Two hours later, a wingnut known as Q Shaman stood at the podium of the United States Senate.
Think what you will of this encounter, it has permanently entered the narrative of this country and the world. Rival gangs of shamans fought each other to a standstill, and now we are in a new era.
I take no responsibility for this situation, of course. Lucky coincidence. Monkeys typing Hamlet, and all that. But imagine if somebody did: if they accidentally summoned Q Shaman with a stray word.
Likely it’s true! The mysterious “Q” ̶ı̶s̶ ̶J̶ı̶m̶ ̶W̶α̶t̶k̶ı̶n̶s̶ is officially anonymous, but is thought to be one person, or a small group of people, peddling misinformation to a wider ecosystem of grifters like Q Shaman. Whoever it is, they have a classic sorcerer’s apprentice problem. The conspiracy entertainment circuit can’t be turned off by just one person.
The war for reality is just beginning.
🔮 Hello, I thought this was a newsletter about interfaces? Not politics and magic and whatever?
That’s the thing: politics and magic are interfaces too. Politics is a Turing machine that runs on human compute power, written in a language called Common Law. Magic — religion, shamanism, the deliberate manipulation of consensus reality — works on a lower level, exploiting basic hot-buttons of the human organism.
Now that computers can use human language, they can program us back. We need to think about magic, about human belief, when we think about something like Twitter or smartphones. Think not just about the technical aspects of the machine, but the technical aspects of humans as well.
For instance: human beings are visual creatures. our brains are easily fooled by our eyes. This is why things like deep fakes are so scary to people: they look real enough that the average human might just believe it. And with seven billion people in the world, what humans believe is almost as important as what is true.
In that light, the news of OpenAI’s DALL-E and CLIP transformers — released on January 5 but overshadowed by the events of the next day — may be seen by future historians as more important than the Capitol riot itself.
🖇️What’s DALL-E and CLIP?
DALL-E is a GPT-3 variant that can generate images directly from text, and CLIP is the language model that tells it what to generate. DALL-E can do some pretty amazing things, and you can find some press about it. But as I mentioned last time, I think GPT-3 is too big to be practical, and I think CLIP is the more interesting development anyway.
CLIP is actually two models combined: a vision transformer and a language transformer. At risk of oversimplifying this, the vision transformer translates each image into encoding that the language transformer can read, and then the language transforming attempts to predict that encoding directly. It’s trained on 400 million image/text pairs from the internet.
CLIP is, in essence, a translator between visual language (of photographs, paintings, drawings, etc) and human language (in this case mostly English, I think). By itself it can classify images based on how close they are to its representation of a natural language sentence — for something like a natural language search engine of 2 million Unsplash images. or, combined with a sophisticated image generator, it can be used to tell the generator exactly what kind of image you want it to create.
⌚ Can I use it now? For some reason OpenAI has not given me an API key for DALL-E yet…
You can! The Big Sleep is a combination of CLIP and BigGAN, a modern generative adversarial network pre-trained to generate high-resolution images. It allows you to generate images without DALL-E, although the BigGAN generator isn’t as powerful and flexible as a GPT-3 model could be. You can use it right now in a Google Colab session from its creator @advadnoun.
I’ve been playing with it all week and it’s amazing the things it can come up with. N tural language is powerful because it has compositionality. It can express many different concepts and the ways they intersect. So the CLIP architecture is often looking at a concept it has never seen before in its entirety — this is called “zero shot transfer learning”. Take this picture, which it generated based on the phrase “A painting of a powerful neutral spirit to balance any angels or demons i might have accidentally summoned into Twitter”
It seems to have picked up on the color of the Twitter brand and its bird shaped outline, but inverted it to be a sun-bright bird in a Twitter-blue sky. With a Doge face: the ultimate arbiter of neutrality.
The Big Sleep feels like an externalized version of lucid dreaming. Or the technique that Carl Jung called active imagination, in which you picture an image in your mind and let it evolve without dissipating. It’s a way to visualize the archetypes that live in the thought space of humanity. But active imagination is difficult, and personal.
This tool puts it in your phone, makes a collective imagination accessible to your very word. it’s a scrying pool. And this new magic is like the old magic: if you know a thing’s true name, you can summon it.
We are as wizards now. We have to get good at it.
Thanks for reading,
This is the part where you like and subscribe. Or smash that reply button. And hit me on Twitter @deepfates. 🪄