I think the reason I’m obsessed with CLIP is because it’s hard evidence for unified meme theory
unified meme theory states that the “meme”, defined as the transmissible unit of human thought, and defined has a picture with some words on it that gets copied on the internet, are not different.
a meme is picture+words because it represents a gradient in semantic space
semantic space is a much-theorized “embedding” for language concepts. Like if you think about a chair, and then you think about a chaise, and then you think about a silla… there’s a region in semantic space that activates from all of those. in your mind and mine
the image provides a field, an area in semantic space. the words provide a vector. Your attention moves to that area of semantic space, and then updates in the direction that the vector points. a gradient in semantic space.
depending on how far the words send you, and how well developed your map of that semantic space, either it lands or it doesn’t. if it works on you, you save it to send it to someone else later.
showed this one to gf, she made me send it to her so she can send it to her dad
but it has to be the right person: you have to simulate their whole mental state to understand whether this gradient will work on them. so memes follow social topologies as well.
this makes sense, though: semantic space is a product of social animals using mimetic calls
CLIP encodes visual image and text about that image into the same embedding space. You’ve probably seen this image, maybe with a caption that says “AI is too dumb to tell the difference between an apple and an iPod lol”
but this is actually amazing
This neural net learned to read. It learned to read handwriting! it’s not very good at it, but it wasn’t trained to do that. it was trained to match flashcards of images to text captions.
Is this not mostly a picture that says iPod? with a hint of apple and a dash of wood fence
The fact that CLIP can recognize letters in photographs Is a sign that they’re encoded in the brain the same way as other visual data. they’re just a bunch of weird squiggles, but they cause the meaning of the image to change.
in predictable ways!
there’s a high dimensional space representing all these concepts and how they interact.
Your brain evolved to think about how other monkeys think about other other monkeys. CLIP is trained on contrastive pretraining objective. we are kinda the same
You can subtract the “appleness” from an image. or you can move along an axis that has fixed appleness but change in iPod. clip has a 512 dimensional space and the neurons have been shown to be multimodal so there’s more “axes” than that
This is a great thread showing how CLIP vectors encode world knowledge and how you can add and subtract them
https://x.com/haltakov/status/1367950896177569808
if you feel like you came up with these ideas independently, that’s probably true! we’re literally studying our own minds here
incidentally, this is why CLIP is perfect for meme search. I’d been wanting to make this program for a long time, but until now there wasn’t a way to search for concepts embedded in both text and visual data. now there is: memery
I thought I was going o have to combine a bunch of things, OCR the text, object recognition on the images, combine a bunch of metadata and manual tags to make a classifier or something. but with CLIP it “just works” 😘
unified meme theory draws on this paper, “Embodiment vs. Memetics,” by Joanna Bryson. she says humans have a special combination of temporal imitation and second order thinking that creates semantic space fertile for memetic vectors
temporal imitation: like birds, we have the ability to repeat short, ordered phrases of sound. birds do their little mating calls and dances, humans do too.
most of the types of creature don’t do this call and response. So they don’t have imitatable units of action
second order thinking: our heritage as troop primates means we think not only about the internal state of our fellows, but about their internal models of our state! or even of third parties.
this creates a hall of mirrors effect. You can go fractally in any direction
The hall of mirrors is filled with the imitatable units of action, in all of their infinite slight variations.
it’s a holographic construct, a multi-dimensional space intersecting with a smaller dimensional reality. that’s semantic space. You can move through it with your mind
memes don’t have to be image macros. they could be visual, behavioral, sounds, motions. tiktok dances are memes. dabbing. imitatable, variational units of action
what used to be “image macros” became the most common definition for “memes” tho, because they are so easily shared. humans are visual creatures. we can parse an image instantaneously, and words almost as fast. and early internet could share them faster than video.
Now that video is cheap and fast to share, you can have all these other type of visual/behaviorism memes on tiktok (riding a skateboard + drinking cranberry juice + Fleetwood Mac * your_creative_addition_here)
but it still takes time to parse them, vs a screenshot or a meme.
this is why viral tiktoks use overlaid captions, btw. it’s a memetic grappling hook that grabs you and reels you into their region of memespace where you will then enjoy the video
have to go back to selling prepackaged memes instead of theorizing about them, i’ll leave you with this diagram i made ca. 2013
Original thread: https://x.com/deepfates/status/1397031948451647489