Article originally appeared on Replicate.
It’s been a long week for me and I have many more busy days before I can actually catch up on everything. Forgive me for sending you such a short letter. I couldn’t bear to send nothing at all.
The new Gemma2 models were released in 9b and 27b sizes. They’re overtrained on tokens, as seems to be the trend since Llama3 at least. They’re also distilled from larger Gemini models? And everyone’s talking about the alternating global/local attention layers, also found in the Character.AI blog post (see below)….
post | paper | try on replicate
Huggingface have updated their previous meta-benchmark to include harder evaluations. They choose evals that are high quality, reliable, not widely contaminated into datasets, and measure interesting skills. The rankings pass my sniff test so far: Qwen 72b holds a strong lead against Meta LLama 3, which edges out Mixtral 8x22B, and so on.
Character.AI serve 20,000 inference queries per second. This is a concise yet specific guide to the optimizations they use to do that — including hybrid attention, as mentioned earlier, and stateful caching for the long, repetitive chat histories they have to include with every turn of the conversation.
Stable Diffusion 3 has been out for a couple weeks now. Our in-house AI experimenter @fofrAI has gotten some great results, but it’s not always easy. Learn how to pick the right version, craft quality prompts, and get the right settings in our blog post.
That’s it. That’s literally all that happened this week. Am I wrong? Reply and let me know. I will make my apologies next week.
— deepfates