Thinking Machines and OpenAI rewrite voice: simultaneous translation is no longer walkie-talkie
🔗 Learn more about me, my work and how to stay in touch: maeste.it: personal bio, projects and social links.
Important week for voice interaction, with two releases just a few days apart that I decided to cover together in the deep dive: OpenAI’s GPT Real Time models, released on May 7, and Thinking Machines’ Interaction Models, the Mira Murati startup that landed with fanfare on May 11. I walk through what really changes compared to the voice mode we are used to, because a year ago we had identified simultaneous translation as one of the jobs at risk, and today that risk has become very concrete. In the links section you’ll find side themes: agentic Gemini installed in the Android operating system, two papers on small, modular models (Recursive LMs and Allen AI’s EMO), Google’s SkillOS framework for agents that learn from experience, and Garry Tan’s thesis on personal AI as an operating system. Happy reading.
My agenda
Episode 52 of Risorse Artificiali is out, marking exactly one year of podcasting without skipping a week. In this episode we try to take stock of what changed in AI over the past year while we tried to tell its story to our Italian friends.
On Wednesday a new interview went live: Domenico Gagliardi (Founder and COO of Kortix) explains why with AI no software is defensible anymore, and where value still sits (infra + data).
As you know, our GitHub repository with tools and configurations for AI coding from a Linux terminal now has its own site with single-script install: Lince.sh
We released AntiVocale (Google Play, GitHub), a tool that turns voice messages into text
Solo:
Tuesday evening I’ll be in Milan for the AI Socratic Milano event. If there’s a chance, I’ll also share an update on the current state of Lince
The video of the talk I gave with Alessio at VoxxedDay Zurich is now online
On May 30 I’ll have the honor of being one of the PyCon Italia speakers
On June 12 I’ll be in Catania as a speaker at Coderful
On June 24 I’ll be in Milan as a speaker at AIConf
Real-time voice: Thinking Machines, OpenAI and the end of turn-by-turn
Over the span of a few days, two releases meaningfully raised the bar for voice interaction with models. The first, on May 7, came from OpenAI with the GPT Real Time models: three new API models, GPT-Realtime-2 with GPT-5-class reasoning and an extended 128K-token context, GPT-Realtime-Translate for live translation from over 70 input languages into 13, and GPT-Realtime-Whisper for streaming transcription. Four days later, and with much fanfare, came the response from Thinking Machines, Mira Murati’s startup, which introduced the Interaction Models in research preview: not yet easily accessible in Europe, but the videos I’ve seen are frankly impressive. If I had to describe them in one phrase, they are ChatGPT voice mode on steroids. These are models that respond to voice in a truly interactive way, built from scratch with a multi-stream design for real-time responsiveness, designed to remove the classic turn-by-turn ping-pong limit by construction. OpenAI’s timing probably took some momentum away from the Thinking Machines launch, because part of what was demoed was already covered by their new API.
There’s one detail, though, that struck me in Thinking Machines’ favor: their model is relatively small, around 273 million parameters as I recall. Reminder that, by rumor since they never published the numbers, both Claude Opus and GPT 5.5 are believed to sit around 2 trillion parameters in a Mixture of Experts configuration. An order of magnitude less, in practice. And the results are still impressive: there are videos of people speaking in an extremely natural way, as if they were talking to another person. The model interrupts the speaker, waits, picks up the thread. Anyone who has tried ChatGPT voice knows that, to date, it was already the best experience around, far better than Claude’s, but you still get the sense that the model is waiting for you to pause before figuring out you’ve finished your sentence and replying. That makes sense, because internally it works like that: it takes the context, slices it into sub-sections and starts preparing the reply turn by turn.
Thinking Machines’ model, and most likely the new GPT Real Time too, works differently. They are called real time precisely because they manage to maintain a per-second understanding of the context up to that moment, continuously re-elaborating it. The paper isn’t out yet, I’m curious to read it, but rumors suggest they may be using recursive language models internally, something Google has already explored in other contexts. And this enables a striking degree of naturalness, including simultaneous translation.
I watched a GPT Real Time clip last night and the effect is exactly this: a person speaks in French, the English translation starts a couple of seconds later and proceeds in parallel, just like when you listen to a professional simultaneous interpreter. It works like this: the model waits to recognize that the main verb of the sentence has been delivered, because that’s what determines the semantic direction of the discourse, and at that point it starts translating. Thinking Machines shows equivalent videos, and for developers the OpenAI cookbook already provides three ready-to-use architectures (browser, Twilio, LiveKit) for broadcast translation, customer service and multilingual meetings.
By the way, a year ago, on these very pages, we said simultaneous translation was one of the jobs at risk. Well, concrete risk, here we are. If automatic translation used to feel like a walkie-talkie, that’s no longer the case. And it holds even with multiple languages interleaved, because once you have the system, one language is as good as another.
The links that stood out this week
Gemini lands on Android in its agentic form
Google brings Gemini to Android with multi-step actions across apps, autonomous browsing, form-filling, Rambler dictation on Gboard, and widgets generated in natural language (vibe-coding). Debut on Samsung Galaxy and Pixel this summer.
What interests me here isn’t the vibe-coded widget, which is more showcase than substance, but the fact that Google is pushing agentic capabilities directly into the mobile operating system, with real access to apps and the web. It’s yet another confirmation of the trend that has agents stepping out of the chat and moving into our devices. The open question is how well security will be handled in such open scenarios, because that’s where the whole game is played.
Reinforcing Recursive Language Models
Article on how to use reinforcement learning to fine-tune 4B-scale models as Recursive Language Models for production, matching Claude Sonnet 4.6 at much lower cost and size.
A theme close to my heart for a while now: small models, trained well for specific tasks and recursively collaborative, can match the big ones. The data point in line with Sonnet 4.6 is notable, especially if it holds outside synthetic tests. I referenced this same work in the deep dive as a plausible architecture behind Thinking Machines’ Interaction Models, because I believe the trend of small recursive models is one of the most interesting threads to follow.
SkillOS: skill curation for agents that learn from experience
Google paper on an RL framework that separates a frozen agent executor from a trainable skill curator, which manages a repository of reusable skills evolved from accumulated experience.
A parallel thread to the Dream feature in Anthropic’s Managed Agents we talked about a few weeks ago: agents that improve themselves by reflecting on their own sessions. Here Google formalizes the idea with a dedicated curator and shows that the resulting skills generalize across different models and domains. For anyone building long-running agentic systems, this is the right direction to keep an eye on.
EMO: emergent modularity in Mixture of Experts
Allen AI releases EMO, a 128-expert MoE where modularity emerges naturally during pretraining by using document boundaries as weak supervision. Near full-model performance with just 12.5% of the experts active.
The number that makes my ears prick up is that 12.5%. If it really holds on real tasks and not only on benchmarks, it means you can deploy specialized subsets of a model and drastically reduce memory and compute. For local inference, which we talk about often, this would be a game changer. The thread is worth following, especially on the open-weight side.
Garry Tan: personal AI as an operating system
Garry Tan (YC) introduces GBrain, an MIT-licensed open-source system that turns markdown notes into a self-organizing knowledge graph, the foundation for personal agents with autonomous cron jobs.
What I like about Tan’s thesis is the framing: personal AI isn’t a chat, it’s an operating system with a thin harness, fat skills, fat code and a fat data layer. It is exactly the mental model I’m running with Hermes Agent at home. The fact that GBrain is MIT, open source and based on markdown is a clear manifesto: stay above the API line, not below it. Worth a deeper look.


