Local or cloud is the wrong question: AI will be hybrid
🔗 Learn more about me, my work and how to stay in touch: maeste.it: personal bio, projects and social links.
A week in which I come back to a topic close to my heart for a while now: how we are optimizing AI to run locally, and where all of this is taking us. In the deep dive I try to line up the pieces, from the work on small models, quantization and minimal harnesses all the way to hardware, with Apple currently winning hands down and NVIDIA trying to respond with the RTX Spark. My thesis is that the future is not local versus cloud, but a hybrid architecture in which the two complete each other: run at home what you can, and call the cloud only when you really need to. In the links section you will find the themes that frame the picture and round it out: a tool to compare Ollama models locally, Anthropic on recursive self-improvement, Google’s Sleep paradigm for models that learn on their own, the real bottleneck of enterprise agents (permissions, not the model), dreaming in ChatGPT’s memory, Open Code Review and NVIDIA’s new Nemotron-3-Ultra, with its open weights and datasets. Enjoy the read.
My agenda
Saturday saw episode 55 of Risorse Artificiali, “Dynamic workflows: the AI that writes its own harnesses”: with Opus 4.8 every agent generates its own custom tool in JavaScript on the fly, and we discuss what this means for security and sandboxing. Listen
In the same episode: why benchmarks are not comparable (it is the harness that counts, not just the model), MiniMax M3 and the Hassabis interview with YouTube’s automatic dubbing.
Our projects Lince.sh and AntiVocale (Google Play, GitHub), by now you know them well.
On my own:
I was at PyCon Italia as a speaker: you will find all the content from my two talks, as always, on maeste.it in the section dedicated to talks. I will also put the videos there as soon as they are available.
On June 12 I will be in Catania as a speaker at Coderful
On June 24 I will be in Milan as a speaker at AIConf
Hybrid architectures: AI runs locally, you call the cloud only when needed
For a while now I have been seeing a trend that keeps getting clearer: optimizing everything to run AI locally. You can see it on several fronts at once, and put together they tell a precise direction. There is the work on inference, there are the small models, there are the quantization aware models, that is, designed from training onward to hold up well under reduced precision, and above all there is an enormous amount of work on quantization, including asymmetric quantizations that try to squeeze every bit without losing quality. New architectures are arriving that remove the encoder or some decoding stages, along with harnesses designed in a minimal way, I am thinking of Pi, or built to run hand in hand with inference, like antirez’s DS4, all the way to systems like Unsloth that let you do inference and even fine tuning on the same machine. The underlying idea seems to be just one: having more and more systems running at home, and not just to play around.
In all of this hardware matters, and it matters a lot. Right now Apple is winning hands down thanks to the stability of its ARM architecture with shared RAM: if on many other fronts Apple is struggling, on the hardware to run things locally it has a turning point in its hands. NVIDIA and Microsoft are trying to respond with a competing system, the RTX Spark, because the DGX Spark remains too specific for most of us.
Before getting to my strong opinion, I want to clear up a possible misunderstanding. All this work on local does not clash with the push by frontier labs to concentrate ever more intelligence into SOTA models, quite the opposite. The two things complete each other, and it is precisely from their sum that the hybrid architecture I am talking about emerges. On one side the big labs will keep raising the bar of what a model can do, on the other a growing ecosystem brings part of that capability onto our machines. It is not a race between the two fronts, it is a division of labor that is taking shape.
And here is my strong opinion. Right now, to do local inference seriously, you need either hardware of a certain level, for example to run DeepSeek with DS4, or fairly specific use cases. Even though the latest small models, I am thinking of Gemma 4 12B, open the door even to RTX and ADA cards with 16 GB of RAM, and in the meantime open weight models keep growing in capability: MiniMax M3 confirms the will to carry forward, in the wake of DeepSeek V4, frontier coding, native multimodality and a one million token window, and on top of that at very low API prices. I, however, see a different future, made of hybrid architectures: running some operations locally, perhaps on purpose-finetuned models, and delegating to the cloud only when it is really needed. A bit like what we saw with “/advisor” in Claude Code, but flipped: the main model is the local one, and you call the cloud advisor only in the moments that matter. It is a direction similar to the one Perplexity proposes, which not by chance titles one of its pieces The data center moves to your machine.
And here I get to the part that intrigues me the most, because it is still all to be written. My gut feeling is that one of the engineering optimizations we will need is the ability to load models into memory in a much faster and more dynamic way, so we can load on the fly specific versions or ones with dedicated fine tuning depending on the task at hand. Today it is a challenge for which there are no clear solutions yet, and that is exactly why it is worth keeping our eyes open: it is one of those problems that, once solved well, will change the economics of everything else.
The links that struck me this week
Ollama Model Tester (GitHub Repo)
A small but clever tool, right in the spirit of this week’s deep dive. If you are experimenting with local inference, being able to run the same prompt across multiple models and compare the responses side by side saves you a lot of time. We are going to need more and more tools like this.
When AI builds itself
Anthropic openly talking about recursive self-improvement always makes a certain impression. I take the figure of eight times more code per engineer with a grain of salt, like all internal benchmarks, but the direction is that one and it is worth reading how they tell it.
Sleep for Continual Learning
Here Google tries to give models a kind of sleep: a phase in which they consolidate short-term knowledge into the parameters, complete with a Dreaming stage via reinforcement learning to generate their own curricula. It is exactly the strand of models that improve on their own that has interested me for a long time. Keep it in mind, because further down, with Open Code Review, the same pattern comes back: AI working on the work of AI.
The AI agent bottleneck isn’t model performance, it’s permissions
This piece says something I have been repeating for months: the bottleneck of enterprise agents is not how good the model is, but permissions and governance. And it is exactly one of the problems that with Lince.sh we are trying to help solve, working on sandboxing and on what an agent can or cannot do. Read it, because it frames the problem well.
OpenAI introduces “dreaming” into ChatGPT’s memory
After Anthropic’s Memory Files I talked about last week, OpenAI too is reworking memory, with a background system that turns past chats into a profile organized by categories. The memory topic has become one of the real battlegrounds between harnesses, and here you can clearly see where the game is heading.
Open Code Review (GitHub Repo)
And here we are at the hook I left you above with the Sleep paper. Here we are on concrete ground: a CLI that reads the git diff and produces precise line-by-line reviews, with the philosophy of combining deterministic engineering and an agent, letting each handle what it does best. It is the same division-of-labor idea from the deep dive, applied to code quality: AI that reviews and improves what AI itself produces, but with a deterministic backbone holding the line.
NVIDIA Nemotron-3-Ultra
I close with a model that speaks straight to the deep dive: 550 billion parameters but only 55 active, thanks to a hybrid Mamba-Attention MoE, with a one million token context window. The thing that makes my ears perk up is that NVIDIA publishes checkpoints, quantized versions and even the datasets: it is exactly the kind of openness that feeds the local ecosystem I was talking about.


