> LLMs cannot offer that promise by design, so it remains your job to find and fix any deviations from the abstraction you intended.
LLMs are clumsy interns now, very leaky. But we know human experts can be leak-proof. Why can't LLMs get there, too, better at coding, understanding your intentions, reviewing automatically for deviations, etc.?
Thought experiment: could you work well with a team of human experts just below your level? Then you should be able to work well with future LLMs.
> 6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!
That means Google Deepmind is the first OFFICIAL IMO Gold.
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
The goal here is not to replace transformers but combine them with RNN so you get both good short-term memory (self-attention) and much improved long-term memory (ATLAS recurrent memory).
"Empirically, our models—OmegaNet, Atlas, DeepTransformers,
and Dot—achieve consistent improvements over Transformers and recent RNN variants across diverse benchmarks."
100% agreed with your experience, AI provides little value to one's area of expertise (10+ years or more). It's the context length -- AI needs comparable training or inference-time cycles.
> The new paper used many of the techniques incorrectly, says Nedergaard, who says she plans to elaborate on her critiques in her submission to Nature Neuroscience. Injecting straight into the brain, for example, requires more control animals than Franks and his colleagues used, to check for glial scarring and to verify that the amount of dye being injected actually reaches the tissue, she says. The cannula should have been clamped for 30 minutes after fluid injection to ensure there was no backflow, she adds, and the animals in the sleep groups are a model of sleep recovery following five hours of sleep deprivation, not natural sleep—a difference she calls “misleading.”
> “They are unaware of so many basic flaws in the experimental setup that they have,” she says.
> More broadly, measurements taken within the brain cannot demonstrate brain clearance, Nedergaard says. “The idea is, if you have a garbage can and you move it from your kitchen to your garage, you don’t get clean.”
> There are no glymphatic pathways, Nedergaard says, that carry fluid from the injection site deep in the brain to the frontal cortex where the optical measurements occurred. White-matter tracts likely separate the two regions, she adds. “Why would waste go that way?”
Seems to me o3 prices would be what the consumer pays, not what OpenAI pays. That would mean o3 could be more efficient in-house than paying subject-matter experts.
For every consumer there will be a period where they need both the SME and the o3 model for initial calibration and eventual handoff for actually getting those efficiencies in whichever processes they want to automate.
In other words if you are diligent enough, you should at least validate your o3 solution with an actual expert for some time. You wouldn't just blindly trust OpenAI your business critical processes, would you? I would expect at least 3 month - 6 months for large corps and even more considering change management, re-upskilling, etc.
With all those considerations I really don't see the value prop at those prices and in those situations right now. Maybe if costs decrease ~1-3 orders of magnitude more for o3-low, depending on the the processes being automated.
I always get the feeling he's subconsciously inserting a "magical" step here with reference to "synthesis"-- invoking a kind of subtle dualism where human intelligence is just different and mysteriously better than hardware intelligence.
Combining programs should be straightforward for DNNs, ordering, mixing, matching concepts by coordinates and arithmetic in learned high-dimensional embedded-space. Inference-time combination is harder since the model is working with tokens and has to keep coherence over a growing CoT with many twists, turns and dead-ends, but with enough passes can still do well.
The logical next step to improvement is test-time training on the growing CoT, using reinforcement-fine-tuning to compress and organize the chain-of-thought into parameter-space--if we can come up with loss functions for "little progress, a lot of progress, no progress". Then more inference-time with a better understanding of the problem, rinse and repeat.