More

tshadley · 2026-02-04T20:53:57 1770238437

https://en.wikipedia.org/wiki/Liquid_droplet_radiator

cryptonector · 2026-02-06T06:36:07 1770359767

Huh! TIL! Thanks.

tshadley · 2026-02-04T20:10:20 1770235820

> LLMs cannot offer that promise by design, so it remains your job to find and fix any deviations from the abstraction you intended.

LLMs are clumsy interns now, very leaky. But we know human experts can be leak-proof. Why can't LLMs get there, too, better at coding, understanding your intentions, reviewing automatically for deviations, etc.?

Thought experiment: could you work well with a team of human experts just below your level? Then you should be able to work well with future LLMs.

tshadley · 2025-07-21T18:18:23 1753121903

Yes, OpenAI:

https://x.com/alexwei_/status/1946477754372985146

> 6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!

That means Google Deepmind is the first OFFICIAL IMO Gold.

https://x.com/demishassabis/status/1947337620226240803

> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!

nomad_horse · 2025-07-21T18:23:50 1753122230

Do you know if OpenAI used the same grading criteria as official judges?

tshadley · 2025-07-21T18:33:15 1753122795

As IMO medalists they would be expected to I'm sure.

But this can be verified because the results are public:

https://github.com/aw31/openai-imo-2025-proofs/

tshadley · 2025-06-26T15:10:36 1750950636

Thank you, amazing, fresh.

tshadley · 2025-06-03T20:24:18 1748982258

The goal here is not to replace transformers but combine them with RNN so you get both good short-term memory (self-attention) and much improved long-term memory (ATLAS recurrent memory).

"Empirically, our models—OmegaNet, Atlas, DeepTransformers, and Dot—achieve consistent improvements over Transformers and recent RNN variants across diverse benchmarks."

tshadley · 2025-05-26T14:42:43 1748270563

100% agreed with your experience, AI provides little value to one's area of expertise (10+ years or more). It's the context length -- AI needs comparable training or inference-time cycles.

But just wait for the next doubling of long task capacity (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...). Or the doubling after that. AI will get there.

tshadley · on Feb 2, 2025

> To understand the capabilities of LLMs, we evaluate GPT3 (text-davinci-003) [11], ChatGPT (GPT-3.5-turbo) [57] and GPT4 (gpt-4)

Oh dear, this is embarrassing. Anil Anathaswamy, are you aware a year in AI research now is like 10 years in every other field?

tshadley · on Jan 9, 2025

Well the Franks study probably destroyed any chance for natural sleep conditions. Nedergaard is scathing:

https://www.thetransmitter.org/glymphatic-system/new-method-...

> The new paper used many of the techniques incorrectly, says Nedergaard, who says she plans to elaborate on her critiques in her submission to Nature Neuroscience. Injecting straight into the brain, for example, requires more control animals than Franks and his colleagues used, to check for glial scarring and to verify that the amount of dye being injected actually reaches the tissue, she says. The cannula should have been clamped for 30 minutes after fluid injection to ensure there was no backflow, she adds, and the animals in the sleep groups are a model of sleep recovery following five hours of sleep deprivation, not natural sleep—a difference she calls “misleading.”

> “They are unaware of so many basic flaws in the experimental setup that they have,” she says.

> More broadly, measurements taken within the brain cannot demonstrate brain clearance, Nedergaard says. “The idea is, if you have a garbage can and you move it from your kitchen to your garage, you don’t get clean.”

> There are no glymphatic pathways, Nedergaard says, that carry fluid from the injection site deep in the brain to the frontal cortex where the optical measurements occurred. White-matter tracts likely separate the two regions, she adds. “Why would waste go that way?”

tshadley · on Dec 22, 2024

Seems to me o3 prices would be what the consumer pays, not what OpenAI pays. That would mean o3 could be more efficient in-house than paying subject-matter experts.

mrbungie · on Dec 22, 2024

For every consumer there will be a period where they need both the SME and the o3 model for initial calibration and eventual handoff for actually getting those efficiencies in whichever processes they want to automate.

In other words if you are diligent enough, you should at least validate your o3 solution with an actual expert for some time. You wouldn't just blindly trust OpenAI your business critical processes, would you? I would expect at least 3 month - 6 months for large corps and even more considering change management, re-upskilling, etc.

With all those considerations I really don't see the value prop at those prices and in those situations right now. Maybe if costs decrease ~1-3 orders of magnitude more for o3-low, depending on the the processes being automated.

lalalali · on Dec 22, 2024

What is open ai margin on that product?

tshadley · on Dec 21, 2024

I always get the feeling he's subconsciously inserting a "magical" step here with reference to "synthesis"-- invoking a kind of subtle dualism where human intelligence is just different and mysteriously better than hardware intelligence.

Combining programs should be straightforward for DNNs, ordering, mixing, matching concepts by coordinates and arithmetic in learned high-dimensional embedded-space. Inference-time combination is harder since the model is working with tokens and has to keep coherence over a growing CoT with many twists, turns and dead-ends, but with enough passes can still do well.

The logical next step to improvement is test-time training on the growing CoT, using reinforcement-fine-tuning to compress and organize the chain-of-thought into parameter-space--if we can come up with loss functions for "little progress, a lot of progress, no progress". Then more inference-time with a better understanding of the problem, rinse and repeat.