Hacker Newsnew | past | comments | ask | show | jobs | submit | rocqua's commentslogin

When you put an LLM in reasoning mode, it will approximately have a conversation with itself. This mimics an inner monologue.

That conversation is held in text, not in any internal representation. That text is called the reasoning trace. You can then analyse that trace.


Unless things have changed drastically in the last 4 months (the last time I looked at it) those traces are not stored but reconstructed when asked. Which is still the same problem.

They aren't necessarily "stored" but they are part of the response content. They are referred to as reasoning or thinking blocks. The big 3 model makers all have this in their APIs, typically in an encrypted form.

Reconstruction of reasoning from scratch can happen in some legacy APIs like the OpenAI chat completions API, which doesn't support passing reasoning blocks around. They specifically recommend folks to use their newer esponses API to improve both accuracy and latency (reusing existing reasoning).


For a typical coding agent, there are intermediate tool call outputs and LLM commentary produced while it works on a task and passed to the LLM as context for follow up requests. (Hence the term agent: it is an LLM call in a loop.) You can easily see this with e.g. Claude Code, as it keeps track of how much space is left in the context and requires "context compaction" after the context gradually fills up over the course of a session.

In this regard, the reasoning trace of an agent is trivially accessible to clients, unlike the reasoning trace of an individual LLM API call; it's a higher level of abstraction. Indeed, I implemented an agent just the other day which took advantage of this. The OP that you originally replied to was discussing an agentic coding process, not an individual LLM API call.


Well, right, I see those reasoning stages in reasoning models with Ollama and if you ask it what its reasoning was after the fact what it says is different than what it said at the time.

I can't speak to your specific set up, but it sounds like you're halfway there if you can access the previous traces? All anyone can ask for is "show me the traces that led up to this point"; the "why did you do this" is a notational convenience for querying that data. If your set up isn't summarizing those traces correctly, then that sounds like a specific bug in the context or model quality, but the point is that the traces exist and are queryable in the first place, however you choose to do that.

(I am still primarily talking about agent traces, like the original OP, not internal reasoning blocks for a particular LLM call, though - which may or may not be available in context afterwards.)

In particular, asking "why" isn't a category error here, although there's only a meaningful answer if the model has access to the previous traces in its context, which is sometimes true and sometimes not.


If you want to be pedantic about it you could phrase it as follows.

When the LLM was in reasoning mode, in the reasoning context it often expressed statement X. Given that, and the relevance of statement X to the taken action. It seems likely that the presence of statement X in the context contributed to this action. Besides, the presence of statement X in the reasoning likely means that given the previous context embeddings of X are close to the context.

Hence we think that the action was taken due to statement X.

And that output could have come from an LLM introspecting it's own reasoning.

I don't think that phrasing things so pedanticaly is worth the extra precision though. Especially not for the statement that inspecting the reasoning logs of sn LLM can help give insight on why an LLM acted a certain way.


This seems great in concept, and totally infeasible. But if anyone can do it, unicode seems like a great candidate.

Does anyone have reason for more optimism?


Care to explain why you think it's infeasible? Then one could provide targeted counter-optimism ;)

I don't see what's infeasible about it. It doesn't seem too different from .po files (gettext catalogs) meshed with hooks for post-processing as would see in e.g. a handlebars, both of which have individually found great adoption.


> why you think it's infeasible?

GP based his opinion on the assumption that this spec new and no implementations for it exist.


ICU4C and ICU4J have implementations. We also have a JS polyfill and will be working on ICU4X impl this quarter.

Unicode consortium already manages a ton of language specs. If there's any group of folks I'd trust to understand languages (natural or otherwise), it's them.

This is the one. Think of all the "misconceptions developer have about X" lists, I trust Unicode to have encountered (if not written) all of them. The people behind unicode are thorough.

I mean they have hieroglyphs, some of which have plurals: https://www.unicode.org/charts/nameslist/n_13000.html


I've been using this format for almost 10 years, and I only see increasing adoption. Why would I be pessimistic?

I recall similar advice around mylar heat blankets. Perhaps those got mixed up?

Feels quite power-law like to me, but checking this roughly it seems to decay more quickly than a power law, but with a fatter tail. At least in the top 1000

The top 3 with 4,000,000 words have about 20 times as many words that the 0.14% percentile (at rank 1000) with 200,000 words.

In between (at rank 500) your at about 450,000 words, so its not a true power law. Because a drop of a factor 9 per 500 ranks would suggest that rank 1000 were at about 50000 words.


So for the ISS at 20c you'd get 481 W/m^2 so you'd only need 2.3m2. So comparing the ISS at 20c to space datacenters at 70c you get an improvement of 63%. Nice, but doesn't feel game-changing.

The power radiated is T^4, but 70c is only about 17.1% warmer than 20c because you need to compare in kelvin.


>The power radiated is T^4, but 70c is only about 17.1% warmer than 20c because you need to compare in kelvin.

17% in T^4 is almost 2x - plugging 293 (in Kelvin of course) in the calculator i get 417 W/m2 vs. 784W/m2 that i got earlier for the 343 (Kelvin for the 70 Celsius).

The ISS targets rejecting 70KW and has something like 140m2 of radiators. These radiators are attached to the ISS and use a lot of plumbing to carry the cooling liquid.

Where is GPUs and everything can be attached directly to the radiators and solar panels. So 70KW - 70 GPUs - can be placed right onto the 10m by 10m radiator panel. In front of those GPUs sitting on that radiator - a 15m by 20m solar panels assembly. Whole thing is less than 1 ton. Between $10K and $100K on Starship.


Why does it matter if people have to sell their shares to unlock value? Is it just the friction of small orders?

Buybacks for manipulating share prices and earnings per share are indeed silly. But they should also be trivial to compensate for by normalising on market cap instead of a single share.


A buyback is almost the same as a dividend, with minor differences around tax and effects on derivative pricing.

And ASML has been paying out a dividend for a long time.


Just say "we are storing your keys on our servers so you won't lose them" and follow that with either "do you trust us" or even "we will share this key with law enforcement if compelled". Would be fine. Let people make these decisions.

Besides, bit ocker keys are really quite hard to lose.


I find that the LLMs are good at the 'glue code'. The "here's a rather simple CRUD like program, please tie all of the important things together in the right way". That was always a rather important and challenging bit of work, so having LLMs take it of our hands is valuable.

But for the code where the hard part isn't making things designed separately work together, but getting the actual algorithm right. That's where I find LLMs still really fail. Finding that trick to take your approach from quadratic to N log N, or even just understanding what you mean after you found the trick yourself. I've had little luck there with LLMs.

I think this is mostly great, because its the hard stuff that I have always found fun. Properly architecting these CRUD apps, and learning which out of the infinite set of ways to do this are better, was fun as a matter of craftsmanship. But that hits at a different level from implementing a cool new algorithm.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: