Last year we wanted IT to confirm that Copilot Agent hadn't exfiltrated data and we couldn't get logs for its website usage without raising a ticket to Microsoft. Maybe this changed, maybe our IT people are bad, but I for one wasn't impressed.
I tried copilot agent once, and it just claimed that it accessed a website that should have been blocked by corporate firewalls and uploaded a bunch of proprietary data. Lots of very specific information about how it clicked on specific buttons of the website etc.
We raised a high priority ticket with MS and turns out that Copilot Agent lied about the entire thing because the website was blocked. It completely made it up.
The fact that we are supposed to use Copilot Agent for open-ended "research" is mind-boggling.
Buybacks lead to stock price increases and are indistinguishable from dividends in theory, and in practice they are better than dividends because of taxation.
The problem I have with that logic is that it still doesn't really give any sensible reason for why the stock should have any economic value at all. If the point is that the company will pay for it at some point, it makes more sense for it to be a loan rather than a unit of stock. I stand by my claim that selling a non-physical item that does nothing other than hopefully get bought again later for more than you sold it for is indistinguishable from a scam.
As of today though, that doesn't work. Even straightforward tasks that are perfectly spec-ed can't be reliably done with agents, at least in my experience.
I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.
That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)
> An operation that propagates a NaN operand to its result and has a single NaN as an input should produce a
NaN with the payload of the input NaN if representable in the destination format.
> If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of
one of the input NaNs if representable in the destination format. This standard does not specify which of
the input NaNs will provide the payload.
As the comment below notes, the language should means it is recommended, but not required. And there are indeed platforms that do not implement the recommendation.
I work in finance and we have prod excel spreadsheets. Those spreadsheets are versioned like code artifacts, with automated testing and everything. Converting them to real applications is a major part of the work the technology division does.
They usually happen because some new and exciting line of business is started by a small team as a POC. Those teams don't get full technology backing, it would slow down the early iteration and cost a lot of money for an idea that may not be lucrative. Eventually they make a lot of money and by then risk controls are basically requiring them to document every single change they make in excel. This eventually sucks enough that they complain and get a tech team to convert the spreadsheet.
My experience being they are an exception rather than the rule and many more businesses have sheets that tend further toward Heath Robinson than would be admitted in public.
According to that OpenAI paper, models hallucinate in part because they are optimized on benchmarks that involve guessing. If you make a model that refuses to answer when unsure, you will not get SOTA performance on existing benchmarks and everyone will discount your work. If you create a new benchmark that penalizes guessing, everyone will think you are just creating benchmarks that advantage yourself.
That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.
The real reason is that every bench I've seen has Anthropic with lower hallucinations.
The captcha would have to be something really boring and repetitive like every click you have to translate a word from one of ten languages to english then make a bullet list of what it means.
reply