Hacker Newsnew | past | comments | ask | show | jobs | submit | xmcqdpt2's commentslogin

Last year we wanted IT to confirm that Copilot Agent hadn't exfiltrated data and we couldn't get logs for its website usage without raising a ticket to Microsoft. Maybe this changed, maybe our IT people are bad, but I for one wasn't impressed.

I tried copilot agent once, and it just claimed that it accessed a website that should have been blocked by corporate firewalls and uploaded a bunch of proprietary data. Lots of very specific information about how it clicked on specific buttons of the website etc.

We raised a high priority ticket with MS and turns out that Copilot Agent lied about the entire thing because the website was blocked. It completely made it up.

The fact that we are supposed to use Copilot Agent for open-ended "research" is mind-boggling.


How did it know about the buttons? Or were they so generic that it could hallucinate them as well?

I wonder if the site you mentioned was earlier harvested through some firewall hole during Copilot's training.


It must have either pulled the websites docs or knew about them.

Copilot uses the Bing search index to access public content. Your corporate firewall is irrelevant.

Turns out that's not true, at least where I work. IT / Microsoft confirmed that all Copilot traffic goes through our corporate firewall.


Buybacks lead to stock price increases and are indistinguishable from dividends in theory, and in practice they are better than dividends because of taxation.

The problem I have with that logic is that it still doesn't really give any sensible reason for why the stock should have any economic value at all. If the point is that the company will pay for it at some point, it makes more sense for it to be a loan rather than a unit of stock. I stand by my claim that selling a non-physical item that does nothing other than hopefully get bought again later for more than you sold it for is indistinguishable from a scam.

As of today though, that doesn't work. Even straightforward tasks that are perfectly spec-ed can't be reliably done with agents, at least in my experience.

I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.

That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)


Syntax errors should be caught by type checking / compiling/ linting. That should not take 2-3 hours!

See 6.2.3 in the 2019 standard.

> 6.2.3 NaN propagation

> An operation that propagates a NaN operand to its result and has a single NaN as an input should produce a NaN with the payload of the input NaN if representable in the destination format.

> If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination format. This standard does not specify which of the input NaNs will provide the payload.


As the comment below notes, the language should means it is recommended, but not required. And there are indeed platforms that do not implement the recommendation.

Oh right sorry. That is confusing.

I work in finance and we have prod excel spreadsheets. Those spreadsheets are versioned like code artifacts, with automated testing and everything. Converting them to real applications is a major part of the work the technology division does.

They usually happen because some new and exciting line of business is started by a small team as a POC. Those teams don't get full technology backing, it would slow down the early iteration and cost a lot of money for an idea that may not be lucrative. Eventually they make a lot of money and by then risk controls are basically requiring them to document every single change they make in excel. This eventually sucks enough that they complain and get a tech team to convert the spreadsheet.


I too have seen such things.

My experience being they are an exception rather than the rule and many more businesses have sheets that tend further toward Heath Robinson than would be admitted in public.

* https://en.wikipedia.org/wiki/W._Heath_Robinson


> Honestly the absolute revolution for me would be if someone managed to make LLM tell "sorry I don't know enough about the topic"

https://arxiv.org/abs/2509.04664

According to that OpenAI paper, models hallucinate in part because they are optimized on benchmarks that involve guessing. If you make a model that refuses to answer when unsure, you will not get SOTA performance on existing benchmarks and everyone will discount your work. If you create a new benchmark that penalizes guessing, everyone will think you are just creating benchmarks that advantage yourself.


That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.

The real reason is that every bench I've seen has Anthropic with lower hallucinations.


...or they hallicunate because of floating point issues in parallel execution environments:

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...


Holy perverse incentives, Batman

The captcha would have to be something really boring and repetitive like every click you have to translate a word from one of ten languages to english then make a bullet list of what it means.

Not entirely different from many human engineers...

Indeed - most of my StackOverflow credit is for explaining TLS config options.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: