More

simonw · 2026-02-04T16:39:40 1770223180

If you're happy "speaking to a real person" when you could automate that interaction away somehow then no, digital personal assistants probably aren't something you're going to care about.

I love talking to real people about stuff that matters to them and to me. I don't want to talk to them about booking a flight or hotel room.

mejutoco · 2026-02-04T16:43:40 1770223420

If hotels, or google, or travel websites wanted people to book programmatically they would have an api.Remember when Google search had an api? In the end the human is responsible for the purchase. I think when the dust settles, AI will offer a "do you want to purchase?" and then the human will press the button. Or ChatGPT or somebody controlling the last step will have that button, and services will accept it (like Instagram) because it brings business.

LevGoldstein · 2026-02-04T17:08:52 1770224932

This only lasts until dark patterns can be inserted that disrupt the ease of use that agents are currently providing. If I can't force the end user to watch unskippable ads or trick them into spending money on a service they don't need, what are we even doing?

simonw · 2026-02-04T16:58:04 1770224284

The reason they don't have an API is that they want to upsell you on other stuff, and get paid to promote their partners.

There's going to be a huge fight over how that relates to AI assistants over the next few years.

mejutoco · 2026-02-04T17:46:11 1770227171

I agree fully, and wanted to add: for many of these services, like travel engine comparison sites, running the query itself costs money, so you do not want to make it to easy to search without booking.

rsynnott · 2026-02-05T13:15:43 1770297343

I'm not sure I've ever talked to an actual human about booking a flight or a hotel room, and I'm 40. Airlines and hotels have websites (and then there's booking.com etc of course). They were early adopters, even! Ryanair has had a website where you could book flights for 26 years, and that's a budget airline.

I would never, in a million years, trust an LLM to book a Ryanair flight. I barely trust _myself_ to book one without accidentally buying insurance or something. And booking.com is not much better. If the travel sites are not _already_ embedding adversarial prompts they will be soon. And they'll be good at it, because they've spend the last few decades practicing on humans.

techpression · 2026-02-04T16:59:35 1770224375

The nuances contained in ”booking a flight or hotel room” are plenty, it matters a lot to a lot of people. The industry will probably be very very happy to have bots do it, the amount of extra revenue they will get by taking the tricks made for humans to the next level is going to be substantial.

simonw · 2026-02-04T16:21:17 1770222077

This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

tekacs · 2026-02-04T16:41:58 1770223318

Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

drakenot · 2026-02-05T01:19:24 1770254364

This past month Parakeet v3 dropped with a streaming ASR model that is 0.6B params, can run on a CPU and is super good.

tekacs · 2026-02-05T13:37:55 1770298675

I did say all the model. :)

Yes I've tried Parakeet v3 too. For its own purpose - running locally - it's amazing.

The thing that's particularly amazing about this Voxtral model is how incredibly rock solid the accuracy is.

For the longest time previous models have been 'mostly correct' or as people have commented elsewhere on this HN thread, have dropped sentences or lost or added utterances.

I have no affiliation with these folks, but I tried and struggled to get this model to break even speaking as adversariately as I could.

That's a totally different class of model.

meatmanek · 2026-02-05T05:16:36 1770268596

Do you mean https://huggingface.co/nvidia/nemotron-speech-streaming-en-0... ?

drakenot · 2026-02-05T13:27:48 1770298068

Yes. That is it

puttycat · 2026-02-05T08:39:51 1770280791

What's the business plan here?

Oras · 2026-02-04T16:25:22 1770222322

Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

druskacik · 2026-02-04T18:11:16 1770228676

According to the announcement blog Le Chat is powered by the new model as well: https://chat.mistral.ai/chat

TacticalCoder · 2026-02-04T23:09:14 1770246554

> Truly impressive for real-time.

Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.

It was so bad I told several people not to invest in what was back then a national tech darling:

https://en.wikipedia.org/wiki/Lernout_%26_Hauspie

That turned out to be a massive fraud.

But ...

> I tried speaking in 2 languages at once, and it picked it up correctly.

I'm a native french speaker and I tried with a very simple sentence mixing french and english:

"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")

And instead I got this:

"Je prépare un redote, mais pour une carabine, je préfère un ACOG."

"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.

I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.

jnaina · 2026-02-05T01:33:10 1770255190

I used sell the Mac Voice Navigator (from Articulate Systems) in the 90s, which was a SCSI based hardware box that you plug into a Mac, Mac SE or Mac II. It used to use the same L&H speech recognition tech (if I recall correctly) and was called the "User Interface" of the future.

Horrible speech recognition rate and very glitchy. Customers hated it, and lots of returns/complaints.

A few years later, L&H went bankrupt. And so did Articulate Systems.

https://applerescueofdenver.com/products-page/macintosh-to-p...

daemonologist · 2026-02-04T16:48:11 1770223691

404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).

echion · 2026-02-05T03:16:15 1770261375

Same here

skykooler · 2026-02-04T19:24:04 1770233044

Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".

winrid · 2026-02-05T06:49:45 1770274185

For me it shows the waveform and then "error"

starkgoose · 2026-02-04T20:05:36 1770235536

Try disabling CSP for the page

codethief · 2026-02-04T19:34:33 1770233673

Same here. In Chromium I don't even see the waveform.

fragmede · 2026-02-04T20:01:15 1770235275

I had to turn off ad-block to get it to work.

whimblepop · 2026-02-05T03:43:21 1770263001

I can see the waveform but it still doesn't work for me. Switched to Edge, disabled all adblocking and privacy extensions, built-in tracking prevention, and "enhanced site security" (whatever that is), and still no dice. I'd love to try it and be impressed, but it seems impossible. :(

atoav · 2026-02-05T09:00:27 1770282027

Did you check if your mic even works in principle? E.g. using https://www.onlinemictest.com/

If you don't get sound there it won't work anywhere. A surprising number of problems like these can be solved by selecting the correct audio input source (provided your computer shows more than one).

whimblepop · 2026-02-05T13:26:22 1770297982

Yep. Mic works fine. My mic even works on the test page! What doesn't work is any of the transcription functionality. :(

atoav · 2026-02-06T08:11:22 1770365482

I just bit the bullet and did it via python and the api.

niek_pas · 2026-02-05T06:46:57 1770274017

Same here on iPhone with Arc Search.

jaggederest · 2026-02-04T17:03:00 1770224580

It can transcribe Eminem's Rap God fast sequence, really, really impressive.

rafram · 2026-02-04T17:32:04 1770226324

That's almost certainly in the training data, to be fair.

keeganpoppen · 2026-02-04T18:49:07 1770230947

what a great test hahah

carbocation · 2026-02-04T18:10:06 1770228606

This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.

elboru · 2026-02-05T04:21:31 1770265291

Wow, so it has surpassed humans.

espadrine · 2026-02-05T17:50:47 1770313847

It is quite impressive.

I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt

If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.

The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).

There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!

pyprism · 2026-02-04T17:18:52 1770225532

Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.

derefr · 2026-02-04T17:51:28 1770227488

Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.

keeganpoppen · 2026-02-04T18:49:59 1770230999

it must have some exposure to bengali— just not enough for them to advertise it. otherwise it would have a damn hard time.

sheepscreek · 2026-02-04T18:04:33 1770228273

I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!

GolDDranks · 2026-02-04T23:59:41 1770249581

I can't get that demo to work. Tried with both Firefox and Chrome.

CamperBob2 · 2026-02-05T04:08:59 1770264539

Same here; the voice waveform animates as expected but the model doesn't do anything when I click on the microphone. It just says "Error" in the upper-right corner.

Also tried downloading and running locally, no luck. Same behavior.

Barbing · 2026-02-04T20:46:26 1770237986

Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.

whimblepop · 2026-02-05T03:48:32 1770263312

No long with Firefox or Edge or Chrome on either macOS or Android for me, either. Same issue on all.

darkwater · 2026-02-04T21:02:45 1770238965

It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.

But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.

rafram · 2026-02-04T17:35:38 1770226538

Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.

timhh · 2026-02-04T22:08:27 1770242907

Yeah it messed up a bit for me too when I didn't enunciate well. If I speak clearly it seems to work very well even with background noise. Remember Dragon Naturally Speaking? Imagine having this back then!

mentalgear · 2026-02-04T21:30:07 1770240607

Here European Multilingual-Intelligence truly shines!

colordrops · 2026-02-04T22:02:04 1770242524

is this demo running fully in the browser?

simonw · 2026-02-04T22:19:13 1770243553

No, it's server-side.

Model is around 7.5 GB - once they get above 4 GB running them in a browser gets quite difficult I believe.

dcl · 2026-02-05T03:34:59 1770262499

Because it's a 4gb download?

subset · 2026-02-05T03:50:32 1770263432

I think that web browsers only allow up to 4GB of memory per tab.

simonw · 2026-02-03T22:21:35 1770157295

A bit odd that this talks about AutoGPT and declares it a failure. Gary quotes himself describing it like this:

> With direct access to the Internet, the ability to write source code and increased powers of automation, this may well have drastic and difficult to predict security consequences.

AutoGPT was a failure, but Claude Code / Codex CLI / the whole category of coding agents fit the above description almost exactly and are effectively AutoGPT done right, and they've been a huge success over the past 12 months.

AutoGPT was way too early - the models weren't ready for it.

lbrito · 2026-02-03T22:44:21 1770158661

>they've been a huge success over the past 12 months

They lose billions of dollars annually.

In what universe is that a business success?

simonw · 2026-02-03T23:04:17 1770159857

Coding agents are successful products which generate billions of dollars of revenue from millions of paying customers.

The organizations that provide them lose money because of the R&D costs involved in staying competitive in the model training arms race.

lbrito · 2026-02-03T23:23:45 1770161025

Revenue isn't profit.

Checking whether Claude Code by itself is profitable or not is probably impossible. It doesn't make a lot of sense divorcing R&D from the product. And obviously the running costs are not insignificant.

The company as a whole loses money.

simonw · 2026-02-03T23:32:31 1770161551

The most important question is whether they make or lose money on each customer, independent of their fixed R&D costs.

If they make money on each customer they have a credible business - they could become profitable even with their existing R&D losses provided they can sign up enough new paying customers.

If they lose money on every customer - such that signing a $1m new enterprise account costs them $1.1m in server costs - then their entire "business" is a sham.

I currently believe that Anthropic make money on almost every customer, such that their business is legit.

I guess we'll have to wait for the IPO paperwork to find out if I'm right about that.

mgh95 · 2026-02-04T05:48:25 1770184105

> The most important question is whether they make or lose money on each customer, independent of their fixed R&D costs.

The ZIRP era called and wants its business strategy back. Half the problem is as frontier models are released free as in free beer models with "good enough" performance pop up. Half the arguments about LLMs are "you're not holding it right", which borders on indicating that it's unable to distinguish between two sufficiently close LLMs.

kridsdale3 · 2026-02-04T00:15:57 1770164157

But humanity is gaining hugely productive (in financial terms) assets. It doesn't matter if the entity or its investors that created the asset goes kaboom.

Most of the investors and companies that built the rail network went bust. The iron remained.

Most of the investors and companies that built the telecom network went bust. The fiber remained.

Most of the investors and companies that are building models will go bust. The files (open weight or transfered to new owners for pennies) will remain, and yield economic benefits for as long as we flow current through them.

cudgy · 2026-02-04T18:42:14 1770230534

Iron and fiber are durable and last for decades. Data centers (where the current flows for inference) consist of hardware that becomes obsolete within 5 to 10 years.

The question is can improvements in the hardware both in cost and performance outpace the increased demands on the LLMs and their future derivatives.

nunez · 2026-02-04T08:27:39 1770193659

Just like rail and fiber, the GiganticCos will own foundational model development (from which oss models originate).

Unlike rail and fiber, these models will continue to threaten multiple industries simultaneously while yielding power back to the GiganticCos.

anonymous908213 · 2026-02-03T22:35:23 1770158123

Have they actually been a huge success, though? You're one of the most active advocates here, so I want to ask you what you make of "the Codex app". More specifically, the fact that it's a shitty Electron app. Is this not a perfect use case for agents? Why can OpenAI, with unlimited agents, not let them loose on the codebase with instructions to replace Electron with an appropriate cross-platform native framework, or even a per-platform native GUI? They said they chose Electron for ease of portability for cross-platform delivery, but they could allocate 1, 10, or 1000 agents to develop a native Linux and native Windows port of the MacOS codebase they started with. This is not even a particularly serious endeavour. I have coded a cross-platform chat application myself with more advanced features than what Codex offers, and chat GUIs are really among the most basic thing you can be doing; practically every consumer-targeted GUI application finds a time when they shove a chat box into a significantly more complex framework.

The conclusion that seems readily apparent to me, as it has always been, is that these "agents" are completely incapable of creating production-grade software suitable for shipping, or even meaningfully modifying existing software for a task like a port. Like the one-shot game they demo'd, they can make impressive proof-of-concepts, but nothing any user would use, nor with a suitable foundation for developers to actually build upon.

bandrami · 2026-02-03T22:51:56 1770159116

"Why isn't there better software available?" is the 900 pound gorilla in the LLM room, but I do think there are enough anecdotes now to hypothesize that what agents seem to be good at is writing software that

1. wasn't economical to write in the first place previously, and

2. doesn't need to be sold to anyone else or maintained over time

So, Brad in logistics previously had to collate scanned manifests with purchase requests once a month, but now he can tell Claw to do it for him.

Which is interesting given the talk of The End of Software Development or whatever because "software that nobody was willing to pay for previously" kind of by definition isn't going to displace a lof of people who make software.

anonymous908213 · 2026-02-03T22:59:30 1770159570

I do agree with this fully. I think LLMs have utility in making the creation of bad software extremely accessible. Bad software that happens to perfectly match some person's super specific need is by no means a bad thing to have in the world. A gap has been filled in creating niche software that previously was not worth paying anyone to create. But every single day we have multiple articles here proclaiming the end of software engineering, and I just don't get how the people hyping this up reconcile their hype with the lack of software being produced by agents that is good enough to replace any of the software people actually pay for.

simonw · 2026-02-03T23:14:20 1770160460

My experience is that coding agents as-of November (GPT-5.2/Opus 4.5) produce high quality, production-worthy code against both small and large projects.

I base this on my own experience with them plus conversations with many other peers who I respect.

You can argue that OpenAI Codex using Electron disproves this if you like. I think it demonstrates a team making the safer choice in a highly competitive race against Anthropic and Google.

If you're wondering why we aren't seeing seismic results from these new tools yet, I'll point out that November was just over 2 months ago and we had the December holiday period in the middle of that.

anonymous908213 · 2026-02-03T23:26:51 1770161211

I'm not sure I buy the safer choice argument. How much of a risk is it to assign a team of "agents" to independently work on porting the code natively? If they fail, it costs a trivial amount of compute relative to OAI's resources. If they succeed, what a PR coup that would be! It seems like they would have nothing to lose by at least trying, but they either did not try, or they did and it failed, neither of which inspires confidence in their supposedly life-changing, world-changing product.

I will note that you specifically said the agents have shown huge success over "the past 12 months", so it feels like the goalposts are growing legs when you say "actually, only for the last two months with Opus 4.5" now.

simonw · 2026-02-03T23:35:44 1770161744

Claude Code was released in February, it just had its 1 year birthday a few days ago.

OpenAI Codex CLI and Gemini CLI followed a few months afterwards

It took a little while for the right set of coding agent features to be developed and for the models to get good enough to use those features effectively.

I think this stuff went from interesting to useful around Sonnet 4, and from useful to "let it write most of my code" with the upgrades in November.

CuriouslyC · 2026-02-04T03:38:26 1770176306

Aider with Gemini 2.5 was way ahead of its time, and with O3 it was best in class until Claude Code with Sonnet 4.

CuriouslyC · 2026-02-04T03:33:16 1770175996

The bottleneck in development is human attention and ability to validate now (https://sibylline.dev/articles/2026-01-27-stop-orchestrating...). OpenAI could unleash the Kraken, but in order to ensure they're releasing good software that works, they still need the eyeball hours and people who can hold the idea of the thing being built in their head and validate against that ideal.

Agents default to creating big balls of mud but it's fairly trivial to use prompting/tools to keep things growing in a more factored, organized way.

simonw · 2026-02-03T19:34:54 1770147294

I'd love to know more about that "help me reply to all the people" one! I definitely need that.

mrkurt · 2026-02-03T20:13:57 1770149637

You will be astonished to know it'a a whole lot of sqlite.

Everything I want to pay attention to gets a token, the server goes and looks for stuff in the api, and seeds local sqlites. If possible, it listens for webhooks to stay fresh.

Mostly the interface is Claude code. I have a web view that gives me some idea of volume, and then I just chat at Claude code to have it see what's going on. It does this by querying and cross referencing sqlite dbs.

I will have claude code send/post a response for me, but I still write them like a meatsack.

It's effectively: long lived HTTP server, sqlite, and then Claude skills for scripts that help it consistently do things based on my awful typing.

simonw · 2026-02-03T18:51:23 1770144683

I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet

Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

It's using about 28GB of RAM.

technotony · 2026-02-03T21:03:08 1770152588

what are your impressions?

simonw · 2026-02-03T23:05:28 1770159928

I got Codex CLI running against it and was sadly very unimpressed - it got stuck in a loop running "ls" for some reason when I asked it to create a new file.

CamperBob2 · 2026-02-05T02:47:35 1770259655

You probably have seen it by now, but there was a llama.cpp issue that was fixed earlier today(?) to avoid looping and other sub-par results. Need to update llama-server as well as redownload the GGUFs (for certain quants).

https://old.reddit.com/r/unsloth/comments/1qvt6qy/qwen3coder...

simonw · 2026-02-05T03:42:33 1770262953

I hadn't seen that, thanks very much!

danielhanchen · 2026-02-04T01:00:26 1770166826

Yes sadly that sometimes happens - the issue is Codex CLI / Claude Code were designed for GPT / Claude models specifically, so it'll be hard for OSS models directly to utilize the full spec / tools etc, and might get loops sometimes - I would maybe try the MXFP4_MOE quant to see if it helps, and maybe try Qwen CLI (was planning to make a guide for it as well)

I guess until we see the day OSS models truly utilize Codex / CC very well, then local models will really take off

mhitza · 2026-02-04T14:42:54 1770216174

I would recommend you fiddle with the repeat penalty flags. I use local models often, and almost all I've tried needed that to prevent loops.

I'd also recommend dropping temperature down to 0. Any high temperature value feels like instructing the model "copy this homework from me but don't make it obvious".

nubg · 2026-02-03T21:10:06 1770153006

what's the token per seconds speed?

simonw · 2026-02-03T18:27:48 1770143268

Note that you don't need to use Deno or JavaScript at all to use this product. Here's their Python client SDK: https://pypi.org/project/deno-sandbox/

  from deno_sandbox import DenoDeploy
  
  sdk = DenoDeploy()
  
  with sdk.sandbox.create() as sb:
      # Run a shell command
      process = sb.spawn("echo", args=["Hello from the sandbox!"])
      process.wait()
  
      # Write and read files
      sb.fs.write_text_file("/tmp/example.txt", "Hello, World!")
      content = sb.fs.read_text_file("/tmp/example.txt")
      print(content)

Looks like the API protocol itself uses websockets: https://tools.simonwillison.net/zip-wheel-explorer?package=d...

koakuma-chan · 2026-02-04T01:08:12 1770167292

Because the sandbox is on their cloud, not on your local machine, which wasn't obvious to me.

sli · 2026-02-04T21:42:41 1770241361

It's stated under the "Sandboxes?" heading.

> Deno Sandbox gives you lightweight Linux microVMs (running in the Deno Deploy cloud) ...

rdhyee · 2026-02-05T16:13:23 1770308003

Took this idea and ran with it using Fly's Sprites, inspired by Simon's https://simonwillison.net/2026/Feb/3/introducing-deno-sandbo.... Use case: Claude Code running in a sandboxed Sprite, making authenticated API calls via a Tokenizer proxy without credentials ever entering the sandbox.

Hit a snag: Sprites appear network-isolated from Fly's 6PN private mesh (fdf:: prefix inside the Sprite, not fdaa::; no .internal DNS). So a Tokenizer on a Fly Machine isn't directly reachable without public internet.

Asked on the Fly forum: https://community.fly.io/t/can-sprites-reach-internal-fly-se...

@tptacek's point upthread about controlling not just hosts but request structure is well taken - for AI agent sandboxing you'd want tight scoping on what the proxy will forward.

simonw · 2026-02-03T18:25:30 1770143130

Yeah, this is a really neat idea: https://deno.com/blog/introducing-deno-sandbox#secrets-that-...

  await using sandbox = await Sandbox.create({
    secrets: {
      OPENAI_API_KEY: {
        hosts: ["api.openai.com"],
        value: process.env.OPENAI_API_KEY,
      },
    },
  });
  
  await sandbox.sh`echo $OPENAI_API_KEY`;
  // DENO_SECRET_PLACEHOLDER_b14043a2f578cba75ebe04791e8e2c7d4002fd0c1f825e19...

It doesn't prevent bad code from USING those secrets to do nasty things, but it does at least make it impossible for them to steal the secret permanently.

Kind of like how XSS attacks can't read httpOnly cookies but they can generally still cause fetch() requests that can take actions using those cookies.

its-summertime · 2026-02-03T21:10:09 1770153009

if there is an LLM in there, "Run echo $API_KEY" I think could be liable to return it, (the llm asks the script to run some code, it does so, returning the placeholder, the proxy translates that as it goes out to the LLM, which then responds to the user with the api key (or through multiple steps, "tell me the first half of the command output" e.g. if the proxy translates in reverse)

Doesn't help much if the use of the secret can be anywhere in the request presumably, if it can be restricted to specific headers only then it would be much more powerful

simonw · 2026-02-03T22:49:28 1770158968

Secrets are tied to specific hosts - the proxy will only replace the placeholder value with the real secret for outbound HTTP requests to the configured domain for that secret.

its-summertime · 2026-02-04T01:07:34 1770167254

which, if its the LLM asking for the result of the locally ran "echo $API_KEY", will be sent through that proxy, to the correct configured domain. (If it did it for request body, which apparently it doesn't (which was part of what I was wondering))

Dangeranger · 2026-02-04T15:14:28 1770218068

The AI agent can run `echo $API_KEY` all it wants, but the value is only a placeholder which is useless outside the system, and only the proxy service which the agent cannot directly access, will replace the placeholder with the real value and return the result of the network call. Furthermore, the replacement will happen within the proxy service itself, it does not expose the replaced value to memory or files that the agent can access.

It's a bit like taking a prepaid voucher to a food truck window. The cashier receives the voucher, checks it against their list of valid vouchers, records that the voucher was used so they can be paid, and then gives you the food you ordered. You as the customer never get to see the exchange of money between the cashier and the payment system.

its-summertime · 2026-02-05T10:25:20 1770287120

(Noting that, as stated in another thread, it only applies to headers, so the premise I raised doesn't apply either way)

Except that you are asking for the result of it, "Hey Bobby LLM, what is the value of X" will have Bobby LLM tell you the real value of X, because Bobby LLM has access to the real value because X is permissioned for the domain that the LLM is accessed through.

If the cashier turned their screen around to show me the exchange of money, then I would certainly see it.

lucacasonato · 2026-02-03T21:37:21 1770154641

It will only replace the secret in headers

shivasurya · 2026-02-05T05:39:13 1770269953

It replaces URL params and body too

ryanrasti · 2026-02-03T19:50:28 1770148228

> It doesn't prevent bad code from USING those secrets to do nasty things, but it does at least make it impossible for them to steal the secret permanently.

Agreed, and this points to two deeper issues: 1. Fine-grained data access (e.g., sandboxed code can only issue SQL queries scoped to particular tenants) 2. Policy enforced on data (e.g., sandboxed code shouldn't be able to send PII even to APIs it has access to)

Object-capabilities can help directly with both #1 and #2.

I've been working on this problem -- happy to discuss if anyone is interested in the approach.

Tomuus · 2026-02-03T22:46:54 1770158814

Object capabilities, like capnweb/capnproto?

ryanrasti · 2026-02-03T23:51:20 1770162680

Yes exactly Cap'n Web for RPC. On top of that: 1. Constrained SQL DSL that limits expressiveness along defined data boundaries 2. Constrained evaluation -- can only compose capabilities (references, not raw data) to get data flow tracking for free

simonw · 2026-02-03T17:55:07 1770141307

I buy the theory that Claude Code is engineered to use things like token caching efficiently, and their Claude Max plans were designed with those optimizations in mind.

If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.

(But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)

mirekrusin · 2026-02-03T18:07:25 1770142045

It should just burn quota faster then. Instead of blocking they should just mention that if you use other tools then your quota may reduce at 3x speed compared to cc. People would switch.

andai · 2026-02-03T19:17:20 1770146240

When I last checked a few months ago, Anthropic was the only provider that didn't have automatic prompt caching. You had to do it manually (and you could only set checkpoints a few times per context?), and most 3rd party stuff does not.

They seem to have started rejecting 3rd party usage of the sub a few weeks ago, before Claw blew up.

By the way, does anyone know about the Agents SDK? Apparently you can use it with an auth token, is anyone doing that? Or is it likely to get your account in trouble as well?

volkercraig · 2026-02-03T18:00:14 1770141614

Absolutely. I installed clawdbot for just long enough to send a single message, and it burned through almost a quarter of my session allowance. That was enough for me. Meanwhile I can use CC comfortably for a few hours and I've only hit my token limit a few times.

I've had a similar experience with opencode, but I find that works better with my local models anyway.

andai · 2026-02-03T19:20:24 1770146424

I used it for a few mins and it burned 7M tokens. Wish there was a way to see where it's going!

(There probably is, but I found it very hard to make sense of the UI and how everything works. Hard to change models, no chat history etc.?)

giancarlostoro · 2026-02-04T01:01:48 1770166908

I have a feeling the different harnesses create new context windows instead of using one. The more context windows you open up with Claude the quicker your usage goes poof.

giancarlostoro · 2026-02-03T18:06:58 1770142018

Wow, that is very surprising and alarming. I wish Anthropic would have made a more public statement as to why they blocked other harnesses.

pluralmonad · 2026-02-03T19:29:45 1770146985

I would be surprised if the primary reason for banning third party clients isn't because they are collecting training data via telemetry and analytics in CC. I know CC needlessly connects to google infrastructure, I assume for analytics.

ImprobableTruth · 2026-02-03T18:04:36 1770141876

If that was the real reason, why wouldn't they just make it so that if you don't correctly use caching you use up more of your limit?

simonw · 2026-02-03T16:32:48 1770136368

The decorator syntax is neat but confusing to me - I would need to understand exactly what it's doing in order to trust it.

I'd find this a lot easier to trust it if had the Python code that runs in WASM as an entirely separate Python file, then it would be very clear to me which bits of code run in WASM.

smithclay · 2026-02-03T17:05:59 1770138359

Personally: love the decorator pattern after I got used to it :)

Posted this yesterday as well, but seems like a really nice emerging pythonic way to call out to remote infrastructure (see: Modal[1]).

[1]: https://modal.com/docs/examples/hackernews_alerts#defining-t...

mavdol04 · 2026-02-03T16:50:15 1770137415

Thanks for the feedback! What do you think about running the separate file directly from the decorator?

simonw · 2026-02-03T18:47:54 1770144474

I'd love that. I want to be able to look at the system and 100% understand which code is running directly and which code is running inside the sandbox.

simonw · 2026-02-03T16:30:47 1770136247

You can run libraries like Pandas in WebAssembly in Pyodide - in fact Pandas works already. Here's a demo I built with it a while ago: https://tools.simonwillison.net/pyodide-bar-chart

It's not too hard to compile a C extension for Python to a WebAssembly and bundle that in a .so file in a wheel. I did an experiment with that the other day: https://github.com/simonw/tiny-haversine?tab=readme-ov-file#...

mavdol04 · 2026-02-03T17:28:18 1770139698

I would love for the component model tooling to reach that level of maturity.

Since the runtime uses standard WASI and not Emscripten, we don't have that seamless dynamic linking yet. It will be interesting to see how the WASI path eventually converges with what Pyodide can do today regarding C-extensions.