More

Robdel12 · 2026-02-22T00:21:12 1771719672

Heh, I felt the same. I'm a web dev but I do not want a electron app. We can do better, I used to write electron apps because I wasn't able to build a proper native app. Now I can!

I've been building a native macOS/iOS app that lets me manage my agents. Both the ability to actually control/chat fully from the app and to just monitor your existing CLI sessions (and/or take 'em over in the app).

Terrible little demo as I work on it right now w/claude: https://i.imgur.com/ght1g3t.mp4

iOS app w/codex: https://i.imgur.com/YNhlu4q.mp4

Also has a rust server that backs it so I can throw it anywhere (container, pi, etc) and the connect to it. If anyone wants to see it, but I have seen like 4 other people at least doing something similar: https://github.com/Robdel12/OrbitDock

Robdel12 · 2026-02-19T16:55:29 1771520129

I really want to use google’s models but they have the classic Google product problem that we all like to complain about.

I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.

With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?

surgical_fire · 2026-02-19T17:12:50 1771521170

May be very silly of me, but I avoid using Gemini on my personal Google account. I use it at work, because my employer provides it.

I am scared some automated system may just decide I am doing something bad and terminate my account. I have been moving important things to Proton, but there are some stuff that I couldn't change that would cause me a lot of annoyance. It's not trivial to set up an alternative account just for Gemini, because my Google account is basically on every device I use.

I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.

CamperBob2 · 2026-02-19T18:34:33 1771526073

For what it's worth, there was an (unfortunately unsuccessful) HN submission from a guy who got his Gemini account banned, apparently without losing his whole Google account: https://news.ycombinator.com/item?id=47007906

surgical_fire · 2026-02-19T20:07:59 1771531679

Comforting to know that they may ban you from only some of their services, I guess?

I really regret relying so much on my Google account for so long. Untangling myself from it is really hard. Some places treat your email as a login, not as simply as a way to contact you. This is doubly concerning for government websites, where setting up a new account may just not be a possibility.

At some point I suppose Gemini will be the only viable option for LLMs, so oh well.

paganel · 2026-02-19T22:19:33 1771539573

Same feeling here, if it makes you feel any better (for sure it made me better seeing I'm not alone in this).

alpineman · 2026-02-19T16:59:34 1771520374

100% agreed. I wish someone would make a test for how reliably the LLMs follow tool use instructions etc. The pelicans are nice but not useful for me to judge how well a model will slot into a production stack.

embedding-shape · 2026-02-19T17:22:43 1771521763

At first when I got started with using LLMs I read/analyzed benchmarks, looked at what example prompts people used and so on, but many times, a new model does best at the benchmark, and you think it'll be better, but then in real work, it completely drops the ball. Since then I've stopped even reading benchmarks, I don't care an iota about them, they always seem more misdirected than helpful.

Today I have my own private benchmarks, with tests I run myself, with private test cases I refuse to share publicly. These have been built up during the last 1/1.5 years, whenever I find something that my current model struggles with, then it becomes a new test case to include in the benchmark.

Nowadays it's as easy as `just bench $provider $model` and it runs my benchmarks against it, and I get a score that actually reflects what I use the models for, and it feels like it more or less matches with actually using the models. I recommend people who use LLMs for serious work to try the same approach, and stop relying on public benchmarks that (seemingly) are all gamed by now.

cdelsolar · 2026-02-19T17:26:43 1771522003

share

embedding-shape · 2026-02-19T17:41:42 1771522902

The harness? Trivial to build yourself, ask your LLM for help, it's ~1000 LOC you could hack together in 10-15 minutes.

As for the test cases themselves, that would obviously defeat the purpose, so no :)

MrGreenTea · 2026-02-20T07:58:24 1771574304

Would you be willing to give a rough outline of one or a few test cases? I am having a bit of a hard time imagining what and how you are testing. Is it like "change the signature of function X in file @Y to take parameter Z" and then comparing the result with what you expect?

phamilton · 2026-02-19T17:25:26 1771521926

> For those building with a mix of bash and custom tools, Gemini 3.1 Pro Preview comes with a separate endpoint available via the API called gemini-3.1-pro-preview-customtools. This endpoint is better at prioritizing your custom tools (for example view_file or search_code).

It sounds like there was at least a deliberate attempt to improve it.

pdntspa · 2026-02-19T17:00:42 1771520442

You can delete the billing from a given API key

Stevvo · 2026-02-19T17:17:57 1771521477

You could always use it through Copilot. The credits based billing is pretty simple without surprise charges.

horsawlarway · 2026-02-19T17:07:21 1771520841

So much this.

It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.

I want to do relatively simple things like:

1. Buy shit from you

2. For a controllable amount (ex - let me pick a limit on costs)

3. Without spending literally HOURS trying to understand 17 different fucking products, all overlapping, with myriad project configs, api keys that should work, then don't actually work, even though the billing links to the same damn api key page, and says it should work.

And frankly - you can't do any of it. No controls (at best delayed alerts). No clear access. No real product differentiation pages. No guides or onboarding pages to simplify the matter. No support. SHIT LOADS of completely incorrect and outdated docs, that link to dead pages, or say incorrect things.

So I won't buy shit from them. Period.

sciencejerk · 2026-02-19T17:27:44 1771522064

You think AWS is better?

horsawlarway · 2026-02-20T13:37:12 1771594632

Scarily - yes, although not by much.

I've used all 3 major providers - AWS, GCP, Azure.

AWS is no gem... it also has it's own byzantine processes to sign up and pay for things. And it also doesn't support any real and reasonable way to stop spend when you hit limits (abusive practices).

But at least I can generally sign up for and consume a new service without hours and hours of debugging.

For context - Google own Gemini 3 utterly fails to figure out how to do something as simple as "access the image doodle feature" proudly marketed here: https://gemini.google/overview/image-generation/

It can't figure out how to do. Honestly, I still can't figure out how to do it, despite signing up for about 5 different products, and trying 4 different UIs. The closest I got was to their inpainting/outpainting UI on the legacy models in their image create studio.

And none of that involved creating a billing account, which I already had, and was required for 3 of the signups.

As far as I'm concerned, this feature is fake marketing. It doesn't exist. That's the "quality" level of GCP.

3form · 2026-02-19T17:33:26 1771522406

Exact reason I used none of these platforms for my personal projects, ever.

pdimitar · 2026-02-19T18:15:49 1771524949

Who is comparing to AWS and why? They can both be terrible at the same time, you know.

abiraja · 2026-02-19T18:41:16 1771526476

I've been using it lately with OpenCode and it's working pretty well (except for API reliability issues).

himata4113 · 2026-02-19T17:06:17 1771520777

use openrouter instead

Robdel12 · 2026-02-19T18:56:48 1771527408

This is actually an excellent idea, I’ll give this a shot tonight!

Robdel12 · 2026-02-16T17:15:57 1771262157

> according to Bloomberg's Mark Gurman.

The entire Apple rumor world is based off this guys Sunday email.

CharlesW · 2026-02-16T18:02:37 1771264957

Yep, MacRumors is one of many bottom-feeders regurgitating rumors and half-truths from a fatter bottom-feeder, trading Apple hysteria for ad impressions. It's another small signal that the tech industry is now mostly eating itself.

Robdel12 · 2026-02-11T20:21:58 1770841318

I’m a heavy Claude code user and it’s pretty clear they’re starting to bend under their vibe coding. Each Claude code update breaks a ton of stuff, has perf issues, etc.

And then this. They want to own your dev workflow and for some reason believe Claude code is special enough to be closed source. The react TUI is kinda a nightmare to deal with I bet.

I will say, very happy with the improvements made to Codex 5.3. I’ve been spending A LOT more time with codex and the entire agent toolchain is OSS.

Not sure what anthropic’s plan is, but I haven’t been a fan of their moves in the past month and a half.

binsquare · 2026-02-11T20:24:53 1770841493

Same, codex 5.3 was able to solve a problem that I personally was stuck on even with help from Claude for the last 2 weeks.

4b11b4 · 2026-02-11T21:57:20 1770847040

Yeah, I can feel it too, it _mostly_ works but.. feels like it needs a rewrite.

for example Amp "feels" much better. Also like in Amp how I can just send the message whenever and it doesn't get queued

* I know, lots of "feels" in there..

viking123 · 2026-02-11T20:34:26 1770842066

I switched to Codex 5.3 too, it's cheaper also anyway and as dumb as it sounds, Scam Altman is actually the less annoying CEO compared to Amodei which is kind of an achievement. Amodei really looking more and more like some huckster giving these idiotic predictions to the press.

Der_Einzige · 2026-02-11T21:21:55 1770844915

Scam Altman is Epstein tier: https://www.bbc.com/news/articles/cz6lq6x2gd9o

https://www.nytimes.com/2025/01/08/technology/sam-altman-sis...

OrangeMusic · 2026-02-12T08:07:06 1770883626

So he's been accused of various crimes and has not been not found guilty?

Not like Epstein at all then.

amai · 2026-02-11T21:39:16 1770845956

OpenAI’s president is a Trump mega-donor

https://news.ycombinator.com/item?id=46771231

noosphr · 2026-02-11T22:07:10 1770847630

>Of all tyrannies, a tyranny sincerely exercised for the good of its victims may be the most oppressive. It would be better to live under robber barons than under omnipotent moral busybodies. The robber baron's cruelty may sometimes sleep, his cupidity may at some point be satiated; but those who torment us for our own good will torment us without end for they do so with the approval of their own conscience. They may be more likely to go to Heaven yet at the same time likelier to make a Hell of earth. This very kindness stings with intolerable insult. To be "cured" against one's will and cured of states which we may not regard as disease is to be put on a level of those who have not yet reached the age of reason or those who never will; to be classed with infants, imbeciles, and domestic animals.

Sam wants money. Dario wants to be your dad.

I'm going with Sam.

amai · 2026-02-11T23:07:59 1770851279

The article is about Greg Brockman, president of OpenAI.

computerex · 2026-02-12T00:24:43 1770855883

Codex has been useless for me on standard Plus plan unfortunately. Actually thoroughly disappointed. And VS code integration is totally broken.

computerex · 2026-02-12T05:35:07 1770874507

I'm not sure why I'm getting downvoted, but VS Code integration really does stink. Often times it will just simply not send the API request and just say reconnecting and I've had it simply freeze where the VS Code OpenAI Codex plugin has frozen, but all the other plugins like Cline or Roo are working perfectly fine. So the VS Code integration is almost unusable in my experience.

Robdel12 · 2026-02-06T16:18:00 1770394680

Salesforce is the worst, lol

Robdel12 · 2026-02-03T17:52:25 1770141145

I really really want local or self hosted models to work. But my experience is they’re not really even close to the closed paid models.

Does anyone any experience with these and is this release actually workable in practice?

littlestymaar · 2026-02-03T20:13:10 1770149590

> But my experience is they’re not really even close to the closed paid models.

They are usually as good as the flagship model for 12-18 months ago. Which may sound like a massive difference, because somehow it is, but it's also fairly reasonable, you don't need to live to the bleeding edge.

cmrdporcupine · 2026-02-03T21:05:06 1770152706

And it's worth pointing out that Claude Code now dispatches "subagents" from Opus->Sonnet and Opus->Haiku ... all the time, depending on the problem.

Running this thing locally on my Spark with 4-bit quant I'm getting 30-35 tokens/sec in opencode but it doesn't feel any "stupider" than Haiku, that's for sure. Haiku can be dumb as a post. This thing is smarter than that.

It feels somewhere around Sonnet 4 level, and I am finding it genuinely useful at 4-bit even. Though I have paid subscriptions elsewhere, so I doubt I'll actually use it much.

I could see configuration OpenCode somehow to use paid Kimi 2.5 or Gemini for the planning/analysis & compaction, and this for the task execution. It seems entirely competent.

Robdel12 · 2026-02-03T14:04:47 1770127487

The effort put into it is not “just a joke”. The creator knows exactly what he did and the joke excuse is weak

Robdel12 · 2026-01-28T19:49:35 1769629775

This is realllly cool. I have a rabbit hole to go down into tonight

Robdel12 · 2026-01-23T14:21:27 1769178087

I’ve always enjoyed creating working things for people to use. Which is why I love LLM coding agents.

Is it wild the thing I took a while to learn is now done basically by a GPU? Yes. But hand writing code is not my identity. Never has been.

Robdel12 · 2026-01-18T13:33:20 1768743200

https://www.npmjs.com/package/@vizzly-testing/honeydiff

I worked for Percy for 4 years. We were “stuck” with imagemagik to do diffing (I’m sure they still might). I was able to build my own differ with Claude/LLM help.

That special enough for you? Or?

teaearlgraycold · 2026-01-19T00:17:51 1768781871

I looked at the source. It seems most of the code is not included in the GitHub repo, which itself contains a bit of JS glue. The .tgz uploaded to npm has various prebuilt binaries. Can I take a look at the rust code?

I'm not trying to imply LLMs aren't useful. I just want more info from GP so that I can evaluate their claims.