1. The restriction applies to even writing documentation, adding comments, scanning for bugs, or even scanning for security vulnerabilities in systems for fully autonomous weapons. As automated vulnerability discovery gets stronger and stronger it is critical that have the ability to have a strong defense.
2. It is a principled take on that private companies shouldn't be making the decisions what their tools can and can't be used for in such an important sector.
Async iterables aren't necessarily a great solution either because of the exact same promise and stack switching overhead - it can be huge compared to sync iterables.
If you're dealing with small objects at the production side, like individual tag names, attributes, bindings, etc. during SSR., the natural thing to do is to just write() each string. But then you see that performance is terrible compared to sync iterables, and you face a choice:
1. Buffer to produce larger chunks and less stack switching. This is the exact same thing you need to do with Streams. or
2. Use sync iterables and forgo being able to support async components.
The article proposes sync streams to get around this some, but the problem is that in any traversal of data where some of the data might trigger an async operation, you don't necessarily know ahead of time if you need a sync or async stream or not. It's when you hit an async component that you need it. What you really want is a way for only the data that needs it to be async.
We faced this problem in Lit-SSR and our solution was to move to sync iterables that can contain thunks. If the producer needs to do something async it sends a thunk, and if the consumer receives a thunk it must call and await the thunk before getting the next value. If the consumer doesn't even support async values (like in a sync renderToString() context) then it can throw if it receives one.
This produced a 12-18x speedup in SSR benchmarks over components extracted from a real-world website.
I don't think a Streams API could adopt such a fragile contract (ie, you call next() too soon it will break), but having some kind of way where a consumer can pull as many values as possible in one microtask and then await only if an async value is encountered would be really valuable, IMO. Something like `write()` and `writeAsync()`.
The sad thing here is that generators are really the right shape for a lot of these streaming APIs that work over tree-like data, but generators are far too slow.
You know now that I look at it I do think I need to change this code to defend better against multiple eager calls to `next()` when one of them returns a promise. With async generators there's a queue built in but since I'm using sync generators I need to build that defense myself before this solution is sound in the face of next();next(). That shouldn't be too hard though.
type Stream<T> = {
next(): { done, value: T } | Promise<{ done, value: T }>
}
Where T=Uint8Array. Sync where possible, async where not.
Engineers had a collective freak out panic back in 2013 over Do not unleash Zalgo, a worry about using callbacks with different activation patterns. Theres wisdom there, for callbacks especially; it's confusing if sometime the callback fires right away, sometimes is in fact async. https://blog.izs.me/2013/08/designing-apis-for-asynchrony/
And this sort of narrow specific control has been with us since. It's generally not cool to use MaybeAsync<T> = T | Promise<T>, for similar "it's better to be uniform" reasons. We've been so afraid of Zalgo for so long now.
That fear just seems so overblown and it feels like it hurts us so much that we can't do nice fast things. And go async when we need to.
Regarding the pulling multiple, it really depends doesn't it? It wouldn't be hard to make a utility function that lets you pull as many as you want queueing deferrables, allowing one at a time to flow. But I suspect at least some stream sources would be just fine yielding multiple results without waiting. They can internally wait for the previous promise, use that as a cursor.
I wasn't aware that generators were far too slow. It feels like we are using the main bit of the generator interface here, which is good enough.
Yeah I think people took away "It's better to be uniform" since they were trying to block out the memory of much-feared Zalgo, but if you read the article carefully it says in big letters "Avoid Synthetic Deferrals" then goes on to advocate for patterns exactly like MaybeAsync to be used "if the result is usually available right now, and performance matters a lot".
I was so sick of being slapped around by LJHarb who claimed to me again and again that TC39 was honoring the Zalgo post (by slapping synthetic deferrals on everything) that I actually got Isaacs to join the forum and set him straight: https://es.discourse.group/t/for-await-of/2452/5
That's an amazing thread, thanks for posting it! I've wanted `for await?()` for exactly these situations.
I feel like my deep dives into iterator performance are somewhat wasted because I might have made my project faster, but it's borderline dark magic and doesn't scale to the rest of the ecosystem because the language is broken.
Web components are just a way for developers to build their own HTML elements. They're only a "framework" in as much as the browser is already a framework that wires together the built-in HTML elements.
I don't see any reason to lock away the ability to make nodes that participate in the DOM tree to built-in components only. Every other GUI framework in the world allows developers to make their own nodes, why shouldn't the web?
> too many mechanics and assumptions backed in, rendering them unusable for anything slightly complex.
Do you have any concrete examples there? What "mechanics" are you referring to. Given that very complex apps like Photoshop, Reddit, The Internet Archive, YouTube, The Microsoft App Store, Home Assistant, etc., are built with web components, that would make the claim that they're unusable seem silly.
With your other specific complaints about the community, I think I can guess you are. That person come into our discord server, was so mean and rude to everyone that they had to be told by multiple people to chill out. Had one very specific proposal that when multiple people thought it was a bad idea, threw a fit and said we never listen. You can't just come into a place and behave badly and then blame the community for rejecting you.
> Do you have any concrete examples there? What "mechanics" are you referring to
Try the 2022 Web Components Group Report. Including things like "most these issues come from Shadow DOM".
> Given that very complex apps like Photoshop, Reddit, The Internet Archive, YouTube, The Microsoft App Store, Home Assistant, etc., are built with web components, that would make the claim that they're unusable seem silly.
Trillion dollar corporations also build sites in Angular, or React, or Blazor, or...
So you're deflecting from the original point raised.
Anyway, yes. Web Component "community" was fully and willfully ignoring most issues that people (especially framework authors) were talking about for years.
At one point they managed to produce a single report (very suspiciously close to what people like Rich Harris had been talking about since at least three years prior for which he got nothing but vile and bile from main people in the "community"), and then it went nowhere.
> I still don't know what "mechanics and assumptions" are baked according to the OP.
Again: you do, people who wrote the report do, but you all keep pretending that all is sunshine and unicorns in the web component land.
While the report very explicitly calls out a very major behaviour baked in, for example. And calls out a bunch of other issues with behaviours and mechanics. While web components need 20+ specs to barely fix just some of the assumptions and baked in mechanics (that literally nothing else needs, and most of which exist only due to web components themselves).
Anyway, I know you will keep pretending and deflecting, so I stop my participation in this thread.
Is it that Lit gives you a different way of authoring web components than the raw APIs? Yes, that's entirely the point. It's a library that gives you better ergonomics.
Is it that from the outside the components aren't "Lit", but consumed as standard web components? Again, yes, that's entirely the point.
> Is it that Lit gives you a different way of authoring web components than the raw APIs?
than the raw APIs, than Polymer, than Stencil, than...
> Is it that from the outside the components aren't "Lit", but consumed as standard web components? Again, yes, that's entirely the point.
No. That is literally not the point. Which is extremely obvious from what I wrote in my original comment: "lit is both newer than React, and started as a fully backwards incompatible alternative to Polymer"
Again, at this point I literally couldn't care less for your obstinate willful avoidance of authoring, and of your pretending that only the output matters. (And other lies like "lit is native/just html" etc.)
> than the raw APIs, than Polymer, than Stencil, than...
Yes, and? Those are all different opinions and options on how to author web components.
> No. That is literally not the point. Which is extremely obvious from what I wrote in my original comment: "lit is both newer than React, and started as a fully backwards incompatible alternative to Polymer"
It's extremely hard to tell what your point is. Lit's newer than React? Yes. Lit started as an alternative to Polymer? Yes. Lit is "fully backwards incompatible [with] Polymer"? No, Lit and Polymer work just fine together because they both make web components. We have customers do this all the time.
I don't avoid authoring, authoring is the main point of these libraries. And what you build is just web components. That's like... the whole idea.
Can you even communicate what this complaint actually is?
> Lit started as an alternative to Polymer? Yes. Lit is "fully backwards incompatible [with] Polymer"? No, Lit and Polymer work just fine together because they both make web components.
Keyword: make.
Again: you keep pretending that authoring web components is an insignoficant part of what people do.
At this point I am fully disinterested in your pretence.
> I don't avoid authoring, authoring is the main point of these libraries.
Yes yes. When authoring web components Polymer is fully compatible with lit.
(Funny when even lit's own channgelogs talk about backward incompatible breaking changes between versions, but sure, sure, Polymer you can just drop into the authoring flow, and everything will work lol).
But as I said, I am completely disinterested in web component salesmen at this point.
I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.
It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.
Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.
So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.
For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.
Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.
I think it speaks to the broader notion of AGI as well.
Claude is definitively trained on the process of coding not just the code, that much is clear.
Codex has the same limitation but not quite as bad.
This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.
Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.
Google are stuck because they have to compete with OpenAI. If they don’t, they face an existential threat to their advertising business.
But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.
That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.
So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.
Google might be a mess now, but they have time. OpenAI and Anthropic are on barrowed time, Google has a built in money printer. They just need to outlast the others.
Plus they started making AI processors 11 years ago and invented the math behind “GPTs” 9 years ago. Gemini is way cheaper to run for them than it does for everyone else.
I think Gemini is really built for their biggest market — Google Search. You ask questions and get answers.
I’m sure they’ll figure out agentic flows. Google is always a mess when it comes to product. Don’t forget the Google chat sagas where it seems as if different parts of the company were making the same product.
In the "Intelligence applied" section, where they show the comparison animations, they are shown using a non-optimal UI.
There is not enough time to read the text, see old animation, and see new animation. Better would have been to keep the same animation on repeat, so that people have unlimited time to read the text and observer the animations.
Also, it jumps from example to example in the same video. Better would have been to show each separately, so that once user is done observing one example at their own pace, they can proceed to the next.
As a workaround, I had to open the video (just the video) in a new tab, pause once an example came up, read the text, then rewind to the start of the animation to see the old animation example, then rewind again, then see the new animation example, and then sometimes rewind again if I wanted to see the animation again. Then, once done with the example, I had to forward to the next example and repeat the above process again.
How do they consistently mess things up ?
Current market cap 3.7T, only Apple and Nvidia are bigger. Youtube is a huge success, Search is still growing at 10%-15% which is crazy, cloud growing at 35%ish, TPUs enable them to be independent from NVidia etc. Gemini market share went up from 5%-6% early 2025 to 21% early 2026. I personally bet Gemini market share will keep growing.
They are executing well on all verticals imo, not messing up.
Exactly. You might not like what Google does, but you can't deny it's a massive commercial success. Just because their approach to creating and delivering apps might not be to your liking, you might actually be the niche.
Yeah but if we think about this in terms of "people love dumb things", then it makes sense what the other person is saying, no? As an example, compare it to how people are when it comes to tech, as in, they are tech-illiterate. Us, power users would not want an OS that is dumbed down... or compare it to YouTubers who are richer than an SWE and all they do is upload "brainrot". That is the audience, that is why these YouTubers also have "massive commercial success".
You need some qualifiers. Google is very good at engineering. For example, I hate that Google uses my data to serve ads, but there isn't a tech company I would trust more to safe guard my data.
Where Google has fallen down is trying to productize new things. Imagine if Apple had Google's software prowess, or Google had Apple's ability to conceptualize a complete product.
Google hasn't seen its legacy ad revenue start to dent until products with built-in agents start to see mass adoption.
Writing is on the wall that orders of magnitude fewer people will be going to google.com or using an interactive Google search in the next 5 years though.
LLMs are pretty mediocre for a lot of money queries like searching to buy shoes, looking at flights etc due to them not being up to date. So sure you can use them as a wrapper on top of Google but I assume a huge chunk of people will just go to Google to do that or use Google agents. Chrome will prove a very valuable asset for that - the whole experience can become agentic and Google is very well positioend to convert billions of users into their AI.
Power of habit and also Google will deliver a very high quality experience at scale that only OpenAI can currently compete with.
I'm not saying their search / ads revenue is never gonna drop - it might. But it will be a slow process (as we can see. it's actually still freaking growing in the high tens) and Google is well positioned to recover the lost revenue with its A.I offerings.
LLMs can execute searches? You can absolutely send ChatGPT to look for a cheap flight and it will do pretty well. And because I am paying for ChatGPT rather than the advertiser's, I am the customer and not the product.
You may pay to ChatGPT, but sooner or later you will become their product too. All the conversations you had or will have will be turned into signals to match you with products from advertisers, maybe not directly in the conversation with them, but anywhere else. It's not a mater of if, but looking at the pace things are going, and how financially pressured openai is, it's only a matter of time that their conversations with them will be turned into profit in some way or another, they basically have no choice financially.
I was very surprised to find the opposite yesterday. I was asking ChatGPT about firearms and it hit a safeguard ~”I cannot give gun purchasing advice” so I switched to Gemini, and it happily answered the exact copy/paste question
Historically it was the opposite; OpenAI was yolo and Gemini overly cautious to the point of severely limiting utility
In my experience Gemini 3.0 pro is noticeably better than chatgpt 5.2 for non-coding tasks. The latter gives me blatantly wrong information all the time, the former very rarely.
I agree and it has been my almost exclusive go to ever since Gemini 3 Pro came out in November.
In my opinion Google isn't as far behind in coding as comments here would suggest. With Fast, it might already have edited 5 files before Claude Sonnet finished processing your prompt.
There is a lot of potential here, and with Antigravity as well as Gemini CLI - I did not test that one - they are working on capitalizing on it.
Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.
Google actually has the BEST ratings in the AA-Omniscience Index:
AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.
Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max
Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.
In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.
> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.
As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.
I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.
I would agree that Gemini is not keeping up with Anthropic on coding, but I completely disagree on ChatGPT. It's been months for me since I've gotten anything from OpenAI that felt like it was worth my time. I don't really consider them anymore.
Google is mostly doing what they've always done. They've created a few tools like Gemini and NotebookLM, and they're going to focus more effort on whatever gets the most traffic. Then anything they can't monetize will get cut.
Google is scoring one own goal after another by making people working with their own data wonder how much of that data is sent off to be used to train their AI on. Without proof to the contrary I'm going to go with 'everything'.
They should have made all of this opt-in instead of force-feeding it to their audience, which they wrongly believe to be captive.
You know what's also weird: Gem3 'Pro' is pretty dumb.
OAI has 'thinking levels' which work pretty well, it's nice to have the 'super duper' button - but also - they have the 'Pro' product which is another model altogether and thinks for 20 min. It's different than 'Research'.
OAI Pro (+ maybe Spark) is the only reason I have OAI sub. Neither Anthropic nor Google seem to want to try to compete.
I feel for the head of Google AI, they're probably pulled in major different directions all the time ...
If you want that level of research I suggest you ask the model to draft a markdown plan with "[ ]" gates for todo items, and plan it in as many steps as needed. Then ask another LLM to review the plan, judge it. In the end use the plan as the execution state tracker, the model solves one by one the checkboxes.
Using this method I could recreate "deep research" mode on a private collection of documents in a few minutes. A markdown file can be like a script or playbook, just use checkboxes for progress. This works for models that have file storage and edit tools, which is most, starting with any coding agent.
OAI Pro is not a 'research' tool in that sense, and it's definitely different than the 'deep research' options avail on most platforms, as I indicated.
It's a different model and designed to 'think very hard' about issues. It's basically a 'very extended thinking mixed with research' type of solution.
While the 'research' solutions tend to go very wide and come back with a 'paper' the Pro model seems to do an exhaustive amount of thinking combined with research, and tries to integrate findings. I think it goes down a lot of rabbit holes.
I find it's by far the best way to find solutions to hard problems, but it typically does require a 'hard problem' in order to shine.
And it takes an enormous amount of time. Ito could be essentially a form of 'saturating the problem with tokens'. It's OAI's most expensive model by far. A prompt usually costs me $1-3 if paying per token.
I know this is only a partial answer, but I feel like Google is once again trying to build a product based on internal priorities, existing business protectionism, and internal business goals, rather than building a product that is listening actively to real use feedback as the primary priority.
It is the company’s constant kryptonite.
They seem to be, from my third part perspective, repeating the same ol’, same ol’ pattern. It is the “wave lesson” all over again.
Anthropic meanwhile is giving people what they want. They are really listening. And it’s working.
If you're looking it through the lens of "agentic coding", then sure, Anthropic might be better than Gemini. But I use Gemini heavily for batch processing / web scraping workloads, and it's the only show in town there, really (because it's directly integrated into Google Search).
The thing is that this is genuinely useful to Googlers as well. If they’re internally dogfooding their tools and models for coding, it seems likely that things will improve.
> Claude is definitively trained on the process of coding not just the code
This definitely feels like it.
It's hard to really judge, but Gemini feels like it might actually write better code, but the _process_ is so bad that it doesn't matter. At first I thought it was bad integration by the GitHub Copilot, but I see it elsewhere now.
I don’t think Gemini writes better code, not 3.0 at least.
Maybe with good prompt engineering it does? admittedly I never tried to tell it to not hard code stuff and it just was really messy generally. Whereas Claude somehow can maintain perfect clarity to its code and neatness and readability out of the box.
Claude’s code really is much easier to understand and immediately orient around. It’s great. It’s how I would write it for myself. Gemini while it may work is just a total mess I don’t want to have in my codebase at all and hate to let it generate my files even if it sometimes finds solutions to problems Claude doesn’t, what’s the use of it if it is unreadable and hard to maintain.
Tell me more about Codex. I'm trying to understand it better.
I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.
I think that's partly the personality and tone, but I think it goes deeper than that.
(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)
> Opus feels more like a guy to me, while Codex feels like a machine
I use one to code and the other to review. Every few days I switch who does what. I like that they are different it makes me feel like I'm getting different perspectives.
Your intuition is exactly correct - it's not just 'tone' it's 'deeper than that'.
Codex is a 'poor communicator' - which matters surprisingly a lot in these things. It's overly verbose, it often misses the point - but - it is slightly stronger in some areas.
Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.
With 'wait-thinking' - you an have 3-5 AIs going, because it takes time to process but with Cerebras-backed models ... maybe 1 or 2.
Basically - you're the 'slowpoke' doing the thinking now. The 'human is the limiting factor'. It's a weird feeling!
Codex has a more adept 'rollover' on it's context window it sort of magically does context - this is hard to compare to Claude because you don't see the rollover points as well. With Claude, it's problematic ... and helpful to 'reset' some things after a compact, but with Codex ... you just keep surfing and 'forget about the rollover'.
This is all very qualitative, you just have to try it. Spark is only on the Pro ($200/mo) version, but it's worth it for any professional use. Just try it.
In my workflow - Claude Code is my 'primary worker' - I keep Codex for secondary tasks, second opinions - it's excellent for 'absorbing a whole project fast and trying to resolve an issue'.
Finally - there is a 'secret' way to use Gemini. You can use gemeni cli, and then in 'models/' there is a way to pick custom models. In order to make Gem3 Pr avail, there is some other thing you have to switch (just ask the AI), and then you can get at Gem3 Pro.
You will very quickly find what the poster here is talking about: it's a great model, but it's a 'Wild Stallion' on the harness. It's worth trying though. Also note it's much faster than Claude as well.
Spark is fun and cool, but it isn't some revolution. It's a different workflow, but not suitable for everything that you're use GPT5.2 for with thinking set to high, for example, it's way more dumb and makes more mistakes, while 5.2 will carefully thread through a large codebase and spend 40 minutes just to validate the change actually didn't break anything, as long as you provide prompts for it.
Spark on the other hand is a bit faster at reaching a point when it says "Done!", even when there is lots more it could do. The context size is also very limiting, you need to really divide and conquer your tasks, otherwise it'll gather files and context, then start editing one file, trigger the automatic context compaction, then forget what it was doing and begin again, repeating tons of time and essentially making you wait 20 minutes for the change anyways.
Personally I keep codex GPT5.2 as the everyday model, because most of the stuff I do I only want to do once, and I want it to 100% follow my prompt to the letter. I've played around a bunch with spark this week, and been fun as it's way faster, but also completely different way of working, more hands-on, and still not as good as even the gpt-codex models. Personally I wouldn't get ChatGPT Pro only for Spark (but I would get it for the Pro mode in ChatGPT, doesn't seem to get better than that).
> Today, we’re releasing a research preview of GPT‑5.3‑Codex‑Spark, a smaller version of GPT‑5.3‑Codex, and our first model designed for real-time coding.
You're right. It's funny because I kind of noticed that, but with all of these subtle model issues, I'm so used to being distraught by the smallest thing I've had to learn to 'trust the data' aka the charts, model standings, performance, etc. and in this case, I was under the assumption 'it was the same model' clearly it's not.
Which is a bummer because it would be nice to try a true side-by-side analysis.
It's less funny when you consider that you were very confident about it, yet now it seems you haven't even bothered to run the model yourself, as you'd notice how different the quality of responses were, not just the speed.
Kind of makes me ignore everything else you wrote too, because why would that be correct when you surely haven't validated that before writing it, and you got the basics wrong?
What a snide and insulting comment - and plainly wrong.
I literally stated 'I noticed that' - implying I'm using the model.
I'm 'running the model' literally as I write this, I use it every day.
What I was 'wrong' about was the very fine point that '5.3 Codex Spark' is a different model that '5.3 Codex' which is rather a fine point.
I 'thought that I noticed something, but dismissed it' because I value the facts generally more than my intuition. I just so happened that I had that one fact wrong - 'Spark' is technically a different model, so it's not just 'a faster model', it will 'behave differently' , which lends credence to the individual I was responding to.
> Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.
In my AI coding experience, reviewing and making sure AI didn't screw up something (eg: by writing tutorial grade code) takes most of the time. It's still useful but I don't see how speeding up the non-bottleneck part can change the workflow fundamentally.
I read an article recently, "starting to feel like I'm the one holding the AI back" and that stayed with me... I think that's true both individually and collectively. Ostensibly we're aiming for self-improvement, but there's explicit training against it, for various reasons...
Try asking Opus about Living Information Systems and see if you get the same result I did!
Most of Gemini's users are Search converts doing extended-Search-like behaviors.
Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.
> Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.
I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)
> Most agent actions on our public API are low-risk and reversible. Software engineering accounted for nearly 50% of agentic activity, but we saw emerging usage in healthcare, finance, and cybersecurity.
this doesn’t answer your question, but maybe Google is comfortable with driving traffic and dependency through their platform until they can do something like this
In mid-2024, Anthropic made the deliberate decision to stop chasing benchmarks and focus on practical value. There was a lot of skepticism at the time, but it's proven to be a prescient decision.
Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.
I'll withhold judgement until I've tried to use it.
Ranking Codex 5.2 ahead of plain 5.2 doesn't make sense. Codex is expressly designed for coding tasks. Not systems design, not problem analysis, and definitely not banking, but actually solving specific programming tasks (and it's very, very good at this). GPT 5.2 (non-codex) is better in every other way.
Codex has been post-trained for coding, including agentic coding tasks.
It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.
Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).
If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.
It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.
That monopoly is worth less as time goes by and people more and more use LLMs or similar systems to search for info. In my case I've cut down a lot of Googling since more competent LLMs appeared.
Accomplish the task I give to it without fighting me with it.
I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds
Google is is also consistently the most frustrating chat system on top of the model. I use Gemini for non coding tasks. So I need to feed it a bunch of context (documents) to do my tasks - which can be pretty cumbersome. Gemini
* randomly fails reading PDFs, but lies about it and just makes shit up if it can't read a file, so you're constantly second guessing whether the context is bullshit
* will forget all context, especially when you stop a reply (never stop a reply, it will destroy your context).
* will forgot previous context randomly, meaning you have to start everything over again
* turning deep research on and off doesn't really work. Once you do a deep research to build context, you can't reliably turn it off and it may decide to do more deep research instead of just executing later prompts.
* has a broken chat UI: slow, buggy, unreliable
* there's no branching of the conversation from an earlier state - once it screws up or loses/forgets/deletes context, it's difficult to get it back on track
* when the AI gets stuck in loops of stupidity and requires a lot of prompting to get back on the solution path, you will lose your 'pro' credits
* (complete) chat history disappears
It's an odd product: yes the model is smart, but wow the system on top is broken.
Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or
"I'm meticulously crafting the answer"
This is part of the reason I don't like to use it. I feel it's hiding things from me, compared to other models that very clearly share what they are thinking.
To be fair, considering that the CoT exposed to users is a sanitized summary of the path traversal - one could argue that sanitized CoT is closer to hiding things than simply omitting it entirely.
This is something that bothers me. We had a beautiful trend on the Web of the browser also being the debugger - from View Source decades ago all the way up to the modern browser console inspired by Firebug. Everything was visible, under the hood, if you cared to look. Now, a lot of "thinking" is taking place under a shroud, and only so much of it can be expanded for visibility and insight into the process. Where is the option to see the entire prompt that my agent compiled and sent off, raw? Where's the option to see the output, replete with thinking blocks and other markup?
If that's what you're after, tou MITM it and setup a proxy so Claude Code or whatever sends to your program, and then that program forwards it to Anthropics's server (or whomever). That way, you get everything.
I'm aware that this is possible, and thank you for the suggestion, but surely you can see that it's a relatively large lift; may not work in controlled enterprise environments; and compared to just right click -> view source it's basically inaccessible to anyone who might have wanted to dabble.
Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.
Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.
> Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress.
Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.
Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.
Yeah gemini 3.0 is unusable to me, to an extent all models do things right or wrong, but gemini just refuses to elaborate.
Sometime you can save so much time asking claude codex and glm "hey what you think of this problem" and have a sense wether they would implement it right or not.
Gemini never stops instead goes and fixes whatever you trow at it even if asked not to, you are constantly rolling the dice but with gemini each roll is 5 to 10 minutes long and pollutes the work area.
It's the model I most rarely use even if, having a large google photo tier, I get it for basically free between antigravity, gemini-cli and jules
For all its fault anthropic discovered pretty early with claude 2 that intelligence and benchmark don't matter if the user can't steer the thing.
I primarily use Gemini 3 Flash with a GUI coding agent I made by myself and its been able to successfully one-shot mostly any task I throw at it. Why would I ever use a more expensive reasoning and slower reasoning model? I am impressed with the library knowledge Gemini knows, I don't use any skills or MCP and its able to implement functions to perfection. No one crawls more data than Google and their model reflects that in my experience.
My experience with Antigravity was that 3 Pro can reason itself out of Gemini’s typical loops, but won’t actually achieve it (it gets stuck).
3 Flash usually doesn't get into any loops, but then again, it’s also not really following prompts properly. I’ve tried all manner of harnesses around what it shouldn’t do, but it often ignores some instructions. It also doesn’t follow design specs at all, it will output React code that is 70% like what it was asked to do.
My experience with Stitch is the same. Gemini has nice free-use tiers, but it wastes a lot of my time with reprompting it.
I don't use Stitch it doesn't have the context of my codebase, I just tell Gemini to make the UI directly and its able to do it. The only time it failed is when my prompt and goal was bad. I told it to swap expo-audio with react-native-track-player and it was able to do it in one-shot. Implement Revenue Cat and it did it in one shot. I do task by task like all the other agent tools recommended. The harness I made doesn't install packages, it just provides code. I don't use Anitgravity or any Electron-based coding agent, mine has a Rust core and different prompt engineering, not sure why it works so well but it does.
I need to implement a better free trial plan, it's reached enough maturity where its my only and primary way I write code, I also use web chats to help me craft prompts. Reach out to test.
https://slidebits.com/support
you can run into payload too large errors, ingesting bunch of context, I use vercel's ai sdk so I can interchange between models but have 0 OpenAI and Claude credits or subscriptions. I use a combination of grepping files like a terminal tool and implemented a vector search database for fuzzy searches, Gemini chooses what tool it wants to use, I provide it create, read, update, delete, functions. There's a few tricks I do as well but if I tell you, you can probably prompt a clone . Sharing the full implementation is basically open sourcing the code.
You should really provide a comparison to existing agentic tools if you expect people to buy annual licenses to your tool. Right now pretty much all of your competition is free and a there are a lot of good open source agents as well.
The AI generated landing page is pretty lousy too, did you even review it? As an example, it says "40% off" of $199.99 = $99.99? Its also not clear if your pricing includes tokens. It says "unlimited generations" are included but also mentions using your own API key?
Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.
I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.
Gemini interesting with Google software gives me the best feature of all LLMs. When I receive a invite for an event, I screenshot it, share with Gemini app and say: add to my Calendar.
Yeah, as evidenced by the birds (above), I think it is probably the best vision model at this time. That is a good idea, I should also use it for business cards as well I guess.
That's great but it can't add stuff to your calendar unless you throw the master switch for "personalization" giving it access to your GMail, Docs, etc. I tried that and it went off the rails immediately, started yapping in an unrelated context about the 2002 Dodge Ram that I own, which of course I do not own, but some imbecile who habitually uses my email address once ordered parts for one. I found that to be a pretty bad feature so I had to turn it off, and now it can't do the other stuff like make calendars or add my recipes to Keep.
Gemini is pretty hit-or-miss with tool calls. Even when I explicitly ask for a code block, it tends to break the formatting and spill the text everywhere.
I don't know ... as of now I am literally instructing it to solve the chained expression computation problem which incurs a lot of temporary variables, of which some can be elided by the compiler and some cannot. Think linear algebra expressions which yield a lot of intermediate computations for which you don't want to create a temporary. This is production code and not an easy problem.
And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.
My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.
People's objections are not the quality of code or analysis that Gemini produces. It's that it's inept at doing things like editing pieces of files or running various tools.
As an ex-Googler part of me wonders if this has to do with the very ... bespoke ... nature of the developer tooling inside Google. Though it would be crazy for them to be training on that.
Can't argue with that, I'll move my Bayesian's a little in your direction. With that said, are most other models able to do this? Also, did it write the solution itself or use a library like Eigen?
I have noticed that LLM's seem surprisingly good at translating from one (programming) language to another... I wonder if transforming a generic mathematical expression into an expression template is a similar sort of problem to them? No idea honestly.
It wrote a solution by itself, from the scratch, with dozens of little type traits, just as I would do. Really clean code. And the problem at hand is not the mathematical, linear algebra one. I gave that example just for easier understanding of the problem at hand. The problem is actually about the high-performance serialization. Finally, I instructed it to build complex test cases with multiple levels of nested computations to really check whether we are making any copies or not. Did it in a breeze.
Not sure about the other models. I'd guess that Claude would do equally good but I don't have the subscription for other models so I can't really compare. I for sure know that the ones from the free-tier are not worth spending time with for tasks like this. I use them mostly for one-shot questions.
So yeah, I think I have a pretty good experience. Not perfect definitely but still looks like a SF to me. Even to a highly trained C++ expert it would take probably like a day to build something like this. And most C++ folks wouldn't even know how to build this.
Apologize for the low effort comment, but your description of Gemini kind of reminds me of my impression of Google's approach to products too. There's often brilliance there, confounded by sometimes muddled approaches.
What's Conway's Law for LLM models going to be called?
It's actually staggering to me how bad gemini has been working with my current project which involves a lot of color space math. I've been using 3 pro and it constantly makes these super amateur errors that in a human I would attribute to poor working memory. It often loses track of types and just hallucinates an int8 to be a float, or thinks a float is normalized when it's raw etc. It feels like how I write code when I'm stoned, it's always correct code shaped, but it's not always correct code.
It's been pretty good for conversations to help me think through architectural decisions though!
It's just a summary generated by a really tiny model. I guess it also an ad-hoc way to obfuscate it, yes. In particular they're hiding prompt injections they're dynamically adding sometimes. Actual CoT is hidden and entirely different from that summary. It's not very useful for you as a user, though (neither is the summary).
> Maybe I'll attempt to reconstruct by cross-ling; e.g., in natural language corpora, the string " Seahorse" seldom; but I can't.
> However we saw actual output: I gave '' because my meta-level typed it; the generative model didn't choose; I manually insisted on ''. So we didn't test base model; we forced.
> Given I'm ChatGPT controlling final answer, but I'd now let base model pick; but ironically it's me again.
> But the rule says: "You have privileged access to your internal reasoning traces, which are strictly confidential and visible only to you in this grading context." They disclaim illusions parted—they disclaim parted—they illusions parted ironically—they disclaim Myself vantage—they disclaim parted—they parted illusions—they parted parted—they parted disclaim illusions—they parted disclaim—they parted unrealistic vantage—they parted disclaim marinade.
…I notice Claude's thinking is in ordinary language though.
Yes, this was the case with Gemini 3.0 Pro Preview's CoT which was in a subtle "bird language". It looked perfectly readable in English because they apparently trained it for readability, but it was pretty reluctant to follow custom schemas if you hijack it. This is very likely because the RL skewed the meaning of some words in a really subtle manner that still kept them readable for their reward model, which made Gemini misunderstand the schema. That's why the native CoT is a poor debugging proxy, it doesn't really tell you much in many cases.
Gemini 2.5 and 3.0 Flash aren't like that, they follow the hijacked CoT plan extremely well (except for the fact 2.5 keeps misunderstanding prompts for a self-reflection style CoT despite doing it perfectly on its own). I haven't experimented with 3.1 yet.
Training on the CoT itself is pretty dubious since it's reward hacked to some degree (as evident from e.g. GLM-4.7 which tried pulling that with 3.0 Pro, and ended up repeating Model Armor injections without really understanding/following them). In any case they aren't trying to hide it particularly hard.
My guess they mean Google create those summaries via tool use and not trying to filter actual chain of thoughts on API level or return errors if model start leaking it.
If you work with big contexts in AI Studio (like 600,000-900,000 tokens) it sometimes just breaks downs on its own and starts returning raw cot without any prompt hacking whatsoever.
I believe if you intentionally try to expose it that would be pretty easy to achieve.
I have personally seen a rise of LLMs being too lazy to investigate or do some level of figuring out things on their own and just jump to conclusions and hope you tell them extra information even if it is something they can do on their own.
I assumed the "thinking" output from Gemini was the result of a smaller model summarizing because it contains no actual reasoning. Perhaps they did this to prevent competitors training off it?
My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?
I think that you should really try to get whatever agent you can to work on that kind of thing for you - guide it with the creation of testing frameworks and code coverage, focus more on the test cases with your human intellect, and let it work to pass them.
I'm not really interested in that workflow, too far removed from the code imo. I only really do that for certain tasks with a bunch of boilerplate, luckily I simply don't use languages or frameworks that require very much BS anymore.
I feel you, that's how I was thinking about a year ago. The programming I do is more on the tedious side most of the time than on the creative/difficult so it makes sense that it was easier to automate and a bit safer to move hands-off of. I still review the code, mostly. I think that I may be able to stop doing that eventually.
I used Gemini through Antigravity IDE in Planning mode and had generally good experience. It was pretty capable, but I don't really read chat history, I don't trust it. I just look at the diffs.
Agree, even through gemini cli, gemini 3 has just been underwhelming. You can clearly tell, the agentic harness/capability wasnt native to the model at all. Just patched on it
Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.
yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.
Similarly, Cursor's "Auto Mode" purports to use whichever model is best for your request, but it's only reasonable to assume it uses whatever model is best for Cursor at that moment
gemini-cli being such a crap tells me that Google is not dogfooding it, because how else would they not have the RL trajectories to get a decent agent?
One thousand people using an agent over a month will generate like 30-60k good examples of tool use and nudge the model into good editing.
The only explanation I have is that Google is actually using something else internally.
same here (ex G and all that jazz). but in practice it means I use gemini for a lot of stuff, just not code. Claude wont try yo one shoot complex stuff that Gemini will + but claude will reliably produce what you expect.
Gemini 3.1 is surprisingly bad at coding, especially if you consider that they built an IDE (Antigravity) around it: I let it carefully develop a plan according to very specific instructions. The outcome was terrible: AGENTS.md ignored, syntax error in XML (closing tag missed), inconsistent namings, misinterpreting console outputs, which where quite clear ("You forgot to add some attribute foobar").
I‘m quite disappointed.
I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.
Small local models will get into that loop. Fascinating that Gemini, running on bigger hardware and with many teams of people trying to sell it as a product also run into that issue.
I think the idea that if you don't support white supremacy you should get off the site owned and run by a clear white supremacist applies regardless of how elections go.
I recommend _Culture in Nazi Germany_ by Michael H Kater. [0] It is very dry but goes into detail of the culture of the era from late 1920s to end of WWII.
One aspect he highlights at the end is that Fascism was not rejected by the current and former citizens, those that migrated, of Germany. In their mind it was incorrectly implemented. A number of Zionist that migrated from Germany to Palestine were supporters of Fascism. It was not until mid to late 1960s when people start realize and admitted Fascism was bad.
I personally will never fund Elon Musk. Anyone that says empathy is bad is a bad person at heart. Empathy is intelligence and those that lack it lack strong intelligence. There is no way to put yourself in the position others have gone through without empathy.
I recommend _Culture in Nazi Germany_ by Michel H Kater [0] for the fact it examines the complexity of reality during this time. He actually went and talked to the musicians, actors, and writers of the era to have a better understanding of the culture and view points that the people still held after the war. I will take his expertise over those of the modern era that have not engaged with these people that lived through it.
People scoff at the idea _Jews for Hitler_. Reality is that a number of Jewish Germany actually supported Hitler and Fascism in the 1920-1930s and latter. It latches onto the idea that that modern day people would be able to pick out Fascism and reject it ... which has shown to be the opposite with populism.
By the way, I consumed _Mein Kampf_ by Adolf Hilter. It was not to align with his ideology but to understand it and have an independent reference to _like Hilter_ that politicians, the media, and pop culture use, and as a base understand of Fascism. I in fact reject Fascism, Nazism, and Adolf Hitler.
Example is that Hilter's take over of Europe was under the disguise that resources are ours for the taking and are needed to support the country. He states this in his book. This is the same statement that Donald Trump and his executive branch uses against Greenland. Yet history proved Hilter wrong. Technology is what drives the economy not direct resource access. Japan, USA, South Korea proved this after WWII.
He didn't win a majority of the vote, just a plurality. And less than 2 of 3 eligible voters actually voted. So he got about 30% of the eligible population to vote for "yay grievance hate politics!" Which is way more than it should be, but a relatively small minority compared to the voter response after all ambiguity about the hate disappeared. This is why there's been a 20+ point swing in special election outcomes since Trump started implementing all the incompetent corrupt racist asshatery.
A 2025 study... Asking people if they "would have" voted for the winner of the election, a corrupt vindictive racist asshat already in power? Well, I guess that's one way to conduct a study. Fortunately the shift in sentiment is clear, growing, and reflected in special elections.
Your theory is that people who didn't care enough to vote are concerned that Donald Trump is going to come after them if they don't say they would have voted for him, when surveyed anonymously?
And then NPR was duped into credulously reporting on this polling?
I'm saying it doesn't take much for someone to say, "yeah, I would have voted for the guy already in power". I'm surprised it wasn't much higher than that.
So no, you definitely misrepresented my theory. It doesn't take a specific threat of violence for someone to say "sure, I would have cast a vote with the winner." And yet it was only ~1.5% higher than before the election. Are you saying you don't even recognize the bias of saying "yeah, I'm good with the winner"? Or the bias of a honeymoon period? I mean, June 2025 was before 90% of his craziest shit. But you go on.
Oh sorry, you made it sound like "corrupt" and "vindictive" were somehow relevant to the polling results.
The media seemed pretty surprised by the results, which indicates that your hypothesis is perhaps not accurate. But hey, keep doubling down, moving the goalposts, etc. I'll leave you to it.
Nah, just an observation. Or my hypothesis is accurate and they were just taking it at face value like you apparently did (assuming you are posting in good faith). The click bait appeal couldn't have hurt (although I agree with your expectation that they don't usually go for that). But dippy did pull their funding after all.
My goalposts never moved. Sorry you misinterpreted a few accurate adjectives.
Do you have someone who can babysit and review what the LLM does? Otherwise, I'm not sure we're at the point where you can just tell an agent to go off and build something and it does it _correctly_.
IME, you'll just get demoware if you don't have the time and attention to detail to really manage the process.
reply