Hacker Newsnew | past | comments | ask | show | jobs | submit | timabdulla's commentslogin

I'd be curious to see screenshots or a video! I only have a Mac at my disposal, unfortunately.

This seems cool, but beware that Fly's other products are not exactly models of stability and polish.

API downtime is a semi-frequent occurrence, as are transient API errors and slowness.

I've also had a ticket open with support for weeks due to rampant billing issues. For instance, a destroyed instance still shows up in my usage report as actively accruing billed time, and at a rate faster than is even possible (something like 2 hours for every 1 actual hour that has passed.)

They've released two new products in the AI space, this and Phoenix.new, and my worry is that they are focused on new products over making what they have good and reliable.


yeah nobody should use this based on reliability and support alone

> We conducted three runs per experiment and selected the run with the highest final accuracy for inclusion in the chart (though illustrative examples and anecdotes may be drawn from any of the runs).

Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)


I mean, the fact that OpenAI, at the bleeding edge of it all, has decided to buy an IDE is a rather strong hint that the future of agents handling entire engineering tickets might be further out than many believe.

If autonomous agents were just around the corner, then why wouldn't OpenAI bet on their own Codex product obviating (most) need for an IDE and save themselves the $3 billion?


why OpenAI purchased windsurf instead of prompting openai to create something like windsurf?

this is the question i am still asking...


These products are not complicated at their core — you can pretty much just drop in something like Monacopilot [1] and be 80% of the way there. But the last 20% is a real slog, and it mostly comes down to handling edge cases (bracket closing...) and optimizing prompting/context so you aren't burning cash. Whatever anyone claims about "feeling the AGI," AI isn't there yet.

[1]: https://github.com/arshad-yaseen/monacopilot


They did. They’ve just released codex (CLI client).

They don’t have access to copilot users in general, Microsoft and Google does. And perhaps they are realizing that Microsoft is hedging them over multiple LLM providers and maybe no longer feeding them juicy copilot data, with humans in a tight loop, correcting LLMs.


Controlling demand (developer workflow and mindshare) is a good position if you're trying to build scale on supply.


Maybe to avoid the Second System Effect.


This is a good point. It is already the case that unless you deeply review every Windsurf change you will have zero understanding of your codebase. If it gets 1000X better in the next 3 years why would anyone look at code at all?

Of course, back to reality. Today, at least in my workflow, I use / like Windsurf but it is a small part of what I am doing. For any code I want to keep I mostly write it by hand (using vim for a very bare-bones / cognitive mode experience). For me, the real flow state occurs in vim while ChatGPT and Windsurf are great for exploration.


It sounds like the openAI team is overburdened (I guess they aren’t AI super users yet) so this may be their only option. Easy entry into a key segment, at least for now, and locks out competitors.


so much for ai turning everyone at openai into 1000x coders


As a competitor in that key segment I don't feel locked out. I could almost jump for joy that this very weak-tea move is the most they can do with that much money. They're just quintupling down on the technology of 50 years ago. There's no threat to me at all here as a creator of from-first-principles IDE technology.


What are you working on?


It's not too hard to find out, but I'm going to make a big announcement in a few days so my official message at the moment is "stay tuned"


It’s one of your GitHub projects?


They might just want a way to quickly collect data needed for fine-tuning the next generation of programming agents.


What were the human PhDs able to do after more than 48 hours of effort? Presumably given that these are top-level PhDs, the replication success rate would be close to 100%?


Depending on how well the exact algorithms, implementation details, and experimental design were documented, replication can easily take days, if not weeks. (Personally, I would start by filtering out papers that cannot be replicated by well-skilled researchers in a fixed amount of time and only give the replicatable ones to the agents.)


How does it perform on e.g. WebVoyager, WebArena, or OSWorld? These seem to be the oft-cited benchmarks when comparing computer-use agents.


This is the most interesting aspect to me. I had Claude generate a guide to all the gyms in Pokemon Red and instructions for how to quickly execute a play through [0].

It obviously knows the game through and through. Yet even with encyclopedic knowledge of the game, it's still a struggle for it to play. Imagine giving it a gave of which it knows nothing at all.

[0] https://claude.site/artifacts/d127c740-b0ab-43ba-af32-3402e6...


There is no 3.6. There is 3.5 and 3.5 (New), both of which remain available.


Ah, you are right, I found it (it was under the more menu on the UI).


There was never a Sonnet 3.6. They released what is commonly known as 3.6 as "Sonnet 3.5 (New)". Then, because so many folks ended up referring to it as 3.6, they decided to call this new model 3.7, as the mental territory for 3.6 was already occupied by 3.5 (New). Not confusing in the slightest!


My feeling (totally unproven) is that in the drive to make Sonnet 3.7 more "agentic", they've lost some of its ability to actually just stick to what you asked it to do. It seems that it "wants" (I know, it's not sentient!) to be more in the driver's seat now.

Definitely can be very annoying if you do just want it to execute on a set of instructions.


I don't know but if it wants to be in the driver seat, maybe it can pay the subscription fee for itself, because it is barely usable for anything I paid for before.

To me it's not a good deal that I pay for something that makes up the things it wants, and completely disregards what I asked. It's like the recipe for the worst SaaS you ever used.


This can be mostly mitigated with the right system prompt although I noticed occasionally the prompt will be ignored (~1/20).


Yes, the Agent mode is terrible. Have you tried using the Ask mode instead?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: