More

timabdulla · 2026-01-15T15:02:45 1768489365

I'd be curious to see screenshots or a video! I only have a Mac at my disposal, unfortunately.

timabdulla · 2026-01-11T07:56:23 1768118183

This seems cool, but beware that Fly's other products are not exactly models of stability and polish.

API downtime is a semi-frequent occurrence, as are transient API errors and slowness.

I've also had a ticket open with support for weeks due to rampant billing issues. For instance, a destroyed instance still shows up in my usage report as actively accruing billed time, and at a rate faster than is even possible (something like 2 hours for every 1 actual hour that has passed.)

They've released two new products in the AI space, this and Phoenix.new, and my worry is that they are focused on new products over making what they have good and reliable.

cschmatzler · 2026-01-11T09:47:41 1768124861

yeah nobody should use this based on reliability and support alone

timabdulla · 2025-07-22T10:59:16 1753181956

> We conducted three runs per experiment and selected the run with the highest final accuracy for inclusion in the chart (though illustrative examples and anecdotes may be drawn from any of the runs).

Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)

timabdulla · 2025-05-06T13:03:48 1746536628

I mean, the fact that OpenAI, at the bleeding edge of it all, has decided to buy an IDE is a rather strong hint that the future of agents handling entire engineering tickets might be further out than many believe.

If autonomous agents were just around the corner, then why wouldn't OpenAI bet on their own Codex product obviating (most) need for an IDE and save themselves the $3 billion?

slt2021 · 2025-05-06T18:07:17 1746554837

why OpenAI purchased windsurf instead of prompting openai to create something like windsurf?

this is the question i am still asking...

rafram · 2025-05-06T18:28:20 1746556100

These products are not complicated at their core — you can pretty much just drop in something like Monacopilot [1] and be 80% of the way there. But the last 20% is a real slog, and it mostly comes down to handling edge cases (bracket closing...) and optimizing prompting/context so you aren't burning cash. Whatever anyone claims about "feeling the AGI," AI isn't there yet.

[1]: https://github.com/arshad-yaseen/monacopilot

startupsfail · 2025-05-06T19:45:25 1746560725

They did. They’ve just released codex (CLI client).

They don’t have access to copilot users in general, Microsoft and Google does. And perhaps they are realizing that Microsoft is hedging them over multiple LLM providers and maybe no longer feeding them juicy copilot data, with humans in a tight loop, correcting LLMs.

pchristensen · 2025-05-06T18:53:30 1746557610

Controlling demand (developer workflow and mindshare) is a good position if you're trying to build scale on supply.

rhizome · 2025-05-06T19:05:32 1746558332

Maybe to avoid the Second System Effect.

osigurdson · 2025-05-06T13:37:32 1746538652

This is a good point. It is already the case that unless you deeply review every Windsurf change you will have zero understanding of your codebase. If it gets 1000X better in the next 3 years why would anyone look at code at all?

Of course, back to reality. Today, at least in my workflow, I use / like Windsurf but it is a small part of what I am doing. For any code I want to keep I mostly write it by hand (using vim for a very bare-bones / cognitive mode experience). For me, the real flow state occurs in vim while ChatGPT and Windsurf are great for exploration.

bix6 · 2025-05-06T13:07:29 1746536849

It sounds like the openAI team is overburdened (I guess they aren’t AI super users yet) so this may be their only option. Easy entry into a key segment, at least for now, and locks out competitors.

htrp · 2025-05-06T15:20:50 1746544850

so much for ai turning everyone at openai into 1000x coders

conartist6 · 2025-05-06T19:09:57 1746558597

As a competitor in that key segment I don't feel locked out. I could almost jump for joy that this very weak-tea move is the most they can do with that much money. They're just quintupling down on the technology of 50 years ago. There's no threat to me at all here as a creator of from-first-principles IDE technology.

bix6 · 2025-05-06T19:52:50 1746561170

What are you working on?

conartist6 · 2025-05-06T20:31:12 1746563472

It's not too hard to find out, but I'm going to make a big announcement in a few days so my official message at the moment is "stay tuned"

bix6 · 2025-05-06T21:00:07 1746565207

It’s one of your GitHub projects?

macrolime · 2025-05-06T20:49:07 1746564547

They might just want a way to quickly collect data needed for fine-tuning the next generation of programming agents.

timabdulla · 2025-04-02T19:54:31 1743623671

What were the human PhDs able to do after more than 48 hours of effort? Presumably given that these are top-level PhDs, the replication success rate would be close to 100%?

macleginn · 2025-04-03T09:29:36 1743672576

Depending on how well the exact algorithms, implementation details, and experimental design were documented, replication can easily take days, if not weeks. (Personally, I would start by filtering out papers that cannot be replicated by well-skilled researchers in a fixed amount of time and only give the replicatable ones to the agents.)

timabdulla · 2025-04-02T19:52:55 1743623575

How does it perform on e.g. WebVoyager, WebArena, or OSWorld? These seem to be the oft-cited benchmarks when comparing computer-use agents.

timabdulla · 2025-03-27T13:41:58 1743082918

This is the most interesting aspect to me. I had Claude generate a guide to all the gyms in Pokemon Red and instructions for how to quickly execute a play through [0].

It obviously knows the game through and through. Yet even with encyclopedic knowledge of the game, it's still a struggle for it to play. Imagine giving it a gave of which it knows nothing at all.

[0] https://claude.site/artifacts/d127c740-b0ab-43ba-af32-3402e6...

timabdulla · 2025-03-08T12:55:22 1741438522

There is no 3.6. There is 3.5 and 3.5 (New), both of which remain available.

zsoltkacsandi · 2025-03-08T13:02:12 1741438932

Ah, you are right, I found it (it was under the more menu on the UI).

timabdulla · 2025-03-08T12:54:28 1741438468

There was never a Sonnet 3.6. They released what is commonly known as 3.6 as "Sonnet 3.5 (New)". Then, because so many folks ended up referring to it as 3.6, they decided to call this new model 3.7, as the mental territory for 3.6 was already occupied by 3.5 (New). Not confusing in the slightest!

timabdulla · 2025-03-08T12:52:57 1741438377

My feeling (totally unproven) is that in the drive to make Sonnet 3.7 more "agentic", they've lost some of its ability to actually just stick to what you asked it to do. It seems that it "wants" (I know, it's not sentient!) to be more in the driver's seat now.

Definitely can be very annoying if you do just want it to execute on a set of instructions.

zsoltkacsandi · 2025-03-08T13:00:24 1741438824

I don't know but if it wants to be in the driver seat, maybe it can pay the subscription fee for itself, because it is barely usable for anything I paid for before.

To me it's not a good deal that I pay for something that makes up the things it wants, and completely disregards what I asked. It's like the recipe for the worst SaaS you ever used.

4b11b4 · 2025-03-08T16:38:20 1741451900

This can be mostly mitigated with the right system prompt although I noticed occasionally the prompt will be ignored (~1/20).

FlyingAvatar · 2025-03-08T12:54:38 1741438478

Yes, the Agent mode is terrible. Have you tried using the Ask mode instead?