More

the_duke · 2026-02-12T19:47:00 1770925620

They claim the opposite, though, saying the chip is designed to tolerate many defects and work around them.

the_duke · 2026-02-06T21:18:26 1770412706

That's because many developers are used to working like this.

With AI, the correct approach is to think more like a software architect.

Learning to plan things out in your head upfront without to figure things out while coding requires a mindset shift, but is important to work effectively with the new tools.

To some this comes naturally, for others it is very hard.

skydhash · 2026-02-06T21:30:47 1770413447

I think what GP is referring too are technical semantics and accidental complexity. You can’t plan for those.

The same kind of planning you’re describing can and do happen sans LLM, usually on the sofa, or in front of a whiteboard. Or by reading some research materials. No good programmer rushes to coding without a clear objective.

But the map is not the territory. A lot of questions surface during coding. LLMs will guess and the result may be correct according to the plan, but technically poor, unreliable, or downright insecure.

mejutoco · 2026-02-07T08:50:53 1770454253

> Learning to plan things out in your head

I dont think any complex plan should be planned in your head. But drawing diagrams, sketching components, listing pros and cons, 100%. Not jumping directly into coding might look more like jumping into spec writing a poc

Nasrudith · 2026-02-07T10:32:04 1770460324

Maintaining a 'mental RAM Cache' is a powerful tool to understanding the system as a whole on a deep and intuitive level, even if you can only 'render' sections at a time. The bigger it is the more you can keep track of to be able to foresee interactions between distant pieces.

It shouldn't be your only source of a plan as you'd likely wind up dropping something, but figuring out how to jiggle things around before getting it 'on paper' is something I've found helpful.

mejutoco · 2026-02-07T20:26:04 1770495964

Following the RAM analogy, this sounds like saving files only in RAM, instead of creating the files in the file system, persisted on disk, and then caching it in RAM.

Personally, for me without writing or sketching I cannot think complex things: as in complex logic, constraints, etc.

I guess this is topic too abstract, so we can read into it different things.

the_duke · 2026-02-05T18:22:54 1770315774

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

NitpickLawyer · 2026-02-05T18:30:40 1770316240

Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

kilroy123 · 2026-02-05T19:02:17 1770318137

Personally, I have Claude do the coding. Then 5.2-high do the reviewing.

mmaunder · 2026-02-05T23:47:35 1770335255

I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.

seunosewa · 2026-02-05T19:28:50 1770319730

Then I pass the review back to Claude Opus to implement it.

VladVladikoff · 2026-02-05T19:42:17 1770320537

Just curious is this a manual process or you guys have automated these steps?

ricketycricket · 2026-02-05T20:36:03 1770323763

I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.

bryanlarsen · 2026-02-05T22:01:08 1770328868

Mind sharing the skill/prompt?

dror · 2026-02-06T00:33:41 1770338021

Not the OP, but I use the same approach.

https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...

I typically use github issues as the unit of work, so that's part of my instruction.

_zoltan_ · 2026-02-05T21:26:12 1770326772

zen-mcp (now called pal-mcp I think) and then claude code can actually just pass things to gemini (or any other model)

kilroy123 · 2026-02-05T22:21:00 1770330060

Sometimes, depends on how big of a task. I just find 5.2 so slow.

_zoltan_ · 2026-02-05T21:25:45 1770326745

I have Opus 4.5 do everything then review it with Gemini 3.

StephenHerlihyy · 2026-02-05T19:47:08 1770320828

I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.

readyforbrunch · 2026-02-05T21:34:55 1770327295

How do you orchestrate this workflow? Do you define different skills that all use different models, or something else?

nitroedge · 2026-02-06T06:56:23 1770360983

You should check out the PAL MCP and then also use this process, its super solid: https://github.com/glittercowboy/get-shit-done

The way "Phases" are handled is incredible with research then planning, then execution and no context rot because behind the scenes everything is being saved in a State.md file...

I'm on Phase 41 of my own project and the reliability and almost absence of any error is amazing. Investigate and see if its a fit for you. The PAL MCP you can setup to have Gemini with its large context review what Claude codes.

aurareturn · 2026-02-05T18:42:31 1770316951

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

koakuma-chan · 2026-02-05T18:44:24 1770317064

Opus 4.5 is more creative and better at making UIs

hypercube33 · 2026-02-06T09:33:29 1770370409

Unless it's scroll bar theming then my God it's bad. it told me it gives up. Gemini 3 got stuck but the right prompt it did work.

fooker · 2026-02-05T18:39:04 1770316744

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

bunderbunder · 2026-02-05T20:36:58 1770323818

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

abustamam · 2026-02-05T21:19:33 1770326373

Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

bunderbunder · 2026-02-06T01:21:45 1770340905

Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.

When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.

abustamam · 2026-02-06T03:45:24 1770349524

Thanks, that makes sense!

mrandish · 2026-02-05T19:34:54 1770320094

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

fooker · 2026-02-05T19:48:05 1770320885

For the current state of AI, the harness is unfortunately part of the secret sauce.

ndriscoll · 2026-02-06T15:07:01 1770390421

In what sense? Codex CLI is FOSS and works fine with other models as a backend, including those served by llama.cpp.

scoring1774 · 2026-02-05T20:36:20 1770323780

This has been done: https://arxiv.org/abs/2510.04871v1

mmaunder · 2026-02-05T23:46:56 1770335216

ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.

int_19h · 2026-02-06T09:44:18 1770371058

Codex 5.3 seems to be a lot chattier. As in, it comments in the chat about things it has done or is about to do. They don't show up as "thinking" CoT blocks, but as regular outputs, but overall the experience is somewhat more like Claude is in that you can spot the problems in model's reasoning much earlier if you keep an eye on it as it works, and steer it away.

jahsome · 2026-02-05T18:30:43 1770316243

Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

StephenHerlihyy · 2026-02-05T19:50:36 1770321036

What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.

Rudybega · 2026-02-05T22:05:28 1770329128

The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.

theLiminator · 2026-02-06T06:21:02 1770358862

Yeah, I really didn't believe in agentic coding until December, that was where it took off from being slightly more useful than hand crafting code to becoming extremely powerful.

girvo · 2026-02-06T12:18:02 1770380282

The effects when extrapolated out aren’t good, IMO. Certainly bad for me, a mid 30s software engineer who’s been doing this for nearly 20 years…

SatvikBeri · 2026-02-05T19:59:59 1770321599

I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test

malshe · 2026-02-05T18:37:25 1770316645

This pretty accurately summarizes all the long discussions about AI models on HN.

clhodapp · 2026-02-05T19:31:19 1770319879

And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...

cactusplant7374 · 2026-02-05T19:32:15 1770319935

Hourly occurrence on /r/codex. Model astrology is about the vibes.

wasmainiac · 2026-02-05T18:38:37 1770316717

[flagged]

nocman · 2026-02-05T19:07:51 1770318471

> Who are making these claims? script kiddies? sr devs? Altman?

AI agents, perhaps? :-D

locknitpicker · 2026-02-05T19:03:57 1770318237

> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?

You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.

andrepd · 2026-02-05T20:05:47 1770321947

It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.

BoredPositron · 2026-02-05T18:55:23 1770317723

When you keep his ramblings on twitter or company blog in mind I bet he is a shit poster here.

nerdsniper · 2026-02-05T19:40:06 1770320406

Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

nubg · 2026-02-05T20:24:28 1770323068

what do you do?

audience_mem · 2026-02-05T23:51:47 1770335507

He works on brain-melting stuff, the understanding of which is far beyond us.

nerdsniper · 2026-02-06T09:10:12 1770369012

It's relatively easy for people to grok, if a bit niche. Just sometimes confuses LLMs. Humans are much better at holding space for rare exceptions to usual rules than LLMs are.

the_duke · 2026-02-02T13:42:37 1770039757

Do you have a source for that?

the_duke · 2026-01-29T16:56:13 1769705773

This is very confusingly written.

From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!

Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....

I'd be very curious HOW exactly the models fail.

Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?

Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.

Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.

All in all, I'm very skeptical that this is very useful as a benchmark as is.

I'd be much more interested in tasks like:

Here are trace/log outputs , here is the source code, find and fix the bug.

sathish316 · 2026-01-29T21:05:26 1769720726

+1 I’m not sure if tasks like Add OTel instrumentation belongs more in a Coding bench than an SRE bench. I came here expecting to see things like, this is how Models perform on finding the root cause in 50 complicated microservice failure scenarios.

For AI-SRE tasks like finding root cause of bugs and errors, I believe the key is to provide tools to the agent to query metrics, logs, traces and understand the problem. I’m working on a similar OSS framework and benchmark (work in progress using metrics and logs - demo - https://youtube.com/playlist?list=PLKWJ03cHcPr3Od1rwL7ErHW1p...), where context is Semantics and Text2SQL to query the right metrics, logs and benchmark is on a set of Skills that Claude code or other agents can run using these tools to find the root cause of errors:

Codd Semantic/Text2SQL engine: https://github.com/sathish316/codd_query_engine

PreCogs skills and simulated scenarios: https://github.com/sathish316/precogs_sre_oncall_skills

rixed · 2026-01-30T04:51:13 1769748673

I'm surprised by how many people think that SRE's job is to debug.

SRE's job is to make the software reliable, for instance by adding telemetry, understanding and improving the failure modes, the behavior under load etc.

So a better SRE test would not be "read the logs and fix the bug", but rather "read the code and identify potential issues".

YetAnotherNick · 2026-01-29T19:38:11 1769715491

Looked into some tests and the tasks are definitely AI written. I think then a separate AI call generated the test.

pixl97 · 2026-01-29T17:00:59 1769706059

>Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.

As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.

chaps · 2026-01-29T17:56:26 1769709386

This is definitely not a context problem. Very simple things like checking for running processes and killing the correct one is something that models like opus 4.5 can't do consistently correct... instead of recognizing that it needs to systematize that sort of thing -- one and done. Like, probably 50% of the time it kills the wrong thing. About 25% of the time after that it recognizes that it didn't kill the correct thing and then rewrites the ps or lsof from scratch and has the problem again. Then if I kill the process myself out of frustration it checks to see if the process is running, sees that it's not, then gets confused and sets its new task to rewrite the ps or lsof... again. It does the same thing with tests, where it decides to just, without any doubt in its rock brain, delete the test and replace it with a print statement.

bob1029 · 2026-01-29T17:25:36 1769707536

> limited context sizes

Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.

ambicapter · 2026-01-29T19:42:24 1769715744

> "Use standard OTEL patterns" ... that's about as useful as saying "go write some code".

People say to say things like "Use best practices" in your prompts all the time, and chide people who don't.

ndriscoll · 2026-01-29T20:03:09 1769716989

Are these the same people who say it doesn't work well? I've been experimenting with writing what I actually mean by that (with the help of an LLM, funny enough), and it seems to be giving me much better code than the typical AI soup. e.g.

  - functional core, imperative shell. prefer pure helpers.
  - avoid methods when a standalone function suffices
  - use typed errors. avoid stringly errors.
  - when writing functions, create a "spine" for orchestration
  - spine rules: one dominant narrative, one concept per line, named values.
  - orchestration states what happens and in what order
  - implementation handles branching, retries, parsing, loops, concurrency, etc.
  - apply recursively: each function stays at one abstraction level
  - names describe why something exists, not how it is computed

etc.

This is no different from writing a style guide for your team/org. You don't just say "write clean code" and expect that you'll get something you like.

dudeinhawaii · 2026-01-29T20:55:16 1769720116

To play devils advocate, why do we have to layout a simple task in PAINSTAKING DETAIL to an AI model which is "PHD LEVEL" and going to take our jobs in 6-12 months?

Why am I still holding its hand like it has the intellect and experience of a new-hire intern that's coded one project in college?

I would never expect to have to layout every detail about "how to write code" to someone I hired to code on my team, at the SWEII and above level. (I.e, sub-senior but beyond junior)

In fact, often times backlog items are "fix bug in x where y is happening" or "add instrumentation to X so that we can see why it's crashing at runtime".

ronsor · 2026-01-29T21:14:48 1769721288

> PHD LEVEL

It is PhD level. Most PhD students write awful code that's worse than AI.

ndriscoll · 2026-01-29T22:09:14 1769724554

I find that generally it does alright picking up the style of what exists on its own, so this is more important if it's writing something completely from scratch.

I think also "how to write code" is a matter of taste. e.g. in many ways I think I and a Laravel or Rails developer would each think that the other person's code is bad. e.g. as a small-ish thing, I think test-driven development sounds like a massive waste of time, but type-driven development is a huge productivity multiplier and makes the code a lot clearer. I'm sure that I have massive disagreements with e.g. the Go maintainers about what is straightforward.

simonw · 2026-01-29T21:39:53 1769722793

Because the models aren't PhD level and aren't going to take our jobs in 6-12 months.

That's hype. If you want to use these things effectively you need to ignore the hype and focus on what they can actually do.

fragmede · 2026-01-29T23:05:21 1769727921

Don't worry about devil's advocate, if < 100 words feels like a gargantuan amount of documentation effort ("PAINSTAKING DETAIL"), well, there are certain stereotypes about developers (not) writing comments or documentation that come to mind. Whoever coined the term "prompt engineering" may have the last laugh (before the robots take over) after all.

noitpmeder · 2026-01-29T22:05:49 1769724349

I hate that it's true, but things like this make outputs night-and-day for me. This is the difference e.g. of a model writing appropriate test harnesses, or pushing back on requirements, vs writing the most absolute horrible code and test/dependency injection I've ever seen in pursuit of the listed goals.

Similar to adjacent commentors I've tried to be better at enumerating what I consider to be best practice, but I couldn't argue in good faith that instructions like these produce no noticible improvment.

(As with all things AI, it could all be percepion on my end, so YMMV, wish there was a better way to concretely evaluate effects on outcomes of different rule sets / instructions / ...)

julienfr112 · 2026-01-29T19:13:39 1769714019

Like with robotaxi, ok, the thing is not perfect, but how does this compare to an human ? I'm interviewing OPS / SRE at the moment , and i'm not so happy with what I see...

esseph · 2026-01-29T19:15:46 1769714146

If you're interviewing Ops don't expect them to know anything about OTEL. Ops is about platforms, systems, and operations surrounding and supporting the application.

Integration of OTEL into an application stack requires explicitly knowledge of the code - the developers.

the_duke · 2026-01-20T17:23:33 1768929813

According to USTR data the EU had a 200bn goods surplus, but a 100bn services deficit in 2024.

So a 100bn deficit out of 800bn total US imports.

The deficit is there, but it's not nearly as lopsided as some reporting would have you believe.

the_duke · 2026-01-19T14:07:07 1768831627

Tell that to all the companies that built their entire tech stacks on US cloud providers...

Massive endeavor for a lot of setups.

GuB-42 · 2026-01-19T14:33:40 1768833220

While it is a "massive endeavor", it is not impossible, it essentially amounts to writing portable code. A computer is a computer, and most of the tech stack in US cloud providers is based on open source projects.

Not depending on Chinese manufacturing is borderline impossible even if you are starting from scratch. Not only it will be way more expensive, with potentially longer delays and lesser capacities, but just finding some company that can and wants to do the job can be a nightmare. From what I have seen, many local manufacturers in the US and Europe are really there to fulfill government contracts that requires local production.

Most hardware kickstarter-like projects rely on Chinese manufacturing as if it was obvious. It is not "find a manufacturer", it is "go to China". Projects that instead rely on local (US/Europe) manufacturing in order to make a political statement have to to though a lot of trouble, and the result is often an overpriced product that may still have some parts made in China.

raw_anon_1111 · 2026-01-19T16:03:48 1768838628

Anyone who thinks migrations at scale is just about “writing portable code” has never done a migration at scale.

A large corporation just migrating from everything hosted on VMs can take years.

And if you are responsible for an ETL implementation and working with AWS and have your files stored on S3 (every provider big and small has S3 compatible storage) and your data is hosted on Aurora Postgres, are you going to spend time creating a complicated ETL process or are you going to just schedule a cron job to run “select outfile into S3”?

And “most” of the services on AWS aren’t based on open source software and you still have to provision your resources using IAC and your architecture. No Terraform doesn’t give you “cloud agnosticism” any more than using Python when using AWS services.

johnnyanmac · 2026-01-19T23:44:55 1768866295

I don't think anyone here is arguing that. Just that you can make things less painful with portable code. It still won't be easy, as everybody in this chain agrees with. But we don't put things that need to be done off because it's "difficult".

raw_anon_1111 · 2026-01-20T00:58:18 1768870698

If it takes a year and half to migrate from plain old VMs to AWS as the first part of “lift shift and modernize” and you have to to do it in “waves” how much difference is the code going to make?

Are you going to tell your developers to spend weeks writing ETL code that could literally be done in an hour using SQL extensions to AWS?

Are you going to tell them not to use any AWS native services? What are you going to do about your infrastructure as code? Are you going to tell them to set up a VM to host a simple cron job instead of just using a Lambda + Event Bridge?

And what business value does this theoretical “cloud agnosticism” bring - that never is once you get to scale.

It took Amazon years to move off of Oracle and much of its infrastructure still doesn’t run on AWS and still uses its older infrastructure (CDO? It’s been a while and I was on the AWS side)

I have yet to hear anyone who worries about cloud agnosticism even think about the complexity of migrations bring at scale, the risk of regressions, etc.

While I personally stay the hell away from lift and shifts and I come in at the “modernization” phase, it’s because I know the complexity and drudgery of it. I worked at AWS ProServe for 3.5 years and I now work as a staff consultant at a 3rd party consulting company.

This isn’t me rah rahing about AWS. I would say the same about GCP, Azure, the choice of database you use, or any other infrastructure decision.

johnnyanmac · 2026-01-20T01:34:34 1768872874

If it only took 18 months for all that, I'd be very impressed. I was thinking at least a year of inevitable meetings and plannings and maybe 3 years of slow execution. And I still might be optimistic there.

>And what business value does this theoretical “cloud agnosticism” bring - that never is once you get to scale.

The "business value" here is not being beholden to an increasingly hostile "ally" who owns the land these servers operate on. If you aren't worried about that, then there is no point in doing any of this.

But if things do escalate to war, there's a very obvious attack vector to cripple your company with. Even if you're only 20% into the migration, that's better than 0%.

raw_anon_1111 · 2026-01-20T01:49:50 1768873790

I don’t know how long it took before they brought AWS in and they decided to do something or if they failed beforehand and I don’t know how long it was before they brought me in.

johnnyanmac · 2026-01-20T02:29:23 1768876163

Oh, sorry. I wasn't trying to speak on your experiences specifically. It more about general talks on the scenario of "America is compromised, we need to decouple starting now".

I of course don't know the scale of your company and how much they even wanted to migrate. Those are all variable in this.

deaux · 2026-01-20T01:26:55 1768872415

Yup! Still very doable, and has been done tens if not hundred of thousands of times before. Migrations from e.g. AWS-> Azure/GCP, or even harder, cloud->on-prem.

How often has been replacing Chinese tech manufacturing dependency at scale done before? About 0.

the_duke · 2026-01-06T05:33:03 1767677583

The more advanced AI related workflows are the reason I finally switched away from neovim as my main coding IDE - for now.

The existing AI plugins for neovim aren't great.

the_duke · 2025-12-28T09:25:23 1766913923

With a bit of education one would know that manus is Latin for hand. That's where "manual" comes from.

And their logo is, lo and behold ... a hand!

HendrikHensen · 2025-12-28T09:50:14 1766915414

Since it's an AI company, and not actually doing anything by hand, it wouldn't surprise me if they came up with the name "manus" because it has "anus" in it, and then designed the hand logo due to the Latin meaning of the name. [this is a sarcasm, in case that was not clear]

nrhrjrjrjtntbt · 2025-12-28T09:36:16 1766914576

So it is the Ancient Romans who were obsessed with butts. Got it.

stavros · 2025-12-28T09:53:15 1766915595

There are more English words that end in -ass than Latin words that end in -anus, so who's really obsessed?

codeduck · 2025-12-28T10:23:19 1766917399

Canonically it was winged penises, but yes.

fifticon · 2025-12-28T09:38:45 1766914725

but where has that hand BEEN? And we are back at -anus again.

the_duke · 2025-12-17T05:37:04 1765949824

I've been preaching similar thoughts for the last half year.

Most popular programming languages are optimized for human convenience, not for correctness! Even most of the popular typed languages (Java/Kotlin/Go/...) have a wide surface area for misuse that is not caught at compile time.

Case in point: In my experience, LLMs produce correct code way more regularly for Rust than for Js/Ts/Python/... . Rust has a very strict type system. Both the standard library and the whole library ecosystem lean towards strict APIs that enforce correctness, prevent invalid operations, and push towards handling or at least propagating errors.

The AIs will often write code that won't compile initially, but after a few iterations with the compiler the result is often correct. Strong typing also makes it much easier to validate the output when reviewing.

With AIs being able to do more and more of the implementation, the "feel-good" factor of languages will become much less relevant. Iteration speed is not so important when parallel AI agents do the "grunt work". I'd much rather wait 10 minutes for solid output rather than 2 minutes for something fragile.

We can finally move the industry away from wild-west languages like Python/JS and towards more rigorous standards.

Rust is probably the sweet spot at the moment, thanks to it being semi-popular with a reasonably active ecosystem, sadly I don't think the right language exists at the moment.

What we really want is a language with a very strict, comprehensive type system with dependent types, maybe linear types, structured concurrency, and a built-in formal proof system.

Something like ADA/Spark, but more modern.