Hacker Newsnew | past | comments | ask | show | jobs | submit | more thot_experiment's commentslogin

It's really hard for me to take these benchmarks seriously at all, especially that first one where Sonnet 4.5 is better at software engineering than Opus 4.1.

It is emphatically not, it has never been, I have used both models extensively and I have never encountered a single situation where Sonnet did a better job than Opus. Any coding benchmark that has Sonnet above Opus is broken, or at the very least measuring things that are totally irrelevant to my usecases.

This in particular isn't my "oh the teachers lie to you moment" that makes you distrust everything they say, but it really hammers the point home. I'm glad there's a cost drop, but at this point my assumption is that there's also going to be a quality drop until I can prove otherwise in real world testing.


These announcements and "upgrades" are becoming increasingly pointless. No one is going to notice this. The improvements are questionable and inconsistent. They could swap it out for an older model and no one would notice.


This is the surest sign progress has plateaued, but it seems people just take the benchmarks at face value.


Now we can make a firecracker's worth of antimatter (by annihilation energy) in a mere two hundred thousand years of continuous production. Super cool stuff though, pun intended.


I don't consider work done to prevent me having complete control of my own hardware to be a positive development. In fact it's one of the worst things they could spend their time on (from a long term global optimum perspective).


Let me just hit up the bot on irc so i can get the one time ftp login to xbins and grab this.


I have Qwen3-30B-VL (an MoE model) resident in my VRAM at all times now because it is quicker to use it to answer most basic google questions. The type of stuff like remembering how to force kill a WSL instance which i don't do that often is now frictionless because i can just write on terminal (q is my utility)

    q how to force kill particular WSL
and it will respond with "wsl --terminate <distro-name>" much faster than google

it's also quite good at tool calling, if you give it shell access it'll happily do things like "find me files over 10mb modified in the last day" etc where remembering the flags and command structure if you're not doing that action regularly previously required a google or a peek at the manpage

i also use it to transcribe todo lists and notes and put them in my todo app as well as text manipulation, for example if i have a list of like, API keys and URLs or whatever that i need to populate into a template, I can just select the relevant part of the template in VSCode, put the relevant data in the context and say "fill this out" and it does it faster than i would be able to do the select - copy - select - paste loop, even with my hard won VIM knowledge

TL;DR

It's very fast (90tok/s) and very low latency and that means it can perform a lot of mildly complex tasks that have an obvious solution faster than you.

and fwiw i don't even think sonnet 4.5 is very useful, it's a decent model but it's very common for me to push it into a situation where it will be subtly wrong and waste a lot of my time (of course that's colored by it being slow and costs money)


Qwen3-30B-VL is going to be fucking hard to beat as a daily driver, it's so good for the base 80% of tasks I want an AI for, and holy fuck is it fast. 90tok/s on my machine, I pretty much keep it in vram permanently. I think this sort of work is important and I'm really glad it's being done, but in terms of something I want to use every day there's no way a dense model can compete unless it's smart as fuck. Even dumb models like Qwen3-30B get a lot of stuff right and not having to wait is amazing.


Olmo author here! Qwenmodels are in general amazing, but 30B is v fast cuz it’s an MoE. MoEs very much on the roadmap for next Olmo.


Thanks for the hint. I just tried it on a bright new Mac laptop, and it’s very slow here. But it led me to test qwen2.5:14b and it looks like it can create instant feedback loop.

It can even interact through fluent Esperanto, very nice.


I'm specifically talking about qwen3-30b-a3b, the MoE model (this also applies to the big one). It's very very fast and pretty good, and speed matters when you're replacing basic google searches and text manipulation.


I'm only superficially familiar with these, but curious. Your comment above mentioned the VL model. Isn't that a different model or is there an a3b with vision? Would it be better to have both if I'd like vision or does the vision model have the same abilities as the text models?



fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently

    llama-server --host 0.0.0.0 --model Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf --mmproj qwen3-VL-mmproj-F16.gguf --port 8080 --jinja --temp 0.7 --top-k 20 --top-p 0.8 -ngl 99 -c 65536 --repeat_penalty 1.0 --presence_penalty 1.5


This has been my question also: I spend a lot of time experimenting with local models and almost all of my use cases involve text data, but having image processing and understanding would be useful.

How much do I give up (in performance, and running on my 32G M2Pro Mac) using the VL version of a model? For MOE models, hopefully not much.


all the qwen flavors have a VL version and it's a separate tensor stack, just a bit of vram if you want to keep it resident and vision-based queries take longer to process context but generation is still fast asf

i think the model itself is actually "smarter" because they split the thinking and instruct models so both modalities become better in their respective model

i use it almost exclusively to OCR handwritten todo lists into my todo app and i don't think it's missed yet, does a great job of toolcalling everything


I'm out of the loop... so Qwen3-30B-VL is smart and Qwen3-30B is dumb... and that has to do not with the size but architecture?


Olmo author here, but I can help! First release of Qwen 3 left a lot of performance on the table bc they had some challenges balancing thinking and non-thinking modes. VL series has refreshed posttrain, so they are much better!


ahaha sorry that was unclear, while i think the VL version is maybe a bit more performant, by "dumb" i meant any low quant low size model you're going to run locally, vs a "smart" model in my book is something like Opus 4.1 or Gemma 3.

I basically class LLM queries into two categories, there's stuff i expect most models to get, and there's stuff i expect only the smartest models to have a shot of getting right, there's some stuff in the middle ground that a quant model running locally might not get but something dumb but acceptable like Sonnet 4.5 or Kimi K2 might be able to handle.

I generally just stick to the two extremes and route my queries accordingly. I've been burned by sonnet 4.5/gpt-5 too many times to trust it.


sorry i meant gemini 3


Very very sad that the adaptive subdivision is touted as a Blender feature but unfortunately it's a Cycles feature.

Always nice to see these updates though, Blender has really come a long long way.


It might be possible to reproduce the same effect in EEVEE using geometry nodes. I know people have done that for automatic level of detail work. That being said, IDK if subsurf as a geometry node will take a non-constant number of iterations.


I've been using XMonad since 2012, not really for any reason other than someone at HMC told me to and I just stuck with it ever since. Sometimes I wonder if there's a reason to try something else, but I already know all the keybinds and it just does everything I want.

My absolute favorite feature (i'm sure this is present elsewhere too) is the idea that I have my stuff laid out on virtual screens and I can just assign a virtual screen to a physical screen super trivially without ever moving my hands off the keyboard. It's such a wonderful workflow.

Tiling WMs are one of those power user things where once you get used to it the other way just seem so obviously bad. VIM and Blender are similar, unfamiliar annoying interface if you're used to the normie way of doing things, but once you understand the patterns and the way you can compose them it becomes so much more expressive.


> Who are we to tell you how to use your computer?

i'm having a hard time describing the feelings this makes me feel. like i've been stressed, bedraggled and worn down, and suddenly there's a moment where i can just rest

it's nice to be excited about something for once instead of the baseline expectation of a horrible adversarial experience, which is the case for most tech in 2025

it is somewhat depressing that it's this novel to expect a piece of hardware to actually exist to make my life nicer vs the default of being an abomination that tries constantly to extract money and information from me like a fucking vampire

(and i guess, not having used this yet, this also speaks to valve being one of the last companies that i have any trust in to be capable of making a business decision that makes them less money in the short run in order to deliver a better product)


Valve earned a lot of goodwill from me when I set up my docked steam deck as my main media player & gaming device. It required me to do a lot of little hacks. I was doing stuff the device wasn't meant to do, but it never put up road blocks just because I wasn't allowed to do it. Not like when I want to do simple things on my wife's macbook.


An ongoing 'background noise' concern I've had for a while is how PC gaming seems to be centralizing around steam. There's reasons why that happened, but it'd be real nice if 'infrastructure' was able to decouple from their store. It feels like practically requiring steam for PC gaming on windows and certainly on linux isn't a mile away from requiring MS windows, is it much freedom to pick which Seattle based company you run software from?


I don't think there's NO reason to be concerned, but I think it's pretty different considering the decades of history of how Valve acts vs how M$FT acts. Also, many games available on Steam are DRM free or available from other sources and Proton itself is open source.

Valve is also not publicly traded and they have a succession plan of some sort in the event that gaben kicks it, I can only assume whatever he's come up with is sound, he's done a great job of running the place so far.


FWIW 95% of the games i play on my Linux are from other stores than Steam: GOG, Zoom Platform (not related to the Zoom telething) and itch.io, all of which are DRM-free stores. The Steam games i buy are mainly from small indie devs that do not have nor plan to have releases outside of Steam.

To play games i use UMU Launcher which is basically Proton minus Steam (or Wine plus DXVK, etc, depending on how you look it at). I use the "raw" UMU Launcher with its own command-line utility, though it can be used as part of Lutris for a GUI-based experience.


> There's reasons why that happened

Steam's near-monopoly was earned by simply being the best store. Other stores like Epic don't even include basic features like a shopping cart to buy multiple games at once.

I could go on and on about why Steam is so much better than any other store, but this isn't the place.

That said, I can understand being nervous. Steam is great because it's privately owned and GabeN is happy with the money he makes from it and doesn't feel the need to enshittify it in order to get more money. But eventually he will die or retire, and someone else will be given control. Supposedly, he's already vetted some people to take the job, but what's to say they weren't merely playing the part and will take it public as soon as they can?


Epic actually got a shopping cart last year. Still has terrible UX, however.


There are plenty of competing stores, they just aren't good. I require a game to be on steam because I like the store and features, but many games are also sold elsewhere.


The built in Steam DRM is very weak. Of course that can change at any time, but at least the current catalog of Steam DRM-only games are not really tied down to steam except via law/licensing.


When alternatives are Epic, EA and Microsoft, you choose lesser evil.


A couple weeks ago Amazon said something about "we were trying to compete with Steam and even with all our resources nobody noticed" and that made me realize something: ideally, companies with similar products and services compete on features and cost, but nowadays the big tech providers compete more on lock in than anything else. But in the market of video game retail stores the competition _is_ on features and price, because Steam competes on those terms (ref gaben's famous quote "piracy is a service problem"; they're even competing and succeeding against free products)


I definitely didn't notice, I had no idea they were trying anything like that.


The Steam Deck has been my dream computer for this reason. It just works, literally all of the hardware is 100% supported on linux. And it's also not locked down in any way. You are completely free to install anything you want. I'm just so glad at least one tech company has the resources and will to create something that is a fully polished consumer ready product which also isn't completely restricted.


Steam is a service that's been running for >20 years and somehow hasn't been enshittified (although, I suppose when it first appeared it was seen as enshittification). It's worth celebrating, to be honest.


I can personally vouch for a great deal of consternation among players of Valve's games at the time of Steam's launch and I have the IRC logs to prove it!

I was also personally resistant to the new thing and to this day have "only" a five digit Steam ID rather than maybe a four or even three digit one.. Haha!

Since then I can say that PC gamers have only benefited greatly from Valve's benevolent dictatorship compared to the alternatives.


Plot twist, Valve AI will syphon all your user metrics into Valve's new model. J/k and all joking aside, I feel the same way. Feels like a love letter to gamers


Valve being the only company in 2025 launching something that isn't a AI glowing AI button.

Coincidentally also the only launch in 2025 people appear genuinely excited about.


Claude, can you summarize this book for me so I can post about it on hacker news?

I skimmed a few chapters and deeply read a few paragraphs and it's certainly thought provoking book-shaped. I really do get a pretty visceral feeling when reading stuff like this of "why would i bother reading it if someone couldn't be bothered to write it", though I get this is in collaboration so perhaps there's more effort in here. Reading the whole conversation inclusive of the user prompts would be a lot more interesting to me.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: