This looks cool and could be a much needed step towards fixing the web.
Some questions:
[Tech]
1. How deep does the modification go? If I request a tweek to the YouTube homepage, do I need to re-specify or reload the tweek to have it persist across the entire site (deeply nested pages, iframes, etc.)
2. What is your test and eval setup? How confident are you that the model is performing the requested change without being overly aggressive and eliminating important content?
3. What is your upkeep strategy? How will you ensure that your system continues to WAI after site owners update their content in potentially adversarial ways? In my experience LLMs do a fairly poor job at website understanding when the original author is intentionally trying to mess with the model, or has overly complex CSS and JS.
4. Can I prompt changes that I want to see globally applied across all sites (or a category of sites)? For example, I may want a persistent toolbar for quick actions across all pages -- essentially becoming a generic extension builder.
[Privacy]
5. Where and how are results being cached? For example, if I apply tweeks to a banking website, what content is being scraped and sent to an LLM? When I reload a site, is content being pulled purely from a local cache on my machine?
[Business]
6. Is this (or will it be) open source? IMO a large component of empowering the user against enshittification is open source. As compute commoditizes it will likely be open source that is the best hope for protection against the overlords.
7. What is your revenue model? If your product essentially wrestles control from site owners and reduces their optionality for revenue, your arbitrage is likely to be equal or less than the sum of site owners' loss (a potentially massive amount to be sure). It's unclear to me how you'd capture this value though, if open source.
8. Interested in the cost and latency. If this essentially requires an LLM call for every website I visit, this will start to add up. Also curious if this means that my cost will scale with the efficiency of the sites I visit (i.e. do my costs scale with the size of the site's content).
> 1. How deep does the modification go? If I request a tweek to the YouTube homepage, do I need to re-specify or reload the tweek to have it persist across the entire site (deeply nested pages, iframes, etc.)
If you're familiar with Greasemonkey, we work similar to the @match metadata. A given script could have a specific domain like (https://www.youtube.com/watch?v=cD5Ei8bMmUk) or all videos (https://www.youtube.com/watch*) or all of youtube (https://www.youtube.com/*) or all domains (https:///). During generation, we try to infer your intent based on your request (and you can also manually override with a dropdown.
> 2. What is your test and eval setup? How confident are you that the model is performing the requested change without being overly aggressive and eliminating important content?
Oh boy, don't get me started. We have not found a way to automate eval yet. We can automate "is there an error?", "does it target the right selectors", etc. But the request are open ended so there are 1M "correct" answers. We have a growing set of "tough" requests and when we are shipping a major change, we sit down, generate them all, and click through and manually check pass/fail. We built tooling around this so it is actually pretty quick but definitely thinking about better automation.
This is also where more users comes in. Hopefully you complain to us if it doesn't work and we get a better sense of what to improve!
> 3. What is your upkeep strategy? How will you ensure that your system continues to WAI after site owners update their content in potentially adversarial ways? In my experience LLMs do a fairly poor job at website understanding when the original author is intentionally trying to mess with the model, or has overly complex CSS and JS.
Great question. The good news is that there are things like aria labels that are pretty consistent. If the model picks the right selectors, it can be pretty robust to change. Beyond that, hopefully it is as easy as one update request ("this script doesn't work anymore, please update the selectors"). Though we can't really expect each user to do that, so we are thinking of an update system where e.g. if you install/copy script A, and then the original script A is updated, you can pull that new update. The final stage of this is an intelligent system where the script can heals itself (every so often, it assess the site, sees if selectors have changed and fixes itself) -> that is more long-term.
> 4. Can I prompt changes that I want to see globally applied across all sites (or a category of sites)? For example, I may want a persistent toolbar for quick actions across all pages -- essentially becoming a generic extension builder.
Yes, if domain is https:/// it applies to all sites so you can think of this as a meta-extension builder. E.g. I have a timer script that applies across reddit, linkedin, twitter, etc. and keeps me focused.
> 5. Where and how are results being cached? For example, if I apply tweeks to a banking website, what content is being scraped and sent to an LLM? When I reload a site, is content being pulled purely from a local cache on my machine?
There is a distinction. When you generate a tweek, the page is captured and sent to an LLM. There is no way around this. You can't generate a modification for a site you cannot see.
The result of a generation is a static script that applies to the page across reloads (unless you disable it). When you apply a tweek, everything is local, there is no dynamic server communication.
Hopefully that is all helpful! I need to get to other replies, but I will try to return to finish up your business questions (those are the most boring anyway)
-- Edit: I'm back! --
> 6. Is this (or will it be) open source? IMO a large component of empowering the user against enshittification is open source. As compute commoditizes it will likely be open source that is the best hope for protection against the overlords.
It is very important to me that people trust us. I can say that we don't do X, Y, Z with your data and that using our product is safe, but trust is not freely given (nor should it be). We have a privacy policy, we have SOC II, and in theory, you could even download the extension and dig into the code yourself.
Open-source is one way to build trust. However, I also recognize that many of these "overlords" you speak of are happy to abuse their power. Who's to say that we don't open our code, only to have e.g. OpenAI fork it for their own browser? Of course, we could put restrictive licenses, but lawsuits haven't been particularly protective of copyright lately. I am interested in open-sourcing parts of our code (and there certainly is hunger for it in this post), but I am cognizant that there is a lot that goes into that decision.
> 7. What is your revenue model? If your product essentially wrestles control from site owners and reduces their optionality for revenue, your arbitrage is likely to be equal or less than the sum of site owners' loss (a potentially massive amount to be sure). It's unclear to me how you'd capture this value though, if open source.
The honest answer is TBD. I would push back on your claim that we wrestle control from site owners and reduce their optionality for revenue. While there likely will be users who say "hide this ad" (costing the site revenue) there are also users who say "move this sidbebar from left to right" or "I use {x} button all the time but it is hidden three menus in, place it prominently for easy access". I'd argue the latter cases are not negative for the site owners, they could be positive sum. Maybe we even see a trend that 80% of users make this UX modification on Z site. We could go to Z site and say, "Hey, you could probably make your users happy if you made this change". Maybe they'd even pay us for that insight?
Again, the honest answer is that I'm not certain about the business model. I am a lover of positive sum games. And in the moment, I am building something that I enjoy using and hopefully also provides value to others.
> 8. Interested in the cost and latency. If this essentially requires an LLM call for every website I visit, this will start to add up. Also curious if this means that my cost will scale with the efficiency of the sites I visit (i.e. do my costs scale with the size of the site's content).
As I noted above, this does not require an LLM call for every website you visit. You are correct that that would bankrupt us very quickly! An LLM is only involved when you actively start a generation/update request. There is still a cost and it does scale with the complexity of the site/request, but it is infinitely more feasible than running on every site.
In the future, we may extend functionality so that the core script that is generated can itself dynamically call LLMs on new page loads. That would enable you to do things like "filter political content from my feed" which requires test time LLM compute to dynamically categorize on each load (can't be hard-coded in a once-generated static script). That would likely have to be done locally (e.g. Google actually packages Gemini nano into the browser) for both cost and latency reasons. We're not there yet, and there is a lot you can do with the extension today, but there are definitely opportunities to build really cool stuff, way beyond Greasemonkey.
Wow, you really put me to work with this comment. Appreciate all the great questions!
I love the idea and the execution. The onboarding experience is great as well. Thanks for sharing. I am curious about SOC II. how much effort did you put in to acquire it, and what made you decide to pursue it?
> how much effort did you put in to acquire it, and what made you decide to pursue it?
We originally started looking into it when we were in the B2B space. On our end, we already took security pretty seriously so checking all the boxes was low lift.
> We could go to Z site and say, "Hey, you could probably make your users happy if you made this change". Maybe they'd even pay us for that insight?
My honest opinion:
1. No site would pay for that insight
2. Every site should pay for that insight
Part of the problem is that a lot of companies fall into one of two categories:
1. Small companies that don't have the time/energy/inclination to make changes, even if they're simple; often they're not even the ones making the website itself and they aren't going to way to pay the company who made the site originally to come back and tweak it based on what a small, self-selecting group of users decided to change.
2. Large companies who, even if they did care about what that small, self-selecting group of users wanted to change, have so many layers between A and Z that it's nearly impossible to get anything done without a tangible business need. No manager is going to sign off on developer and engineer time and testing because 40% of 1% of their audience moves the sidebar from one side to the other.
Also:
1. Designers are opinionated and don't want some clanker telling them what they're doing wrong, regardless of the data.
2. Your subset of users may have different goals or values; maybe the users more likely to install this extension and generate tweaks don't want to see recommended articles or video articles or 'you may like...' or whatever, but most of their users do and the change would turn out to be a bad one. Maybe it would reduce accessibility in some way that most users don't care about, etc.
If I had to pick a 'what's the value of all this', I would say that it's less about "what users want from this site" vs. "what users want from sites". For example, if you did the following:
1. Record all the prompts that people create that result in tweaks that people actually use, along with the category of site (banking, blogs, news, shopping, social media, forums); this gives you a general selection of things that people want. Promote these to other users to see how much mass appeal they have
2. Record all the prompts that people create that result in tweaks that people don't actually use; this gives you a selection of things that people think they want but it turns out they don't.
3. Summarize those changes into reports.
Now you could produce a 'web trend report' where you can say:
1. 80% of users are making changes to reduce clutter on sites
2. 40% of users are disabling or hiding auto-play videos
3. 40% of People in countries which use right-to-left languages swap sidebars from one side to another even on left-to-right-language websites
4. The top 'changed' sites in your industry are ... and the changes people make are ...
5. The top changes that people make to sites in your industry are ... and users who make those changes have a 40% lower bounce rate / 30% longer time-on-site / etc. than users who don't make those changes.
On top of that, you could build a model trained on those user prompts that companies could then pay for (somehow?) to run their sites through to provide suggestions of what changes they could make to their sites to satisfy these apparent user needs or preferences without sacrificing their own goals for the websites - e.g. users want to remove auto-playing videos because they're obnoxious, but the company is trying to promote their video content so maybe this model could find a middle-ground to present the video to users in a way that's less obnoxious but generates user engagement.
That's what I think anyway, but I'm not in marketing or whatever.
Seems to me that the obvious business model here is that they will need to have their AI inject their own ads into the DOM. Overall though, this feels like a feature, not a business.
Clearly there’s a tension on this venture-capital-run website between some people using their computer-nerd skills to save money and improve their experience, and other people hustling a business that requires the world to pay them.
> Clearly there’s a tension on this venture-capital-run website
Yeah. If they have a problem with that, they can kill HN. You can't have hackers/smart people in your forum and decide what they will do. Moderation can try do guide it but there is a limit when meeting smart + polite people.
This. Amazon buyer metrics have been tanking for a while. In general if I don't care about the quality I have better and cheaper places to shop. When I know the brand I want, and want predictable quality, I order from the company directly. Price, service, quality, and delivery time are equal or better than Amazon.
A good move, even if many years late. It's a bilateral trustbuster when the same platform that allows commingling and knockoffs then begins tagging legit items as "frequently returned" in the feed.
Can we also get the ability to filter by seller entity country of origin?
Amazon also needs to offer far better tools for buyers to effectively find and attach to brands.
The TL;DR is that the value of renting vs. buying has a lot to do with being realistic and understanding real estate and investments in general.
Buying a house is often an emotionally motivated decision with many important risk factors... Were inspections comprehensive and thorough or did they overlook an issue with the foundation, wiring, plumbing, etc.? Did the buyer understand the required disclosures, and specifically understand what the seller is not obligated to disclose in their jurisdiction? (e.g., in many areas the seller is not obligated to disclose if a child sex offender lives next door). Is the neighborhood up and coming or struggling?
Getting these wrong can easily negate the potential upsides of ownership.
Renting can also be great, but as the article points out, if it mostly just results in more disposable cash, then you may be better off owning (forced savings). Rental properties often cannot be sublet and also cannot be used as collateral or passed on to family members with a step-up in basis. Rental leases also fluctuate with the market, so it's not uncommon in big cities for renters to be paying close or equal prices of their homeowners next door.
Anecdotally I know many folks who have rented their way through, invested wisely, and done well. I also know folks who have moved around the US and always purchased, did their homework, capitalized on tax incentives, and now have a stable of rental properties that helped them become FIRE.
The problem is that our thoughts, opinions, and ultimately actions are the product of our exposure. Social media gives a small number of companies (and their algorithms) unparalleled and unchecked control over our exposure.
We should be educating children at a young age about the benefits and risks of social media. We haven't adapted the way we educate society in light of massive tech changes.
This will likely be a topic that future humans look back on and wonder why we did this to ourselves.
Is the media even “social” anymore? How much of Reddit is simply bots generating catchy takes and then generating commentary on these takes. You can easily be deceived into thinking that a vast number of people believe something, or think the way you do, or think the way you do but were swayed by some thought process.
Repeat the process long enough and with enough variation and tuning and anyone can be made to believe anything.
At the same time, "reading the news" has become less and less valuable. And social media has an overwhelming impact on the tone and content of "the news," too.
Written "news" is frequently a report about a series of X posts by various authorities, thought leaders and celebrities, embedded directly in the story.
The best engineers are all three, and can turn up or down these tendencies depending on what's required for the project, business, or personal goals. These should not be fixed in proportion over time, as they are each useful in different circumstances.
I spent time at Microsoft as well, and one of the things I noticed was folks who spent time in different disciplines (e.g. dev, test, pgm) seemed to be especially great at tailoring these qualities to their needs. If you're working on optimizing a compiler, you probably need a bit more Einstein and Mort than Elvis. If you're working on a game engine you may need a different combination.
The quantities of each (or whether these are the correct archetypes) is certainly debatable, but understanding that you need all of them in different proportions over time is important, IMHO.
1. This is one example. How many other attempts did the person try that failed to be useful, accurate, coherent? The author is an OpenAI employee IIUC, so it begs this question. Sora's demos were amazing until you tried it, and realized it took 50 attempts to get a usable clip.
2. The author noted that humans had updated their own research in April 2025 with an improved solution. For cases where we detect signs of superior behavior, we need to start publishing the thought process (reasoning steps, inference cycles, tools used, etc.). Otherwise it's impossible to know whether this used a specialty model, had access to the more recent paper, or in other ways got lucky. Without detailed proof it's becoming harder to separate legitimate findings from marketing posts (not suggesting this specific case was a pure marketing post)
3. Points 1 and 2 would help with reproducibility, which is important for scientific rigor. If we give Claude the same tools and inputs, will it perform just as well? This would help the community understand if GPT-5 is novel, or if the novelty is in how the user is prompting it
I don't mean to be cynical, but I don't think these points matter as much as you think, at least not in practice. The hardest part of a proof is working out the intermediate steps; joining them up is often trivial, even for a student. So even if it works out a few good steps or finds an effective theorem to apply, and does so only every one in a hundred prompts, the time savings can be significant.
I should know, I've been using LLM thinking models to help brainstorm ideas for stickier proofs. It's been more successful at discovering esoteric entry points than I would like to admit.
> This is one example. How many other attempts did the person try that failed to be useful, accurate, coherent? The author is an OpenAI employee IIUC, so it begs this question. Sora's demos were amazing until you tried it, and realized it took 50 attempts to get a usable clip.
If you could combine this with automated theorem proving, it wouldn't matter if it was right only 1 out of a 1000 times.
I'd say a lot of people even have an incentive to not give credit to the LLMs, because there is a social stigma associated with using AI, due to its association with low-quality work.
I'm guessing the music business right now is absolutely awash with unreported and uncredited AI lyrics and backing tracks. It's an area where you can get away with it a lot easier than in the visual arts.
People are delusional. There’s a large cohort of folks on HN who still think AI is just a stochastic parrot. Depending on the topic or the thread you’ll find more of those people and get voted down if you even imply that LLMs can reason.
They are to some extent though. The bigger point is that they are not just a stochastic parrot. But examples like the modified riddles where they just answer the original riddle shows that they have the behaviour of stochastic parrots at least some of the time.
Yeah you're right. I guess that's the spirit of my intent, to say they are more than stochastic parrots... a behavior they, like humans, exhibit sometimes
I don’t think it’s that they don’t have the incentive. I think it’s because it’s unclear if you give credit to the LLM if that means that OpenAI or similar would be considered an author in which case that could really screw up intellectual property and make using LLMs much less attractive. If the LLM wants attribution then it’s sentient, and if it’s sentient, it may be given personhood (Johnny-five scenario) and get rights, and then it would be a writer, and it could influence the license and intellectual property may belong partially to it unless it willingly became and employee of a ton of companies and organizations or contracted with them.
I've used all of the popular coding agents, including Jules. The reality to me is that they can and should be used for certain kinds of low severity and low complexity tasks (documentation, writing tests, etc.). They should not be used for the opposite end of the spectrum.
There are many perspectives on coding agents because there are many different types of engineers, with different levels of experience.
In my interactions I've found that junior engineers overestimate or overuse the capabilities of these agents, while more senior engineers are better calibrated.
The biggest challenge I see is what to do in 5 years once a generation of fresh engineers never learned how compilers, operating systems, hardware, memory, etc actually work. Innovation almost always requires deep understanding of the fundamentals, and AI may erode our interest in learning these critical bits of knowledge.
What I see as a hiring manager is senior (perhaps older) engineers commanding higher comp, while junior engineers become increasingly less in demand.
Agents are here to stay, but I'd estimate your best engineering days are still ahead.
I don't think this generalization is quite fair. I'm sure this is true for some folks and their social circles, but for those of us who engineer and know our way around a Home Depot, the capacity is a game changer. I used to have to rent or borrow trucks for my projects.
Not to mention Christmas trees, moving, helping friends out, etc.
Depending on the survey 63-to-75% of truck owners use it for towing once a year or less (i.e in reality never). So yes, the majority do not use it for towing.
To be fair I'd misremembered the load carrying figure and the load figure for 1 time a year or less is 32-35%