More

orf · 2026-02-03T23:00:47 1770159647

How does this work with more complex authentication schemes, like AWS?

solatic · 2026-02-04T10:37:16 1770201436

AWS has a more powerful abstraction already, where you can condition permissions such that they are only granted when the request comes from a certain VPC or IP address (i.e. VPN exit). Malware thus exfiltrated real credentials, but they'll be worthless.

tptacek · 2026-02-04T17:44:08 1770227048

I'm not prepared to say which abstraction is more powerful but I do think it's pretty funny to stack a non-exfiltratable credential up against AWS given how the IMDS works. IMDS was the motivation for machine-locked tokens for us.

solatic · 2026-02-04T18:50:21 1770231021

There are two separate concerns here: who the credentials are associated with, and where the credentials are used. IMDS's original security flaw was that it only covered "who" the credentials were issued to (the VM) and not where they were used, but aforementioned IAM conditions now ensure that they are indeed used within the same VPC. If a separate proxy is setup to inject credentials, then while this may cover the "where" concern, care must still be taken on the "who" concern, i.e. to ensure that the proxy does not fall to confused deputy attacks arising from multiple sandboxed agents attempting to use the same proxy.

tptacek · 2026-02-04T18:52:52 1770231172

There are lots of concerns, not just two, but the point of machine-bound Macaroons is to address the IMDS problem.

orf · 2026-01-23T10:14:42 1769163282

Number of docs isn’t the limiting factor.

I just searched for “stackoverflow” and the first result was this: https://www.perl.com/tags/stackoverflow/

The actual Stackoverflow site was ranked way down, below some weird twitter accounts.

saltysalt · 2026-01-23T10:20:46 1769163646

I don't weight home pages in any way yet to bump them up, it's just raw search on keyword relevance.

dredmorbius · 2026-01-23T11:36:56 1769168216

Google's entire (initial) claim-to-fame was "PageRank", referring both to the ranking of pages and co-founder Larry Page, which strongly prioritised a relevance attribute over raw keyword findings (which then-popular alternatives such as Alta Vista, Yahoo, AskJeeves, Lycos, Infoseek, HotBot, etc., relied on, or the rather more notorious paid-rankings schemes in which SERP order was effectively sold). When it was first introduced, Google Web Search was absolutely worlds ahead of any competition. I remember this well having used them previously and adopted Google quite early (1998/99).

Even with PageRank result prioritisation is highly subject to gaming. Raw keyword search is far more so (keyword stuffing and other shenanigans), moreso as any given search engine begins to become popular and catch the attention of publishers.

Google now applies other additional ordering factors as well. And of course has come to dominate SERP results with paid, advertised, listings, which are all but impossible to discern from "organic" search results.

(I've not used Google Web Search as my primary tool for well over a decade, and probably only run a few searches per month. DDG is my primary, though I'll look at a few others including Kagi and Marginalia, though those rarely.)

<https://en.wikipedia.org/wiki/PageRank>

"The anatomy of a large-scale hypertextual Web search engine" (1998) <http://infolab.stanford.edu/pub/papers/google.pdf> (PDF)

Early (1990s) search engines: <https://en.wikipedia.org/wiki/Search_engine#1990s:_Birth_of_...>.

saltysalt · 2026-01-23T12:01:55 1769169715

PageRank was an innovative idea in the early days of the Internet when trust was high, but yes it's absolutely gamed now and I would be surprised if Google still relies on it.

Fair play to them though, it enabled them to build a massive business.

marginalia_nu · 2026-01-23T12:07:14 1769170034

Anchor text information is arguably a better source for relevance ranking in my experience.

I publish exports of the ones Marginalia is aware of[1] if you want to play with integrating them.

[1] https://downloads.marginalia.nu/exports/ grab 'atags-25-04-20.parquet'

dredmorbius · 2026-01-23T13:11:50 1769173910

Though I'd think that you'd want to weight unaffiliated sites' anchor text to a given URL much higher than an affiliated site.

"Affiliation" is a tricky term itself. Content farms were popular in the aughts (though they seem to have largely subsided), firms such as Claria and Gator. There are chumboxes (Outbrain, Taboola), and of course affiliate links (e.g., to Amazon or other shopping sites). SEO manipulation is its own whole universe.

(I'm sure you know far more about this than I do, I'm mostly talking at other readers, and maybe hoping to glean some more wisdom from you ;-)

marginalia_nu · 2026-01-23T13:37:12 1769175432

Oh yeah, there's definitely room for improvement in that general direction. Indexing anchor texts is much better than page rank, but in isolation, it's not sufficient.

I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.

e.g. https://marginalia-search.com/site/www.salon.com?view=traffi...

The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.

dredmorbius · 2026-01-23T14:56:27 1769180187

Oh, that is clever!

I'd also suspect that there are networks / links which are more likely signs of low-value content than others. Off the top of my head, crypto, MLM, known scam/fraud sites, and perhaps share links to certain social networks might be negative indicators.

marginalia_nu · 2026-01-23T15:10:59 1769181059

You can actually identify clusters of websites based on the cosine similarity of their outbound links. Pretty useful for identifying content farms spanning multiple websites.

Have a lil' data explorer for this: https://explore2.marginalia.nu/

Quite a lot of dead links in the dataset, but it's still useful.

saltysalt · 2026-01-23T12:12:08 1769170328

Very interesting, and it is very kind of you to share your data like that. Will review!

snowwrestler · 2026-01-23T13:23:05 1769174585

Google’s biggest search signal now is aggregate behavioral data reported from Chrome. That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.

It’s also why it is so hard to compete with Google. You guys are talking about techniques for analyzing the corpus of the search index. Google does that and has a direct view into how millions of people interact with it.

xnx · 2026-01-23T14:16:24 1769177784

> That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS

The Chrome iOS app still knows every url visited, duration, scroll depth, etc.

saltysalt · 2026-01-23T14:28:30 1769178510

Yes indeed, they have an impossibly deep moat and deeper pockets. I'm certainly not trying to compete with them with my little side project, it's just for fun!

danans · 2026-01-24T17:30:30 1769275830

> That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.

There is a native Chrome app on iOS. It gets all the same url visit data as Chrome on other platforms.

Apple blocks 3rd party renderers and JS engines on iOS to protect its App Store from competition that might deliver software and content through other channels that they don't take a cut of.

orf · 2026-01-23T10:38:00 1769164680

Sure, but the point is results are not relevant at all?

It’s cool though, and really fast

saltysalt · 2026-01-23T10:42:12 1769164932

I'll work on that adjustment, it's fair feedback thanks!

direwolf20 · 2026-01-23T11:18:33 1769167113

Unfortunately this is the bulk of search engine work. Recursive scraping is easy in comparison, even with CAPTCHA bypassing. You either limit the index to only highly relevant sites (as Marginalia does) or you must work very hard to separate the spam from the ham. And spam in one search may be ham in another.

saltysalt · 2026-01-23T11:38:40 1769168320

I limit it to highly relevant curated seed sites, and don't allow public submissions. I'd rather have a small high-quality index.

You are absolutely right, it is the hardest part!

globular-toast · 2026-01-23T11:28:05 1769167685

What do you mean they're not relevant? The top result you linked contained the word stackoverflow didn't it? It's showing you exactly what you searched for. Why would you need a search engine at all if you already know the name of the thing? Just type stackoverflow.com into your address bar.

I feel like Google-style "search" has made people really dumb and unable to help themselves.

orf · 2026-01-23T11:43:31 1769168611

the query is just to highlight that relevance is a complex topic. few people would consider "perl blog posts from 2016 that have the stack overflow tag" as the most relevant result for that query.

pjc50 · 2026-01-23T12:53:50 1769172830

Confluence search does this, for our intranet. As a result it's barely usable.

Indexing is a nice compact CS problem; not completely simple for huge datasets like the entire internet, but well-formed. Ranking is the thing that makes a search engine valuable. Especially when faced with people trying to game it with SEO.

orf · 2026-01-20T13:55:16 1768917316

When was the vote on deciding if murder is good or bad?

“Society” doesn’t vote on things. Your viewpoint may differ, but a large enough majority of other people feel differently.

In other words, it’s a you problem.

bigyabai · 2026-01-20T19:17:55 1768936675

Murder has a fixed cost of human lives, which is considered (by the living) to be reprehensible at every scale.

Piracy has a negligible cost on the industry, and contributes to a positive upward pressure on IP holders to compete with low-cost access. These two crimes are not the same.

orf · 2026-01-20T21:21:53 1768944113

Agreed, but not relevant to my comment.

bigyabai · 2026-01-20T22:30:42 1768948242

Your comment is not dictated by principles. I don't care what the society says, their judgement is wrong half the time.

water9 · 2026-01-20T16:15:42 1768925742

Oh, so you believe in mob rule then OK I got it. And no because there are uncensored LLM’s like menstral so it’s a you need to worry about yourself problem. Stop trying to parent me who the hell are you?

orf · 2026-01-20T21:21:09 1768944069

None of which is relevant to the point I was making.

Try to focus your thoughts, they are obviously pretty scattered.

water9 · 2026-01-28T06:22:44 1769581364

What are you talking about? You said

“but a large enough majority of other people feel differently. In other words, it’s a you problem.”

Ignoring the enormous strawman, you just made, how do you know what the majority opinion is on this topic?. you don’t. You’re just arrogant because what you actually did is conducted a strap hole in your own mind of people in your echo chamber and said yeah the majority of people think my opinion is right.

that that’s called mob rule.

Next time I’ll speak slower so you can keep up that’s why it seems scattered you’re having trouble connecting the dots.

“The only thing worse than an idiot is an arrogant idiot.” you’re the dumb one here you just are too dumb to know it.

orf · 2026-01-18T20:26:13 1768767973

> StageExecution(ref_id="t3", requisite_stage_ref_ids={"start"}, > …

Your UX is important in a tool like this: using “needs” instead of “ requisite_stage_ref_ids” would be a pretty big improvement.

orf · 2026-01-11T09:31:43 1768123903

You can look at the authors LinkedIn page and see that he’s never worked for a startup. The only job he’s had for more than 14 months was at Huawei.

I’m pretty sure this is all completely fabricated LLM slop, created as a vehicle for the “handbook” adverts littered throughout.

1. https://www.linkedin.com/in/devrimozcay

orf · 2026-01-06T21:34:55 1767735295

You need to apply backpressure before you hit memory limits, not after.

If you’re OOM your application is in a pretty unrecoverable state. Theoretically possible, practically not.

koakuma-chan · 2026-01-06T21:45:53 1767735953

If you allocate a relatively big chunk of memory for each unit of work, and at some point your allocation fails, you can just drop that unit of work. What is not practical?

ViewTrick1002 · 2026-01-06T23:20:01 1767741601

I think in that case overcommit will happily say the allocation worked. Unless you also zero the entire chunk of memory and then get OOM killed on the write.

I suppose you can try to reliable target "seriously wild allocation fails" without leaving too much memory on the table.

   0: Heuristic overcommit handling. Obvious overcommits of
      address space are refused. Used for a typical system. It
      ensures a seriously wild allocation fails while allowing
      overcommit to reduce swap usage.  root is allowed to 
      allocate slightly more memory in this mode. This is the 
   default.

https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

Running in an environent without overcommit would allow you to handle it gracefully though, although bringing its own zoo of nasty footguns.

See this recent discussion on what can happen when turning off overcommit:

https://news.ycombinator.com/item?id=46300411

koakuma-chan · 2026-01-06T23:44:02 1767743042

> See this recent discussion on what can happen when turning off overcommit:

What are you referring to specifically? Overcommit is only (presumably) useful if you are using Linux as a desktop OS.

orf · 2026-01-06T21:22:57 1767734577

None of that matters: what is your application going to do if it tries to allocate 3mb of data from your 2mb allocator?

This is the far more meaningful part of the original comment:

> and furthermore most code is not in a position to do anything other than crash in an OOM scenario

Given that (unlike a language such as Zig) Rust doesn’t use a variety of different allocator types within a given system, choosing to reliably panic with a reasonable message and stack/trace is a very reasonable mindset to have.

ncruces · 2026-01-06T23:16:30 1767741390

Since we're talking about SQLite, by far the most memory it allocates is for the page cache.

If some allocation fails, the error bubbles up until a safe place, where some pages can be dropped from the cache, and the operation that failed can be tried again.

All this requires is that bubbling up this specific error condition doesn't allocate. Which SQLite purportedly tests.

I'll note that this is not entirely dissimilar to a system where an allocation that can't be immediately satisfied triggers a full garbage collection cycle before an OOM is raised (and where some data might be held through soft/weak pointers and dropped under pressure), just implemented in library code.

orf · 2026-01-06T23:24:26 1767741866

Sure, and this is completely sensible to do in a library.

But that’s not the point: what can most applications do when SQLite tells them that it encountered a memory error and couldn’t complete the transaction?

Abort and report an error to the user. In a CLI this would be a panic/abort, and in a service that would usually be implemented as a panic handler (which also catches other errors) that attempts to return an error response.

In this context, who cares if it’s an OOM error or another fatal exception? The outcome is the same.

Of course that’s not universal, but it covers 99% of use cases.

ncruces · 2026-01-06T23:46:44 1767743204

The topic is whether Rust should be used to re-implement SQLite.

If SQLite fails to allocate memory for a string or blob, it bubbles up the error, frees some data, and maybe tries again.

Your app may be "hopeless" if the error bubbles up all the way to it, that's your choice, but SQLite may have already handled the error internally, retried, and given your answer without you noticing.

Or it may at least have rolled back your transaction cleanly, instead of immediately crashing at the point of the failed allocation. And although crashing should not corrupt your database, a clean rollback is much faster to recover from, even if your app then decides to crash.

Your app, e.g. an HTTP server, might decide to drop the request, maybe close that SQLite connection, and stay alive to handle other ongoing and new requests.

SQLite wants to be programmed in a language were a failed allocation doesn't crash, and unlike most other code, SQLite is actually tested for how it behaves when malloc fails.

osiris88 · 2026-01-06T23:17:01 1767741421

In C++ it will throw an exception which you can catch, and then gracefully report that the operation exceeded limits and/or perform some fallback.

Historically, a lot of C code fails to handle memory allocation failure properly because checking malloc etc for null result is too much work — C code tends to calm that a lot.

Bjarne Stroustrup added exceptions to C++ in part so that you could write programs that easily recover when memory allocation fails - that was the original motivation for exceptions.

In this one way, rust is a step backwards towards C. I hope that rust comes up with a better story around this, because in some applications it does matter.

orf · 2025-12-28T16:49:17 1766940557

In the world of confusing landing pages, this project is a piece of art: what the fuck does any of this mean

> EXCITABLE. EXACTABLE. EXECUTABLE. A shared universe of destinations and memory.

> Space is a SaaS platform presented by PromptFluid. It contains a collection of tools and toys that are released on a regular cadence.

> PromptFluid Official Briefing: You are reading from the ground. Space is above. We transmit because the colony cannot afford silence.

> You can ignore the story and still use everything. But if you want the deeper current: this relay exists because the colony is fragile, and Space is the only place the tools can grow without choking the ground.

> Creating an account creates a Star. Your Star is your identity within Space and unlocks Spacewalking capabilities and certain tools that require persistent state

vunderba · 2025-12-28T19:30:06 1766950206

It means they let an LLM write the ad copy.

kylecazar · 2025-12-29T13:28:25 1767014905

"You are reading from the ground. Space is above. We transmit because the colony cannot afford silence."

orf · 2025-08-17T19:47:01 1755460021

Interesting read, but I feel like they should have also benchmarked using COPY with Postgres. This should be far faster than a bulk insert, and it’s more in line with what they are benchmarking.

The omission feels… odd.

sdairs · 2025-08-17T19:51:19 1755460279

To be honest, I just didn't think of it. But thanks for the suggestion, we'll give it a go!

orf · 2025-03-05T11:53:41 1741175621

I maintain a project that publishes a SQLite file containing all package metadata, if you don’t want to use BigQuery or the API to do this kind of analysis

https://github.com/pypi-data/pypi-json-data