Hacker Newsnew | past | comments | ask | show | jobs | submit | orf's commentslogin

How does this work with more complex authentication schemes, like AWS?

AWS has a more powerful abstraction already, where you can condition permissions such that they are only granted when the request comes from a certain VPC or IP address (i.e. VPN exit). Malware thus exfiltrated real credentials, but they'll be worthless.

I'm not prepared to say which abstraction is more powerful but I do think it's pretty funny to stack a non-exfiltratable credential up against AWS given how the IMDS works. IMDS was the motivation for machine-locked tokens for us.

There are two separate concerns here: who the credentials are associated with, and where the credentials are used. IMDS's original security flaw was that it only covered "who" the credentials were issued to (the VM) and not where they were used, but aforementioned IAM conditions now ensure that they are indeed used within the same VPC. If a separate proxy is setup to inject credentials, then while this may cover the "where" concern, care must still be taken on the "who" concern, i.e. to ensure that the proxy does not fall to confused deputy attacks arising from multiple sandboxed agents attempting to use the same proxy.

There are lots of concerns, not just two, but the point of machine-bound Macaroons is to address the IMDS problem.

Number of docs isn’t the limiting factor.

I just searched for “stackoverflow” and the first result was this: https://www.perl.com/tags/stackoverflow/

The actual Stackoverflow site was ranked way down, below some weird twitter accounts.


I don't weight home pages in any way yet to bump them up, it's just raw search on keyword relevance.


Google's entire (initial) claim-to-fame was "PageRank", referring both to the ranking of pages and co-founder Larry Page, which strongly prioritised a relevance attribute over raw keyword findings (which then-popular alternatives such as Alta Vista, Yahoo, AskJeeves, Lycos, Infoseek, HotBot, etc., relied on, or the rather more notorious paid-rankings schemes in which SERP order was effectively sold). When it was first introduced, Google Web Search was absolutely worlds ahead of any competition. I remember this well having used them previously and adopted Google quite early (1998/99).

Even with PageRank result prioritisation is highly subject to gaming. Raw keyword search is far more so (keyword stuffing and other shenanigans), moreso as any given search engine begins to become popular and catch the attention of publishers.

Google now applies other additional ordering factors as well. And of course has come to dominate SERP results with paid, advertised, listings, which are all but impossible to discern from "organic" search results.

(I've not used Google Web Search as my primary tool for well over a decade, and probably only run a few searches per month. DDG is my primary, though I'll look at a few others including Kagi and Marginalia, though those rarely.)

<https://en.wikipedia.org/wiki/PageRank>

"The anatomy of a large-scale hypertextual Web search engine" (1998) <http://infolab.stanford.edu/pub/papers/google.pdf> (PDF)

Early (1990s) search engines: <https://en.wikipedia.org/wiki/Search_engine#1990s:_Birth_of_...>.


PageRank was an innovative idea in the early days of the Internet when trust was high, but yes it's absolutely gamed now and I would be surprised if Google still relies on it.

Fair play to them though, it enabled them to build a massive business.


Anchor text information is arguably a better source for relevance ranking in my experience.

I publish exports of the ones Marginalia is aware of[1] if you want to play with integrating them.

[1] https://downloads.marginalia.nu/exports/ grab 'atags-25-04-20.parquet'


Though I'd think that you'd want to weight unaffiliated sites' anchor text to a given URL much higher than an affiliated site.

"Affiliation" is a tricky term itself. Content farms were popular in the aughts (though they seem to have largely subsided), firms such as Claria and Gator. There are chumboxes (Outbrain, Taboola), and of course affiliate links (e.g., to Amazon or other shopping sites). SEO manipulation is its own whole universe.

(I'm sure you know far more about this than I do, I'm mostly talking at other readers, and maybe hoping to glean some more wisdom from you ;-)


Oh yeah, there's definitely room for improvement in that general direction. Indexing anchor texts is much better than page rank, but in isolation, it's not sufficient.

I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.

e.g. https://marginalia-search.com/site/www.salon.com?view=traffi...

The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.


Oh, that is clever!

I'd also suspect that there are networks / links which are more likely signs of low-value content than others. Off the top of my head, crypto, MLM, known scam/fraud sites, and perhaps share links to certain social networks might be negative indicators.


You can actually identify clusters of websites based on the cosine similarity of their outbound links. Pretty useful for identifying content farms spanning multiple websites.

Have a lil' data explorer for this: https://explore2.marginalia.nu/

Quite a lot of dead links in the dataset, but it's still useful.


Very interesting, and it is very kind of you to share your data like that. Will review!


Google’s biggest search signal now is aggregate behavioral data reported from Chrome. That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.

It’s also why it is so hard to compete with Google. You guys are talking about techniques for analyzing the corpus of the search index. Google does that and has a direct view into how millions of people interact with it.


> That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS

The Chrome iOS app still knows every url visited, duration, scroll depth, etc.


Yes indeed, they have an impossibly deep moat and deeper pockets. I'm certainly not trying to compete with them with my little side project, it's just for fun!


> That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.

There is a native Chrome app on iOS. It gets all the same url visit data as Chrome on other platforms.

Apple blocks 3rd party renderers and JS engines on iOS to protect its App Store from competition that might deliver software and content through other channels that they don't take a cut of.


Sure, but the point is results are not relevant at all?

It’s cool though, and really fast


I'll work on that adjustment, it's fair feedback thanks!


Unfortunately this is the bulk of search engine work. Recursive scraping is easy in comparison, even with CAPTCHA bypassing. You either limit the index to only highly relevant sites (as Marginalia does) or you must work very hard to separate the spam from the ham. And spam in one search may be ham in another.


I limit it to highly relevant curated seed sites, and don't allow public submissions. I'd rather have a small high-quality index.

You are absolutely right, it is the hardest part!


What do you mean they're not relevant? The top result you linked contained the word stackoverflow didn't it? It's showing you exactly what you searched for. Why would you need a search engine at all if you already know the name of the thing? Just type stackoverflow.com into your address bar.

I feel like Google-style "search" has made people really dumb and unable to help themselves.


the query is just to highlight that relevance is a complex topic. few people would consider "perl blog posts from 2016 that have the stack overflow tag" as the most relevant result for that query.


Confluence search does this, for our intranet. As a result it's barely usable.

Indexing is a nice compact CS problem; not completely simple for huge datasets like the entire internet, but well-formed. Ranking is the thing that makes a search engine valuable. Especially when faced with people trying to game it with SEO.


When was the vote on deciding if murder is good or bad?

“Society” doesn’t vote on things. Your viewpoint may differ, but a large enough majority of other people feel differently.

In other words, it’s a you problem.


Murder has a fixed cost of human lives, which is considered (by the living) to be reprehensible at every scale.

Piracy has a negligible cost on the industry, and contributes to a positive upward pressure on IP holders to compete with low-cost access. These two crimes are not the same.


Agreed, but not relevant to my comment.


Your comment is not dictated by principles. I don't care what the society says, their judgement is wrong half the time.


Oh, so you believe in mob rule then OK I got it. And no because there are uncensored LLM’s like menstral so it’s a you need to worry about yourself problem. Stop trying to parent me who the hell are you?


None of which is relevant to the point I was making.

Try to focus your thoughts, they are obviously pretty scattered.


What are you talking about? You said

“but a large enough majority of other people feel differently. In other words, it’s a you problem.”

Ignoring the enormous strawman, you just made, how do you know what the majority opinion is on this topic?. you don’t. You’re just arrogant because what you actually did is conducted a strap hole in your own mind of people in your echo chamber and said yeah the majority of people think my opinion is right.

that that’s called mob rule.

Next time I’ll speak slower so you can keep up that’s why it seems scattered you’re having trouble connecting the dots.

“The only thing worse than an idiot is an arrogant idiot.” you’re the dumb one here you just are too dumb to know it.


> StageExecution(ref_id="t3", requisite_stage_ref_ids={"start"}, > …

Your UX is important in a tool like this: using “needs” instead of “ requisite_stage_ref_ids” would be a pretty big improvement.


You can look at the authors LinkedIn page and see that he’s never worked for a startup. The only job he’s had for more than 14 months was at Huawei.

I’m pretty sure this is all completely fabricated LLM slop, created as a vehicle for the “handbook” adverts littered throughout.

1. https://www.linkedin.com/in/devrimozcay


You need to apply backpressure before you hit memory limits, not after.

If you’re OOM your application is in a pretty unrecoverable state. Theoretically possible, practically not.


If you allocate a relatively big chunk of memory for each unit of work, and at some point your allocation fails, you can just drop that unit of work. What is not practical?


I think in that case overcommit will happily say the allocation worked. Unless you also zero the entire chunk of memory and then get OOM killed on the write.

I suppose you can try to reliable target "seriously wild allocation fails" without leaving too much memory on the table.

   0: Heuristic overcommit handling. Obvious overcommits of
      address space are refused. Used for a typical system. It
      ensures a seriously wild allocation fails while allowing
      overcommit to reduce swap usage.  root is allowed to 
      allocate slightly more memory in this mode. This is the 
   default.
https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

Running in an environent without overcommit would allow you to handle it gracefully though, although bringing its own zoo of nasty footguns.

See this recent discussion on what can happen when turning off overcommit:

https://news.ycombinator.com/item?id=46300411


> See this recent discussion on what can happen when turning off overcommit:

What are you referring to specifically? Overcommit is only (presumably) useful if you are using Linux as a desktop OS.


None of that matters: what is your application going to do if it tries to allocate 3mb of data from your 2mb allocator?

This is the far more meaningful part of the original comment:

> and furthermore most code is not in a position to do anything other than crash in an OOM scenario

Given that (unlike a language such as Zig) Rust doesn’t use a variety of different allocator types within a given system, choosing to reliably panic with a reasonable message and stack/trace is a very reasonable mindset to have.


Since we're talking about SQLite, by far the most memory it allocates is for the page cache.

If some allocation fails, the error bubbles up until a safe place, where some pages can be dropped from the cache, and the operation that failed can be tried again.

All this requires is that bubbling up this specific error condition doesn't allocate. Which SQLite purportedly tests.

I'll note that this is not entirely dissimilar to a system where an allocation that can't be immediately satisfied triggers a full garbage collection cycle before an OOM is raised (and where some data might be held through soft/weak pointers and dropped under pressure), just implemented in library code.


Sure, and this is completely sensible to do in a library.

But that’s not the point: what can most applications do when SQLite tells them that it encountered a memory error and couldn’t complete the transaction?

Abort and report an error to the user. In a CLI this would be a panic/abort, and in a service that would usually be implemented as a panic handler (which also catches other errors) that attempts to return an error response.

In this context, who cares if it’s an OOM error or another fatal exception? The outcome is the same.

Of course that’s not universal, but it covers 99% of use cases.


The topic is whether Rust should be used to re-implement SQLite.

If SQLite fails to allocate memory for a string or blob, it bubbles up the error, frees some data, and maybe tries again.

Your app may be "hopeless" if the error bubbles up all the way to it, that's your choice, but SQLite may have already handled the error internally, retried, and given your answer without you noticing.

Or it may at least have rolled back your transaction cleanly, instead of immediately crashing at the point of the failed allocation. And although crashing should not corrupt your database, a clean rollback is much faster to recover from, even if your app then decides to crash.

Your app, e.g. an HTTP server, might decide to drop the request, maybe close that SQLite connection, and stay alive to handle other ongoing and new requests.

SQLite wants to be programmed in a language were a failed allocation doesn't crash, and unlike most other code, SQLite is actually tested for how it behaves when malloc fails.


In C++ it will throw an exception which you can catch, and then gracefully report that the operation exceeded limits and/or perform some fallback.

Historically, a lot of C code fails to handle memory allocation failure properly because checking malloc etc for null result is too much work — C code tends to calm that a lot.

Bjarne Stroustrup added exceptions to C++ in part so that you could write programs that easily recover when memory allocation fails - that was the original motivation for exceptions.

In this one way, rust is a step backwards towards C. I hope that rust comes up with a better story around this, because in some applications it does matter.


In the world of confusing landing pages, this project is a piece of art: what the fuck does any of this mean

> EXCITABLE. EXACTABLE. EXECUTABLE. A shared universe of destinations and memory.

> Space is a SaaS platform presented by PromptFluid. It contains a collection of tools and toys that are released on a regular cadence.

> PromptFluid Official Briefing: You are reading from the ground. Space is above. We transmit because the colony cannot afford silence.

> You can ignore the story and still use everything. But if you want the deeper current: this relay exists because the colony is fragile, and Space is the only place the tools can grow without choking the ground.

> Creating an account creates a Star. Your Star is your identity within Space and unlocks Spacewalking capabilities and certain tools that require persistent state


It means they let an LLM write the ad copy.


"You are reading from the ground. Space is above. We transmit because the colony cannot afford silence."


Interesting read, but I feel like they should have also benchmarked using COPY with Postgres. This should be far faster than a bulk insert, and it’s more in line with what they are benchmarking.

The omission feels… odd.


To be honest, I just didn't think of it. But thanks for the suggestion, we'll give it a go!


I maintain a project that publishes a SQLite file containing all package metadata, if you don’t want to use BigQuery or the API to do this kind of analysis

https://github.com/pypi-data/pypi-json-data


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: