More

luizfelberti · 2025-12-30T04:16:54 1767068214

I also find this to be an elegant way of doing this, and it is also how the Thompson VM style of regex engines work [0]

It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.

[0] https://swtch.com/~rsc/regexp/regexp2.html

luizfelberti · 2025-11-17T23:47:42 1763423262

> If sqlite had a generic "strictly ascending sequence of integers" type

Is that not what WITHOUT ROWID does? My understanding is that it's precisely meant to physically cluster data in the underlying B-Tree

If that is not what you meant, could you elaborate on the "primary key tables aren't really useful here" footnote?

luizfelberti · 2025-11-14T00:17:55 1763079475

Honestly this benchmark feels completely dominated by the instance's NIC capacity.

They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.

The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

amluto · 2025-11-14T00:42:55 1763080975

It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.

S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.

switchbak · 2025-11-14T00:56:36 1763081796

Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.

It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.

tempest_ · 2025-11-14T00:48:13 1763081293

Everyone wants a data lake when what they have a is a data pond.

baq · 2025-11-14T07:24:57 1763105097

I think you meant puddle.

cue Peppa Pig laughter sounds

layoric · 2025-11-14T05:20:32 1763097632

Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.

Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.

mrbungie · 2025-11-14T03:05:36 1763089536

> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).

> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

That's absolutely right.

mrlongroots · 2025-11-14T03:25:54 1763090754

Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

justincormack · 2025-11-14T08:41:40 1763109700

Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.

mrlongroots · 2025-11-14T16:28:47 1763137727

I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.

wtallis · 2025-11-14T16:25:49 1763137549

And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.

mrlongroots · 2025-11-14T16:30:54 1763137854

The appropriate comparison point for aggregate cluster storage bandwidth would be its bisection bandwidth.

(I do HPC, IIRC ANL Aurora is < 1PB/s DAOS and 20 PB/s bisection).

kccqzy · 2025-11-14T01:16:33 1763082993

10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.

otterley · 2025-11-14T06:02:57 1763100177

You can get a 600Gbps interface on an Amazon EC2 instance (c8gn.48xlarge), if you’re willing to pay for it.

basilgohar · 2025-11-14T01:36:49 1763084209

"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".

kccqzy · 2025-11-14T02:36:02 1763087762

I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.

degamad · 2025-11-14T02:43:27 1763088207

The infra might have a 400Gbps NIC, but if you're buying a small compute slice on that infra, you don't get all the capability.

nijave · 2025-11-14T14:02:27 1763128947

They do at AWS, too but op paid for a small VM

Scubabear68 · 2025-11-14T13:31:31 1763127091

This is a really good observation, and matches something I had to learn painfully over 30 years ago. At a Wall Street bank, we were trying to really push the limits with some middleware, and my mentor at the time very quietly suggested "before you test your system's performance, understand the theoretical maximum of your setup first with no work".

The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.

bushbaba · 2025-11-14T00:52:44 1763081564

I'm kind of suprised they didn't choose an ec2 instance with higher throughput. S3 can totally eek out 100s of Gibps with the right setup.

BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.

dukodk · 2025-11-14T02:24:13 1763087053

c5 is such a bad instance type, m6a would be so much better and even cheaper, I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps

luizfelberti · 2025-11-14T03:18:08 1763090288

Actually for this kind of workload 15Gbps is still mediocre. What you actually want is the `n` variant of the instance types, which have higher NIC capacity.

In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.

amluto · 2025-11-14T18:16:28 1763144188

The math here is weird.

A Samsung 990 Pro reads at something like 50 Gbps and PCIe 4.0 x4 is quite a bit faster than that. You can get this speed with a queue depth that isn’t crazy, and you can have multiple NVMe operations in flight reading the same large Parquet file. Latency is in the tens of microseconds.

The consensus seems to be that S3 can read one object at somewhat under 1Gbps. You can probable scale that to the full speed of your NIC by reading multiple objects at once, but you may not be able to scale by reading one object in multiple overlapping ranges. Latency is in the milliseconds.

So, sure, an EC2 with a fast instance and massive multiple object parallelism can have 10x higher bandwidth than an NVMe device, but the amount of parallelism and latency tolerance needed is a couple orders of magnitude higher than NVMe. Meanwhile that NVMe device does not charge for read operations and costs a couple hundred dollars, once.

If you are so inclined, you can build an NVMEoF setup (at much much higher cost) that separates compute and storage and has excellent performance, but this is a nontrivial undertaking.

luizfelberti · 2025-10-28T17:51:07 1761673867

They barely just released Containerization Framework[0] and the new container[1] tool, and they are already scheduling a kneecapping of this two years down the line.

Realistically, people are still going to be deploying on x64 platforms for a long time, and given that Apple's whole shtick was to serve "professionals", it's really a shame that they're dropping the ball on developers like this. Their new containerization stuff was the best workflow improvement for me in quite a while.

[0] https://github.com/apple/containerization

[1] https://github.com/apple/container

pjmlp · 2025-10-29T08:37:42 1761727062

Apple has always been like this, there are other options when backwards compatibility is relevant feature.

WillAdams · 2025-10-29T11:07:19 1761736039

Yeah, it kind of kills me that I am writing this on a Samsung Galaxy Book 3 Pro 360 running Windows 11 so that I can run Macromedia Freehand/MX (I was a beta-tester for that version) so that I can still access Altsys Virtuoso 2 files from my NeXT Cube (Virtuoso 2 ~= Macromedia Freehand 4) for a typeface design project I'm still working on (a digital revival of a hot metal typeface created by my favourite type designer/illustrator who passed in 1991, but whose widow was gracious enough to give me permission to revive).

I was _so_ hopeful when I asked the devs to revive the Nx-UI code so that FH/MX could have been a native "Cocoa" app....

latexr · 2025-10-29T13:14:06 1761743646

> running Windows 11 so that I can run Macromedia Freehand/MX

Freehand still works on Windows 11? I’m happy for you, I never found a true replacement for it.

> a digital revival of a hot metal typeface created by my favourite type designer/illustrator who passed in 1991, but whose widow was gracious enough to give me permission to revive

Any reason you haven’t shared the name of the designer or the typeface? That story sounds interesting, I’d really welcome learning more.

WillAdams · 2025-10-29T17:16:12 1761758172

Yes, fortunately. I despair of what I'm going to do when I no longer have such an option. Cenon is clunky, Inkscape's cross-platform nature keeps it from having many interface aspects which I depend on, and I'd rather give up digital drawing than use Adobe Illustrator (which despite using since v3.2 on college lab Macs and on my NeXT Cube I never found comfortable).

The designer/typeface are Warren Chappell's Trajanus, and his unreleased Eichenauer --- I read _The Living Alphabet_ (and his cousin Oscar Ogg's _The 26 Letters_) when I was very young, and met him briefly on a school field trip back when he was Artist-in-Residence at UVA and did a fair bit of research in their Rare Book Room, and even had a sample of the metal type (missing one character unfortunately).

It is currently stalled at my having scanned and drawn up one of each letter at each size which I have available, but only having two letters, _N_ and _n_ in all sizes --- probably shouldn't worry that much about the optical axis, since it was cut in metal in one master size and the other sizes made using a pantograph, but there were _some_ adjustments which I'd like to preserve. There is a digital version of Trajanus available, but it's based on the phototype. I've been working at recreating each character using METAFONT, encompassing the optical size variation in that programmatically, but it's been slow going (and once I'm done, I then have to work out how to make it into outlines....)

jack_tripper · 2025-10-29T10:20:12 1761733212

That's why like 80%+(?) of corporate world runs Windows client side for their laptops/workstations. They don't want to have to rewrite their shit whenever the OS vendor pushes an update.

Granted, that's less of an issue now with most new SW being written in JS to run in any browser but old institutions like banks, insurances, industrial, automation, retail chains, etc still run some ancient Java/C#/C++ programs they don't want to, or can't update for reasons but it keeps the lights on.

Which is why I find it adorable when people in this bubble think all those industries will suddenly switch to Macs.

AdamN · 2025-10-29T11:05:58 1761735958

they use Windows because it's ostensibly cheap and there's momentum. I don't think any modern tech company is majority Windows.

user_7832 · 2025-10-29T15:04:28 1761750268

One of my previous companies gave top of the line workstations with 4k touchscreens and i9s to literally everyone junior and below a particular grade. I'm quite sure they could've saved 1000s of dollars per laptop by going with a reasonable MacBook.

(Ironically, windows 11 + corporate bloatware made the laptops super laggy. Go figure.)

pjmlp · 2025-10-29T11:58:42 1761739122

It surely is outside US, and countries with similar income level.

https://www.accio.com/business/operating-system-market-share...

AdamN · 2025-10-29T12:04:51 1761739491

That's overall market share. Agree Windows use is high but in general the more tech-forward the company is the less Windows there is at it.

pjmlp · 2025-10-29T12:33:28 1761741208

So only 13% of the world desktop users might be employeed at a tech-forward company.

Might, because the number is even less, when we differenciate between companies and home use.

wolvesechoes · 2025-10-29T12:33:13 1761741193

> more tech-forward

That may be surprising for people here, but technology is not synonymous with software.

jack_tripper · 2025-10-29T17:05:54 1761757554

>but in general the more tech-forward the company is the less Windows there is at it. reply

Only if you count food delivery apps, crypto Ponzi scheme unicorns, Ad-services and SaaS start-ups as "tech-forward" exclusively, because you're omitting a lot of other tech companies your daily life in the civilized world depends on, which operate mainly on Windows, like where I work now.

Is designing and building semiconductors not "technology"? Or MRI machines? Or jets? Or car engines?

Yeul · 2025-10-29T21:01:34 1761771694

The whole world does not consist of the tech industry.

mxey · 2025-10-28T17:52:08 1761673928

The OP says nothing about Rosetta for Linux.

luizfelberti · 2025-10-28T17:57:07 1761674227

It seems to talk about Rosetta 2 as a whole, which is what the containerization framework depends on to support running amd64 binaries inside Linux VMs (even though the kernel still needs to be arm)

Is there a separate part of Rosetta that is implemented for the VM stuff? I was under the impression Rosetta was some kind of XPC service that would translate executable pages for Hypervisor Framework as they were faulted in, did I just misunderstand how the thing works under the hood? Are there two Rosettas?

mxey · 2025-10-28T18:00:02 1761674402

I cannot tell you about implementation difference but what I mean is that this only talks about Rosetta 2 for Mac apps. Rosetta for Linux is a feature of the Virtualization framework that’s documented in a completely different place. And this message says a part of Rosetta for macOS will stick around, so I would be surprised if they removed the Linux part.

On the Linux side, Rosetta is an executable that you hook up with binfmt to run AMD64 binaries, like how you might use Wine for windows binaries

watermelon0 · 2025-10-29T08:44:20 1761727460

Rosetta Linux executable can be used without host hardware/software support; for example, you can run it on AWS's Graviton instances.

However, to get performance benefits, you still need to have hardware support, and have Rosetta installed on macOS [1].

TFA is quite vague about what is being deprecated.

[1] https://developer.apple.com/documentation/virtualization/run...

klausa · 2025-10-29T05:33:04 1761715984

The "other" part of Rosetta is having all system frameworks being also compiled for x86_64, and being supported running in this configuration.

embedding-shape · 2025-10-29T12:08:56 1761739736

> and given that Apple's whole shtick was to serve "professionals",

When was the last time this was true? I think I gave up on the platform around the new keyboards, who clearly weren't made for typing, and the non-stop "Upgrade" and "Upgrade" notifications that you couldn't disable, just push forward into the future. Everything they've done since them seems to have been to impress the Average Joe, not for serving professionals.

conception · 2025-10-29T12:28:15 1761740895

The current macbook pro was basically a checklist of items professionals wanted away from consumer new shiny thing.

nottorp · 2025-10-29T12:53:42 1761742422

And then they introduced liquid glass, because professionals don't need an easily readable UI to work with.

alwillis · 2025-10-29T14:48:10 1761749290

> Everything they've done since them seems to have been to impress the Average Joe, not for serving professionals.

"CIOs say Apple is now mission critical for the enterprise" [1]

[1]: https://9to5mac.com/2025/10/25/cios-say-apple-is-now-mission...

embedding-shape · 2025-10-29T14:51:09 1761749469

That's literally sponsored content/an ad by a company who makes money managing Apple devices, of course they'll say it's "mission critical", on a website meant to promote Apple hardware.

Happen to have some less biased source saying anything similar, ideally not sponsored content?

ehutch79 · 2025-10-29T11:30:09 1761737409

There are a lot of projects with arm containers on docker hub. It’s not hard to build multi platform containers.

luizfelberti · 2025-10-22T13:00:08 1761138008

Do not reference these kinds of docs whenever you need practical, actionable advice. They serve their purpose, but are for a completely different kind of audience.

For anyone perusing this thread, your first resource for this kind of security advice should probably be the OWASP cheatsheets which is a living set of documents that packages current practice into direct recommendations for implementers.

Here's what it says about tuning Argon2:

https://cheatsheetseries.owasp.org/cheatsheets/Password_Stor...

tptacek · 2025-10-22T13:11:50 1761138710

It's been a couple years since I've looked but the track record of OWASP for cryptography advice has been pretty dismal.

linsomniac · 2025-10-22T13:43:43 1761140623

Do you have a better recommendation?

I feel bad for OWASP. They're doing the lords work, but seem to have a shoestring budget.

rubendev · 2025-10-22T16:26:30 1761150390

The OWASP ASVS appendix on Cryptography is one of the best and concise resources I know for this kind of thing: https://github.com/OWASP/ASVS/blob/master/5.0/en/0x92-Append...

tptacek · 2025-10-22T16:44:50 1761151490

This is just a random list of links to standards and summary tables, some of which are wrong (urandom vs. random, for instance). The "A/L/D" scoring makes very little sense. CBC is legacy-allowable and CTR is disallowed; "verification of padding must be performed in constant time". For reasons passing understanding, "MAC-then-encrypt" is legacy-allowable. They've deprecated the internally truncated SHA2's and kept the full-width ones (the internally truncated ones are more, not less secure). They've taken the time to formally disallow "MD5 and SHA1 based KDF functions". There's a long list of allowed FFDH groups. AES-CMAC is a recommended general-purpose message authenticator.

This is a mess, and I would actively steer people away from it.

rubendev · 2025-10-22T17:07:53 1761152873

Yes it’s an audit checklist for when you need to know specifically what to use and with which parameters.

It’s unfortunate if there are mistakes in there. The people at OWASP would be very happy to receive feedback on their GitHub I’m sure.

tptacek · 2025-10-22T17:25:35 1761153935

It's a bad audit checklist! If OWASP volunteers can't do a good one, they should just not do one at all. It's fine for them not to cover things that are outside their expertise.

rubendev · 2025-10-22T17:39:23 1761154763

Which one would you recommend instead? Referring dev teams to NIST standards or the like doesn’t work well in my experience.

tptacek · 2025-10-22T17:41:25 1761154885

There doesn't always have to be a resource. Sometimes no resource is better than a faulty one. Cryptography is one of those cases.

akerl_ · 2025-10-22T16:30:42 1761150642

I’d wager that something like 90% of developers who look at that page should close the tab instead of reading any of it.

If you’re building a system and need crypto… pick the canonical library for the ecosystem or language you’re working in. Don’t try to build your own collection of primitives.

rubendev · 2025-10-22T19:49:40 1761162580

Also, I gave the link to the appendix because there was a specific question about Argon2 parameters. For general developer audiences, they need to look at the standard itself which is a lot more high level about how to properly implement cryptography in software: https://github.com/OWASP/ASVS/blob/master/5.0/en/0x20-V11-Cr...

For the most common use-cases of cryptography like authentication and secure communication there is more specific, but still high level guidance that is useful for developers as well:

- https://github.com/OWASP/ASVS/blob/master/5.0/en/0x21-V12-Se...

- https://github.com/OWASP/ASVS/blob/master/5.0/en/0x18-V9-Sel...

- https://github.com/OWASP/ASVS/blob/master/5.0/en/0x15-V6-Aut...

tptacek · 2025-10-23T20:08:43 1761250123

This standard is bad. People should avoid it. For example: 11.2.2 (cryptographic agility) is an anti-pattern in modern cryptographic engineering.

rubendev · 2025-10-24T17:35:02 1761327302

Please elaborate why you believe that? The ability to easily rotate encryption keys is considered an anti pattern?

rubendev · 2025-10-22T16:42:00 1761151320

Yes I fully agree. I’m a big fan of libraries like Google Tink that make you pick a use case and use the best implementation for that use case with built in crypto agility.

Most crypto libraries are not built like that however. They just give you a big pile of primitives/algorithms to choose from. Then frameworks get built on top of that, not always taking into account best practices, and leave people that are serious about security the job of making sure the implementation is secure. This is the point where you need something like ASVS.

akerl_ · 2025-10-22T16:47:57 1761151677

What language today still doesn't have a de facto simplified toolbox for wrapping crypto operations?

If you're a developer, and you start trying to perform crypto operations for your service and the library you chose is making you question which cipher, what KDF parameters, or what DH group you want, that is 100% a red flag and you should promptly stop using that crypto library.

rubendev · 2025-10-22T17:01:27 1761152487

Can you give some examples of such commonly used libraries for languages like Java / C# / C++?

In my experience there are not many libraries like Google Tink around, and they are not in widespread use at all. Most applications doing encryption manually for specific purposes still have the words AES, CBC, GCM, IV etc hardcoded in their source code.

If you review such code, it’s still useful to have resources that show industry best practices, but I agree that the gold standard is to not have these details in your own code at all.

akerl_ · 2025-10-22T17:18:36 1761153516

I'd probably look at https://libsodium.gitbook.io/doc/bindings_for_other_language... or https://nacl.cr.yp.to/box.html

luizfelberti · 2025-10-13T14:52:59 1760367179

Documenso[0] is a pretty cool alternative that is increasingly compliant with more and more e-signature standards

https://documenso.com/

luizfelberti · 2025-10-12T18:56:47 1760295407

There used to be a similarly names one called CozoDB[0] which was pretty awesome but it looks like its development significantly slowed down.

[0] https://github.com/cozodb/cozo

luizfelberti · 2025-10-08T19:42:30 1759952550

Ah yes, pretending we can access infinite amounts of memory instantaneously or in a finite/bounded amount of time is the achilles heel of the Von Neumann abstract computer model, and is the point where it completely diverges from physical reality.

Acknowledging that memory access is not instantaneous immediately throws you into the realm of distributed systems though and something much closer to an actor model of computation. It's a pretty meaningful theoretical gap, more so than people realize.

hinkley · 2025-10-08T20:33:42 1759955622

I would like to see someone pick up Knuth’s torch and formulate a new order of complexity for distributed computing.

Many of the products we use, and for probably the last fifty years really, live in the space between theory and practice. We need to collect all of this and teach it. Computer has grown 6, maybe more orders of magnitude since Knuth pioneered these techniques. In any other domain of computer science the solutions often change when the order of magnitude of the problem changes, and after several it’s inescapable.

luizfelberti · 2025-09-10T14:45:24 1757515524

I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.

I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...

mhitza · 2025-09-10T15:04:36 1757516676

You might want to bookmark https://openwebsearch.eu/open-webindex/

While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.

3RTB297 · 2025-09-10T18:08:37 1757527717

You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?

I'll add it to the mile-long list of things that should exist and be online public goods.

moduspol · 2025-09-10T15:47:56 1757519276

Is the common crawl usable for something like this?

https://commoncrawl.org

chiefsearchaco · 2025-09-10T22:09:43 1757542183

I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.

giancarlostoro · 2025-09-10T16:04:12 1757520252

Most likely it is, the issue then becomes being able to store and afford the storage for all the files.

moduspol · 2025-09-10T16:22:39 1757521359

Sure, and that's not easy, but it's a lot easier than having to crawl the entire public Internet yourself.

wordpad · 2025-09-10T15:29:59 1757518199

Why can't crawling be crowd sourced? It would solve ip rotation and spread the load

6510 · 2025-09-10T15:51:56 1757519516

https://yacy.net

catlikesshrimp · 2025-09-10T16:24:32 1757521472

Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)

Poomba · 2025-09-10T15:51:27 1757519487

That’s how residential proxies work, in a perverse way

chiefsearchaco · 2025-09-10T22:10:54 1757542254

Common crawl sort of serves this function. I use it. It's a really good foundation.

6510 · 2025-09-10T15:50:44 1757519444

The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?

ge96 · 2025-09-10T15:00:53 1757516453

The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.

kccqzy · 2025-09-10T15:03:03 1757516583

Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.

Bratmon · 2025-09-10T15:33:50 1757518430

Not just the black market anymore!

https://www.proxyrack.com/residential-proxies/

immibis · 2025-09-10T18:47:14 1757530034

you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one.

typpilol · 2025-09-11T06:28:57 1757572137

I've heard a few horror stories... Since the people using residential proxies aren't necessarily always good people

luizfelberti · 2025-09-10T14:16:03 1757513763

I also switched away from Tarsnap because I needed to restore my personal PDF collection of like 20GB once and my throughput was like 100Kb/s, maybe less. It has been a problem for at least a decade, with no fix in sight.

I'm carefully monitoring plakar in this space, wondering if anyone has experience with it and could share?