Hacker Newsnew | past | comments | ask | show | jobs | submit | luizfelberti's commentslogin

I also find this to be an elegant way of doing this, and it is also how the Thompson VM style of regex engines work [0]

It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.

[0] https://swtch.com/~rsc/regexp/regexp2.html


> If sqlite had a generic "strictly ascending sequence of integers" type

Is that not what WITHOUT ROWID does? My understanding is that it's precisely meant to physically cluster data in the underlying B-Tree

If that is not what you meant, could you elaborate on the "primary key tables aren't really useful here" footnote?


Honestly this benchmark feels completely dominated by the instance's NIC capacity.

They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.

The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.


It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.

S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.


Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.

It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.


Everyone wants a data lake when what they have a is a data pond.


I think you meant puddle.

cue Peppa Pig laughter sounds


Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.

Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.


> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).

> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

That's absolutely right.


Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.


Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.


I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.


And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.


The appropriate comparison point for aggregate cluster storage bandwidth would be its bisection bandwidth.

(I do HPC, IIRC ANL Aurora is < 1PB/s DAOS and 20 PB/s bisection).


10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.


You can get a 600Gbps interface on an Amazon EC2 instance (c8gn.48xlarge), if you’re willing to pay for it.


"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".


I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.


The infra might have a 400Gbps NIC, but if you're buying a small compute slice on that infra, you don't get all the capability.


They do at AWS, too but op paid for a small VM


This is a really good observation, and matches something I had to learn painfully over 30 years ago. At a Wall Street bank, we were trying to really push the limits with some middleware, and my mentor at the time very quietly suggested "before you test your system's performance, understand the theoretical maximum of your setup first with no work".

The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.


I'm kind of suprised they didn't choose an ec2 instance with higher throughput. S3 can totally eek out 100s of Gibps with the right setup.

BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.


c5 is such a bad instance type, m6a would be so much better and even cheaper, I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps


Actually for this kind of workload 15Gbps is still mediocre. What you actually want is the `n` variant of the instance types, which have higher NIC capacity.

In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.


The math here is weird.

A Samsung 990 Pro reads at something like 50 Gbps and PCIe 4.0 x4 is quite a bit faster than that. You can get this speed with a queue depth that isn’t crazy, and you can have multiple NVMe operations in flight reading the same large Parquet file. Latency is in the tens of microseconds.

The consensus seems to be that S3 can read one object at somewhat under 1Gbps. You can probable scale that to the full speed of your NIC by reading multiple objects at once, but you may not be able to scale by reading one object in multiple overlapping ranges. Latency is in the milliseconds.

So, sure, an EC2 with a fast instance and massive multiple object parallelism can have 10x higher bandwidth than an NVMe device, but the amount of parallelism and latency tolerance needed is a couple orders of magnitude higher than NVMe. Meanwhile that NVMe device does not charge for read operations and costs a couple hundred dollars, once.

If you are so inclined, you can build an NVMEoF setup (at much much higher cost) that separates compute and storage and has excellent performance, but this is a nontrivial undertaking.


They barely just released Containerization Framework[0] and the new container[1] tool, and they are already scheduling a kneecapping of this two years down the line.

Realistically, people are still going to be deploying on x64 platforms for a long time, and given that Apple's whole shtick was to serve "professionals", it's really a shame that they're dropping the ball on developers like this. Their new containerization stuff was the best workflow improvement for me in quite a while.

[0] https://github.com/apple/containerization

[1] https://github.com/apple/container


Apple has always been like this, there are other options when backwards compatibility is relevant feature.


Yeah, it kind of kills me that I am writing this on a Samsung Galaxy Book 3 Pro 360 running Windows 11 so that I can run Macromedia Freehand/MX (I was a beta-tester for that version) so that I can still access Altsys Virtuoso 2 files from my NeXT Cube (Virtuoso 2 ~= Macromedia Freehand 4) for a typeface design project I'm still working on (a digital revival of a hot metal typeface created by my favourite type designer/illustrator who passed in 1991, but whose widow was gracious enough to give me permission to revive).

I was _so_ hopeful when I asked the devs to revive the Nx-UI code so that FH/MX could have been a native "Cocoa" app....


> running Windows 11 so that I can run Macromedia Freehand/MX

Freehand still works on Windows 11? I’m happy for you, I never found a true replacement for it.

> a digital revival of a hot metal typeface created by my favourite type designer/illustrator who passed in 1991, but whose widow was gracious enough to give me permission to revive

Any reason you haven’t shared the name of the designer or the typeface? That story sounds interesting, I’d really welcome learning more.


Yes, fortunately. I despair of what I'm going to do when I no longer have such an option. Cenon is clunky, Inkscape's cross-platform nature keeps it from having many interface aspects which I depend on, and I'd rather give up digital drawing than use Adobe Illustrator (which despite using since v3.2 on college lab Macs and on my NeXT Cube I never found comfortable).

The designer/typeface are Warren Chappell's Trajanus, and his unreleased Eichenauer --- I read _The Living Alphabet_ (and his cousin Oscar Ogg's _The 26 Letters_) when I was very young, and met him briefly on a school field trip back when he was Artist-in-Residence at UVA and did a fair bit of research in their Rare Book Room, and even had a sample of the metal type (missing one character unfortunately).

It is currently stalled at my having scanned and drawn up one of each letter at each size which I have available, but only having two letters, _N_ and _n_ in all sizes --- probably shouldn't worry that much about the optical axis, since it was cut in metal in one master size and the other sizes made using a pantograph, but there were _some_ adjustments which I'd like to preserve. There is a digital version of Trajanus available, but it's based on the phototype. I've been working at recreating each character using METAFONT, encompassing the optical size variation in that programmatically, but it's been slow going (and once I'm done, I then have to work out how to make it into outlines....)


That's why like 80%+(?) of corporate world runs Windows client side for their laptops/workstations. They don't want to have to rewrite their shit whenever the OS vendor pushes an update.

Granted, that's less of an issue now with most new SW being written in JS to run in any browser but old institutions like banks, insurances, industrial, automation, retail chains, etc still run some ancient Java/C#/C++ programs they don't want to, or can't update for reasons but it keeps the lights on.

Which is why I find it adorable when people in this bubble think all those industries will suddenly switch to Macs.


they use Windows because it's ostensibly cheap and there's momentum. I don't think any modern tech company is majority Windows.


One of my previous companies gave top of the line workstations with 4k touchscreens and i9s to literally everyone junior and below a particular grade. I'm quite sure they could've saved 1000s of dollars per laptop by going with a reasonable MacBook.

(Ironically, windows 11 + corporate bloatware made the laptops super laggy. Go figure.)


It surely is outside US, and countries with similar income level.

https://www.accio.com/business/operating-system-market-share...


That's overall market share. Agree Windows use is high but in general the more tech-forward the company is the less Windows there is at it.


So only 13% of the world desktop users might be employeed at a tech-forward company.

Might, because the number is even less, when we differenciate between companies and home use.


> more tech-forward

That may be surprising for people here, but technology is not synonymous with software.


>but in general the more tech-forward the company is the less Windows there is at it. reply

Only if you count food delivery apps, crypto Ponzi scheme unicorns, Ad-services and SaaS start-ups as "tech-forward" exclusively, because you're omitting a lot of other tech companies your daily life in the civilized world depends on, which operate mainly on Windows, like where I work now.

Is designing and building semiconductors not "technology"? Or MRI machines? Or jets? Or car engines?


The whole world does not consist of the tech industry.


The OP says nothing about Rosetta for Linux.


It seems to talk about Rosetta 2 as a whole, which is what the containerization framework depends on to support running amd64 binaries inside Linux VMs (even though the kernel still needs to be arm)

Is there a separate part of Rosetta that is implemented for the VM stuff? I was under the impression Rosetta was some kind of XPC service that would translate executable pages for Hypervisor Framework as they were faulted in, did I just misunderstand how the thing works under the hood? Are there two Rosettas?


I cannot tell you about implementation difference but what I mean is that this only talks about Rosetta 2 for Mac apps. Rosetta for Linux is a feature of the Virtualization framework that’s documented in a completely different place. And this message says a part of Rosetta for macOS will stick around, so I would be surprised if they removed the Linux part.

On the Linux side, Rosetta is an executable that you hook up with binfmt to run AMD64 binaries, like how you might use Wine for windows binaries


Rosetta Linux executable can be used without host hardware/software support; for example, you can run it on AWS's Graviton instances.

However, to get performance benefits, you still need to have hardware support, and have Rosetta installed on macOS [1].

TFA is quite vague about what is being deprecated.

[1] https://developer.apple.com/documentation/virtualization/run...


The "other" part of Rosetta is having all system frameworks being also compiled for x86_64, and being supported running in this configuration.


> and given that Apple's whole shtick was to serve "professionals",

When was the last time this was true? I think I gave up on the platform around the new keyboards, who clearly weren't made for typing, and the non-stop "Upgrade" and "Upgrade" notifications that you couldn't disable, just push forward into the future. Everything they've done since them seems to have been to impress the Average Joe, not for serving professionals.


The current macbook pro was basically a checklist of items professionals wanted away from consumer new shiny thing.


And then they introduced liquid glass, because professionals don't need an easily readable UI to work with.


> Everything they've done since them seems to have been to impress the Average Joe, not for serving professionals.

"CIOs say Apple is now mission critical for the enterprise" [1]

[1]: https://9to5mac.com/2025/10/25/cios-say-apple-is-now-mission...


That's literally sponsored content/an ad by a company who makes money managing Apple devices, of course they'll say it's "mission critical", on a website meant to promote Apple hardware.

Happen to have some less biased source saying anything similar, ideally not sponsored content?


There are a lot of projects with arm containers on docker hub. It’s not hard to build multi platform containers.


Do not reference these kinds of docs whenever you need practical, actionable advice. They serve their purpose, but are for a completely different kind of audience.

For anyone perusing this thread, your first resource for this kind of security advice should probably be the OWASP cheatsheets which is a living set of documents that packages current practice into direct recommendations for implementers.

Here's what it says about tuning Argon2:

https://cheatsheetseries.owasp.org/cheatsheets/Password_Stor...


It's been a couple years since I've looked but the track record of OWASP for cryptography advice has been pretty dismal.


Do you have a better recommendation?

I feel bad for OWASP. They're doing the lords work, but seem to have a shoestring budget.


The OWASP ASVS appendix on Cryptography is one of the best and concise resources I know for this kind of thing: https://github.com/OWASP/ASVS/blob/master/5.0/en/0x92-Append...


This is just a random list of links to standards and summary tables, some of which are wrong (urandom vs. random, for instance). The "A/L/D" scoring makes very little sense. CBC is legacy-allowable and CTR is disallowed; "verification of padding must be performed in constant time". For reasons passing understanding, "MAC-then-encrypt" is legacy-allowable. They've deprecated the internally truncated SHA2's and kept the full-width ones (the internally truncated ones are more, not less secure). They've taken the time to formally disallow "MD5 and SHA1 based KDF functions". There's a long list of allowed FFDH groups. AES-CMAC is a recommended general-purpose message authenticator.

This is a mess, and I would actively steer people away from it.


Yes it’s an audit checklist for when you need to know specifically what to use and with which parameters.

It’s unfortunate if there are mistakes in there. The people at OWASP would be very happy to receive feedback on their GitHub I’m sure.


It's a bad audit checklist! If OWASP volunteers can't do a good one, they should just not do one at all. It's fine for them not to cover things that are outside their expertise.


Which one would you recommend instead? Referring dev teams to NIST standards or the like doesn’t work well in my experience.


There doesn't always have to be a resource. Sometimes no resource is better than a faulty one. Cryptography is one of those cases.


I’d wager that something like 90% of developers who look at that page should close the tab instead of reading any of it.

If you’re building a system and need crypto… pick the canonical library for the ecosystem or language you’re working in. Don’t try to build your own collection of primitives.


Also, I gave the link to the appendix because there was a specific question about Argon2 parameters. For general developer audiences, they need to look at the standard itself which is a lot more high level about how to properly implement cryptography in software: https://github.com/OWASP/ASVS/blob/master/5.0/en/0x20-V11-Cr...

For the most common use-cases of cryptography like authentication and secure communication there is more specific, but still high level guidance that is useful for developers as well:

- https://github.com/OWASP/ASVS/blob/master/5.0/en/0x21-V12-Se...

- https://github.com/OWASP/ASVS/blob/master/5.0/en/0x18-V9-Sel...

- https://github.com/OWASP/ASVS/blob/master/5.0/en/0x15-V6-Aut...


This standard is bad. People should avoid it. For example: 11.2.2 (cryptographic agility) is an anti-pattern in modern cryptographic engineering.


Please elaborate why you believe that? The ability to easily rotate encryption keys is considered an anti pattern?


Yes I fully agree. I’m a big fan of libraries like Google Tink that make you pick a use case and use the best implementation for that use case with built in crypto agility.

Most crypto libraries are not built like that however. They just give you a big pile of primitives/algorithms to choose from. Then frameworks get built on top of that, not always taking into account best practices, and leave people that are serious about security the job of making sure the implementation is secure. This is the point where you need something like ASVS.


What language today still doesn't have a de facto simplified toolbox for wrapping crypto operations?

If you're a developer, and you start trying to perform crypto operations for your service and the library you chose is making you question which cipher, what KDF parameters, or what DH group you want, that is 100% a red flag and you should promptly stop using that crypto library.


Can you give some examples of such commonly used libraries for languages like Java / C# / C++?

In my experience there are not many libraries like Google Tink around, and they are not in widespread use at all. Most applications doing encryption manually for specific purposes still have the words AES, CBC, GCM, IV etc hardcoded in their source code.

If you review such code, it’s still useful to have resources that show industry best practices, but I agree that the gold standard is to not have these details in your own code at all.



Documenso[0] is a pretty cool alternative that is increasingly compliant with more and more e-signature standards

https://documenso.com/


There used to be a similarly names one called CozoDB[0] which was pretty awesome but it looks like its development significantly slowed down.

[0] https://github.com/cozodb/cozo


Ah yes, pretending we can access infinite amounts of memory instantaneously or in a finite/bounded amount of time is the achilles heel of the Von Neumann abstract computer model, and is the point where it completely diverges from physical reality.

Acknowledging that memory access is not instantaneous immediately throws you into the realm of distributed systems though and something much closer to an actor model of computation. It's a pretty meaningful theoretical gap, more so than people realize.


I would like to see someone pick up Knuth’s torch and formulate a new order of complexity for distributed computing.

Many of the products we use, and for probably the last fifty years really, live in the space between theory and practice. We need to collect all of this and teach it. Computer has grown 6, maybe more orders of magnitude since Knuth pioneered these techniques. In any other domain of computer science the solutions often change when the order of magnitude of the problem changes, and after several it’s inescapable.


I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.

I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...


You might want to bookmark https://openwebsearch.eu/open-webindex/

While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.


You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?

I'll add it to the mile-long list of things that should exist and be online public goods.


Is the common crawl usable for something like this?

https://commoncrawl.org


I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.


Most likely it is, the issue then becomes being able to store and afford the storage for all the files.


Sure, and that's not easy, but it's a lot easier than having to crawl the entire public Internet yourself.


Why can't crawling be crowd sourced? It would solve ip rotation and spread the load



Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)


That’s how residential proxies work, in a perverse way


Common crawl sort of serves this function. I use it. It's a really good foundation.


The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?


The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.


Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.


Not just the black market anymore!

https://www.proxyrack.com/residential-proxies/


you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one.


I've heard a few horror stories... Since the people using residential proxies aren't necessarily always good people


I also switched away from Tarsnap because I needed to restore my personal PDF collection of like 20GB once and my throughput was like 100Kb/s, maybe less. It has been a problem for at least a decade, with no fix in sight.

I'm carefully monitoring plakar in this space, wondering if anyone has experience with it and could share?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: