I also find this to be an elegant way of doing this, and it is also how the Thompson VM style of regex engines work [0]
It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.
Honestly this benchmark feels completely dominated by the instance's NIC capacity.
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.
Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.
> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
Yep I think the value of the experiment is not clear.
You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.
Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.
I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.
I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.
I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.
And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.
10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.
"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".
I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.
This is a really good observation, and matches something I had to learn painfully over 30 years ago. At a Wall Street bank, we were trying to really push the limits with some middleware, and my mentor at the time very quietly suggested "before you test your system's performance, understand the theoretical maximum of your setup first with no work".
The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.
c5 is such a bad instance type, m6a would be so much better and even cheaper,
I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps
Actually for this kind of workload 15Gbps is still mediocre. What you actually want is the `n` variant of the instance types, which have higher NIC capacity.
In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.
A Samsung 990 Pro reads at something like 50 Gbps and PCIe 4.0 x4 is quite a bit faster than that. You can get this speed with a queue depth that isn’t crazy, and you can have multiple NVMe operations in flight reading the same large Parquet file. Latency is in the tens of microseconds.
The consensus seems to be that S3 can read one object at somewhat under 1Gbps. You can probable scale that to the full speed of your NIC by reading multiple objects at once, but you may not be able to scale by reading one object in multiple overlapping ranges. Latency is in the milliseconds.
So, sure, an EC2 with a fast instance and massive multiple object parallelism can have 10x higher bandwidth than an NVMe device, but the amount of parallelism and latency tolerance needed is a couple orders of magnitude higher than NVMe. Meanwhile that NVMe device does not charge for read operations and costs a couple hundred dollars, once.
If you are so inclined, you can build an NVMEoF setup (at much much higher cost) that separates compute and storage and has excellent performance, but this is a nontrivial undertaking.
They barely just released Containerization Framework[0] and the new container[1] tool, and they are already scheduling a kneecapping of this two years down the line.
Realistically, people are still going to be deploying on x64 platforms for a long time, and given that Apple's whole shtick was to serve "professionals", it's really a shame that they're dropping the ball on developers like this. Their new containerization stuff was the best workflow improvement for me in quite a while.
Yeah, it kind of kills me that I am writing this on a Samsung Galaxy Book 3 Pro 360 running Windows 11 so that I can run Macromedia Freehand/MX (I was a beta-tester for that version) so that I can still access Altsys Virtuoso 2 files from my NeXT Cube (Virtuoso 2 ~= Macromedia Freehand 4) for a typeface design project I'm still working on (a digital revival of a hot metal typeface created by my favourite type designer/illustrator who passed in 1991, but whose widow was gracious enough to give me permission to revive).
I was _so_ hopeful when I asked the devs to revive the Nx-UI code so that FH/MX could have been a native "Cocoa" app....
> running Windows 11 so that I can run Macromedia Freehand/MX
Freehand still works on Windows 11? I’m happy for you, I never found a true replacement for it.
> a digital revival of a hot metal typeface created by my favourite type designer/illustrator who passed in 1991, but whose widow was gracious enough to give me permission to revive
Any reason you haven’t shared the name of the designer or the typeface? That story sounds interesting, I’d really welcome learning more.
Yes, fortunately. I despair of what I'm going to do when I no longer have such an option. Cenon is clunky, Inkscape's cross-platform nature keeps it from having many interface aspects which I depend on, and I'd rather give up digital drawing than use Adobe Illustrator (which despite using since v3.2 on college lab Macs and on my NeXT Cube I never found comfortable).
The designer/typeface are Warren Chappell's Trajanus, and his unreleased Eichenauer --- I read _The Living Alphabet_ (and his cousin Oscar Ogg's _The 26 Letters_) when I was very young, and met him briefly on a school field trip back when he was Artist-in-Residence at UVA and did a fair bit of research in their Rare Book Room, and even had a sample of the metal type (missing one character unfortunately).
It is currently stalled at my having scanned and drawn up one of each letter at each size which I have available, but only having two letters, _N_ and _n_ in all sizes --- probably shouldn't worry that much about the optical axis, since it was cut in metal in one master size and the other sizes made using a pantograph, but there were _some_ adjustments which I'd like to preserve. There is a digital version of Trajanus available, but it's based on the phototype. I've been working at recreating each character using METAFONT, encompassing the optical size variation in that programmatically, but it's been slow going (and once I'm done, I then have to work out how to make it into outlines....)
That's why like 80%+(?) of corporate world runs Windows client side for their laptops/workstations. They don't want to have to rewrite their shit whenever the OS vendor pushes an update.
Granted, that's less of an issue now with most new SW being written in JS to run in any browser but old institutions like banks, insurances, industrial, automation, retail chains, etc still run some ancient Java/C#/C++ programs they don't want to, or can't update for reasons but it keeps the lights on.
Which is why I find it adorable when people in this bubble think all those industries will suddenly switch to Macs.
One of my previous companies gave top of the line workstations with 4k touchscreens and i9s to literally everyone junior and below a particular grade. I'm quite sure they could've saved 1000s of dollars per laptop by going with a reasonable MacBook.
(Ironically, windows 11 + corporate bloatware made the laptops super laggy. Go figure.)
>but in general the more tech-forward the company is the less Windows there is at it.
reply
Only if you count food delivery apps, crypto Ponzi scheme unicorns, Ad-services and SaaS start-ups as "tech-forward" exclusively, because you're omitting a lot of other tech companies your daily life in the civilized world depends on, which operate mainly on Windows, like where I work now.
Is designing and building semiconductors not "technology"? Or MRI machines? Or jets? Or car engines?
It seems to talk about Rosetta 2 as a whole, which is what the containerization framework depends on to support running amd64 binaries inside Linux VMs (even though the kernel still needs to be arm)
Is there a separate part of Rosetta that is implemented for the VM stuff? I was under the impression Rosetta was some kind of XPC service that would translate executable pages for Hypervisor Framework as they were faulted in, did I just misunderstand how the thing works under the hood? Are there two Rosettas?
I cannot tell you about implementation difference but what I mean is that this only talks about Rosetta 2 for Mac apps. Rosetta for Linux is a feature of the Virtualization framework that’s documented in a completely different place. And this message says a part of Rosetta for macOS will stick around, so I would be surprised if they removed the Linux part.
On the Linux side, Rosetta is an executable that you hook up with binfmt to run AMD64 binaries, like how you might use Wine for windows binaries
> and given that Apple's whole shtick was to serve "professionals",
When was the last time this was true? I think I gave up on the platform around the new keyboards, who clearly weren't made for typing, and the non-stop "Upgrade" and "Upgrade" notifications that you couldn't disable, just push forward into the future. Everything they've done since them seems to have been to impress the Average Joe, not for serving professionals.
That's literally sponsored content/an ad by a company who makes money managing Apple devices, of course they'll say it's "mission critical", on a website meant to promote Apple hardware.
Happen to have some less biased source saying anything similar, ideally not sponsored content?
Do not reference these kinds of docs whenever you need practical, actionable advice. They serve their purpose, but are for a completely different kind of audience.
For anyone perusing this thread, your first resource for this kind of security advice should probably be the OWASP cheatsheets which is a living set of documents that packages current practice into direct recommendations for implementers.
This is just a random list of links to standards and summary tables, some of which are wrong (urandom vs. random, for instance). The "A/L/D" scoring makes very little sense. CBC is legacy-allowable and CTR is disallowed; "verification of padding must be performed in constant time". For reasons passing understanding, "MAC-then-encrypt" is legacy-allowable. They've deprecated the internally truncated SHA2's and kept the full-width ones (the internally truncated ones are more, not less secure). They've taken the time to formally disallow "MD5 and SHA1 based KDF functions". There's a long list of allowed FFDH groups. AES-CMAC is a recommended general-purpose message authenticator.
This is a mess, and I would actively steer people away from it.
It's a bad audit checklist! If OWASP volunteers can't do a good one, they should just not do one at all. It's fine for them not to cover things that are outside their expertise.
I’d wager that something like 90% of developers who look at that page should close the tab instead of reading any of it.
If you’re building a system and need crypto… pick the canonical library for the ecosystem or language you’re working in. Don’t try to build your own collection of primitives.
Also, I gave the link to the appendix because there was a specific question about Argon2 parameters. For general developer audiences, they need to look at the standard itself which is a lot more high level about how to properly implement cryptography in software:
https://github.com/OWASP/ASVS/blob/master/5.0/en/0x20-V11-Cr...
For the most common use-cases of cryptography like authentication and secure communication there is more specific, but still high level guidance that is useful for developers as well:
Yes I fully agree. I’m a big fan of libraries like Google Tink that make you pick a use case and use the best implementation for that use case with built in crypto agility.
Most crypto libraries are not built like that however. They just give you a big pile of primitives/algorithms to choose from. Then frameworks get built on top of that, not always taking into account best practices, and leave people that are serious about security the job of making sure the implementation is secure. This is the point where you need something like ASVS.
What language today still doesn't have a de facto simplified toolbox for wrapping crypto operations?
If you're a developer, and you start trying to perform crypto operations for your service and the library you chose is making you question which cipher, what KDF parameters, or what DH group you want, that is 100% a red flag and you should promptly stop using that crypto library.
Can you give some examples of such commonly used libraries for languages like Java / C# / C++?
In my experience there are not many libraries like Google Tink around, and they are not in widespread use at all. Most applications doing encryption manually for specific purposes still have the words AES, CBC, GCM, IV etc hardcoded in their source code.
If you review such code, it’s still useful to have resources that show industry best practices, but I agree that the gold standard is to not have these details in your own code at all.
Ah yes, pretending we can access infinite amounts of memory instantaneously or in a finite/bounded amount of time is the achilles heel of the Von Neumann abstract computer model, and is the point where it completely diverges from physical reality.
Acknowledging that memory access is not instantaneous immediately throws you into the realm of distributed systems though and something much closer to an actor model of computation. It's a pretty meaningful theoretical gap, more so than people realize.
I would like to see someone pick up Knuth’s torch and formulate a new order of complexity for distributed computing.
Many of the products we use, and for probably the last fifty years really, live in the space between theory and practice. We need to collect all of this and teach it. Computer has grown 6, maybe more orders of magnitude since Knuth pioneered these techniques. In any other domain of computer science the solutions often change when the order of magnitude of the problem changes, and after several it’s inescapable.
I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?
I'll add it to the mile-long list of things that should exist and be online public goods.
I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.
The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?
The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.
you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one.
I also switched away from Tarsnap because I needed to restore my personal PDF collection of like 20GB once and my throughput was like 100Kb/s, maybe less. It has been a problem for at least a decade, with no fix in sight.
I'm carefully monitoring plakar in this space, wondering if anyone has experience with it and could share?
It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.
[0] https://swtch.com/~rsc/regexp/regexp2.html