Hacker Newsnew | past | comments | ask | show | jobs | submit | breckognize's commentslogin

Shameless plug: Row Zero has real enterprise traction. AWS uses us.

2B row limit, connected, eliminates the Excel security risk because it's hosted.


Kinda related - our product, rowzero.io, is a browser-based spreadsheet with a 2 billion row limit. We initially built the client as anyone would, using a div per cell. We tried to use an off-screen div to take advantage of the browser's native scrollbars but ran into document height limits. Firefox's was 6M pixels iirc. The solution was to do rendering in canvas and draw the scrollbars ourselves.


Firefox’s limit is 17,895,697 pixels. Others have a limit less than twice as high, so given you’re aiming for a value way higher than that, it’s not a browser-specific issue, except insofar as Firefox ignores rather than clamping, so you have to detect Firefox and clamp it manually.

In Fastmail’s case (see my top-level comment), making the end of a ridiculously large mailbox inaccessible was considered acceptable. In a spreadsheet, that’s probably not so, so you need to do something different. But frankly I think you needed to use a custom scrollbar anyway, as a linear scrollbar will be useless for almost all documents for anything except returning to the top.

Rendering the content, however, to a canvas is not particularly necessary: make a 4 million pixel square area, hide its scrollbars, render the outermost million pixels of all edges at their edge, and where you’re anywhere in the middle (e.g. 1.7 billion rows in), render starting at 2 million pixels, and if the user scrolls a million pixels in any direction, recentre (potentially disrupting scrolling inertia, but that’s about it). That’s basically perfect, allowing native rendering and scrolling interaction, meaning better behaviour and lower latency.


Is it completely client side? Why does it have a 2 billion row limit? Where are the limitations coming from?


Do you even need to have one scroll pixel == one screen pixel (or even one scroll pixel == one spreadsheet row)? At the point of 2 billion rows, the scrollbar really falls apart and just jumping to an approximation of the correct location in the document is all anyone can hope for.


It's worth noting the author led the implementation of the file system at the bottom of S3.


And depth first search is just a stack!


Yes, but it doesn't come right away. When preferring deeper nodes for a moment you have a loop-safe Depth-first traversal, and then when simplifying things in case the Graph is a Tree you get your regular stack-based Depth-first traversal, in which if you settle for the first goal you get back a tail-call optimised DFS.


To measure performance the author looked at latency, but most S3 workloads are throughput oriented. The magic of S3 is that it's cheap because it's built on spinning HDDs, which are slow and unreliable individually, but when you have millions of them, you can mask the tail and deliver multi TBs/sec of throughput.

It's misleading to look at S3 as a CDN. It's fine for that, but it's real strength is backing the world's data lakes and cloud data warehouses. Those workloads have a lot of data that's often cold, but S3 can deliver massive throughout when you need it. R2 can't do that, and as far as I can tell, isn't trying to.

Source: I used to work on S3


yes, this. In case you are interested in seeing some numbers backing this claim, see here https://outerbounds.com/blog/metaflow-fast-data

Source: I used to work at Netflix, building systems that pull TBs from S3 hourly


Yeah, I'd be interested in the bandwidth as well. Can R2 saturate 10/25/50 gigabit links? Can it do so with single requests, or if not, how many parallel requests does that require?



Cloudflare's paid DDoS protection product being able to soak up insane L3/4 DDoS attacks doesn't answer the question as to whether or not the specific product, R2 from Cloudflare which has free egress is able to saturate a pipe.

Cloudflare has the network to do that, but they charge money to do so with their other offerings, so why would they give that to you for free? R2 is not a CDN.


[flagged]


> can't read CDN

> Can't read R2

k


That's unrelated to the performance of (for instance) the R2 storage layer. All the bandwidth in the world won't help you if you're blocked on storage. It isn't clear whether the overall performance of R2 is capable of saturating user bandwidth, or whether it'll be blocked on something.

S3 can't saturate user bandwidth unless you make many parallel requests. I'd be (pleasantly) surprised if R2 can.


I'm confused, I assumed we were talking about the network layer.

If we are talking about storage, well, SATA can't give you more than ~5Gbps so I guess the answer is no? But also no one else can do it, unless they're using super exotic HDD tech (hint: they're not, it's actually the opposite).

What a weird thing to argue about, btw, literally everybody is running a network layer on top of storage that lets you have much higher throughput. When one talks about R2/S3 throughput no one (on my circle, ofc.) would think we are referring to the speed of their HDDs, lmao. But it's nice to see this, it's always amusing to stumble upon people with a wildly different point of view on things.


We're talking about the user-visible behavior. You argued that because Cloudflare's CDN has an obscene amount of bandwidth, R2 will be able to saturate user bandwidth; that doesn't follow, hence my counterpoint that it could be bottlenecked on storage rather than network. The question at hand is what performance R2 offers, and that hasn't been answered.

There are any number of ways they could implement R2 that would allow it to run at full wire speed, but S3 doesn't run at full wire speed by default (unless you make many parallel requests) and I'd be surprised if R2 does.


n = 1 aside.

I have some large files stored in R2 and a 50Gbps interface to the world.

curl to Linode's speed test is ~200MB/sec.

curl to R2 is also ~200MB/sec.

I'm only getting 1Gbps but given that Linode's speed is pretty much the same I would think the bottleneck is somewhere else. Dually, R2 gives you at least 1Gbps.


No, most people aren’t interested in subcomponent performance, just in total performance. A trivial example is that even a 4-striped U2 NVMe disk array exported over Ethernet can deliver a lot more data than 5 Gbps and store mucho TiB.


Thanks for +1 what I just said. So, apparently, it's not just me and my peers who think like this.


That comment didn't +1 what you just said. It basically said that we care about the total, usable throughput. Whether some specific components are capable of more doesn't mean anything unless / until that greather throughput is usable by us.


that's completely unrelated. the way to soak up a ddos at scale is just "have lots of peering and a fucking massive amount of ingress".

neither of these tell you how fast you can serve static data.


>that's completely unrelated

Yeah, I'm sure they use a completely different network infrastructure to serve R2 requests.


I mean, it may be true in practice that most S3 workloads are throughput oriented and unconcerned with latency.

But if you look at https://aws.amazon.com/s3/ it says things like:

"Object storage built to retrieve any amount of data from anywhere"

"any amount of data for virtually any use case"

"S3 delivers the resiliency, flexibility, latency, and throughput, to ensure storage never limits performance"

So if S3 is not intended for low-latency applications, the marketing team haven't gotten the message :)


lol I think the only reason you're being downvoted is because the common belief at HN is, "of course marketing is lying and/or doesn't know what they're talking about."

Personally I think you have a point.


I didn’t downvote but s3 does have low latency offerings (express). Which has reasonable latency compared to EFS iirc. I’d be shocked if it was as popular as the other higher latency s3 tiers though.


I agree - we generally do the opposite of trusting marketing, but sometimes marketing is coincidentally correct.

Cloudflare wants to "protect" the world from the evils of DNS services other than themselves even knowing what geographical region people are in, so they strip all geographical information, even general, broad location, from DNS lookups. This has the effect of increasing latency for non-Cloudflare CDNs sometimes, since data will sometimes end up being served out of the wrong region.

I've wondered since I first heard about this if this is their way to enshittify CDN deliverability in general and make their latency look better in comparison.


You should check out Row Zero (https://rowzero.io). We launched on HN earlier this year. Our CSV handling is the best on the market.

You can import multi GB csvs, we auto infer your format, and land your data in a full-featured spreadsheet that supports filter, sort, ctrl-F, sharing, graphs, the full Excel formula language, native Python, and export to Postgres, Snowflake, and Databricks.


or skip the spreadsheet and go relational with DuckDB. Pretty cool to run directly against a set of CSVs and get performant, results in a language most of us already know and use regularly.


> Our CSV handling is the best on the market.

It’s ironic that you cite the one thing that being bad at hasn’t held Excel back. ;)


I suppose. But as a software developer I've never created an Excel spreadsheet that wasn't first a CSV. I do most of my own work with local data files in jq for JSON or q for CSV, then go from a CSV to an Excel spreadsheet only when it's time to communicate that data with non-programmers.

Their niche is clearly supposed to be in helping developers and data scientists make that same leap, from the tools and formats native to their data pipelines to feature-rich spreadsheets as an export/reporting/analysis format for consumption by people who otherwise don't code. CS V support (especially for huge files) is unusually important there.


I've been working on a better spreadsheet for a while now. https://rowzero.io is 1000x faster spreadsheet than Excel/Google Sheets. It looks and feels like those products but can open multi GB data sets, supports Python natively, and can also connect directly to Snowflake/Databricks/Redshift.


Similar to Amazon's Retail fulfillment infrastructure, the AWS supply chain infrastructure is definitely not a commodity.


We built our spreadsheet (https://rowzero.io) from the ground up to integrate natively with Python. Bolting it on like Microsoft did, or as an add in like xlwings, just feels second class. To make it first class, we had to solve three hard problems:

1. Sandboxing and dependencies. Python is extremely unsafe to share, so you need to sandbox execution. There's also the environment/package management problem (does the user you're sharing your workbook with have the same version of pandas as you?). We run workbooks in the cloud to solve both of these.

2. The type system. You need a way to natively interop between Excel's type system and Python's much richer type system. The problem with Excel is there are only two types - numbers and strings. Even dates are just numbers in Excel. Python has rich types like pandas Dataframes, lists, and dictionaries, which Excel can't represent natively. We solved this in a similar way to how Typescript evolved Javascript. We support the Excel formula language and all of its types and also added support for lists, dictionaries, structs, and dataframes.

3. Performance. Our goal was to build a spreadsheet 1000x faster than Excel. Early on we used Python as our formula language but were constantly fighting the GIL and slow interpreter performance. Instead we implemented the spreadsheet engine in Rust as a columnar engine and seamlessly marshal Python types to the spreadsheet type system and back.

It's the hardest systems problem our team's ever worked on. Previously we wrote the S3 file system, so it's not like this was our first rodeo. There's just a ton of details you need to get right to make it feel seamless.

You can try it free here: https://rowzero.io/new?feature=code


As the author of said second class add-in, let me just guess that your most popular feature request was adding the "Import from xlsx" functionality...which describes the whole issue: it's always Excel + something, never something instead of Excel.


My apologies, that came off harsher than I intended. I've used xlwings in previous jobs to complete Excel automation tasks, so thank you for building it. xlwings is one of the projects that motivated me to start Row Zero. My main issue with it, and other Excel add-ins, is they break the promise of an .xlsx file as a self-contained virtual machine of code and data. I can no longer just send the .xlsx file - I need the recipient to install (e.g.) Python first. This makes collaboration a nightmare.

I wanted a spreadsheet interface, which my business partners need, but with a way for power users (me) to do more complicated stuff in Python instead of VBA.

To borrow your phrasing, our thesis is that it has to be Excel-compatible spreadsheet + something, not necessarily Excel + something. It's early days for us, but we've seen a couple publicly traded companies switch off Excel to Row Zero to eliminate the security risks that come with Excel's desktop model.


No offense taken, and happy that xlwings was an inspiration for creating Row Zero! I don't really buy the security issues though for being the reason for switching from Excel to Row Zero. Yes, Excel has security issues, but so does the cloud, but at least the issues with Excel can be dealt with: disable VBA macros on a company level, run Excel on airgapped computers, etc. Promising that your cloud won't be hacked or is unintentionally leaking information is impossible, no matter how much auditing and certification you're going through. The relatively recent addition of xlwings Server fixes pretty much all of the issues you encountered in your previous company: user don't need a local installation of Python, but the Office admin just pushes an Office.js add-in to them and their done. No sensitive credentials etc. are required to be stored on the end-users computer or spreadsheet either as you can take advantage of SSO and can manage user roles on Microsoft Entra ID (that companies are using already anyways).


These are exactly the issues I would have guessed you would run into when using Python in a spreadsheet. Python has really been promoted above its level of competence. It's not suitable for these things at all.

I would say Typescript is a more obvious choice, or potentially Dart. Maybe even something more obscure like Nim (though I have no experience of that).

I get that you want compatibility with Pandas, Numpy, etc. but you're going to pay for that with endless pain.


Looks very cool. Will be keeping an eye on this for local network hosted and/or desktop application version. Thanks for sharing!


We have private hosting available (in your VPC) for enterprise customers.


looks cool!

do you have a desktop app in the works?


We have some development desktop builds working. Is it something you'd pay for?


I think calling out Durability is a bit of a straw man. Most services get their durability from S3 or some other managed database service. So they're really only making the "do it on a beefy machine argument" for the stateless portion of their service.

I agree with the other points for production services with the caveat that many workloads don't need all of those. Internal workloads or batch data processing use cases often don't need 4 9's of availability and can be done more simply and cheaply on a chonky EC2 instance.

The last point is part of our thesis for https://rowzero.io. You can vertically scale data analysis workloads way further than most people expect.


> Most services get their durability from S3 or some other managed database service.

I don't think this is as true as you think it is. Sure, many do, but I'd wager it's not most.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: