S3 doesn’t have directories, it could be thought of a flat + sorted list of keys...

jacobsimon · on March 11, 2024

Interesting - isn't this just a matter of indexing/caching the file names, though? Surely S3 must store the files somewhere and index them. There's a Unix command called `locate` that does the same thing by maintaining a local database of keys and lets you search with prefixes.[1]

Anyway, I guess this is beyond the point of the original commenter above. I would disagree that listing files efficiently is the most useful part of S3. The main value prop is the fact that you can easily upload and download files from a distributed store. Most use cases involve uploading and downloading known files, not efficiently listing millions of files.

[1] https://jvns.ca/blog/2015/03/05/how-the-locate-command-works...

bradleyjg · on March 10, 2024

Code written against s3 is not portable either. It doesn’t support azure or gcp, much less some random proprietary cloud.

cuno · on March 10, 2024

Actually we've found it's often much worse than that. Code written against AWS S3 using the AWS SDK often doesn't work on a great many "S3-compatible" vendors (including on-prem versions). Although there's documentation on S3, it's vague in many ways, and the AWS SDKs rely on actual AWS behaviour. We've had to deal with a lot of commercial and cloud vendors that subtly break things. This includes giant public cloud companies. In one case a giant vendor only failed at high loads, making it appear to "work" until it didn't, because its backoff response was not what the AWS SDK expected. It's been a headache that we've had to deal for cunoFS, as well as making it work with GCP and Azure. At the big HPC conference Supercomputing 2023, when we mentioned supporting "S3 compatible" systems, we would often be told stories about applications not working with their supposedly "S3 compatible" one (from a mix of vendors).

yencabulator · on March 10, 2024

Back in 2011 when I was working on making Ceph's RadosGW more S3-compatible, it was pretty common that AWS S3 behavior differed from their documentation too. I wrote a test suite to run against AWS and Ceph, just to figure out the differences. That lives on at https://github.com/ceph/s3-tests

orf · on March 11, 2024

What differences in behaviour from the AWS docs did you find, out of interest?

yencabulator · on March 11, 2024

What I can dig up today is that back in 2011, they documented that bucket names cannot look like IPv4 addresses and the character set was a-z0-9.-, but they failed to prevent 192.168.5.123 or _foo.

I recall there were more edge cases around HTTP headers, but they don't seem to have been recorded as test cases -- it's been too long for me to remember details, I may have simply ran out of time / real world interop got good enough to prioritize something else.

2011 state, search for fails_on_aws: https://github.com/tv42/s3-tests/blob/master/s3tests/functio...

Current state, I can't speak to the exact semantics of the annotations today, they could simply be annotating non-AWS features: https://github.com/ceph/s3-tests/blob/master/s3tests/functio...

arcfour · on March 10, 2024

I've seen several S3-compatible APIs and there are open-source clients. If anything it's the de-facto standard.

zaphar · on March 10, 2024

GCP storage buckets implement the S3 api. You can treat them like they were an s3 bucket. Something I do all the time.

mechanicalpulse · on March 10, 2024

Isn't that a limitation imposed by the POSIX APIs, though, as a direct consequence of the interface's representation of hierarchical filesystems as trees? As you've illustrated, that necessitates walking the tree. Many tools, I suppose, walk the tree via a single thread, further serializing the process. In an admittedly haphazard test, I ran `find(1)` on ext4, xfs, and zfs filesystems and saw only one thread.

I imagine there's at least one POSIX-compatible file system out there that supports another, more performant method of dumping its internal metadata via some system call or another. But then we would no longer be comparing the S3 and POSIX APIs.