S3 doesn’t have directories, it could be thought of a flat + sorted list of keys.
UNIX (and all operating systems) differentiate between a file and a directory. To list the contents of a directory, you need to make an explicit call. That call might return files or directories.
So to list all files recursively, you need to list, sort, check if an entry is a directory, recurse”. This isn’t great.
Interesting - isn't this just a matter of indexing/caching the file names, though? Surely S3 must store the files somewhere and index them. There's a Unix command called `locate` that does the same thing by maintaining a local database of keys and lets you search with prefixes.[1]
Anyway, I guess this is beyond the point of the original commenter above. I would disagree that listing files efficiently is the most useful part of S3. The main value prop is the fact that you can easily upload and download files from a distributed store. Most use cases involve uploading and downloading known files, not efficiently listing millions of files.
Actually we've found it's often much worse than that. Code written against AWS S3 using the AWS SDK often doesn't work on a great many "S3-compatible" vendors (including on-prem versions). Although there's documentation on S3, it's vague in many ways, and the AWS SDKs rely on actual AWS behaviour. We've had to deal with a lot of commercial and cloud vendors that subtly break things. This includes giant public cloud companies. In one case a giant vendor only failed at high loads, making it appear to "work" until it didn't, because its backoff response was not what the AWS SDK expected. It's been a headache that we've had to deal for cunoFS, as well as making it work with GCP and Azure. At the big HPC conference Supercomputing 2023, when we mentioned supporting "S3 compatible" systems, we would often be told stories about applications not working with their supposedly "S3 compatible" one (from a mix of vendors).
Back in 2011 when I was working on making Ceph's RadosGW more S3-compatible, it was pretty common that AWS S3 behavior differed from their documentation too. I wrote a test suite to run against AWS and Ceph, just to figure out the differences. That lives on at https://github.com/ceph/s3-tests
What I can dig up today is that back in 2011, they documented that bucket names cannot look like IPv4 addresses and the character set was a-z0-9.-, but they failed to prevent 192.168.5.123 or _foo.
I recall there were more edge cases around HTTP headers, but they don't seem to have been recorded as test cases -- it's been too long for me to remember details, I may have simply ran out of time / real world interop got good enough to prioritize something else.
Isn't that a limitation imposed by the POSIX APIs, though, as a direct consequence of the interface's representation of hierarchical filesystems as trees? As you've illustrated, that necessitates walking the tree. Many tools, I suppose, walk the tree via a single thread, further serializing the process. In an admittedly haphazard test, I ran `find(1)` on ext4, xfs, and zfs filesystems and saw only one thread.
I imagine there's at least one POSIX-compatible file system out there that supports another, more performant method of dumping its internal metadata via some system call or another. But then we would no longer be comparing the S3 and POSIX APIs.
UNIX (and all operating systems) differentiate between a file and a directory. To list the contents of a directory, you need to make an explicit call. That call might return files or directories.
So to list all files recursively, you need to list, sort, check if an entry is a directory, recurse”. This isn’t great.