Sort of. mmap absolutely copies the file contents into the kernel file system ca...

wtallis · on Oct 10, 2020

> mmap absolutely copies the file contents into the kernel file system cache which is a buffer

Isn't this a bit misleading? mmaping a file doesn't cause the kernel to start loading the whole thing into RAM, it just sets things up for the kernel to later transparently load pages of it on demand, possibly with some prefetching.

shawnz · on Oct 10, 2020

The GP's comment on the other hand seemed to imply it was completely unbuffered.

wtallis · on Oct 10, 2020

"Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference. About the best you can do in general is not do any buffering yourself, and usually explicitly bypass whatever buffering is going on at the next level of abstraction down. Ensuring you've cut out all buffering in the entire IO stack takes real effort.

monocasa · on Oct 10, 2020

> "Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference.

While absolutely true, I've found that fact to be very surprising to a lot of engineers.

ivalm · on Oct 10, 2020

But I think the important part is that the file starts on disk and ends parsed. The rate of that was NVME limited (per article).

wtallis · on Oct 10, 2020

> The rate of that was NVME limited (per article).

The article shows that he's getting half the throughput of parsing a CSV that's already in RAM. But: he's using RAID0 of two SSDs and only getting a little more than half the throughput of one of those SSDs. As currently written, this program might not be giving the SSDs a high enough queue depth to hit their full read throughput. I'd like to see what throughput is like with an explicit attempt to prefetch data into RAM (either with a thread manually touching all the necessary pages, or maybe with a madvise call). That could drastically reduce the number of page faults and context switches affecting the OpenMP worker threads, and yield much better CPU utilization.

throwaway_pdp09 · on Oct 10, 2020

I thought queue depth related to supporting outstanding/pending reads. For a serial access such as csv parsing, what would you do other than having a readahead - somehow, see my other question - which would presumably maintain the queue depth at about 1.

Put another way, what would you do to read in the CSV serially to increase speed that would push the queue depth above 1?

wtallis · on Oct 10, 2020

For sequential accesses, it usually doesn't make a whole lot of difference whether the drive's queue is full of lots of medium-sized requests (eg. 128kB) or a few giant requests (multiple MB), so long as the queue always has outstanding requests for enough data to keep the drive(s) busy. Every operating system will have its own preferred IO sizes for prefetching, and if you're lucky you can also tune the size of the prefetch window (either in terms of bytes, or in terms of number of IOs). Different drives will also have different requirements here to achieve maximum throughput; an enterprise drive that stripes data across 16 channels will probably need a bigger/deeper queue than a consumer drive with just 4 channels, if the NAND page size is the same for both.

However, optimal utilization of the drive(s) will always require a queue depth of more than one request, because you don't want the drive to be idle after signalling completion of its only queued command and waiting for the CPU to produce a new read request. In a RAID0 setup like the author describes, you need to also ensure that you're generating enough IO to keep both drives busy, and the minimum prefetch window size that can accomplish this will usually be at least one full stripe.

As for how you accomplish the prefetching: the madvise system call sounds like a good choice, with the MADV_SEQUENTIAL or MADV_WILLNEED options. But how much prefetching that actually causes is up to the OS and the local system's settings. On my system, /sys/block/$DISK/queue/read_ahead_kb defaults to 128, which is definitely insufficient for at least some drives but might only apply to read-ahead triggered by the filesystem's heuristics rather than more explicitly requested by a madvise. So manually touching pages from a userspace thread is probably the safer way to guarantee the OS pages in data ahead of time—as long as it doesn't run so far ahead of the actual use of the data that it creates memory pressure that might get unused pages evicted.

formerly_proven · on Oct 10, 2020

WILLNEED really just reads the entire file asynchronously into the page cache, at least on Linux.

wtallis · on Oct 11, 2020

Is that still true if the size_t parameter to the madvise call is less than the entire file size? I would think that madvise hints could be issued at page granularity and not affect the entire mapping as originally allocated.

shawnz · on Oct 10, 2020

With no limit? What if the file is huge, will it evict other things in the cache?

liuliu · on Oct 11, 2020

Yes, probably. With hindsight, it is probably a mistake to use mmap. I probably can do better to just read file myself, since I have to make a mirror buffer later for some data manipulation anyway.

ivalm · on Oct 10, 2020

That makes sense, thanks!

saagarjha · on Oct 10, 2020

Well, it copies into the kernel buffer as you access it as a sort of demand paging that isn’t actually all that bad depending on what you’re doing. It’s dramatically different from a typical “read everything into a buffer” that most programs do.