It's the page fault+copy latency, together with some secondary effects from the page tables being updated (seems to briefly halt all cores). The actual copying of a page of RAM is almost free compared to the time spent in all the kernel code for a page fault.
If your RAM is file backed, you end up spending lots of time in the filesystem code too - I used anonymous mappings which really helped there, and called clone() on the VM process to keep them shared.
I suspect if you use huge pages you might see lots of the impact vanish, but obviously that has other downsides.
Right, that makes sense. Once the memory page is in memory it should be fast though. We use shared mapping right now, and practically the pages stay in memory during the lifetime of a VM once they've been loaded, but we need to do more testing when there's more memory pressure.
I've been looking at huge pages recently, I'm going to do some more testing with transparent huge pages today and see if it changes performance. Unfortunately we cannot use reserved huge pages because that doesn't work with shared mmap on say an XFS FS.
Another idea is to make clones use the same memory base layer of their parent, then the pages are already prefaulted and it would deduplicate overall memory usage. Many things to discover still..
If your RAM is file backed, you end up spending lots of time in the filesystem code too - I used anonymous mappings which really helped there, and called clone() on the VM process to keep them shared.
I suspect if you use huge pages you might see lots of the impact vanish, but obviously that has other downsides.