I may have missed something when skimming the paper, but it sounds like Xor filters are constructed offline and can't be modified efficiently afterwards, whereas Bloom filters can be inserted into efficiently. So they don't seem to be an exact replacement
I wonder if the reason the escape analysis fails could be that, for small enough types, the concrete value is directly inlined inside the interface value, instead of the latter being "a smart pointer" as the author said. So when the compiler needs to take a reference to the concrete value in `vs.chunkStore`, that ends up as an internal pointer inside the `vs` allocation, requiring it to be on the heap.
Either that or the escape analysis just isn't smart enough; taking a pointer to an internal component of an interface value seems like a bit of a stretch.
Leaking resources in Rust is pretty trivial: create an Rc cycle with internal mutability. By contrast, in languages with a GC, cycles will be properly collected.
And there is nothing preventing GC languages from tying their non-memory resources to the lifetime of a value like Rust. See Python files, which are closed when the file handle is collected (ie. when all variables holding it go out of scope), and if that's not explicit enough, there's the "with" keyword to explicitly introduce a scope at the end of which the file is closed.
The author claims that mmap is exposed by glibc whereas memfd_create is not, so they reimplemented it with syscall, but looking at the man page, did they just forget to #define _GNU_SOURCE?
in what way is unicode similar to html, docx, or a file format? the only features I can think of that are even remotely similar to what you're describing are emoji modifiers.
and no, this webpage is not result of "carefully cutting out the complicated stuff from Unicode". i'm pretty sure it's just the result of not supporting Unicode in any meaningful way.
indeed, Rust can be fast, if you know what you're doing. this indeed means not making unnecessary allocations, but in my opinion, it also means... using byte-based indices instead of insisting on char-based indices, ie. Version 0, which OP so quickly dismissed.
using char-based indices means you have to convert back to byte-based indices, a linear time operation, every time you want to do much of anything with them. this is a silly performance loss, since you probably got those char-based indices by iterating over the string in the first place: you're doing redundant work, which could be avoided by using char_indices() in your initial iteration and keeping those byte-based indices for later manipulation. this is why that iterator exists, really.
you might ask: "but then if I do +1 to get the index of the next character, it might fall in the middle of a multibyte character and the substringing will panic!" yes, you will need to use char::len_utf8 or char_indices to offset your indices (forwards or backwards: CharIndices is a DoubleEndedIterator!). but this is less work than adding one... and then recounting characters from the beginning.
and importantly, +1 isn't even really appropriate to do with char-based indices either. there are many "characters" in the user-perceived sense of the word that are made up of multiple chars, and while cutting in the middle of one won't panic, it also won't give the user-friendly cutting you're expecting. just try your code with a flag emoji and see what happens: you'll split it into two weird "residual" characters.
if you care about that in your specific application, the solution is to iterate on an even bigger unit than chars: (extended) grapheme clusters, or EGCs. and because EGC segmentation is quite a bit more demanding than simple char-based iteration, using EGC-based numerical indices (which I believe Swift does by default?) is an even bigger waste of CPU time. in my opinion, you need to fully let go of the assumption that characters can be given consecutive numerical indices in a performant way, and once again, use byte-based indices along with the appropriate EGC-aware methods for acquiring and offsetting them.