Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I don't know if there's a spec-problem that favors a slower-hash table

Yes. std::unordered_map requires that pointers to stored values will always be valid (except if the value is erased). This makes open adressing effectively impossible. If you need max. performance, std::unordered_map is indeed not a good choice.

> The ret[] operator searches for the string. Since it doesn't exist, it now has to malloc the key and insert the value in there.

The optimal method would be https://en.cppreference.com/w/cpp/container/unordered_map/em...

> So eight mallocs and 6 frees? > Yeah, C++ strings shoot yourself in the foot performance-wise.

Most implementations of std::string use short string optimization. Typically, you can save up to 12 chars (32-bit) or 24 chars (64-bit) inside the actual string container without any memory allocation.

Looking at the code example, I would say that there won't be any mallocs at all for std::string.

I think the performance difference just comes from the suboptimal design of std::unordered_map, where each node has to be allocated on the heap. In open addressing hash tables, as used by PHP (and I guess also by Rust), the nodes live in a single large array. The only memory allocation you see is when the array has to be resized due to a rehash - which happens very rarely!



Thanks for the SSO pointer. I'll keep that in mind in the future.

> Yes. std::unordered_map requires that pointers to stored values will always be valid (except if the value is erased).

Hmm, that requirement also prevents Cuckoo-hashing.

The general pattern here (and this is problematic in a lot of languages), is the idea of having a "shortcut" into an object in a data-structure. For example, C++ and Java's "iterator-invalidation" rules are absolutely arcane, even if we removed the pointer-requirements.

Many data-structures want to move things around (cuckoo-hashing, open-addressing, robin-hood hashing... look 3 different, high-performance strategies and we're still just talking about hash tables!!). But those moves "invalidate" the shortcuts (be it a pointer or an iterator).

Some data-structures can have those pointers just fine (Red-and-black trees), others cannot. Representing all these possibilities with a fixed, easily understood interface that generalizes among all of these data-structures / decisions seems impossible.

------

Well, I guess that's why I keep rolling my own data-structures when I need performance guarantees... standard-library just has to make decisions that may or may not match my cases.


> Well, I guess that's why I keep rolling my own data-structures when I need performance guarantees...

Yes, this is what many companies do. std containers are good enough for the general case, but are certainly situations where you need something more fancy. However, before you roll your own, first check if there's a good library you can use. The world's best hashtable has already been written by other people :-)


For std::unordered_map, I don't know why they decided to add the no-iterator-invalidation requirement. If you really need that property, you could get most of the benefits by using unordered_map<unique_ptr<T>> instead of unordered_map<T>. So maybe not the best example of "don't pay for what you don't use".


> the no-iterator-invalidation requirement

Iterators can be invalidated (e.g. after a rehash). The requirement is rather that the address of the value itself must not change (so that pointers will remain valid).

> If you really need that property, you could get most of the benefits by using unordered_map<unique_ptr<T>> instead of unordered_map<T>.

I agree. I don't know why they have decided for the current design. They even expose the bucket interface (https://en.cppreference.com/w/cpp/container/unordered_map/bu...) which should really be an implementation detail...


I mean, its a hashmap we're talking about. I'm not sure if iterators even make sense in any regard. Things inside of a hashmap have no order at all.

And determining if a cell is full or empty in a hash-map isn't exactly efficient either (You pretty much have to visit those locations and see if a tombstone or empty symbol is there).

-------

I recognize that people want to foreach() over a HashMap, but I'm just not convinced its an efficient operation, nor a good fit for the iterator interface (especially with all this discussion of invalidations and movement...)


> I'm not sure if iterators even make sense in any regard. Things inside of a hashmap have no order at all.

Iteration doesn't necessarily imply a specified order.

> but I'm just not convinced its an efficient operation

Why does iteration have to be efficient?

> nor a good fit for the iterator interface

What would you use instead?

Iterator invalidation only comes into play when you erase elements. If you just iterate over a hashtable and, say, print the elements, you wouldn't have to worry about it at all.


In both Robin Hood and Cuckoo hashing, insertions will move elements around. That means the iterator guarantees (or: that you'd visit every element) could be broken).

So a foreach(cuckoo_table) operation may not in fact visit all elements if the table has an insert somewhere in the loop.


Well, inserting elements while iterating is never a good idea :-) This will cause problems even with std::vector.

BTW, inserting elements can invalidate iterators with any hashtable if it might cause a rehash.


That's what I'm talking about though.

    foreach(datastructure){
        datastructure.what-functions-are-legal() ??
    }
Vector: Inserts are allowed as long as element fits in vector.capacity(). Erase (which "shifts" the elements down) are allowed at the end of the vector (specifically: all locations "after" the erase are invalidated. So pop_back() is always legal)

List: All operations allowed, except for reading from an erased element.

map: All operations allowed, except for reading from an erased element.

unordered_map: ??? Under debate right now.

-------

So even if you had assurances that unordered_map.insert would work (ex: Robinhood hash, you check size() and compare to capacity(), much like a vector), the movement of data means that your iterator is basically invalidated on any insert.

So unordered_map(Robin-hood): Nothing. No erases, no insertions allowed during the foreach loop at all.

-------

This comes up in practice. The minute you have 2 threads, and a 2nd thread _might_ insert while the 1st thread _might_ be in a foreach loop, you suddenly have to have assurances for what is or isn't legal from an iterator-invalidation perspective.

I guess you could just have the entire for-each loop inside of a lock, but that seems a bit inefficient?


> So unordered_map(Robin-hood): Nothing. No erases, no insertions allowed during the foreach loop at all.

Yes, but this applies to any hashtable that might rehash on inseration (= most). In practice this is not a big problem. For example, you can just put the new elements / to be deleted keys on a std::vector and only insert/erase after the for-loop.

I don't see how this makes iteration itself problematic. You can still iterate over a std::unordered_map perfectly fine. Just don't erase/insert elements while iterating.

> I guess you could just have the entire for-each loop inside of a lock, but that seems a bit inefficient?

Yes, you would have to lock the whole for-loop. It's the same with any other data structures where the iterators can get invalidated on inseration/deletion, e.g. std::vector.

Even for containers where iterators are not invalidated, you have to be careful with fain-grained locks. For example, you can only increment the iterator while holding the lock!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: