Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Map Insert Stuck in an infinite loop #52

Closed
szhang0119 opened this issue May 5, 2022 · 53 comments
Closed

Hash Map Insert Stuck in an infinite loop #52

szhang0119 opened this issue May 5, 2022 · 53 comments

Comments

@szhang0119
Copy link

Hi, there,

We used the robin-map/v0.6.1, and we observed the robin map stuck in an infinite loop here in insert_value_impl method.

We created tsl::robin_map<uint64_t, DataBlock*>, and we inserted monotonically increasing keys like: 0xc000000000001, 0xc000000000002, etc. We do not have concurrent modifications on the robin map.

We observed this issue both on Intel and ARM machines, but ARM machines showed up more often. However, we do not have a good reproducer, and we only observed this issue in our production fleet.

Could any one help us debug a bit to see if anything pops out? Any suggestions would be appreciated!

Thanks very much!


Here are some debug information we gathered.

The robin map has 8 buckets, but the number of elements are only 5 (although all 8 keys look valid to us). However, we observed a really large m_load_threshold = 18446744069414584320, and all 8 buckets show with m_dist_from_ideal_bucket = 32767.

We suspected somehow m_load_threshold overflowed, causing the robin map would not expand the size, and hence there are no empty buckets in the map anymore.

(gdb) p *this 
$35 = {<std::hash<unsigned long>> = {<std::__hash_base<unsigned long, unsigned long>> = {<No data fields>}, <No data fields>}, <std::equal_to<unsigned long>> = {<std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}, <tsl::rh::power_of_two_growth_policy<2>> = {m_mask = 7}, 
  static STORE_HASH = false, static USE_STORED_HASH_ON_LOOKUP = false, static DEFAULT_INIT_BUCKETS_SIZE = 0, static DEFAULT_MAX_LOAD_FACTOR = 0.5, 
  static MINIMUM_MAX_LOAD_FACTOR = 0.200000003, static MAXIMUM_MAX_LOAD_FACTOR = 0.949999988, static DEFAULT_MIN_LOAD_FACTOR = 0, static MINIMUM_MIN_LOAD_FACTOR = 0, 
  static MAXIMUM_MIN_LOAD_FACTOR = 0.150000006, static REHASH_ON_HIGH_NB_PROBES__NPROBES = 128, static REHASH_ON_HIGH_NB_PROBES__MIN_LOAD_FACTOR = 0.150000006, 
  m_buckets_data = {<std::_Vector_base<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false>, std::allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> > >> = {
      _M_impl = {<std::allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> >> = {<__gnu_cxx::new_allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> >> = {<No data fields>}, <No data fields>}, _M_start = 0x40006630f040, 
        _M_finish = 0x40006630f100, _M_end_of_storage = 0x40006630f100}}, <No data fields>}, m_buckets = 0x40006630f040, m_bucket_count = 8, m_nb_elements = 5, 
  m_load_threshold = 18446744069414584320, m_max_load_factor = 0.5, m_grow_on_next_insert = true, m_min_load_factor = 0, m_try_skrink_on_next_insert = false}

(gdb) p m_buckets[0]
$36 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\t\000\000\000\000\000\f\000\000\000\000\000\000\000\000", __align = {<No data fields>}}}
(gdb) p m_buckets[1]
$37 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\002\000\000\000\000\000\f\000\340R\304#\v@\000", __align = {<No data fields>}}}
(gdb) p m_buckets[2]
$38 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\003\000\000\000\000\000\f\000\320P\304#\v@\000", __align = {<No data fields>}}}
(gdb) p m_buckets[3]
$39 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\004\000\000\000\000\000\f\000А?\034\v@\000", __align = {<No data fields>}}}
(gdb) p m_buckets[4]
$40 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\005\000\000\000\000\000\f\000\000\221?\034\v@\000", __align = {<No data fields>}}}
(gdb) p m_buckets[5]
$41 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\006\000\000\000\000\000\f\000\300$Fe\000@\000", __align = {<No data fields>}}}
(gdb) p m_buckets[6]
$42 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = false, m_value = {__data = "\a\000\000\000\000\000\f\000\240&Fe\000@\000", __align = {<No data fields>}}}
(gdb) p m_buckets[7]
$43 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, m_dist_from_ideal_bucket = 32767, 
  m_last_bucket = true, m_value = {__data = "\b\000\000\000\000\000\f\000\340\033Re\000@\000", __align = {<No data fields>}}}

Here are the list of keys in the map:

(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[0].m_value.__data)
$14 = 0xc000000000009
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[1].m_value.__data)
$15 = 0xc000000000002
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[2].m_value.__data)
$16 = 0xc000000000003
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[3].m_value.__data)
$17 = 0xc000000000004
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[4].m_value.__data)
$18 = 0xc000000000005
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[5].m_value.__data)
$19 = 0xc000000000006
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[6].m_value.__data)
$20 = 0xc000000000007
(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[7].m_value.__data)
$21 = 0xc000000000008
@Tessil
Copy link
Owner

Tessil commented May 5, 2022

Thank you for the report.

From a first glance I have troubles to understand how m_load_threshold ends up with such a large value. The only place where the variable is set (outside of copy/move constructor/operator and swap) is in max_load_factor and the value there is bounded by m_bucket_count and the clamped between [0.1;0.95] m_max_load_factor. As the m_bucket_count has a normal value of 8 and the m_max_load_factor is 0.5, m_load_threshold should be equal to 4. How does it end up with such value? The copy/move constructor/operator and swap seem correct. Could there be a memory bug somewhere in the implementation that somehow overrides the memory where m_load_threshold is?

I'll try to investigate more when I have some time.

In the meantime could you eventually try the latest version and define the TSL_DEBUG variable with assert enabled (so NDEBUG not defined)? It'll enable the assertions in the project and we might catch something with them.

@Tessil
Copy link
Owner

Tessil commented May 17, 2022

I really can't find how m_load_threshold could overflow like that considering how the variable is assigned. Could some parts of the program that uses the library write to some wrong memory locations? Could you try to compile it with the AddressSanitizer or run it with valgrind?

I'll try to investigate more as it's a worrying bug. Please let me know if you can eventually reproduce the problem in a self-contained example. Thanks.

@markpapadakis
Copy link

I have experienced the issue multiple times in the past which effectively prevents us from using this map implementation. I thought I filed an issue about this but looks like I haven't. It would be great if this is resolved.

@Tessil
Copy link
Owner

Tessil commented May 18, 2022

Thank you, I'll try to get a more in depth look then as it's really worrying if you also had the problem.

I have tried to insert the keys mentioned above in every way possible + tried with some extra ones and I can't reproduce the problem. If any of you eventually has a reproducible self-contained example it'd be immensely useful.

@markpapadakis
Copy link

I can reproduce it, but it is practically impossible to share the code (its for out embeddable columnar store thing, that's not OSS). Certain SQL queries would cause this -- tried every sanitizer, nothing came up really (so almost certainly not a corruption). Had this issue with other programs (our application server, some other utilities etc) -- but I unfortunately can't share that codebase either. It is almost certainly not a matter of some memory corruption/buffer overrun issue though.

@Tessil
Copy link
Owner

Tessil commented May 18, 2022

Thank you very much for the useful infos. I'll look more into depth then as I currently can't understand how m_load_threshold could end up with a value of 18446744069414584320.

@szhang0119
Copy link
Author

szhang0119 commented May 18, 2022

In our tests, we also saw a different stuck stack.

In this case, m_bucket_count=0, and it stuck in a different while loop here.

Here is some gdb information:

(gdb) p *this
$2 = {<std::hash<unsigned long>> = {<std::__hash_base<unsigned long, unsigned long>> = {<No data fields>}, <No data fields>}, <std::equal_to<unsigned long>> = {<std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}, <tsl::rh::power_of_two_growth_policy<2>> = {m_mask = 0}, static STORE_HASH = false, static USE_STORED_HASH_ON_LOOKUP = false, static DEFAULT_INIT_BUCKETS_SIZE = 0, 
  static DEFAULT_MAX_LOAD_FACTOR = 0.5, static MINIMUM_MAX_LOAD_FACTOR = 0.200000003, static MAXIMUM_MAX_LOAD_FACTOR = 0.949999988, 
  static DEFAULT_MIN_LOAD_FACTOR = 0, static MINIMUM_MIN_LOAD_FACTOR = 0, static MAXIMUM_MIN_LOAD_FACTOR = 0.150000006, 
  static REHASH_ON_HIGH_NB_PROBES__NPROBES = 128, static REHASH_ON_HIGH_NB_PROBES__MIN_LOAD_FACTOR = 0.150000006, 
  m_buckets_data = {<std::_Vector_base<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false>, std::allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> > >> = {
      _M_impl = {<std::allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> >> = {<__gnu_cxx::new_allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> >> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}, 
  m_buckets = 0x400a4d6e2010 <tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, tsl::robin_map<unsigned long, MapGraph::Blocks::DataBlock*, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, MapGraph::Blocks::DataBlock*> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, MapGraph::Blocks::DataBlock*, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, MapGraph::Blocks::DataBlock*> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, MapGraph::Blocks::DataBlock*> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::static_empty_bucket_ptr()::empty_bucket>, m_bucket_count = 0, m_nb_elements = 0, 
  m_load_threshold = 0, m_max_load_factor = 0.5, m_grow_on_next_insert = false, m_min_load_factor = 0, m_try_skrink_on_next_insert = false}

Since the m_bucket_count is 0, the below information does not quite make sense. But I just want to show m_dist_from_ideal_bucket is also MAX_SHORT.

(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[0].m_value.__data)
$4 = 0x2a000000000002
(gdb) p m_buckets[0]
$5 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, 
  m_dist_from_ideal_bucket = 32767, m_last_bucket = true, m_value = {__data = "\002\000\000\000\000\000*\000\000\000\000\000\000\000\000", 
    __align = {<No data fields>}}}

We also tried to upgrade to latest version (v1.0.1), but can still see the above stuck stacks in our tests.

@gabrieltanase42
Copy link

gabrieltanase42 commented May 18, 2022

Hi Tessil; I am from same team with Song. For us the bug occurs rarely and there is always an infinite loop. We suspected that maybe some memory corruption on our side may mess robin map memory and cause teh infinite loop and we spent already two weeks investigating this. What is specific to our case is that in a highly multithreaded scenario each thread creates, inserts and destroys teh map. We do guard with mutexes the operations on map as the datastructure itself is not thread safe.

So now I wonder : do you in your tests also have this scenario where we create map/insert elements/delete map in a loop ; and maybe from lots of threads (each thread creates, inserts and destroy its own robin map). Maybe there is an issue on the delete part which I am thinking may not be heavily tested???

At this point we started replacing the robin map with STL unordered to see if we observe any hangs anymore :(

@markpapadakis
Copy link

For what it's worth, in cases where it fails, we don't erase any KVs. It's all insertions. It occurs when inserting elements.

@markpapadakis
Copy link

Also, we experimented with various many different hash maps implementations. None of them failed so it's perhaps safe to suggest that the issue's
specific to this implementation.

@Tessil
Copy link
Owner

Tessil commented May 21, 2022

Thanks all for the information.

I was not able to reproduce the problem but I looked carefully through the implementation. I couldn't find a problem in the insertion logic of the code but I noticed that the library can invoke some undefined behaviour with C++17.

P0137R1 introduced a change in the object model in C++17 which now requires to std::launder the reinterpreted pointer from std::aligned_storage. From my understanding and the discussion I have read, the change seems to be backward incompatible and could potentially make previously well-defined code become undefined. See the following discussion for some interesting details: https://stackoverflow.com/questions/47735657/does-reinterpret-casting-stdaligned-storage-to-t-without-stdlaunder-violat and https://groups.google.com/a/isocpp.org/g/std-discussion/c/xaAwFR6qUmY

I have fixed the problem with commit 4abcc97, could you try these latests changes? It may not be the reason of the bug you encounter but it would be tremendously useful for me to at least try as I can't reproduce the problem myself.

I also added some extra assertions so running the code with TSL_DEBUG defined and assertions enabled may also help to pinpoint the problem.

Sorry for the caused inconveniences.

@gabrieltanase42
Copy link

gabrieltanase42 commented May 24, 2022

@markpapadakis Do you think you can try this patch and confirm that it fixes your problem. Now I am in the middle of replacing and benchmarking with an alternative. Also for some weird reason for us I need to let the server run with a constant load for a week or more before I notice a query stuck in infinite loop inside robin map.

@markpapadakis
Copy link

@gabrieltanase42: I just tried this commit and still fails.

@thompsonbry
Copy link

thompsonbry commented May 25, 2022

I am with the same team as Gabi. We were not using c++17 when we observed the problem. We have only been able to reproduce this in multi-day tests of the integrated software stack. We lack a clean reproducer based on a simple set of inputs and some sequence of their presentation. We do use a lock to guard access to the map from multiple threads in the code where we are observing the problem. We have only observed this for relatively small maps with numerically ordered keys which should be presented more or less in order (there might be some races involved, but there is a guarding lock).

We have also been using the robin map for several years in hash joins and have never observed an issue with that code. However, unless there is something very odd about the case where the map is behind a lock, we have to assume that we can encounter the same issue when building a hash index for a join.

@Tessil
Copy link
Owner

Tessil commented May 29, 2022

So I reviewed all the code of the library to check any eventual problem and I couldn't find anything really suspicious.

One part that could be problematic in a pre-C++11 multi-threaded program is the static initialization in https://github.com/Tessil/robin-map/blob/master/include/tsl/robin_hash.h#L1590 But it's thread-safe in C++11 and forward and this static variable is never modified once initialized. There are no other non-const shared variables.

Another slight doubt I have is the const_cast in https://github.com/Tessil/robin-map/blob/master/include/tsl/robin_hash.h#L1108 but as pos.m_bucket should always point to the non-const m_buckets_data/m_buckets or the end of it, it should be perfectly safe to cast-away the constness of it in non-const methods. One alternative would be to use the slower return iterator(m_buckets + (pos.m_bucket - m_buckets));.

I have not been able to reproduce the bug in a non-threaded program. I'll try to run some tests in a multi-threaded one as it seems it's a common premise both @szhang0119 and @markpapadakis have and maybe the static_empty_bucket_ptr has something to do with it.

@Tessil
Copy link
Owner

Tessil commented May 29, 2022

It'd be good to eventually test the map by replacing https://github.com/Tessil/robin-map/blob/master/include/tsl/robin_hash.h#L1591 with static thread_local bucket_entry empty_bucket(true);. Though if it works I would be a bit puzzled as the library only do concurrent reads from empty_bucket (and if there was a write due to a bug, it should also show weird behaviour in a non-threaded program).

@Tessil
Copy link
Owner

Tessil commented May 29, 2022

Tried to reproduce the bug in a multi-threaded test program that do insertions, erasures and maps creations but no luck unfortunately

@hunjmes
Copy link

hunjmes commented May 31, 2022

I work on the same team as Gabriel-- it's interesting to me (although I don't know what it means) that in the second bug report, from 13 days ago, the static empty_bucket has its m_dist_from_ideal_bucket set to 32767-- no one should ever be modifying it! Also, in the initial bug report, from 27 days ago, all 8 buckets have their m_dist_from_ideal_bucket set to 32767.

Since the comparison involving m_dist_from_ideal_bucket is a <= involving signed 16-bit integers, the comparison will always succeed, which explains the infinite loop. But it's not much of an explanation, since:

  • In the 2nd case, the static empty bucket should never be modified, and its m_dist_from_ideal_bucket should always be -1. Somehow it went from -1 to 32767...
  • In the 1st case, while it is true that the maximum possible value for m_dist_from_ideal_bucket is 32767, it's also the case that trying to access a bucket that has that maximum value would seem to lead to an infinite loop. So, how did the 1st case end up with four(!) "infinite-loop" buckets, without hanging on, say, the first?

I don't have answers to any of these questions, unfortunately--

@Tessil
Copy link
Owner

Tessil commented May 31, 2022

Yes, it's really strange. m_dist_from_ideal_bucket should also never reach a value of 32767 as once m_dist_from_ideal_bucket reaches DIST_FROM_IDEAL_BUCKET_LIMIT=4096 (which should only happen with a really, really bad hash), a rehash to grow the map is forced at next insertion.

How did it end up to reach a value of 32767 with only 8 buckets (and thus a max distance of 7) and how was static_empty_bucket_ptr ever modified to not be empty anymore (I double-checked the code and can't see how it would happen outside of some undefined behaviour)? I'll try to re-read the code again when I have some time as I must be missing something. I never had such behaviour when using the library.

To be sure, are all multi-threaded accesses and writes protected by a mutex? Even move, swap, copy, creation?

@markpapadakis
Copy link

IIRC it always failed in MT programs for us and it's always about maps protected with either an std::mutex or an std::shared_mutex. As I mentioned in a previous comment, no other hash map implementation I tried failed.

@Tessil
Copy link
Owner

Tessil commented Jun 1, 2022

So I tried to reproduce the bug with the following program that has multiple threads inserting, erasing, reading random values in multiple maps in parallel to try to reproduce the problem but to no avail. Also tried to use sequential values instead of random ones but still working fine. Both with clang 13.0 and g++ 11.3 with -O3.

I you could eventually try to reproduce the problem with TSL_DEBUG defined as well as assertions enabled, it may help to pinpoint the problem if one of the asserts fail.

#include <tsl/robin_map.h>

#include <cstdlib>
#include <iostream>
#include <mutex>
#include <random>
#include <shared_mutex>
#include <thread>

const std::size_t nb_iterations = 100000000000;
const std::size_t nb_maps = 4;
std::vector<std::shared_mutex> mutexes(nb_maps);
std::vector<tsl::robin_map<std::uint64_t, int>> maps(nb_maps);
std::mutex stdout_mutex;
thread_local std::random_device rd;
thread_local std::mt19937 gen(rd());

void test() {
  std::uniform_int_distribution<std::size_t> map_choice(0, 3);
  std::uniform_int_distribution<std::uint64_t> value_gen(0, 10000000);
  std::uniform_int_distribution<int> should_insert(0, 3);
  std::uniform_int_distribution<int> should_erase(0, 7);
  std::uniform_int_distribution<int> should_reset(0, 10000000);

  std::uint64_t total = 0;
  for (std::size_t i = 0; i < nb_iterations; i++) {
    const std::uint64_t insert_val = value_gen(gen);
    const std::uint64_t erase_val = value_gen(gen);
    const std::uint64_t read_val = value_gen(gen);
    const bool insert = should_insert(gen) == 0;
    const bool erase = should_erase(gen) == 0;
    const bool reset = should_reset(gen) == 0;

    if (insert) {
      const std::size_t imap = map_choice(gen);
      std::unique_lock<std::shared_mutex> lock(mutexes[imap]);
      maps[imap].insert({insert_val, 1});
    }

    if (erase) {
      const std::size_t imap = map_choice(gen);
      std::unique_lock<std::shared_mutex> lock(mutexes[imap]);
      maps[imap].erase(erase_val);
    }

    {
      const std::size_t imap = map_choice(gen);
      std::shared_lock<std::shared_mutex> lock(mutexes[imap]);
      auto it = maps[imap].find(read_val);
      if (it != maps[imap].end()) {
        total += 1;
      }
    }

    if (reset) {
      const std::size_t imap = map_choice(gen);
      std::unique_lock<std::shared_mutex> lock(mutexes[imap]);
      tsl::robin_map<std::uint64_t, int> empty;
      maps[imap].swap(empty);
      total = 0;
    }

    if (i % 10000000ull == 0) {
      std::unique_lock<std::mutex> lock(stdout_mutex);
      std::cout << i << ": " << total << std::endl;
    }
  }
}

int main(int, char**) {
  std::size_t nthreads = 16;
  std::vector<std::thread> threads(nthreads);

  for (std::size_t i = 0; i < threads.size(); i++) {
    threads[i] = std::thread(test);
  }

  for (std::size_t i = 0; i < threads.size(); i++) {
    threads[i].join();
  }
}

@gabrieltanase42
Copy link

gabrieltanase42 commented Jun 2, 2022

I also tried to reproduce with something simple but no luck. Here is my test in case it may help somebody:

╰─○ cat robin.cc
#include <cstdint>
#include <iostream>
#include <string>
#include <tsl/robin_map.h>
#include <tsl/robin_set.h>

#include <thread>
#include <mutex>
#include <unistd.h>

using namespace std;

class MyMap {
    int count;
    tsl::robin_map<int, char*> *data;
    std::mutex m;
    
public:
    MyMap(){
        data = new tsl::robin_map<int, char*>();
    }
    
    void rotate(){
        m.lock();

        auto new_data = new tsl::robin_map<int, char*>();
        
        auto mit = data->begin();
        while (mit != data->end()){
            delete mit.value();
            ++mit;
        }
        delete data;

        data = new_data;
        
        m.unlock();
        
    }

    void op() {
        m.lock();
        int sz = rand()%16384+8;
        char* bytes = (char*) malloc(sz);
        for (int i=0;i<sz;++i)
            bytes[i] = i%256;
        (*data)[count++] = bytes;
        m.unlock();

        m.lock();
        if (data->size()>10){
            auto mit = data->begin();
            delete mit.value();
            data->erase(mit.key());
        }
        m.unlock();
    }

};


void th_func(int id, MyMap* map, vector<size_t>* counts){
    cout<<"Thread :"<<id<<"\n";
    size_t opid = 0;
    do {
        map->op();
        counts->at(id)++;
    } while ( true);
}

int main() 
{
    const int P = 128;
    
    vector<std::thread> vt(P);
    vector<size_t> counts(P,0);
    vector<size_t> oldcounts(P,0);
        
    MyMap* mdata = new MyMap();
    
    for (int i=0;i<P;++i){
        vt[i] = std::thread(th_func, i, mdata, &counts);
    }

    int s = 0;
    do{
        sleep(1);
        cout<<s++<<":";
        for (int i=0;i<P;++i){
            //cout << counts[i]<<" ";
            if (counts[i] - oldcounts[i] > 100000 && s>2){
                cout<<"Error:: count difference too big on slot"<<i<<":::"<<"\n";       
            }
        }
        cout<<"\n";
        oldcounts = counts;

        if (s%10==0){
            mdata->rotate();
        }
    }while (true);
    
    for (int i=0;i<P;++i){
        vt[i].join();
    }
}

@Tessil
Copy link
Owner

Tessil commented Jun 13, 2022

So I tried to reproduce it again with both scripts, ran it for 24 hours with random values and a few hours with sequential values using 32 threads and couldn't reproduce the problem.

I re-read the code again and can't find where it could come from outside of a potential race condition or memory corruption.

Does the bug happen after moving, swapping, copying a map, serialization/deserialization or anything like that?

If anyone encounters the problem, has more information or a way to reproduce, please let me know. One way to help would be to enable assertions and define TSL_DEBUG with the latests changes that add some assertions.

For now I marked the issue as cannot reproduce as there isn't much more I can do.

@ktprime
Copy link

ktprime commented Jun 30, 2022

I can reproduce it now in my test(gcc only) and test code https://github.com/ktprime/emhash/blob/master/bench/btest.cpp

//download my git emhash and run test with gcc 7.5
root@ubuntu:/emhash/bench g++ -O3 -march=native -mtune=native -I.. -I../thirdparty -std=c++17 btest.cpp -o btest
root@ubuntu:
/emhash/bench ./btest

tsl_robin_map:

Consecutive insert: 538 ms (s=0, size=2000000)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

Program received signal SIGABRT, Aborted.
0x00007ffff7446438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff7446438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff744803a in __GI_abort () at abort.c:89
#2 0x00007ffff7a8cdde in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff7a987a6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff7a98811 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff7a98a65 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff7a8ca7e in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000000000409f42 in tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::robin_hash(unsigned long, std::hash const&, std::equal_to const&, std::allocator<std::pair<unsigned long, unsigned int> > const&, float, float) ()
#8 0x000000000040e00c in tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::rehash_impl(unsigned long) ()
#9 0x000000000040e40a in std::pair<tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::robin_iterator, bool> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::insert_impl<unsigned long, std::pair<unsigned long, unsigned int> >(unsigned long const&, std::pair<unsigned long, unsigned int>&&) ()
#10 0x000000000040e6b5 in void test_insert<tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> > >(tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >&) ()
#11 0x000000000040e7f6 in void test<tsl_robin_map>(char const*) ()
#12 0x000000000040198d in main ()

@Tessil
Copy link
Owner

Tessil commented Jun 30, 2022

The error seems unrelated (not an infinite loop) and more like a lack of memory when rehashing (hence the std::bad_alloc exception). How much RAM do you have?

@ktprime
Copy link

ktprime commented Jun 30, 2022

32GB server. it's a bad hash function cause many rehash ?

@Tessil
Copy link
Owner

Tessil commented Jun 30, 2022

It could be due to the indices3 that are powers-of-two + the std::hash identity function that causes too much collisions and end-up to make the distance variable becomes too large forcing a rehash. See comment in the README:

The first two are faster and use a power of two growth policy, the last two use a prime growth policy instead and are able to cope better with a poor hash function. Use the prime version if there is a chance of repeating patterns in the lower bits of your hash (e.g. you are storing pointers with an identity hash function). See GrowthPolicy for details.

You can thus try with another GrowthPolicy of hash function to avoid the problem of using power-of-two hashes with tsl::rh::power_of_two_growth_policy.

@markpapadakis
Copy link

@Tessil I just ran again this service/program that fails consistently (as described in other comments). I switched from a CitHash variant to xxHash, and even tried a differ GP, but it still fails (effectively, rehashes indefinitely, until it runs out of memory):

        template <typename KT, typename VT>
        using agg_fast_map = tsl::robin_map<KT, VT
                , std::hash<KT>
                , std::equal_to<KT>
                , std::allocator<std::pair<KT, VT>>
                , not std::is_arithmetic_v<KT>
                , tsl::rh::mod_growth_policy<>>;

@Tessil
Copy link
Owner

Tessil commented Jun 30, 2022

Thanks.

So I checked and added a try/catch on the std::bad_alloc to check the distribution of the values in the buckets (save the bucket for each value in a file and analyze it after).

With the power-of-two growth there's a bucket that has to manage 456(!) collisions. With a prime growth, the maximum is 57 collisions for one bucket.

The problem is that the robin-hood collision resolution keeps track of the distance from the ideal bucket in a variable which has a limit in size and can't store unlimited distances. So when the distance reaches DIST_FROM_IDEAL_BUCKET_LIMIT (4096) (meaning that we have a collision chain of 4096 values!) the hash map is rehashed to a larger size. There isn't much that can be done and such behaviour is synonymous of a really bad hash function.

I'll try with CityHash and xxHash, not sure how they behave with integers.

Note though that from my understanding this problem has nothing to do with the infinite loop issue and is a kind of expected behaviour with a bad hash function (I checked the cause, and it's effectively because the dist_from_ideal_bucket reaches 4096 and not a value like 32767). Or @markpapadakis and @gabrieltanase42 do you also have such long collisions in your programs that cause rehashes?

@gabrieltanase42
Copy link

gabrieltanase42 commented Dec 28, 2022

Tessil, today we were able to create a test that can reproduce infinite loop in a deterministic way. Like every time we run a benchmark the program hangs (spinning in the loop below). We flipped again from robin map to std map and things work perfectly. In this situation however the robin map has a relatively large number of elements. I am looking now at the previous comments and I see that you mention some potential issues and that the user should try different hash functions etc. Probably we could try these but in my mind we could still hit this unfortunate situation where code spins forever in this method:

`template
const_iterator find_impl(const K& key, std::size_t hash) const {
std::size_t ibucket = bucket_for_hash(hash);
distance_type dist_from_ideal_bucket = 0;

while (dist_from_ideal_bucket <=
       m_buckets[ibucket].dist_from_ideal_bucket()) {
  if (TSL_RH_LIKELY(
          (!USE_STORED_HASH_ON_LOOKUP ||
           m_buckets[ibucket].bucket_hash_equal(hash)) &&
          compare_keys(KeySelect()(m_buckets[ibucket].value()), key))) {
    return const_iterator(m_buckets + ibucket);
  }

  ibucket = next_bucket(ibucket);
  dist_from_ideal_bucket++;
}

return cend();

}`

I asked permission with my team and if you have time one of these days we could have an interactive session where we can look at the code inside debugger.

In this case the map is from uint64_t to bool (tsl::robin_map<gui_t, bool> ); The Hash function is the default one;
When attaching the debugger I could see dist_from_ideal_bucket being negative;
Today I'll spend more time with it but I wanted to post this first to maybe reopen the ticket and see if we could meet virtually to look at the code briefly.

@gabrieltanase42
Copy link

gabrieltanase42 commented Dec 28, 2022

So after I enable asserts , the program runs for like 30 mins inside that loop and it asserts the following:

TestSuites: /home/igtanase/brazil-pkg-cache/packages/TessilRobinMap/TessilRobinMap-1.0.x.344.0/AL2_x86_64/DEV.STD.PTHREAD/build/include/tsl/robin_hash.h:251: tsl::detail_robin_hash::bucket_entry<ValueType, StoreHash>::value_type& tsl::detail_robin_hash::bucket_entry<ValueTyp
e, StoreHash>::value() [with ValueType = std::pair<long unsigned int, bool>; bool StoreHash = false; tsl::detail_robin_hash::bucket_entry<ValueType, StoreHash>::value_type = std::pair<long unsigned int, bool>]: Assertion `!empty()' failed.
[1/1] P8DataFlowTest.BFSDFGBenchmarkLiveJournal returned/aborted with exit code -6 (132077 ms)

That corresponds to the following function:

value_type& value() noexcept { tsl_rh_assert(!empty()); return *reinterpret_cast<value_type*>(std::addressof(m_value)); }

@Tessil
Copy link
Owner

Tessil commented Dec 28, 2022

The bad hash issue is different and is unrelated to the problem you mention. It is when the collision chain keeps being higher than DIST_FROM_IDEAL_BUCKET_LIMIT=8192 due to a poor hash function as I mentioned in my previous comment and would not cause an infinite loop but a growth of the map which would lead to a std::bad_alloc during insertion if the collision chain keeps happening. Changing the hash is effectively not a solution in your case and the problem is somewhere else.

It seems dist_from_ideal_bucket variable is overflowing if it becomes negative which is quite strange as the maximum value of m_buckets[ibucket].dist_from_ideal_bucket() should be DIST_FROM_IDEAL_BUCKET_LIMIT=8192 (or 4096 before) which is lower than std::numeric_limits<int16_t>::max(). The loop should thus terminate once dist_from_ideal_bucket goes over DIST_FROM_IDEAL_BUCKET_LIMIT in any case. Could you check if any bucket has a value larger than DIST_FROM_IDEAL_BUCKET_LIMIT? Or lower than -1?

I asked permission with my team and if you have time one of these days we could have an interactive session where we can look at the code inside debugger.

Sorry, I would prefer to avoid looking into closed-source code for legal reasons.

Could you try to compile the code with clang address and undefined sanitizers (and eventually thread and memory ones)? Or just run it through Valgrind for a first pass?

What kind of operations are done on the map in the benchmark? Is the map moved or copied? Is it eventually serialized/deserialized? As I see a mention of PTHREAD I suppose multi-threading is used, as the map doesn't support multi-threading, are the locks all in place?

I reopened the ticket.

@Tessil Tessil reopened this Dec 28, 2022
@gabrieltanase42
Copy link

I'll keep digging more. Here are few more details. We do have the code split in multiple .so libs. The robin_map is allocated in one .so and it is passed as argument to a function that is implemented in a a different .so.

The reason I mentioned this is because at one point I tried to replace robin with abseil hash and that effort failed because of this exact reason ( abseil/abseil-cpp#1193 abseil/abseil-cpp#834)

With regards to multithreading. In this scenario is much simpler. The map is accessed by one thread only. I am doing two operations. Find and insert (key, value).

I did run with address sanitizer flag enabled. did'n print out anything.

@Tessil
Copy link
Owner

Tessil commented Dec 28, 2022

Thanks for the info.

Do you know the stacktrace of the failing assert in value()? What operation was done at that moment?

@gabrieltanase42
Copy link

Do you think you can prepare a print_info function for me that I can invoke inside that tsl_rh_assert() ?
Or point me to an existing one if there ?
Maybe if we print the state of all critical variables we may get a clue....

@Tessil
Copy link
Owner

Tessil commented Dec 28, 2022

The problem is that some of the critical infos may be in the m_buckets array which from what I understand is quite large.

As the bug occurs with mono-thread insertions and finds only (no deletions, map copy/move/swap, serialization or other) and is deterministic, would it eventually be possible to have a compressed text file with all the inserted gui_t in order of insertion or is there anything private in them? I could then try to reproduce the bug on my side.

@gabrieltanase42
Copy link

gabrieltanase42 commented Dec 28, 2022

I just realized we are also using clear() to erase the map; Essentially is a BFS algorithm. And we are using two maps; One is a set of input vertices. During the kernel in question we traverse this input map and we compute an output map with new vertices. We need the map to make sure we won;t visit a vertex twice.

After the kernel is over, the output map becomes the new input map and the old input map is erased; then we invoke the kernel again;

stack:: the version of the code is 1.0.1 but it is prior to you adding those std::launder

* frame #0: 0x00007f757c5e7d71 libmapgraph-p8-operators.so`tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, bool>, false>::value(this=0x00007f73e04c8000) at robin_hash.h:258
    frame #1: 0x00007f757c5ec0fe libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<true> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find_impl<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88, hash=11529781552120594432) const at robin_hash.h:1171
    frame #2: 0x00007f757c5eb475 libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<true> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88, hash=11529781552120594432) const at robin_hash.h:1014
    frame #3: 0x00007f757c5ea83d libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<false> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find_impl<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88, hash=11529781552120594432) at robin_hash.h:1160
    frame #4: 0x00007f757c5e997f libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<false> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88) at robin_hash.h:999
    frame #5: 0x00007f757c5e88f1 libmapgraph-p8-operators.so`tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find(this=0x00007f7561e3b280, key=0x00007f7561e26f88) at robin_map.h:488
    frame #6: 0x00007f757c5e763d libmapgraph-p8-operators.so`unsigned long MapGraph::p8::spmspvAdjacencyOnlyKernelMapInput<(MapGraph::p8::GraphDataType)1, bool, (MapGraph::Execution::SemiRingAddition)1, (MapGraph::Execution::SemiRingMultiplication)6, true, false, false>(REL_E=0x00007f7561e27810, inputWeightMap=0x00007f7561e3b140, validVertexMap=0x00007f7575148e40, cc=0x00007f7561e27708, edgep=11529781176126406656, outputMap=0x00007f7561e3b280)1>&, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>&, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>&, MapGraph::p8::Catalog::CatalogConfig&, unsigned long, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false

I also print the state of the hash map:

(lldb) p *this
(tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::ValueSelect, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >) $8 = {
  tsl::rh::power_of_two_growth_policy<8> = (m_mask = 65535)
  m_buckets_data = size=0 {}
  m_buckets = 0x00007f73e0408000
  m_bucket_count = 65536
  m_nb_elements = 32768
  m_load_threshold = 32768
  m_min_load_factor = 0
  m_max_load_factor = 0.5
  m_grow_on_next_insert = false
  m_try_shrink_on_next_insert = false

So most likely the insert before this find reached the 32768 size (power of two) and the resizing may mess up the internals;

Here are the variables in the last find_impl:

(lldb) p dist_from_ideal_bucket
(tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::ValueSelect, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::distance_type) $9 = -32768
(lldb) p ibucket
(std::size_t) $10 = 32768
(lldb) p m_buckets[ibucket].dist_from_ideal_bucket()
(tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, bool>, true>::distance_type) $11 = -1

(lldb) p m_buckets[ibucket]
(tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::ValueSelect, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::bucket_entry) $12 = {
  m_dist_from_ideal_bucket = -1
  m_last_bucket = false
  m_value = (__data = "", __align = std::aligned_storage<8, 8>::type::(unnamed struct) @ 0x0000000006f396c8)
}

@Tessil
Copy link
Owner

Tessil commented Dec 28, 2022

Thank you for all the info.

The m_buckets_data of size 0 with a m_bucket_count of 65536 is really worrying. The m_buckets_data should be of size m_bucket_count and m_buckets should point to m_buckets_data.data() if m_bucket_count > 0 (otherwise static_empty_bucket_ptr()).

I'll take a deeper look tomorrow.

@gabrieltanase42
Copy link

gabrieltanase42 commented Dec 29, 2022

I think I got the trace of operations to get it to expose the issue. you can replay the trace with something like this:


#include <boost/tokenizer.hpp>
#include <boost/lexical_cast.hpp>

// test code

            using namespace std;
            using namespace boost;

            std::string inputPath = "log";
            ifstream in(inputPath.c_str());
            ASSERT_TRUE(in.is_open());

            tsl::robin_map<gui_t, bool> rmap;
            typedef tsl::robin_map<gui_t, bool>::value_type pair_type;

            typedef boost::tokenizer<boost::char_separator<char> > Tokenizer;
            vector< string > vec;
            string line;

            while (getline(in,line))
            {

                if (rmap.size() == 32767){
                   // this here is so you can attach a debugger ; if not need comment out this if
                   // also replace LOG(ERROR) with cout
                    LOG(ERROR)<<"Getting close::"<<rmap.size()<<"\n";
                    int i=0;
                    while (i == 0){
                        usleep(2);
                    }
                }
                boost::char_separator<char> sep{"#"};
                Tokenizer tok(line,sep);
                vec.assign(tok.begin(),tok.end());
                // vector now contains strings from one row, output to cout here
                for (auto& s:vec){
                    LOG(ERROR)<<s<<"#";
                }
                LOG(ERROR) << "\n----------------------" << endl;
                if (vec[1]=="clear") {
                    rmap.clear();
                }
                if (vec[1]=="find") {
                    uint64_t num = lexical_cast<uint64_t>(vec[2]);
                    rmap.find(num);
                }
                if (vec[1]=="insert") {
                    uint64_t num = lexical_cast<uint64_t>(vec[2]);
                    rmap.insert(pair_type(num,true));
                }
                if (vec[1]=="reassign") {
                    uint64_t num = lexical_cast<uint64_t>(vec[2]);
                    rmap[num] = true;
                }

            }

you need to use # as separator. Now the values in each line are like this:

E1228 22:53:38.103422 37107 spmspvKernels.h:276] %%%0x7fa0c523b280#insert#11529778670445002752#1#1

the name of the operation (clear, find, insert, reassign), followed by Key (uint64_t) followed by value (always 1 in this case) followed by the size of the map before the operation is performed. There is some info before the operation name but ignore that;

You can unzip : gzip -d log.gz
log.gz

Tomorrow I'll try to reproduce this outside our framework, with the latest code release;

@gabrieltanase42
Copy link

Standalone test also hangs so we have a reproducer :)
I just did a git pull in my local copy of robin map:

commit 6775231bbf11c4a78ac2d25b8ae5791da8b35422 (HEAD -> master, origin/master, origin/HEAD)
Author: Thibaut Goetghebuer-Planchon <tessil@gmx.com>
Date:   Sun Dec 18 18:18:01 2022 +0000

    Disable CMake install rule if robin_map is used as subproject (#60)
~ cat bug.cc
// compile: g++ -o exe -g -O3 -std=c++17 -I/home/igtanase/brazil-pkg-cache/packages/Boost/Boost-3.0.394338.0/AL2_x86_64/DEV.STD.PTHREAD/build/include/ -Irobin-map/include bug.cc

#include <cstdint>
#include <iostream>
#include <fstream>
#include <string>
#include <tsl/robin_map.h>
#include <tsl/robin_set.h>

#include <thread>
#include <mutex>
#include <unistd.h>


#include <boost/tokenizer.hpp>
#include <boost/lexical_cast.hpp>

using namespace std;

typedef uint64_t gui_t;

int main() 
{
    using namespace boost;

    std::string inputPath = "log";
    ifstream in(inputPath.c_str());
    assert(in.is_open());
    
    tsl::robin_map<gui_t, bool> rmap;
    typedef tsl::robin_map<gui_t, bool>::value_type pair_type;
    
    typedef boost::tokenizer<boost::char_separator<char> > Tokenizer;
    vector< string > vec;
    string line;
    
    while (getline(in,line)) {
            
        if (rmap.size() == 32767){
            // this here is so you can attach a debugger ; if not need comment out this if
            cout<<"Getting close::"<<rmap.size()<<"\n";
            //int i=0;
            //while (i == 0){
            //    usleep(2);
            // }
        }
        boost::char_separator<char> sep{"#"};
        Tokenizer tok(line,sep);
        vec.assign(tok.begin(),tok.end());
        // vector now contains strings from one row, output to cout here
        for (auto& s:vec){
            cout<<s<<"#";
        }
        cout << "\n----------------------" << endl;
        if (vec[1]=="clear") {
            rmap.clear();
        }
        if (vec[1]=="find") {
            uint64_t num = lexical_cast<uint64_t>(vec[2]);
            rmap.find(num);
        }
        if (vec[1]=="insert") {
            uint64_t num = lexical_cast<uint64_t>(vec[2]);
            rmap.insert(pair_type(num,true));
        }
        if (vec[1]=="reassign") {
            uint64_t num = lexical_cast<uint64_t>(vec[2]);
            rmap[num] = true;
        }
        
    }
}

Tessil added a commit that referenced this issue Dec 29, 2022
…LIMIT during insertion (fix issue #52)

During insertion a check was done on dist_from_ideal_bucket to be sure
it doesn't becomes bigger than DIST_FROM_IDEAL_BUCKET_LIMIT but this was
only done during the robin swap. A check should also be done beforehand
if we find an empty bucket otherwise the variable could overflow and
lead to bugs. This commit adds this check.
@Tessil
Copy link
Owner

Tessil commented Dec 29, 2022

Thank you very much for the example and your help. I was able to successfully reproduce the bug and there was effectively a problem in the code, I'm really sorry for that.

I create a PR for the fix if you could try it out. I still have to add some tests before merging it.

The problem was in the end due to a high collisions chain due to a poor hash function which the code didn't manage properly. I did check for potential overflows of dist_from_ideal_bucket during the insertion but only during the robin swap while it should also be done beforehand to make sure that we don't store a m_dist_from_ideal_bucket that overflows in the bucket_entry.

Could you check if the PR fixes your issue?

I would recommend though to change the hash function as unfortunately GCC and clang use an identity function for integer types for std::hash and this can lead to poor distribution with a power of two map size. With the current hash you may still end up with std::bad_alloc due to excessive rehash growth.

I will try to create a tsl::hash project when I have more time to replace it as default hash function of the tsl libs and have a better default hash for integer types with a fallback to std::hash for other types. Eventually a stateful one that would change between rehashes.

Thanks again for your help and patience.

@gabrieltanase42
Copy link

gabrieltanase42 commented Dec 29, 2022

Thx for your help with this. I'll definitely check your fix in a minute. I guess our luck (christmass present) is that we found a trace to reproduce it. Now I am hoping that this fix also fixes other people's issues. @markpapadakis can you also give it a try with the new branch that @Tessil mentions (fix_issue_52).

@Tessil we do use the hash map in multiple scenarios and we do have our own hash functions for different situations. Our first thoutgh was also to change the hash function when we bumped into this latest bug but that would have only masked the problem till the next crash.

Here is a link to a hashing function that will work better than std::hash

https://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm

public int hash32shiftmult(int key)
{
  int c2=0x27d4eb2d; // a prime or an odd constant
  key = (key ^ 61) ^ (key >>> 16);
  key = key + (key << 3);
  key = key ^ (key >>> 4);
  key = key * c2;
  key = key ^ (key >>> 15);
  return key;
}

public long hash64shift(long key)
{
  key = (~key) + (key << 21); // key = (key << 21) - key - 1;
  key = key ^ (key >>> 24);
  key = (key + (key << 3)) + (key << 8); // key * 265
  key = key ^ (key >>> 14);
  key = (key + (key << 2)) + (key << 4); // key * 21
  key = key ^ (key >>> 28);
  key = key + (key << 31);
  return key;
}

@gabrieltanase42
Copy link

Some initial testing shows that the code is not crashing anymore. Next week I'll run more in depth tests

@Tessil
Copy link
Owner

Tessil commented Dec 30, 2022

I guess our luck (christmass present) is that we found a trace to reproduce it.

Yes, that was really useful. I did consider the dist_from_ideal_bucket reaching its size limit in case of very high collisions during the design but I didn't consider this case. Changing the hash function would effectively have masked the bug and I'm glad we found it as it was a worrying bug.

For the robin-map v2 I'm eventually considering to just saturate the distance to std::numeric_limits<distance_type>::max() and take that into account during the find where we would have to continue looking into the collision chain if the limit is reached. Something like:

distance_type next_distance(distance_type distance) const noexcept {
    // To optimize
    if (distance == std::numeric_limits<distance_type>::max()) {
        return distance;
    }

    return distance + 1;
}

template <class K>
const_iterator find_impl(const K& key, std::size_t hash) const {
    ...
    while (dist_from_ideal_bucket <= m_buckets[ibucket].dist_from_ideal_bucket()) {
        if (compare_keys(KeySelect()(m_buckets[ibucket].value()), key))) {
            return const_iterator(m_buckets + ibucket);
        }

        ibucket = next_bucket(ibucket);
        dist_from_ideal_bucket = next_distance(dist_from_ideal_bucket);
    }
    ...
}

During the insertion and erasure robin swap, we would have to pay attention to use the non-saturated real distance (which would mean potentially rehashing some values) instead of using the stored one.

It would avoid the excessive resizes in case of catastrophical collision chains but performance will probably be reduced a bit. Need to check the performance impact and if it's worth it for such a corner case (if this case occurs, it means the hash function is quite poor).

Tessil added a commit that referenced this issue Jan 3, 2023
…LIMIT during insertion (fix issue #52)

During insertion a check was done on dist_from_ideal_bucket to be sure
it doesn't becomes bigger than DIST_FROM_IDEAL_BUCKET_LIMIT but this was
only done during the robin swap. A check should also be done beforehand
if we find an empty bucket otherwise the variable could overflow and
lead to bugs. This commit adds this check.
@Tessil
Copy link
Owner

Tessil commented Jan 3, 2023

I merged the PR. Don't hesitate to let me know if you still encounter some problems even after this fix.

I will create a new release in the next few days and close the PR if everything is working right.

@Tessil
Copy link
Owner

Tessil commented Jan 5, 2023

I created the new release with the bugfix. I strongly recommend to update to this new version.

I will close the issue for now, let me know if the bug still occurs, we can then reopen it.

Note that you may still end up with std::bad_alloc due to aggressive rehashes if the hash function is very poor and create long collision chains (> 8192). Solving this would require some design change (see my #52 (comment)) which may decrease some of the performance which I'm not keen to affect for such extreme bad hash function case. I will still do some tests and eventually provide a 2.0 release if the performances are adequate.

@Tessil Tessil closed this as completed Jan 5, 2023
@yhding456
Copy link

yhding456 commented Jan 11, 2023 via email

@Tessil
Copy link
Owner

Tessil commented Jan 11, 2023

See #52 (comment) comment. To avoid overflowing the dist_from_ideal_bucket, the map is rehashed when it reaches the DIST_FROM_IDEAL_BUCKET_LIMIT limit (8192).

A multiplicative hash is not recommended for open-addressing hash as it may lead to a very high number of collisions depending on the input distribution. The KeyHasher you use will only output odd hash for any even input.

Changing the hash function is the best course of action here, as even if I implement #52 (comment) you'll have terrible performances with the current hash function. If you have a dist_from_ideal_bucket of 8192 for example (or more) for one bucket, it means that the map has to go through 8192 buckets before finding the value in this bucket.

@yhding456
Copy link

yhding456 commented Jan 11, 2023 via email

jonpryor pushed a commit to dotnet/android that referenced this issue Jun 16, 2023
Changes: https://github.com/Tessil/robin-map/releases/tag/v1.2.0

Changes: Tessil/robin-map@784245b...d37a410

  * Tessil/robin-map@d37a410: Update CMake tsl-robin-map to v1.2.1
  * Tessil/robin-map@68ff732: Keep rehashing if dist_from_ideal_bucket is > DIST_FROM_IDEAL_BUCKET_LIMIT during insertion (fix issue Tessil/robin-map#52)
  * Tessil/robin-map@6775231: Disable CMake install rule if robin_map is used as subproject (Tessil/robin-map#60)
  * Tessil/robin-map@57c9b65: Replace depecrated std::aligned_storage since C++23 by alignas (Tessil/robin-map#61)
  * Tessil/robin-map@d3131e4: Raise DIST_FROM_IDEAL_BUCKET_LIMIT to 8192
  * Tessil/robin-map@f8e0f67: Add assertion to make sure that static_empty_bucket_ptr is empty
  * Tessil/robin-map@ac1e3d8: Add some extra assertions for clarity and ease of debug
  * Tessil/robin-map@f1e7457: Clear and shrink the moved hash table in the move operator to be coherent with the move constructor
  * Tessil/robin-map@4abcc97: When using C++17, std::launder the reinterpreted pointer from std::aligned_storage to adapt to the change of object model introduced in P0137R1. Fix potential undefined behaviour.
  * Tessil/robin-map@c77f80b: Update link to Conan package
  * Tessil/robin-map@c7595ba: Apply clang-format --style=Google
  * Tessil/robin-map@37e94dc: When exceptions are disabled, only print the error message when defined(TSL_DEBUG) instead of !defined(NDEBUG)
  * Tessil/robin-map@59a3b7d: Fix test_extreme_bucket_count_value_construction test on some platforms
  * Tessil/robin-map@0c3c858: Check that bucket_count doesn't exceed max_bucket_count() after the constructor initialization

The robin-map fast map implementation is used in our p/invoke
dispatch mechanism (38aa561).

The new version is a recommended upgrade.  It doesn't appear to
contain fixes that affect us, but it's better to be safe than
sorry, right? :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants