Infinite loop in synchronization of priority queues #802

dcorbacho · 2016-05-17T16:33:38Z

This crash generates an infinity loop in the synchronization of priority queues

Found while reproducing #687

=INFO REPORT==== 17-May-2016::14:29:35 ===
Mirrored queue 'pri_q2_0' in vhost '/': Adding mirror on node 'rabbit@t-srv-rabbit05': <8977.27711.1>

=INFO REPORT==== 17-May-2016::14:29:35 ===
Mirrored queue 'pri_q2_0' in vhost '/': Adding mirror on node 'rabbit@t-srv-rabbit06': <8978.26069.1>

=ERROR REPORT==== 17-May-2016::14:29:35 ===
** Generic server <0.302.0> terminating
** Last message in was {init,{<0.148.0>,
                              [non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown,non_clean_shutdown,
.....
                               non_clean_shutdown,non_clean_shutdown,
                               non_clean_shutdown]}}
** When Server state == {q,{amqqueue,
                               {resource,<<"/">>,queue,<<"pri_gb2_2">>},
                               true,false,none,
                               [{<<"x-max-priority">>,signedint,256}],
                               <0.302.0>,[],[],[],
                               [{vhost,<<"/">>},
                                {name,<<"ha-all">>},
                                {pattern,<<>>},
                                {'apply-to',<<"all">>},
                                {definition,
                                    [{<<"ha-mode">>,<<"all">>},
                                     {<<"ha-sync-mode">>,<<"automatic">>}]},
                                {priority,0}],
                               [],[],live},
                           none,false,undefined,undefined,
                           {state,
                               {queue,[],[],0},
                               {active,-576460741042383,1.0}},
                           undefined,undefined,undefined,undefined,
                           {state,fine,5000,undefined},
                           {0,nil},
                           undefined,undefined,undefined,
                           {state,
                               {dict,0,16,16,8,80,48,
                                   {[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                    [],[]},
                                   {{[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]}}},
                               delegate},
                           undefined,undefined,undefined,undefined,0,running}
** Reason for termination == 
** {{badmatch,{error,writer_exists}},
    [{rabbit_queue_index,get_journal_handle,1,
                         [{file,"src/rabbit_queue_index.erl"},{line,835}]},
     {rabbit_queue_index,load_journal,1,
                         [{file,"src/rabbit_queue_index.erl"},{line,847}]},
     {rabbit_queue_index,recover_journal,1,
                         [{file,"src/rabbit_queue_index.erl"},{line,857}]},
     {rabbit_queue_index,init_dirty,3,
                         [{file,"src/rabbit_queue_index.erl"},{line,563}]},
     {rabbit_variable_queue,init,6,
                            [{file,"src/rabbit_variable_queue.erl"},
                             {line,505}]},
     {rabbit_priority_queue,'-init/3-lc$^1/1-0-',2,
                            [{file,"src/rabbit_priority_queue.erl"},
                             {line,157}]},
     {rabbit_priority_queue,'-init/3-lc$^1/1-0-',2,
                            [{file,"src/rabbit_priority_queue.erl"},
                             {line,157}]},
     {rabbit_priority_queue,init,3,
                            [{file,"src/rabbit_priority_queue.erl"},
                             {line,157}]}]}

=ERROR REPORT==== 17-May-2016::14:29:35 ===
Restarting crashed queue 'pri_gb2_2' in vhost '/'.

The text was updated successfully, but these errors were encountered:

dcorbacho · 2016-06-20T10:17:15Z

This bug happens with persistent queues and can be reproduced by constantly re-applying the HA policy, as it causes queues to be started/stopped. It does not happen on its own, but as a consequence of other bugs that do not allow queues to clear up their state (index folder and journal stay on disk).

In stable, it can be a consequence of #803. Inmaster, it happens after getting this crash:

** {{badmatch,true},[{rabbit_queue_index,init,3,[{file,"src/rabbit_queue_index.erl"},{line,286}]},{rabbit_variable_queue,init,6,[{file,"src/rabbit_variable_queue.erl"},{line,477}]},{rabbit_priority_queue,'-init/3-lc$^0/1-1-',3,[{file,"src/rabbit_priority_queue.erl"},{line,157}]},{rabbit_priority_queue,'-init/3-lc$^0/1-1-',3,[{file,"src/rabbit_priority_queue.erl"},{line,157}]},{rabbit_priority_queue,init,3,[{file,"src/rabbit_priority_queue.erl"},{line,157}]},{rabbit_mirror_queue_slave,handle_go,1,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,124}]},{rabbit_mirror_queue_slave,handle_cast,2,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,245}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]}]}

During initialization, this check fails https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_queue_index.erl#L281 as the directory exists on disk. This can be reproduced quickly with queues with large priorities (i.e. 256), but takes longer to happen with small priorities (i.e 5) and could not reproduce it with normal queues. The issue comes from the async stop/start of the queues, as the initialization of the queues can overlap with reset_state in rabbit_queue_index, when the previous instance of the queues is still terminating. Large priorities imply a larger number of variable queues (i.e 256), thus the stop command takes much longer to be processed.

At the same time, I believe that unrelated to the previous issues, I'm getting this error:

2016-06-20 08:01:26.889 [error] <0.1163.0> ** Generic server <0.1163.0> terminating
** Last message in was {'$gen_cast',{run_backing_queue,rabbit_mirror_queue_master,#Fun<ra
bbit_mirror_queue_master.9.50136656>}}
** When Server state == {q,{amqqueue,{resource,<<"/">>,queue,<<"@prio_p_a_21_ 10">>},true
,false,none,[{<<"x-max-priority">>,signedint,124}],<0.1163.0>,[<28871.8320.0>,<28868.9145
.0>],[],['rabbit@t-srv-rabbit05','rabbit@t-srv-rabbit06'],undefined,[{<28868.9154.0>,<288
68.9145.0>},{<28871.8323.0>,<28871.8320.0>},{<0.8447.0>,<0.1163.0>}],[],live,15},none,fal
se,rabbit_priority_queue,{state,rabbit_variable_queue,[{124,{vqstate,{0,{[],[]}},{0,{[],[
]}},{delta,undefined,0,undefined},{0,{[],[]}},{5,{[{msg_status,4,<<13,193,241,204,131,208
,54,32,69,23,31,190,108,97,144,121>>,{basic_message,{resource,<<"/">>,exchange,<<>>},[<<"
@prio_p_a_21_ 10">>],{content,60,{'P_basic',undefined,undefined,undefined,2,124,undefined
,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<24,0,
2,124>>,rabbit_framing_amqp_0_9_1,[<<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>]},...},...},...],...}},...}},...]},...}
** Reason for termination == ** {function_clause,[{rabbit_priority_queue,invoke,[rabbit_mirror_queue_master,#Fun<rabbit_mirror_queue_master.9.50136656>,{state,rabbit_variable_queue,[{124,{vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,undefined},{0,{[],[]}},{5,{[{msg_status,4,<<13,193,241,204,131,208,54,32,69,23,31,190,108,97,144,121>>,{basic_message,{resource,<<"/">>,exchange,<<>>},[<<"@prio_p_a_21_ 10">>],{content,60,{'P_basic',undefined,undefined,undefined,2,124,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<24,0,2,124>>,rabbit_framing_amqp_0_9_1,[<<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

I will investigate later if it is independent and then create a new issue for it.

I'm also getting:

2016-06-20 08:02:46.519 [warning] <0.11268.0> (<0.11268.0>) rabbit_priority_queue delete_and_terminate {{badmatch,{error,enoent}},[{rabbit_queue_index,append_journal_to_segment,1,[{file,"src/rabbit_queue_index.erl"},{line,836}]},{rabbit_queue_index,'-flush_journal/1-fun-0-',2,[{file,"src/rabbit_queue_index.erl"},{line,819}]},{lists,foldl,3,[{file,"lists.erl"},{line,1262}]},{rabbit_queue_index,segment_fold,3,[{file,"src/rabbit_queue_index.erl"},{line,1014}]},{rabbit_queue_index,flush_journal,1,[{file,"src/rabbit_queue_index.erl"},{line,811}]},{rabbit_variable_queue,handle_pre_hibernate,1,[{file,"src/rabbit_variable_queue.erl"},{line,838}]},{rabbit_priority_queue,foreach1,3,[{file,"src/rabbit_priority_queue.erl"},{line,488}]},{rabbit_priority_queue,foreach1,2,[{file,"src/rabbit_priority_queue.erl"},{line,486}]}]}

which seems to be #545 and the next one which may be another manifestation of it:

2016-06-20 08:02:45.537 [warning] <0.11866.0> (<0.11866.0>) rabbit_priority_queue delete_and_terminate {{badmatch,{error,eexist}},[{rabbit_queue_index,get_journal_handle,1,[{file,"src/rabbit_queue_index.erl"},{line,852}]},{rabbit_queue_index,publish,6,[{file,"src/rabbit_queue_index.erl"},{line,382}]},{rabbit_variable_queue,maybe_write_index_to_disk,3,[{file,"src/rabbit_variable_queue.erl"},{line,1886}]},{rabbit_variable_queue,publish1,7,[{file,"src/rabbit_variable_queue.erl"},{line,1721}]},{rabbit_variable_queue,publish,6,[{file,"src/rabbit_variable_queue.erl"},{line,570}]},{rabbit_priority_queue,pick1,3,[{file,"src/rabbit_priority_queue.erl"},{line,496}]},{rabbit_mirror_queue_slave,process_instruction,2,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,847}]},{rabbit_mirror_queue_slave,handle_cast,2,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,254}]}]}

dcorbacho · 2016-06-21T07:38:20Z

The branch https://github.com/rabbitmq/rabbitmq-server/commits/rabbitmq-server-802 contains now two patches.

The invoke crash is solved in 5f1ffc6, retrieving the max priority from the internal state. The rabbit_amqqueue_process, rabbit_mirror_queue_master and rabbit_mirror_queue_slave are unaware of the backing queue they are using, thus will never provide the priority in the invoke call.

The {badmatch,true} is solved in b2fd873, by calling BQ:delete_crashed/1 to ensure indexes are deleted by the backing queue after a crash and before start the new slave. Thus, each backing queue needs to know where and how many indexes has to delete. However, this not yet solves the original {badmatch,{error,writer_exists}}, as it will now happen under new circumstances.
Previously, the queue would not start because the index was still present. As the race condition is still present, the reset of the index can happen once the slave has started and thus delete the new index:

** {{badmatch,{error,enoent}},[{rabbit_queue_index,get_journal_handle,1,[{file,"src/rabbi
t_queue_index.erl"},{line,840}]},{rabbit_queue_index,flush_pre_publish_cache,1,[{file,"sr
c/rabbit_queue_index.erl"},{line,358}]},{rabbit_queue_index,flush_pre_publish_cache,2,[{f
ile,"src/rabbit_queue_index.erl"},{line,351}]},{rabbit_variable_queue,ui,1,[{file,"src/ra
bbit_variable_queue.erl"},{line,2603}]},{rabbit_variable_queue,batch_publish,4,[{file,"sr
c/rabbit_variable_queue.erl"},{line,578}]},{rabbit_priority_queue,pick1,3,[{file,"src/rab
bit_priority_queue.erl"},{line,488}]},{lists,foldl,3,[{file,"lists.erl"},{line,1262}]},{r
abbit_mirror_queue_sync,batch_publish,4,[{file,"src/rabbit_mirror_queue_sync.erl"},{line,
405}]}]}

This needs to be solved before the patch is merged. A fix for this will solve part of #545.

dcorbacho · 2016-06-24T09:31:00Z

These bugs seems to be solved with the following patches, not able to reproduce in master any longer.

The rationale behind the last one:
When the mirroring is stopped, it only waits for a certain amount of time for the slaves to shutdown https://github.com/rabbitmq/rabbitmq-server/blob/stable/src/rabbit_mirror_queue_master.erl#L217. If the 'DOWN' messages doesn't arrive on time, it continues and removes all slaves from the amqqueue record. The slaves might be busy synchronising or stopping the many queues on each priority. If mirroring is restarted shortly afterwards, we can have an old incarnation of the slave still doing things (maybe resetting it's old index or processing data), while the new slave is being started. So the operations of both incarnations over the same queue will overlap and cause havoc.
We don't have a way to find out about those old incarnations, unless we check every single queue supervised by an rabbit_amqqueue_sup, under the main rabbit_amqqueue_sup_sup.
Thus, a new field has been added on amqqueue to store those pids. This change can only go in master (future 3.7)

michaelklishin added the bug label May 17, 2016

michaelklishin assigned dcorbacho May 17, 2016

michaelklishin modified the milestones: 3.6.x, 3.6.3 May 17, 2016

michaelklishin modified the milestones: 3.6.x, 3.6.3 Jun 2, 2016

dcorbacho mentioned this issue Jun 24, 2016

Infinity loop in synchronisation of priority queues #857

Merged

michaelklishin modified the milestones: 3.6.3, 3.6.x Jun 24, 2016

michaelklishin changed the title ~~Infinity loop in synchronization of priority queues~~ Infinite loop in synchronization of priority queues Jun 24, 2016

michaelklishin added the effort-medium label Jun 24, 2016

michaelklishin closed this as completed Jun 27, 2016

This was referenced Jun 27, 2016

Ensure old incarnations of mirrors are stopped before new ones start #863

Closed

Ensure old incarnations of mirrors are stopped before new ones start #864

Closed

dcorbacho mentioned this issue Jul 29, 2016

Windows eacces / eexist error on journal.jif file during pause minority #545

Closed

dumbbell mentioned this issue Oct 11, 2016

Mirrors added/removed asynchronously conflict with fast policy changes #889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop in synchronization of priority queues #802

Infinite loop in synchronization of priority queues #802

dcorbacho commented May 17, 2016

dcorbacho commented Jun 20, 2016

dcorbacho commented Jun 21, 2016

dcorbacho commented Jun 24, 2016

Infinite loop in synchronization of priority queues #802

Infinite loop in synchronization of priority queues #802

Comments

dcorbacho commented May 17, 2016

dcorbacho commented Jun 20, 2016

dcorbacho commented Jun 21, 2016

dcorbacho commented Jun 24, 2016