distributed ddl hang #7507

yuzhichang · 2019-10-28T13:52:46Z

Describe the bug or unexpected behaviour
I have a four-node clickhouse cluster. Today I noticed that all distributed DDL hang.
I run the following CREATE statement on each node, I always get response only from one node (10.126.144.142). RENAME also hang.

CREATE TABLE test1 ON CLUSTER cdp (
    id UInt32,
    name String
) engine=ReplacingMergeTree()
PARTITION BY intDiv(id, 1000000)
ORDER BY (id);
┌─host───────────┬─port─┬─status─┬─error──────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────
┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 10.126.144.142 │ 9000 │     57 │ Code: 57, e.displayText() = DB::Exception: 
Table default.cdp_idmapping_archive_delta already exists. (version 19.15.3.6) 
│                   3 │                0 │
└────────────────┴──────┴────────┴────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────
┴─────────────────────┴──────────────────┘
↑ Progress: 1.00 rows, 58.00 B (2.69 rows/s., 156.23 B/s.)  24%Received exception from server (version 19.15.3):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000003068 is executing longer than distributed_ddl_task_timeout (=600) seconds. There are 3 unfinished hosts (0 of them are currently active), they are going to execute the query in background.

Then I restarted clickhouse-server on all nodes. The the same statement became working as expected.
The cluster was up for about one week. And it's the first time I notice this problem.
#5295 inspired me to restart clickhouse-server.

How to reproduce

Which ClickHouse server version to use
v19.15.3.6-stable
Which interface to use, if matters
CLI

The text was updated successfully, but these errors were encountered:

yuzhichang · 2020-01-06T12:09:48Z

The dddl hang problem occurred today on a node of a cluster. Before that the node machine rebooted several times for unknown reason.
Following log indicates that this node refused to execute the DDL:

2020.01.06 19:33:45.721174 [ 143 ] {19d35336-2ece-4938-a110-1cb1c696c0af} <Debug> executeQuery: (from [::1]:44488) ALTER TABLE cdp_campaign_flink ON CLUSTER c
dp144 DELETE WHERE 1 
2020.01.06 19:33:45.736025 [ 118 ] {} <Debug> DDLWorker: Processing tasks
2020.01.06 19:33:45.736124 [ 113 ] {} <Debug> DDLWorker: Cleaning queue
2020.01.06 19:33:45.739425 [ 143 ] {19d35336-2ece-4938-a110-1cb1c696c0af} <Debug> executeQuery: Query pipeline:
DDLQueryStatusInputStream

2020.01.06 19:33:45.739603 [ 118 ] {} <Debug> DDLWorker: Will not execute task query-0000010991: There is no a local address in host list
2020.01.06 19:33:45.739669 [ 118 ] {} <Debug> DDLWorker: Waiting a watch

Later I restarted clickhouse-server. The following log indicates that the clickhouse executed that DDL which it refused to do.

2020.01.06 20:00:28.808415 [ 51 ] {9b56728f-4f5f-4bb9-ba40-350ba4f9ec9c} <Debug> DDLWorker: Processing task query-0000010991 (ALTER TABLE default.cdp_campaign_flink ON CLUSTER cdp144 DELETE WHERE 1 )
2020.01.06 20:00:28.809527 [ 51 ] {9b56728f-4f5f-4bb9-ba40-350ba4f9ec9c} <Debug> DDLWorker: Executing query: ALTER TABLE default.cdp_campaign_flink DELETE WHERE 1

Other distributed DDLs also work after the restart.

yuzhichang · 2020-01-07T08:27:43Z

I reproduced the hang problem with rebooting the machine on which a very light ck instance (loading tables costs only ~3 seconds) run.

2020.01.07 16:14:15.117528 [ 76 ] {} <Debug> DDLWorker: Started DDLWorker thread
2020.01.07 16:14:15.117682 [ 88 ] {} <Debug> DDLWorker: Started DDLWorker cleanup thread
2020.01.07 16:14:15.119189 [ 1 ] {} <Information> Application: Listening http://[::]:8123
2020.01.07 16:14:15.119341 [ 1 ] {} <Information> Application: Listening for connections with native protocol (tcp): [::]:9000
2020.01.07 16:14:15.119464 [ 1 ] {} <Information> Application: Listening for replica communication (interserver) http://[::]:9009
2020.01.07 16:14:15.120280 [ 1 ] {} <Information> Application: Available RAM: 94.29 GiB; physical cores: 40; logical cores: 40.
2020.01.07 16:14:15.120316 [ 1 ] {} <Information> Application: Ready for connections.

...
2020.01.07 16:14:17.124622 [ 69 ] {} <Debug> ConfigReloader: Loading config '/etc/clickhouse-server/config.xml'
2020.01.07 16:14:18.917046 [ 76 ] {} <Error> void DB::DDLWorker::runMainThread(): Code: 999, e.displayText() = Coordination::Exception: All connection tries failed while connecting to ZooKee
per. Addresses: 10.126.144.123:2181, 10.126.144.124:2181, 10.126.144.125:2181
...
2020.01.07 16:14:42.434017 [ 76 ] {} <Debug> DDLWorker: Will not execute task query-0000011072: There is no a local address in host list
2020.01.07 16:14:42.434138 [ 76 ] {} <Debug> DDLWorker: Waiting a watch

Looks like ck starting so fast that it fails to get self's IP address.

yuzhichang · 2020-01-08T03:59:30Z

I added some debug code at dbms/src/Common/isLocalAddress.cpp to dump all interfaces' address.
The following is log at when the dddl hang problem occur,

Jan  8 11:42:35 shoyclp clickhouse-server: interface address of lo: 127.0.0.1
Jan  8 11:42:35 shoyclp clickhouse-server: interface address of lo: ::1%lo
Jan  8 11:42:35 shoyclp clickhouse-server: interface address of eth0: fe80::250:56ff:fe92:149d%eth0

After systemctl restart clickhouse-server, and run the DELETE sql again, I got following log:

Jan  8 11:43:57 shoyclp clickhouse-server: interface address of lo: 127.0.0.1
Jan  8 11:43:57 shoyclp clickhouse-server: interface address of lo: ::1%lo
Jan  8 11:43:57 shoyclp clickhouse-server: interface address of eth0: 10.126.144.158
Jan  8 11:43:57 shoyclp clickhouse-server: interface address of eth0: fe80::250:56ff:fe92:149d%eth0

In summary, dbms/src/Common/isLocalAddress.cpp#L15 cached interfaces caused the problem. If ck starts very fast, it could get no IP address.

yuzhichang added the bug Confirmed user-visible misbehaviour in official release label Oct 28, 2019

filimonov added the comp-dddl Distributed DDL feature label Oct 28, 2019

yuzhichang mentioned this issue Jan 8, 2020

ensure network be up #8570

Merged

alexey-milovidov closed this as completed in #8570 Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed ddl hang #7507

distributed ddl hang #7507

yuzhichang commented Oct 28, 2019 •

edited

Loading

yuzhichang commented Jan 6, 2020 •

edited

Loading

yuzhichang commented Jan 7, 2020 •

edited

Loading

yuzhichang commented Jan 8, 2020

distributed ddl hang #7507

distributed ddl hang #7507

Comments

yuzhichang commented Oct 28, 2019 • edited Loading

yuzhichang commented Jan 6, 2020 • edited Loading

yuzhichang commented Jan 7, 2020 • edited Loading

yuzhichang commented Jan 8, 2020

yuzhichang commented Oct 28, 2019 •

edited

Loading

yuzhichang commented Jan 6, 2020 •

edited

Loading

yuzhichang commented Jan 7, 2020 •

edited

Loading