Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed ddl hang #7507

Closed
yuzhichang opened this issue Oct 28, 2019 · 3 comments · Fixed by #8570
Closed

distributed ddl hang #7507

yuzhichang opened this issue Oct 28, 2019 · 3 comments · Fixed by #8570
Labels
bug Confirmed user-visible misbehaviour in official release comp-dddl Distributed DDL feature

Comments

@yuzhichang
Copy link
Contributor

yuzhichang commented Oct 28, 2019

Describe the bug or unexpected behaviour
I have a four-node clickhouse cluster. Today I noticed that all distributed DDL hang.
I run the following CREATE statement on each node, I always get response only from one node (10.126.144.142). RENAME also hang.

CREATE TABLE test1 ON CLUSTER cdp (
    id UInt32,
    name String
) engine=ReplacingMergeTree()
PARTITION BY intDiv(id, 1000000)
ORDER BY (id);
┌─host───────────┬─port─┬─status─┬─error──────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────
┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 10.126.144.142 │ 9000 │     57 │ Code: 57, e.displayText() = DB::Exception: 
Table default.cdp_idmapping_archive_delta already exists. (version 19.15.3.6) 
│                   3 │                0 │
└────────────────┴──────┴────────┴────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────
┴─────────────────────┴──────────────────┘
↑ Progress: 1.00 rows, 58.00 B (2.69 rows/s., 156.23 B/s.)  24%Received exception from server (version 19.15.3):
Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000003068 is executing longer than distributed_ddl_task_timeout (=600) seconds. There are 3 unfinished hosts (0 of them are currently active), they are going to execute the query in background.

Then I restarted clickhouse-server on all nodes. The the same statement became working as expected.
The cluster was up for about one week. And it's the first time I notice this problem.
#5295 inspired me to restart clickhouse-server.

How to reproduce

  • Which ClickHouse server version to use
    v19.15.3.6-stable
  • Which interface to use, if matters
    CLI
@yuzhichang yuzhichang added the bug Confirmed user-visible misbehaviour in official release label Oct 28, 2019
@filimonov filimonov added the comp-dddl Distributed DDL feature label Oct 28, 2019
@yuzhichang
Copy link
Contributor Author

yuzhichang commented Jan 6, 2020

The dddl hang problem occurred today on a node of a cluster. Before that the node machine rebooted several times for unknown reason.
Following log indicates that this node refused to execute the DDL:

2020.01.06 19:33:45.721174 [ 143 ] {19d35336-2ece-4938-a110-1cb1c696c0af} <Debug> executeQuery: (from [::1]:44488) ALTER TABLE cdp_campaign_flink ON CLUSTER c
dp144 DELETE WHERE 1 
2020.01.06 19:33:45.736025 [ 118 ] {} <Debug> DDLWorker: Processing tasks
2020.01.06 19:33:45.736124 [ 113 ] {} <Debug> DDLWorker: Cleaning queue
2020.01.06 19:33:45.739425 [ 143 ] {19d35336-2ece-4938-a110-1cb1c696c0af} <Debug> executeQuery: Query pipeline:
DDLQueryStatusInputStream

2020.01.06 19:33:45.739603 [ 118 ] {} <Debug> DDLWorker: Will not execute task query-0000010991: There is no a local address in host list
2020.01.06 19:33:45.739669 [ 118 ] {} <Debug> DDLWorker: Waiting a watch

Later I restarted clickhouse-server. The following log indicates that the clickhouse executed that DDL which it refused to do.

2020.01.06 20:00:28.808415 [ 51 ] {9b56728f-4f5f-4bb9-ba40-350ba4f9ec9c} <Debug> DDLWorker: Processing task query-0000010991 (ALTER TABLE default.cdp_campaign_flink ON CLUSTER cdp144 DELETE WHERE 1 )
2020.01.06 20:00:28.809527 [ 51 ] {9b56728f-4f5f-4bb9-ba40-350ba4f9ec9c} <Debug> DDLWorker: Executing query: ALTER TABLE default.cdp_campaign_flink DELETE WHERE 1 

Other distributed DDLs also work after the restart.

@yuzhichang
Copy link
Contributor Author

yuzhichang commented Jan 7, 2020

I reproduced the hang problem with rebooting the machine on which a very light ck instance (loading tables costs only ~3 seconds) run.

2020.01.07 16:14:15.117528 [ 76 ] {} <Debug> DDLWorker: Started DDLWorker thread
2020.01.07 16:14:15.117682 [ 88 ] {} <Debug> DDLWorker: Started DDLWorker cleanup thread
2020.01.07 16:14:15.119189 [ 1 ] {} <Information> Application: Listening http://[::]:8123
2020.01.07 16:14:15.119341 [ 1 ] {} <Information> Application: Listening for connections with native protocol (tcp): [::]:9000
2020.01.07 16:14:15.119464 [ 1 ] {} <Information> Application: Listening for replica communication (interserver) http://[::]:9009
2020.01.07 16:14:15.120280 [ 1 ] {} <Information> Application: Available RAM: 94.29 GiB; physical cores: 40; logical cores: 40.
2020.01.07 16:14:15.120316 [ 1 ] {} <Information> Application: Ready for connections.

...
2020.01.07 16:14:17.124622 [ 69 ] {} <Debug> ConfigReloader: Loading config '/etc/clickhouse-server/config.xml'
2020.01.07 16:14:18.917046 [ 76 ] {} <Error> void DB::DDLWorker::runMainThread(): Code: 999, e.displayText() = Coordination::Exception: All connection tries failed while connecting to ZooKee
per. Addresses: 10.126.144.123:2181, 10.126.144.124:2181, 10.126.144.125:2181
...
2020.01.07 16:14:42.434017 [ 76 ] {} <Debug> DDLWorker: Will not execute task query-0000011072: There is no a local address in host list
2020.01.07 16:14:42.434138 [ 76 ] {} <Debug> DDLWorker: Waiting a watch

Looks like ck starting so fast that it fails to get self's IP address.

@yuzhichang
Copy link
Contributor Author

I added some debug code at dbms/src/Common/isLocalAddress.cpp to dump all interfaces' address.
The following is log at when the dddl hang problem occur,

Jan  8 11:42:35 shoyclp clickhouse-server: interface address of lo: 127.0.0.1
Jan  8 11:42:35 shoyclp clickhouse-server: interface address of lo: ::1%lo
Jan  8 11:42:35 shoyclp clickhouse-server: interface address of eth0: fe80::250:56ff:fe92:149d%eth0

After systemctl restart clickhouse-server, and run the DELETE sql again, I got following log:

Jan  8 11:43:57 shoyclp clickhouse-server: interface address of lo: 127.0.0.1
Jan  8 11:43:57 shoyclp clickhouse-server: interface address of lo: ::1%lo
Jan  8 11:43:57 shoyclp clickhouse-server: interface address of eth0: 10.126.144.158
Jan  8 11:43:57 shoyclp clickhouse-server: interface address of eth0: fe80::250:56ff:fe92:149d%eth0

In summary, dbms/src/Common/isLocalAddress.cpp#L15 cached interfaces caused the problem. If ck starts very fast, it could get no IP address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release comp-dddl Distributed DDL feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants