roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

cockroach-teamcity · 2018-12-06T13:41:39Z

SHA: https://github.com/cockroachdb/cockroach/commits/1146a03cc217cb57bdddd795e2d2fe2806c64985

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1042719&tab=buildLog

The test failed on release-2.1:
	test.go:1025: test timed out (4h0m0s)
	test.go:630,cluster.go:1488,import.go:132: context canceled

The text was updated successfully, but these errors were encountered:

tbg · 2018-12-06T14:51:51Z

release-2.1. That's still expected to fail there, though I'm going to have to backport some subset of changes that fixes this.

…snapshots Backports cockroachdb#32594. This didn't apply cleanly, but only because I never cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within. Fixes cockroachdb#32898. Fixes cockroachdb#32900. Fixes cockroachdb#32895. /cc @cockroachdb/release Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.

cockroach-teamcity · 2018-12-12T05:39:54Z

SHA: https://github.com/cockroachdb/cockroach/commits/138369ec8091afb0ccabdc1710928b8a47f3797f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1050805&tab=buildLog

The test failed on release-2.1:
	test.go:1025: test timed out (4h0m0s)
	test.go:630,cluster.go:1486,import.go:132: context canceled

tbg · 2018-12-12T09:12:36Z

Well, this sucks. On the plus side, there are no warnings about lots of snapshots, but there are a few stuck ranges that are usually caused by pending snapshots. Going to have to dig into the logs.

cockroach-teamcity · 2018-12-13T00:36:23Z

SHA: https://github.com/cockroachdb/cockroach/commits/f524717e66973da0c11655c860d4b131f82409b9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052892&tab=buildLog

The test failed on master:
	test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052892-import-tpch-nodes-8 -n 8 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		e: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181030 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script385190446 --project cockroach-ephemeral]
		Output: WARNING: Some requests generated warnings:
		 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.
		
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

cockroach-teamcity · 2018-12-13T00:40:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/153fcf347cd6a53517f607ec1e6c7c14193640dd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052961&tab=buildLog

The test failed on release-2.1:
	test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052961-import-tpch-nodes-8 -n 8 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		e: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181030 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script480147032 --project cockroach-ephemeral]
		Output: WARNING: Some requests generated warnings:
		 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.
		
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

cockroach-teamcity · 2018-12-14T03:17:48Z

SHA: https://github.com/cockroachdb/cockroach/commits/0c87b11cb99ba5c677c95ded55dcba385928474e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1054703&tab=buildLog

The test failed on release-2.1:
	test.go:1023: test timed out (4h0m0s)
	test.go:628,cluster.go:1486,import.go:132: context canceled

cockroach-teamcity · 2018-12-14T10:12:02Z

SHA: https://github.com/cockroachdb/cockroach/commits/7efc92a4dec689efc855ecd382a6f6b6065b98ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1055192&tab=buildLog

The test failed on master:
	test.go:628,cluster.go:1486,import.go:132: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.4: row 72831574: reading CSV record: read tcp 10.128.0.38:60150->108.177.111.128:443: read: connection reset by peer

cockroach-teamcity · 2018-12-15T11:40:58Z

SHA: https://github.com/cockroachdb/cockroach/commits/334cce5d61b32b0bb4a300668522c38fb9d6b96d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1056651&tab=buildLog

The test failed on master:
	test.go:628,cluster.go:1486,import.go:132: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.5: row 679114: reading CSV record: read tcp 10.128.0.49:35838->173.194.198.128:443: read: connection reset by peer

tbg · 2018-12-17T21:16:13Z

This one is a real failure with stuck Raft commands. I'm not seeing the usual snapshot problems (which is good because I backported fixes for them). In fact, it doesn't seem that any Raft snapshots were (attempted to be) sent in the last two hours of the test. And yet, there are four Raft groups with stuck commands:

cockroach debug merge-logs *.logs/*.log | grep -F 'have been waiting'
teamcity-1054703-import-tpch-nodes-8-0004> W181214 01:07:46.558791 48146 storage/replica.go:3306  [n7,s3,r497/2:/Table/53/1/10{89480…-92153…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/1/108948070/4/0,/Table/53/1/109215335/4)
teamcity-1054703-import-tpch-nodes-8-0003> W181214 01:23:02.602603 130435 storage/replica.go:3306  [n5,s7,r3152/4:/Table/53/5/8{466/59…-616/32…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/5/8608/378203236/2/0,/Table/53/5/8616/329036386/5)
teamcity-1054703-import-tpch-nodes-8-0003> W181214 01:23:02.982719 128438 storage/quota_pool.go:107  [n5,s7,r3152/4:/Table/53/5/8{466/59…-616/32…}] have been waiting 1m0.000259203s attempting to acquire 441 B of proposal quota
teamcity-1054703-import-tpch-nodes-8-0006> W181214 01:23:08.415716 51275 storage/replica.go:3306  [n4,s2,r3053/4:/Table/53/4/85{2263/…-5374/…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/4/852263/261085636/3/0,/Table/53/4/855374/381603302/7)
teamcity-1054703-import-tpch-nodes-8-0004> W181214 01:23:18.607483 82063 storage/replica.go:3306  [n7,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/4/729681/260812005/1/0,/Table/53/4/732791/146652451/3)

tbg · 2018-12-17T21:24:01Z

r3152 logs this a few times:

E181214 03:08:58.330618 176227 storage/queue.go:839 [n5,replicate,s7,r3152/
4:/Table/53/5/8{466/59…-616/32…}] change replicas of r3152 failed: context deadline exceeded

This suggests that something's supposed to be done with this range. There is no other chatter from the replicate queue, which suggests that it gets stuck really early.
r497 has something similar for the consistency checker.

r3053 has no activity after the "have been waiting.." msg.

r3167 has a stuck request and then does a replication change (but the request seems to remain stuck?)

teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:22:24.148103 82963 storage/replica_raftstorage.go:818 [n1,s1,r3167/?:{-}] applying preemptive snapshot at index 24 (id=cfc563e9, encoded size=5071, 1 rocksdb batches, 14 log entries)
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:22:24.161145 82963 storage/replica_raftstorage.go:824 [n1,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] applied preemptive snapshot in 13ms [clear=0ms batch=0ms entries=0ms commit=13ms]
teamcity-1054703-import-tpch-nodes-8-0004> I181214 01:22:24.161465 82263 storage/replica_command.go:898 [n7,replicate,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] change replicas (ADD_REPLICA (n1,s1):5): read existing descriptor r3167:/Table/53/4/7{29681/260359173/4-32791/146652451/3} [(n8,s6):1, (n7,s3):2, (n6,s8):4, next=5, gen=1]
teamcity-1054703-import-tpch-nodes-8-0004> W181214 01:23:18.607483 82063 storage/replica.go:3306 [n7,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/4/729681/260812005/1/0,/Table/53/4/732791/146652451/3)
teamcity-1054703-import-tpch-nodes-8-0004> E181214 01:23:24.146269 82263 storage/queue.go:839 [n7,replicate,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] change replicas of r3167 failed: context deadline exceeded
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:30:49.769692 90204 storage/replica_gc_queue.go:249 [n1,replicaGC,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] removing young preemptive snapshot (0 ticks)
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:30:49.769710 90204 storage/store.go:2726 [n1,replicaGC,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] removing replica r3167/0
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:30:49.782346 90204 storage/replica.go:907 [n1,replicaGC,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] removed 5 (0+5) keys in 12ms [clear=0ms commit=12ms]

The test timed out at 3:14, so a lot later than many of these events.

tbg · 2018-12-18T10:31:15Z

Looking at a repro. This is one of the five ranges that has stuck proposals:

What's weird is that many ranges seem to have four replicas (the alloc simulation wants to remove one, as you'd expect). But other than that this status seems "healthy" except for the pending command. I'm going to logspy in.

tbg · 2018-12-18T10:39:00Z

Ah, that was easy:

I181218 10:32:44.934749 186 storage/replica.go:4870 [n5,s7,r62/4:/Table/53/1/4{086528…-355264…}] pending commands: sent 0 back to client, reproposing 1 (at 29.14) reasonTicks
I181218 10:32:44.943848 186 storage/replica.go:3893 [n5,s7,r62/4:/Table/53/1/4{086528…-355264…}] proposing command 2ed3f8b14ad3d34b: 1 AddSstable
I181218 10:32:44.953369 186 vendor/go.etcd.io/etcd/raft/raft.go:974 [n5,s7,r62/4:/Table/53/1/4{086528…-355264…}] 4 appending new entries to log would exceed uncommitted entry size limit; dropping proposal

I thought I had backported all of these fixes. Ah, vendor's most recent Raft vendor bump picks up only etcd-io/etcd#10167. I bet it doesn't pick up the fix for that PR, etcd-io/etcd#10199. Yeah, it. doesn't.

petermattis · 2018-12-19T03:00:48Z

Fixed by #33228.

cockroach-teamcity added this to the 2.2 milestone Dec 6, 2018

cockroach-teamcity assigned andreimatei Dec 6, 2018

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 6, 2018

tbg changed the title ~~roachtest: import/tpch/nodes=8 failed~~ roachtest: import/tpch/nodes=8 failed [release-2.1] Dec 6, 2018

tbg assigned tbg and unassigned andreimatei Dec 10, 2018

tbg mentioned this issue Dec 11, 2018

backport-2.1: storage: delay manual splits that would result in more snapshots #33015

Merged

tbg mentioned this issue Dec 18, 2018

backport-2.1: Bump etcd/raft to pick up uncommitted log size fix #33228

Merged

petermattis closed this as completed Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

cockroach-teamcity commented Dec 6, 2018

tbg commented Dec 6, 2018

cockroach-teamcity commented Dec 12, 2018

tbg commented Dec 12, 2018

cockroach-teamcity commented Dec 13, 2018

cockroach-teamcity commented Dec 13, 2018

cockroach-teamcity commented Dec 14, 2018

cockroach-teamcity commented Dec 14, 2018

cockroach-teamcity commented Dec 15, 2018

tbg commented Dec 17, 2018

tbg commented Dec 17, 2018

tbg commented Dec 18, 2018

tbg commented Dec 18, 2018

petermattis commented Dec 19, 2018

roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

Comments

cockroach-teamcity commented Dec 6, 2018

tbg commented Dec 6, 2018

cockroach-teamcity commented Dec 12, 2018

tbg commented Dec 12, 2018

cockroach-teamcity commented Dec 13, 2018

cockroach-teamcity commented Dec 13, 2018

cockroach-teamcity commented Dec 14, 2018

cockroach-teamcity commented Dec 14, 2018

cockroach-teamcity commented Dec 15, 2018

tbg commented Dec 17, 2018

tbg commented Dec 17, 2018

tbg commented Dec 18, 2018

tbg commented Dec 18, 2018

petermattis commented Dec 19, 2018