Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

Closed
cockroach-teamcity opened this issue Dec 6, 2018 · 13 comments
Closed

roachtest: import/tpch/nodes=8 failed [release-2.1] #32898

cockroach-teamcity opened this issue Dec 6, 2018 · 13 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/1146a03cc217cb57bdddd795e2d2fe2806c64985

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1042719&tab=buildLog

The test failed on release-2.1:
	test.go:1025: test timed out (4h0m0s)
	test.go:630,cluster.go:1488,import.go:132: context canceled

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Dec 6, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 6, 2018
@tbg
Copy link
Member

tbg commented Dec 6, 2018

release-2.1. That's still expected to fail there, though I'm going to have to backport some subset of changes that fixes this.

@tbg tbg changed the title roachtest: import/tpch/nodes=8 failed roachtest: import/tpch/nodes=8 failed [release-2.1] Dec 6, 2018
@tbg tbg assigned tbg and unassigned andreimatei Dec 10, 2018
tbg added a commit to tbg/cockroach that referenced this issue Dec 11, 2018
…snapshots

Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

Fixes cockroachdb#32898.
Fixes cockroachdb#32900.
Fixes cockroachdb#32895.

/cc @cockroachdb/release

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/138369ec8091afb0ccabdc1710928b8a47f3797f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1050805&tab=buildLog

The test failed on release-2.1:
	test.go:1025: test timed out (4h0m0s)
	test.go:630,cluster.go:1486,import.go:132: context canceled

@tbg
Copy link
Member

tbg commented Dec 12, 2018

Well, this sucks. On the plus side, there are no warnings about lots of snapshots, but there are a few stuck ranges that are usually caused by pending snapshots. Going to have to dig into the logs.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f524717e66973da0c11655c860d4b131f82409b9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052892&tab=buildLog

The test failed on master:
	test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052892-import-tpch-nodes-8 -n 8 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		e: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181030 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script385190446 --project cockroach-ephemeral]
		Output: WARNING: Some requests generated warnings:
		 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.
		
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/153fcf347cd6a53517f607ec1e6c7c14193640dd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052961&tab=buildLog

The test failed on release-2.1:
	test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052961-import-tpch-nodes-8 -n 8 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		e: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181030 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script480147032 --project cockroach-ephemeral]
		Output: WARNING: Some requests generated warnings:
		 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.
		
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/0c87b11cb99ba5c677c95ded55dcba385928474e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1054703&tab=buildLog

The test failed on release-2.1:
	test.go:1023: test timed out (4h0m0s)
	test.go:628,cluster.go:1486,import.go:132: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7efc92a4dec689efc855ecd382a6f6b6065b98ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1055192&tab=buildLog

The test failed on master:
	test.go:628,cluster.go:1486,import.go:132: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.4: row 72831574: reading CSV record: read tcp 10.128.0.38:60150->108.177.111.128:443: read: connection reset by peer

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/334cce5d61b32b0bb4a300668522c38fb9d6b96d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1056651&tab=buildLog

The test failed on master:
	test.go:628,cluster.go:1486,import.go:132: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.5: row 679114: reading CSV record: read tcp 10.128.0.49:35838->173.194.198.128:443: read: connection reset by peer

@tbg
Copy link
Member

tbg commented Dec 17, 2018

This one is a real failure with stuck Raft commands. I'm not seeing the usual snapshot problems (which is good because I backported fixes for them). In fact, it doesn't seem that any Raft snapshots were (attempted to be) sent in the last two hours of the test. And yet, there are four Raft groups with stuck commands:

cockroach debug merge-logs *.logs/*.log | grep -F 'have been waiting'
teamcity-1054703-import-tpch-nodes-8-0004> W181214 01:07:46.558791 48146 storage/replica.go:3306  [n7,s3,r497/2:/Table/53/1/10{89480…-92153…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/1/108948070/4/0,/Table/53/1/109215335/4)
teamcity-1054703-import-tpch-nodes-8-0003> W181214 01:23:02.602603 130435 storage/replica.go:3306  [n5,s7,r3152/4:/Table/53/5/8{466/59…-616/32…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/5/8608/378203236/2/0,/Table/53/5/8616/329036386/5)
teamcity-1054703-import-tpch-nodes-8-0003> W181214 01:23:02.982719 128438 storage/quota_pool.go:107  [n5,s7,r3152/4:/Table/53/5/8{466/59…-616/32…}] have been waiting 1m0.000259203s attempting to acquire 441 B of proposal quota
teamcity-1054703-import-tpch-nodes-8-0006> W181214 01:23:08.415716 51275 storage/replica.go:3306  [n4,s2,r3053/4:/Table/53/4/85{2263/…-5374/…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/4/852263/261085636/3/0,/Table/53/4/855374/381603302/7)
teamcity-1054703-import-tpch-nodes-8-0004> W181214 01:23:18.607483 82063 storage/replica.go:3306  [n7,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/4/729681/260812005/1/0,/Table/53/4/732791/146652451/3)

@tbg
Copy link
Member

tbg commented Dec 17, 2018

r3152 logs this a few times:

E181214 03:08:58.330618 176227 storage/queue.go:839 [n5,replicate,s7,r3152/
4:/Table/53/5/8{466/59…-616/32…}] change replicas of r3152 failed: context deadline exceeded

This suggests that something's supposed to be done with this range. There is no other chatter from the replicate queue, which suggests that it gets stuck really early.
r497 has something similar for the consistency checker.

r3053 has no activity after the "have been waiting.." msg.

r3167 has a stuck request and then does a replication change (but the request seems to remain stuck?)

teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:22:24.148103 82963 storage/replica_raftstorage.go:818 [n1,s1,r3167/?:{-}] applying preemptive snapshot at index 24 (id=cfc563e9, encoded size=5071, 1 rocksdb batches, 14 log entries)
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:22:24.161145 82963 storage/replica_raftstorage.go:824 [n1,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] applied preemptive snapshot in 13ms [clear=0ms batch=0ms entries=0ms commit=13ms]
teamcity-1054703-import-tpch-nodes-8-0004> I181214 01:22:24.161465 82263 storage/replica_command.go:898 [n7,replicate,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] change replicas (ADD_REPLICA (n1,s1):5): read existing descriptor r3167:/Table/53/4/7{29681/260359173/4-32791/146652451/3} [(n8,s6):1, (n7,s3):2, (n6,s8):4, next=5, gen=1]
teamcity-1054703-import-tpch-nodes-8-0004> W181214 01:23:18.607483 82063 storage/replica.go:3306 [n7,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] have been waiting 1m0s for proposing command AddSSTable [/Table/53/4/729681/260812005/1/0,/Table/53/4/732791/146652451/3)
teamcity-1054703-import-tpch-nodes-8-0004> E181214 01:23:24.146269 82263 storage/queue.go:839 [n7,replicate,s3,r3167/2:/Table/53/4/7{29681/…-32791/…}] change replicas of r3167 failed: context deadline exceeded
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:30:49.769692 90204 storage/replica_gc_queue.go:249 [n1,replicaGC,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] removing young preemptive snapshot (0 ticks)
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:30:49.769710 90204 storage/store.go:2726 [n1,replicaGC,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] removing replica r3167/0
teamcity-1054703-import-tpch-nodes-8-0001> I181214 01:30:49.782346 90204 storage/replica.go:907 [n1,replicaGC,s1,r3167/?:/Table/53/4/7{29681/…-32791/…}] removed 5 (0+5) keys in 12ms [clear=0ms commit=12ms]

The test timed out at 3:14, so a lot later than many of these events.

@tbg
Copy link
Member

tbg commented Dec 18, 2018

Looking at a repro. This is one of the five ranges that has stuck proposals:

image

What's weird is that many ranges seem to have four replicas (the alloc simulation wants to remove one, as you'd expect). But other than that this status seems "healthy" except for the pending command. I'm going to logspy in.

@tbg
Copy link
Member

tbg commented Dec 18, 2018

Ah, that was easy:

I181218 10:32:44.934749 186 storage/replica.go:4870 [n5,s7,r62/4:/Table/53/1/4{086528…-355264…}] pending commands: sent 0 back to client, reproposing 1 (at 29.14) reasonTicks
I181218 10:32:44.943848 186 storage/replica.go:3893 [n5,s7,r62/4:/Table/53/1/4{086528…-355264…}] proposing command 2ed3f8b14ad3d34b: 1 AddSstable
I181218 10:32:44.953369 186 vendor/go.etcd.io/etcd/raft/raft.go:974 [n5,s7,r62/4:/Table/53/1/4{086528…-355264…}] 4 appending new entries to log would exceed uncommitted entry size limit; dropping proposal

I thought I had backported all of these fixes. Ah, vendor's most recent Raft vendor bump picks up only etcd-io/etcd#10167. I bet it doesn't pick up the fix for that PR, etcd-io/etcd#10199. Yeah, it. doesn't.

@petermattis
Copy link
Collaborator

Fixed by #33228.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants