Possible race condition with migrations #101654

tylersmalley · 2021-06-08T17:32:02Z

The backport of using the default distribution for all functional tests #94968 created many failures which were not seen during the original PR against master. These failures happen during the loading of esArchive data containing a .kibana index. During an investigation with @TinaHeiligers we determined that it appeared isolated to archives that were not utilizing an alias and directly creating a .kibana index and running migrations on it. The long-term fix here is to remove any usage of esArchiver with the saved object's indices and instead use the import API. This work has already begun, but it's going to take time and the concern here is this might be an actual issue. I think we need to understand if that is the case, or it's that we're doing something in esArchiver which is unsupported.

Minimal reproduction:

Switch to the 7.x branch (current SHA: 2e3b200)
Edit test/functional/config.js, removing the --oss Kibana server argument

diff --git a/test/functional/config.js b/test/functional/config.js
index 1048bd72dc5..b25d1da0218 100644
--- a/test/functional/config.js
+++ b/test/functional/config.js
@@ -40,7 +40,6 @@ export default async function ({ readConfigFile }) {
       ...commonConfig.get('kbnTestServer'),
       serverArgs: [
         ...commonConfig.get('kbnTestServer.serverArgs'),
-        '--oss',
         '--telemetry.optIn=false',
         '--savedObjects.maxImportPayloadBytes=10485760',
       ],

Download and extract esArchive which writes to the Kibana archive. Downloading the archive as there is a backport pending updating this mapping to use an alias. The esArchive this is taken from is the deprecations_service archive.
wget -qO- https://gist.github.com/tylersmalley/cffdf820615ef661eec666e3fb1a1b96/archive/7b260c33ab207026b3a3782741a9449358707e2c.zip | jar xvf /dev/stdin
Start the functional tests server
node scripts/functional_tests_server.js --config test/functional/config.js
Load the archive

node scripts/es_archiver.js -v --config test/functional/config.js load `pwd`/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c

The archive might need loaded a few times, as the failure doesn't appear consistent.

 debg Loading config file from "/Users/tyler/elastic/kibana/test/functional/config.js"
 debg Loading config file from "/Users/tyler/elastic/kibana/test/common/config.js"
 info [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  Loading "mappings.json"
 info [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  Loading "data.json"
 info [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  Deleted existing index ".kibana_7.14.0_001"
 info [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  Deleted existing index ".kibana_task_manager_7.14.0_001"
 info [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  Created index ".kibana"
 debg [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  ".kibana" settings {"index":{"number_of_replicas":"0","number_of_shards":"1"}}
 info [deprecations_service] Indexed 1 docs into ".kibana"
 debg Migrating saved objects
ERROR [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  request failed (attempt=1/5): socket hang up
ERROR [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  request failed (attempt=2/5): connect ECONNREFUSED 127.0.0.1:5620
ERROR [/home/tyler/elastic/kibana/cffdf820615ef661eec666e3fb1a1b96-7b260c33ab207026b3a3782741a9449358707e2c]  request failed (attempt=3/5): connect ECONNREFUSED 127.0.0.1:5620

The failure case is the migration fails and crashes the server.

   │ info [o.e.c.m.MetadataDeleteIndexService] [Tylers-MacBook-Pro.local] [.kibana_7.14.0_001/M-ENkb_8TXi9l-8t1WwAAw] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [Tylers-MacBook-Pro.local] [.kibana_task_manager_7.14.0_001/FVnOSmrDSNaXcgNdWsWVog] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [Tylers-MacBook-Pro.local] [.kibana] creating index, cause [api], templates [], shards [1]/[0]
   │ info [o.e.c.r.a.AllocationService] [Tylers-MacBook-Pro.local] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana][0]]]).
   │ info [o.e.c.m.MetadataMappingService] [Tylers-MacBook-Pro.local] [.kibana/D5Ulw0umQk-b8IlxjRxj5g] update_mapping [_doc]
   │ proc [kibana] server    log   [11:41:17.548] [info][savedobjects-service] [.kibana_task_manager] INIT -> CREATE_NEW_TARGET. took: 5ms.
   │ proc [kibana] server    log   [11:41:17.552] [info][savedobjects-service] [.kibana] INIT -> LEGACY_SET_WRITE_BLOCK. took: 11ms.
   │ info [o.e.c.m.MetadataCreateIndexService] [Tylers-MacBook-Pro.local] [.kibana_task_manager_7.14.0_001] creating index, cause [api], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [Tylers-MacBook-Pro.local] updating number_of_replicas to [0] for indices [.kibana_task_manager_7.14.0_001]
   │ info [o.e.c.m.MetadataIndexStateService] [Tylers-MacBook-Pro.local] adding block write to indices [[.kibana/D5Ulw0umQk-b8IlxjRxj5g]]
   │ info [o.e.c.m.MetadataIndexStateService] [Tylers-MacBook-Pro.local] completed adding block write to indices [.kibana]
   │ proc [kibana] server    log   [11:41:18.424] [info][savedobjects-service] [.kibana] LEGACY_SET_WRITE_BLOCK -> LEGACY_CREATE_REINDEX_TARGET. took: 872ms.
   │ info [o.e.c.m.MetadataCreateIndexService] [Tylers-MacBook-Pro.local] [.kibana_pre6.5.0_001] creating index, cause [api], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [Tylers-MacBook-Pro.local] updating number_of_replicas to [0] for indices [.kibana_pre6.5.0_001]
   │ proc [kibana] server    log   [11:41:18.941] [info][savedobjects-service] [.kibana_task_manager] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 1393ms.
   │ info [o.e.c.r.a.AllocationService] [Tylers-MacBook-Pro.local] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_pre6.5.0_001][0]]]).
   │ proc [kibana] server    log   [11:41:19.336] [info][savedobjects-service] [.kibana_task_manager] MARK_VERSION_INDEX_READY -> DONE. took: 395ms.
   │ proc [kibana] server    log   [11:41:19.336] [info][savedobjects-service] [.kibana_task_manager] Migration completed after 1793ms
   │ proc [kibana] server    log   [11:41:19.526] [info][savedobjects-service] [.kibana] LEGACY_CREATE_REINDEX_TARGET -> LEGACY_REINDEX. took: 1102ms.
   │ proc [kibana] server    log   [11:41:19.540] [info][savedobjects-service] [.kibana] LEGACY_REINDEX -> LEGACY_REINDEX_WAIT_FOR_TASK. took: 14ms.
   │ info [o.e.c.m.MetadataCreateIndexService] [Tylers-MacBook-Pro.local] [.tasks] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [Tylers-MacBook-Pro.local] updating number_of_replicas to [0] for indices [.tasks]
   │ proc [kibana] Unhandled Promise rejection detected:
   │ proc [kibana] 
   │ proc [kibana] ResponseError: Saved object index alias [.kibana_7.14.0] not found: index_not_found_exception
   │ proc [kibana]     at onBody (/Users/tyler/elastic/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
   │ proc [kibana]     at IncomingMessage.onEnd (/Users/tyler/elastic/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)
   │ proc [kibana]     at IncomingMessage.emit (events.js:388:22)
   │ proc [kibana]     at endReadableNT (internal/streams/readable.js:1336:12)
   │ proc [kibana]     at processTicksAndRejections (internal/process/task_queues.js:82:21) {
   │ proc [kibana]   meta: {
   │ proc [kibana]     body: { error: [Object], status: 404 },
   │ proc [kibana]     statusCode: 404,
   │ proc [kibana]     headers: {
   │ proc [kibana]       'content-type': 'application/json; charset=UTF-8',
   │ proc [kibana]       'content-length': '445'
   │ proc [kibana]     },
   │ proc [kibana]     meta: {
   │ proc [kibana]       context: null,
   │ proc [kibana]       request: [Object],
   │ proc [kibana]       name: 'elasticsearch-js',
   │ proc [kibana]       connection: [Object],
   │ proc [kibana]       attempts: 0,
   │ proc [kibana]       aborted: false
   │ proc [kibana]     }
   │ proc [kibana]   },
   │ proc [kibana]   isBoom: true,
   │ proc [kibana]   isServer: true,
   │ proc [kibana]   data: null,
   │ proc [kibana]   output: {
   │ proc [kibana]     statusCode: 500,
   │ proc [kibana]     payload: {
   │ proc [kibana]       message: 'An internal server error occurred',
   │ proc [kibana]       statusCode: 500,
   │ proc [kibana]       error: 'Internal Server Error'
   │ proc [kibana]     },
   │ proc [kibana]     headers: {}
   │ proc [kibana]   },
   │ proc [kibana]   [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/generalError'
   │ proc [kibana] }
   │ proc [kibana] 
   │ proc [kibana] Terminating process...
   │ proc [kibana]  server crashed  with status code 1

#101537 was created in hopes of circumventing this issue for now in the functional tests by using an alias to avoid the legacy migration step - however, that seems to have now created a similar failure on master, which is also flaky.

https://kibana-ci.elastic.co/job/elastic+kibana+master/14654/execution/node/794/log/

09:04:26           └-: discover sidebar
09:04:26             └-> "before all" hook in "discover sidebar"
09:04:26             └-> "before all" hook in "discover sidebar"
09:04:26               │ proc [kibana] {"log.level":"info","@timestamp":"2021-06-08T16:04:25.748Z","log":{"logger":"elastic-apm-node"},"ecs":{"version":"1.6.0"},"message":"Sending error to Elastic APM: {\"id\":\"b3fce02fd127bc3191214a6fa44d221d\"}"}
09:04:26               │ proc [kibana]   log   [16:04:25.751] [error][dataEnhanced][data_enhanced][plugins] ResponseError: index_not_found_exception: [index_not_found_exception] Reason: no such index [.kibana_8.0.0] and [require_alias] request flag is [true] and [.kibana_8.0.0] is not an alias
09:04:26               │ proc [kibana]     at onBody (/dev/shm/workspace/kibana-build-9/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
09:04:26               │ proc [kibana]     at IncomingMessage.onEnd (/dev/shm/workspace/kibana-build-9/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)
09:04:26               │ proc [kibana]     at IncomingMessage.emit (events.js:388:22)
09:04:26               │ proc [kibana]     at endReadableNT (internal/streams/readable.js:1336:12)
09:04:26               │ proc [kibana]     at processTicksAndRejections (internal/process/task_queues.js:82:21) {

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-08T17:49:02Z

Pinging @elastic/kibana-core (Team:Core)

tylersmalley · 2021-06-08T23:04:56Z

Leaning towards this not so much being an issue as it is a problem with our functional tests. Turns out that some of our archives are old enough they need to be processed through the legacy migration, which is actually determined by it not being an alias. Additionally, when importing another esArchive we are not currently clearing out all aliases which seems to be an issue.

Same as elastic#101950 - these archives are causing issues with a non-oss build elastic#101654 Last remaining fix for elastic#101118 Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>

Same as #101950 - these archives are causing issues with a non-oss build #101654 Last remaining fix for #101118 Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>

…101992) Same as elastic#101950 - these archives are causing issues with a non-oss build elastic#101654 Last remaining fix for elastic#101118 Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>

…#102011) Same as #101950 - these archives are causing issues with a non-oss build #101654 Last remaining fix for #101118 Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co> Co-authored-by: Tyler Smalley <tyler.smalley@elastic.co>

joshdover · 2021-06-15T14:45:47Z

@tylersmalley Any updates here? I want to make sure that we get ahead of any races if we do indeed have any.

tylersmalley · 2021-08-30T18:50:27Z

I don't think this is still an issue.

botelastic bot added the needs-team Issues missing a team label label Jun 8, 2021

tylersmalley added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc and removed needs-team Issues missing a team label labels Jun 8, 2021

tylersmalley mentioned this issue Jun 10, 2021

[esArchives] Persist migrated Kibana archives #101950

Merged

tylersmalley mentioned this issue Jun 11, 2021

[esArchive] Persists updates for management/saved_objects/* #101992

Merged

tylersmalley closed this as completed Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition with migrations #101654

Possible race condition with migrations #101654

tylersmalley commented Jun 8, 2021 •

edited

Loading

elasticmachine commented Jun 8, 2021

tylersmalley commented Jun 8, 2021

joshdover commented Jun 15, 2021

tylersmalley commented Aug 30, 2021

Possible race condition with migrations #101654

Possible race condition with migrations #101654

Comments

tylersmalley commented Jun 8, 2021 • edited Loading

elasticmachine commented Jun 8, 2021

tylersmalley commented Jun 8, 2021

joshdover commented Jun 15, 2021

tylersmalley commented Aug 30, 2021

tylersmalley commented Jun 8, 2021 •

edited

Loading