Log instead of store CC fails by default BA-5721 #5095

kshakir · 2019-07-31T03:39:00Z

No description provided.

aednichols · 2019-07-31T20:28:57Z

docs/cromwell_features/CallCaching.md

+
+```hocon
+call-caching {
+  log-cache-hit-failures: true


Would it be reasonable to simply log these at log level INFO instead of introducing a custom boolean?

"informational messages" seem like a great fit for that log level.

The option has been removed.

aednichols · 2019-07-31T20:30:18Z

docs/cromwell_features/CallCaching.md

+
+```hocon
+call-caching {
+  add-cache-hit-failures-to-metadata: false


Are we sure we need this? The ticket was pretty specific that there exists no customer requirement around recording to metadata, and I can't see ourselves really needing this as developers either.

I like leaving this in and defaulting to true to avoid surprising anyone currently relying on it in other instances.

Ditto on no more options for cromwell admins.

cjllanwarne

Some comments, a lot of "I wasn't sure why this refactor happened"

Re "should we have a config option for logging" my gut feeling is like Adam, I'd be inclined to wait until this option is requested before adding it to the maintenance burden. OTOH since it's already wired in and it would be more work to undo that, I'm not sure that I'd particularly prioritize removing it either... 🤷‍♂

Now where did I leave my fence, I could do with a good sit down...

cjllanwarne · 2019-08-02T15:26:27Z

CHANGELOG.md

+
+By default, call cache failure messages are now sent to the logs instead of the workflow metadata. Failure message
+delivery to the logs and/or metadata may be adjusted via new configuration keys. See [the Cromwell call caching
+documentation](https://cromwell.readthedocs.io/en/stable/cromwell_features/CallCaching/) for more information.


Add a PR reference (per tech talk this morning)

cjllanwarne · 2019-08-02T15:28:27Z

docs/cromwell_features/CallCaching.md

+[Configuration](../Configuring#call-caching) and the behavior can be modified via
+[Workflow Options](../wf_options/Overview). If you are adding Workflow options, do not set
+[`read_from_cache` or `write_to_cache`](../wf_options/Overview#call-caching-options) = false, as it will impact the
+following process.


OOI: not that I mind, just curious what triggered this multi-lining.

I manually tested the links. I think someone moved the page down a directory under cromwell_features, but didn't fix the links on the page with a ../ prefix. So over on cromwell.rtfd.io a bunch of links on these lines actually link to the 404 page. Briefly looked into a link checker to add to testCheckPublish.sh but didn't see anything quick. If anyone knows of one they like we can add it to the build.

cjllanwarne · 2019-08-02T15:51:16Z

engine/src/main/resources/reference.conf

@@ -0,0 +1,19 @@
+# Optional call-caching configuration.


TOL also move the engine conf section here from core/.../reference.conf?

Reverted the whole thing. We'll fix the reference-confs-belong-with-the-code-that-references-them in future PRs.

cjllanwarne · 2019-08-02T15:52:24Z

engine/src/main/resources/reference.conf

+
+  # Whether to put cache hit failure reasons into workflow metadata.
+  # (default: false)
+  add-cache-hit-failures-to-metadata = false


I would default this to true to avoid changing the ground from underneath users.

The ticket description from product indicate that this is not a user-facing feature, so I don't think it matters

No more config.

cjllanwarne · 2019-08-02T15:52:43Z

engine/src/main/resources/reference.conf

+  log-cache-hit-failures = true
+
+  # Whether to put cache hit failure reasons into workflow metadata.
+  # (default: false)


Since we're in reference.conf is there any value in stating that this is the default?

FYI the other reference.conf options that are previously here in this stanza are listed as (paraphrasing) some-key: "some-value" # (default some-value). So the "default" comment was also added for consistency to try not to confuse others looking through the whole file.

cjllanwarne · 2019-08-02T15:55:08Z

engine/src/main/scala/cromwell/engine/workflow/WorkflowProcessingEventPublishing.scala

+                              labels: Map[String, String],
+                              serviceRegistry: ActorRef): IOChecked[Unit] = {
+    val defaultLabel = "cromwell-workflow-id" -> s"cromwell-$workflowId"
+    Monad[IOChecked].pure(labelsToMetadata(workflowId, labels + defaultLabel, serviceRegistry))


I don't understand why we want to wrap Unit returns as pure IoChecked monads. An in-person explanation might be good?

Note: this change was triggered by IntelliJ warning about a dupe block of code. That and your comment also helped me notice a bug where one of these IOChecked[Unit]s were being created but never run. 👍

Happy to chat in-person if I'm off target here. The SomeBox[Unit] vs. Unit is because the latter only allows throwing execptions, not returning them. Could have been as simple as a Try[Unit], but in this case the previously-duplicated-and-now-consolidated block was written to use the feature rich IOChecked[Unit].

But I don't get why this would ever throw an exception? It seems like you're assuming it won't either because later on you use .get in the production WorkflowStoreSubmitActor?

Yeh, I didn't write it. Just de-duped it. I concur the function probably won't throw an exception. I'm pretty sure one-of-the-original-dupes was originally wrapped so it would fit into this IOChecked[_] for-comprehension.

cjllanwarne · 2019-08-02T16:11:53Z

...rc/main/scala/cromwell/engine/workflow/lifecycle/execution/job/EngineJobExecutionActor.scala

-    instrumentJobComplete(response)
-    pushExecutionEventsToMetadataService(jobDescriptorKey, eventList)
-    recordExecutionStepTiming(stateName.toString, currentStateDuration)
-    context stop self


Looks like you're removing a lot more than just the metadata writing?

If you look at what was removed, it was removed twice twice and copypastad back once. FTFY'ed the rest of the file while here cleaning up an IntelliJ warning.

cjllanwarne · 2019-08-02T16:15:24Z

.../cromwell/engine/workflow/lifecycle/materialization/MaterializeWorkflowDescriptorActor.scala

+
+    def errorOrCallCachingBoolean(path: String): ErrorOr[Boolean] = {
+      import common.validation.Validation._
+      validate(callCachingConfig.getBoolean(path))


I don't understand why you switched from Option to ErrorOr to indicate "found or not found" here?

See the new tests. All the configs are using ErrorOrs now. When misconfigured with something like my-config-enabled = tru, a naked getBoolean will quick-crash instead of continuing to validate other config values.

cjllanwarne · 2019-08-02T16:19:02Z

.../cromwell/engine/workflow/lifecycle/materialization/MaterializeWorkflowDescriptorActor.scala

@@ -120,14 +126,36 @@ object MaterializeWorkflowDescriptorActor {
        }
      }

+      val errorOrMaybePrefixes = workflowOptions.getVectorOfStrings("call_cache_hit_path_prefixes")
+      val errorOrInvalidateBadCacheResults = errorOrCallCachingBoolean("invalidate-bad-cache-results")


Seems weird to treat a missing config value as a per-workflow validation error rather than applying a default?

Nothing's ever missing AFAIK, and the newly added tests validate this.

The defaults are still there in the reference.conf (the belt). I just removed the duplicated defaults that were never used by main scala (the suspenders).

cjllanwarne · 2019-08-02T16:19:32Z

.../cromwell/engine/workflow/lifecycle/materialization/MaterializeWorkflowDescriptorActor.scala

+      val errorOrInvalidateBadCacheResults = errorOrCallCachingBoolean("invalidate-bad-cache-results")
+      val errorOrLogCacheHitFailures = errorOrCallCachingBoolean("log-cache-hit-failures")
+      val errorOrAddCacheHitFailuresToMetadata = errorOrCallCachingBoolean("add-cache-hit-failures-to-metadata")
+      val errorOrCallCachingOptions = (


If we do stick to ErrorOrs, could we do the "config" part "once per instance" instead of "once per workflow"?

That makes sense. My only hesitation is the changing logic. Currently the logic only crashes-on-a-bad-config for workflows that request call caching. ~~I can take a shot at updating this and add a line to the ChangeLog and docs describing the change in behavior.~~ Edit: Instead filed https://broadworkbench.atlassian.net/browse/BA-5923

aednichols · 2019-08-16T14:07:51Z

Don't just create the IO, run it.

That's one way to avoid side effects, for sure

cjllanwarne

Yet to be convinced on the whole "wrap it in an IO then call .get on it" thing, but otherwise I only have TOL comments

cjllanwarne · 2019-08-16T18:43:29Z

docs/cromwell_features/CallCaching.md

@@ -81,6 +87,12 @@ Cromwell would search cache hits in all of the `gs://alice_bucket`, `gs://bob_bu

 If no `call_cache_hit_path_prefixes` are specified then all matching cache hits will be considered.

+***Call cache failure logging***
+
+When searching for previous results to cache, the first three times (per call) that caching is unable to copy results


TOL: Mildly awkward sentence structure

English hard is.

cjllanwarne · 2019-08-16T18:47:54Z

docs/cromwell_features/CallCaching.md

 **Runtime Attributes**

-As well as call inputs and the command to run, call caching considers the following [runtime attributes](https://cromwell.readthedocs.io/en/develop/RuntimeAttributes/) of a given task when determining whether to call cache:
+As well as call inputs and the command to run, call caching considers the following [runtime
+attributes](../RuntimeAttributes) of a given task when determining whether to call cache:


This link didn't work for me when I tested it with the View file option

Thanks for the test. I tried them and... it didn't work for me either. 🤦‍♂ It's supposed to be '../../Page/' as apparently mkdocs adds a trailing / to the urls.

By the way, these links won't work on GitHub as these files are hosted on RTFD. Use mkdocs serve and then try out the link locally.

cjllanwarne · 2019-08-16T19:00:37Z

...rc/main/scala/cromwell/engine/workflow/lifecycle/execution/job/EngineJobExecutionActor.scala

+      s"Failed copying cache results for job $jobDescriptorKey (${reason.getClass.getSimpleName}: ${reason.getMessage})"
+    if (invalidationRequired) {
+      // Whenever invalidating a cache result, always log why the invalidation occurred
+      workflowLogger.warn(        s"$problemSummary, invalidating cache entry.")


TOL: Weird whitespace?

cjllanwarne · 2019-08-16T19:03:51Z

...rc/main/scala/cromwell/engine/workflow/lifecycle/execution/job/EngineJobExecutionActor.scala

+      data
+    } else if (data.logCacheHitFailureCount < 3) {
+      workflowLogger.info(problemSummary)
+      data.copy(logCacheHitFailureCount = data.logCacheHitFailureCount + 1)


TOL: Part of me likes the idea of:

Always appending 1 to the count regardless

Renaming it something like cacheHitFailureCount

Having a single log/metric emitted once the call cache sequence has completes, along the lines of "Call cache hit process had ${data.cacheHitFailureCount} total hit failures before completing successfully/unsuccessfully"

cjllanwarne · 2019-08-16T19:07:44Z

engine/src/main/scala/cromwell/engine/workflow/workflowstore/WorkflowStoreSubmitActor.scala

@@ -43,7 +43,7 @@ final case class WorkflowStoreSubmitActor(store: WorkflowStore, serviceRegistryA
          val wfTypeVersion = cmd.source.workflowTypeVersion.getOrElse("Unspecified version")
          log.info("{} ({}) workflow {} submitted", wfType, wfTypeVersion, workflowSubmissionResponse.id)
          val labelsMap = convertJsonToLabelsMap(cmd.source.labelsJson)
-          publishLabelsToMetadata(workflowSubmissionResponse.id, labelsMap)
+          publishLabelsToMetadata(workflowSubmissionResponse.id, labelsMap, serviceRegistryActor).toErrorOr.toTry.get


What happens if the publish to metadata fails? Would we get a workflow store crash? Do we have suitable recovery logic for that?

No recovery logic added. Maybe there should be some. This was a de-dupe based on a suggestion by IntelliJ. Someone had copypasta'd the exact same publishLabelsToMetadata into two different places.

cjllanwarne · 2019-08-16T19:09:46Z

engine/src/main/scala/cromwell/engine/workflow/WorkflowProcessingEventPublishing.scala

+                              labels: Map[String, String],
+                              serviceRegistry: ActorRef): IOChecked[Unit] = {
+    val defaultLabel = "cromwell-workflow-id" -> s"cromwell-$workflowId"
+    Monad[IOChecked].pure(labelsToMetadata(workflowId, labels + defaultLabel, serviceRegistry))


But I don't get why this would ever throw an exception? It seems like you're assuming it won't either because later on you use .get in the production WorkflowStoreSubmitActor?

kshakir · 2019-08-19T17:57:19Z

I think I got everything, hopefully without creating new issues. 🤞 Failing tests are due to GPU issues that I believe are being investigated elsewhere.

cjllanwarne · 2019-08-19T17:58:40Z

docs/cromwell_features/CallCaching.md

+***Call cache failure logging***
+
+When Cromwell fails to cache a job from a previous result the reason will be logged. To reduce the verbosity of the logs
+only the first three failure reasons will be logged per job. Cromwell will continue to try copying previous results


per job => per shard of each job?

danbills requested review from aednichols and cjllanwarne July 31, 2019 14:36

aednichols reviewed Jul 31, 2019

View reviewed changes

cjllanwarne reviewed Aug 2, 2019

View reviewed changes

gemmalam requested review from aednichols and cjllanwarne August 16, 2019 14:40

aednichols approved these changes Aug 16, 2019

View reviewed changes

cjllanwarne reviewed Aug 16, 2019

View reviewed changes

kshakir requested a review from cjllanwarne August 19, 2019 17:56

cjllanwarne approved these changes Aug 19, 2019

View reviewed changes

Log instead of store CC fails by default

7ab5582

kshakir force-pushed the ks_log_cache_fail branch from 5800f00 to 7ab5582 Compare August 22, 2019 12:44

kshakir merged commit b18b976 into develop Aug 22, 2019

kshakir deleted the ks_log_cache_fail branch August 22, 2019 18:22

Log instead of store CC fails by default BA-5721 #5095

Log instead of store CC fails by default BA-5721 #5095

Conversation

kshakir commented Jul 31, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjllanwarne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjllanwarne Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshakir Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

aednichols commented Aug 16, 2019

cjllanwarne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjllanwarne Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

kshakir commented Aug 19, 2019

Choose a reason for hiding this comment

cjllanwarne Aug 16, 2019 •

edited

Loading

kshakir Aug 16, 2019 •

edited

Loading

cjllanwarne Aug 16, 2019 •

edited

Loading