Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call caching with Singularity and SGE #7480

Open
jeremylp2 opened this issue Jul 30, 2024 · 1 comment
Open

Call caching with Singularity and SGE #7480

jeremylp2 opened this issue Jul 30, 2024 · 1 comment

Comments

@jeremylp2
Copy link

jeremylp2 commented Jul 30, 2024

I'm having trouble getting call caching to work with Singularity and SGE, and I'm wondering if anyone has a working example config or some pointers. My config is below, minus passwords and specific paths/urls, which I've replaced with a label encased in <>. I've tried switching to slower hashing strategies finagling with the command construction to no avail. If there's not an obvious solution, is there an easy way to debug this? There are no network issues preventing connections to dockerhub - pulling images and converting to .sif works fine. It's only call caching that's broken.

Even when I see, in the metadata, identical hashes for the docker image and all inputs and outputs, I see a "Cache Miss" as the result, every time.

The call caching stanza in my metadata looks like this, for example. Am I missing something?

      "callCaching": {
        "allowResultReuse": true,
        "hashes": {
          "output count": "C4CA4238A0B923820DCC509A6F75849B",
          "runtime attribute": {
            "docker": "4B2AB7B9EA875BF5290210F27BB9654D",
            "continueOnReturnCode": "CFCD208495D565EF66E7DFF9F98764DA",
            "failOnStderr": "68934A3E9455FA72420237EB05902327"
          },
          "output expression": {
            "File output_greeting": "DFC652723D8EBD4BB25CAC21431BB6C0"
          },
          "input count": "CFCD208495D565EF66E7DFF9F98764DA",
          "backend name": "2A2AB400D355AC301859E4ABB5432138",
          "command template": "AFAC58B849BD67585A857F538B8E92F6"
        },
        "effectiveCallCachingMode": "ReadAndWriteCache",
        "hit": false,
        "result": "Cache Miss"
      },
# simple sge apptainer conf (modified from the slurm one)
#
workflow-options
{
  workflow-log-dir: "cromwell-workflow-logs"
  workflow-log-temporary: false
  workflow-failure-mode: "ContinueWhilePossible"
  default
  {
    workflow-type: WDL
    workflow-type-version: "draft-2"
  }
}

database {
  # Store metadata in a file on disk that can grow much larger than RAM limits.
  metadata {
    profile = "slick.jdbc.MySQLProfile$"
    db {
      url = "jdbc:mysql:<dburl>?rewriteBatchedStatements=true"
      driver = "com.mysql.cj.jdbc.Driver"
      user = "<user>"
      password = "<pass>" 
      connectionTimeout = 5000
    }
  }
}

call-caching
{
  enabled = true
  invalidate-bad-cache-result = true
}


docker {
    hash-lookup {
        enabled = true
    }
}


backend {
  default = sge
  providers {

  
      sge {
	actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
        config {

          # Limits the number of concurrent jobs
          #concurrent-job-limit = 5

          # If an 'exit-code-timeout-seconds' value is specified:
          # - check-alive will be run at this interval for every job
          # - if a job is found to be not alive, and no RC file appears after this interval
          # - Then it will be marked as Failed.
          # Warning: If set, Cromwell will run 'check-alive' for every job at this interval

          # exit-code-timeout-seconds = 120

          runtime-attributes = """
          String time = "11:00:00"
          Int cpu = 4
          Float? memory_gb
          String sge_queue = "hammer.q"
          String? sge_project
          String? docker
          """

          submit = """
          qsub \
          -terse \
          -V \
          -b y \
          -N ${job_name} \
          -wd ${cwd} \
          -o ${out}.qsub \
          -e ${err}.qsub \
          -pe smp ${cpu} \
          ${"-l mem_free=" + memory_gb + "g"} \
          ${"-q " + sge_queue} \
          ${"-P " + sge_project} \
          /usr/bin/env bash ${script}
          """

          kill = "qdel ${job_id}"
          check-alive = "qstat -j ${job_id}"
          job-id-regex = "(\\d+)"

          submit-docker = """          
             #location for .sif files and other apptainer tmp, plus lockfile
	     export APPTAINER_CACHEDIR=<path>
             export APPTAINER_PULLFOLDER=<path>
             export APPTAINER_TMPDIR=<path>
             export LOCK_FILE="$APPTAINER_CACHEDIR/lockfile"
             export IMAGE=$(echo ${docker} | tr '/:' '_').sif
             if [ -z $APPTAINER_CACHEDIR ]; then
                 exit 1
             fi
             CACHE_DIR=$APPTAINER_CACHEDIR
             # Make sure cache dir exists so lock file can be created by flock
             mkdir -p $CACHE_DIR
             # downloads sifs only one at a time; apptainer sif db doesn't handle concurrency well
             out=$(flock --exclusive --timeout 1800 $LOCK_FILE apptainer pull $IMAGE docker://${docker}  2>&1)
             ret=$?
             if [[ $ret == 0 ]]; then
                 echo "Successfully pulled ${docker}!"
             else
                 if [[ $(echo $out | grep "exists" ) ]]; then
                     echo "Image file already exists, ${docker}!"
                 else
                     echo "Failed to pull ${docker}" >> /dev/stderr
                     exit $ret
                 fi
             fi
             #full path to sif for qsub command
             IMAGE="$APPTAINER_PULLFOLDER/$IMAGE"
             qsub \
             -terse \
             -V \
             -b y \
             -N "${job_name}" \
             -wd "${cwd}" \
             -o "${out}.qsub" \
             -e "${err}.qsub" \
             -pe smp "${cpu}" \
             ${"-l mem_free=" + memory_gb + "g"} \
             ${"-q " + sge_queue} \
             ${"-P " + sge_project} \
             apptainer exec --cleanenv --bind "${cwd}:${docker_cwd},<path>" "$IMAGE" "${job_shell}" "${docker_script}"
          """

          default-runtime-attributes
          {
            failOnStderr: false
            continueOnReturnCode: 0
          }
        }
      }

      sge_docker  {
        actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
        config {

          runtime-attributes = """
          String time = "11:00:00"
          Int cpu = 4
          Float? memory_gb
          String sge_queue = "hammer.q"
          String? sge_project
          String? docker
          """

          submit = """
          qsub \
          -terse \
          -V \
          -b y \
          -N ${job_name} \
          -wd ${cwd} \
          -o ${out}.qsub \
          -e ${err}.qsub \
          -pe smp ${cpu} \
          ${"-l mem_free=" + memory_gb + "g"} \
          ${"-q " + sge_queue} \
          ${"-P " + sge_project} \
          /usr/bin/env bash ${script}
          """

          kill = "qdel ${job_id}"
          check-alive = "qstat -j ${job_id}"
          job-id-regex = "(\\d+)"

          submit-docker = """          
             qsub \
             -terse \
             -V \
             -b y \
             -N ${job_name} \
             -wd ${cwd} \
             -o ${out}.qsub \
             -e ${err}.qsub \
             -pe smp ${cpu} \
             ${"-l mem_free=" + memory_gb + "g"} \
             ${"-q " + sge_queue} \
             ${"-P " + sge_project} \
             "docker exec -v ${cwd}:${docker_cwd} -v <path> ${job_shell} ${docker_script}"
          """

          default-runtime-attributes
          {
            failOnStderr: false
            continueOnReturnCode: 0
          }
        }
      } 
    }
    Local
    {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config
      {
        #concurrent-job-limit = 5
        run-in-background = true
        # The list of possible runtime custom attributes.
        runtime-attributes = """
        String? docker
        String? mountOption
        """

        # Submit string when there is no "docker" runtime attribute.
        submit = "/usr/bin/env bash ${script}"
       
        # if the apptainer .sif for the image is created this will automatically use it
        # otherwise it will pull from dockerhub
        # if not using on dori change the source path for /refdata
        submit-docker = """
	    apptainer exec --cleanenv --bind ${cwd}:${docker_cwd},<path> \ 
            docker://${docker} ${job_shell} ${script} 
        """

        filesystems
        {
          local
          {
            localization: [ "hard-link", "soft-link", "copy" ]

            caching {
              duplication-strategy: [ "hard-link", "soft-link", "copy" ]
              hashing-strategy: "fingerprint"
              fingerprint-size: 10485760
            }
          }
        }

        default-runtime-attributes
        {
          failOnStderr: false
          continueOnReturnCode: 0
        }
      }
    }
}
@aednichols
Copy link
Collaborator

I've never used Cromwell this way but my understanding is that good call caching performance is heavily dependent on cloud object storage. This is because it returns checksums in a short, constant time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants