Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first draft of workflow format #171

Closed
wants to merge 12 commits into from
Closed

first draft of workflow format #171

wants to merge 12 commits into from

Conversation

kba
Copy link
Member

@kba kba commented Aug 15, 2020

Spec defines a format for describing OCR-D workflows, independent of the workflow engine to run the workflow.

See https://github.com/kba/ocrd-workflows for examples of workflows defined in this format

@kba kba requested review from bertsky, cneud and EEngl52 August 15, 2020 14:46
kba added a commit to OCR-D/core that referenced this pull request Aug 15, 2020
Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like the idea of using a subset of POSIX sh for this. But we must be careful not to be too restrictive (so the format is expressive enough to actually get used) or too flexible (so the format won't be implementable or translateable to makefiles easily).

Notably, we should consider whether this static approach is sufficient. (You cannot make the configuration of follow-up workflow steps dependent on intermediate results here.)

One feature that might be worth considering for this format as well is (automatically) versioning workflows, as explained here.

workflow-format.md Outdated Show resolved Hide resolved
workflow-format.md Outdated Show resolved Hide resolved
workflow-format.md Outdated Show resolved Hide resolved
a valid `sh` identifier, everything `=` is parsed as a string.

<!-- TODO define behavior? -->
**NOTE** The semantics of assigning `value` to `name` are implementation-dependant
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**NOTE** The semantics of assigning `value` to `name` are implementation-dependant
**NOTE** The semantics of assigning `value` to `name` are implementation-dependent

But what do you mean by that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, where and how in the step definition do you use variables? You'll have to describe variable expansion really, esp. in its interaction with quote expansion. In practice, we'd need to be able to do things like:

bin=wolf
param_bin='{"input_feature_filter": "binarized", "output_feature_added": "binarized", "command": "scribo-cli $bin @INFILE @OUTFILE --enable-negate-output --k 0.2 --win-size 301"}'
ocrd-preprocess-image         -I OCR-D-IMG               -O OCR-D-BIN-$bin          -P level-of-operation page -p "$param_bin"
ocrd-cis-ocropy-denoise       -I OCR-D-BIN-$bin          -O OCR-D-BIN-$bin-DEN      -P level-of-operation page
ocrd-tesserocr-deskew         -I OCR-D-BIN-$bin-DEN      -O OCR-D-BIN-$bin-DEN-FLIP -P operation_level page
ocrd-cis-ocropy-deskew        -I OCR-D-BIN-$bin-DEN-FLIP -O OCR-D-BIN-$bin-DEN-DESK -P level-of-operation page
ocrd-tesserocr-segment-region -I OCR-D-BIN-$bin-DEN-DESK -O OCR-D-SEG-REG
ocrd-preprocess-image         -I OCR-D-SEG-REG           -O OCR-D-SEG-REG-BIN-$bin  -P level-of-operation region -p "$param_bin"
...
ocrd-preprocess-image         -I OCR-D-SEG-LINE          -O OCR-D-SEG-LINE-BIN-$bin -P level-of-operation line -p "$param_bin"

(We may need variable expansion in fileGrp names whenever multiple alternatives are used for different purposes, like different binarizations for segmentation and recognition. In the parameter JSON and parameter literal options, the need is obvious – for brevity and changability.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this initial draft, I side-stepped the question how variables come into play, because I need input (like yours, thx) on how this should work.

I really like how standard = variable assignments work in Makefiles and would like to see the same semantics with /bin/sh syntax. I.e. the example you posted would be valid but at workflow execution time, users should be able to override those assignments to use a different model, a different set of parameters etc.

consistency would then also include a check that there is an assignment (either in the file or provided by the workflow engine) to replace all variables in the file.

For ease of parsing and visual effect, I'd prefer ${foo} syntax over $foo.

Also, I think this should not adhere to bourne shell variable expansion (which e.g. does not work inside single quotes and can be escaped with a backslash) but rather the more aggressive make approach to replace all $(var) calls everywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how standard = variable assignments work in Makefiles and would like to see the same semantics with /bin/sh syntax. I.e. the example you posted would be valid but at workflow execution time, users should be able to override those assignments to use a different model, a different set of parameters etc.

I agree with you aesthetically, but then you definitely need a completely different non-sh (non-static scoping) semantics. You essentially want the declarative model for makefiles. Which is not very intuitive for most users I must say.

Also, I think this should not adhere to bourne shell variable expansion (which e.g. does not work inside single quotes and can be escaped with a backslash) but rather the more aggressive make approach to replace all $(var) calls everywhere.

The make approach is problematic itself though – exactly because of its word expansion (and ignorance of any quotation). In sh-like syntax, you can at least have nested quotation of arbitrary depth.

Having a concise, readable and runnable declarative syntax for workflows was among my initial motivations for workflow-configuration. This could also be parsed to run independent of make. I fail to see the merit of introducing yet another syntax if not for simplicity and intuitivity (which I still believe sh-like configuration serves best).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You essentially want the declarative model for makefiles. Which is not very intuitive for most users I must say.

Why do you think it is unintuitive? The OCRD-WF defines variables and command line calls. Variables are hoisted at parse-time (which is different from shell script, true, but since we do not support any conditional syntax, it makes no sense to support non-global scoping).

In sh-like syntax, you can at least have nested quotation of arbitrary depth.

That's a good point. make's interpolation is more powerful IMHO (because it doesn't care about quotes) but it can lead to unintuitive behavior, I agree.

This could also be parsed to run independent of make.

But only if you stick to a completely static sequence of steps, without any fileGrp-guessing logic and the like - nothing against trying to be smart, just that Makefile's are not trivial to parse, with all its more or less arcane incantations :)

I fail to see the merit of introducing yet another syntax if not for simplicity and intuitivity (which I still believe sh-like configuration serves best).

The merit of this syntax is supposed to be interoperability. At the very least, OCRD-WF should be exportable to (less intuitive than hand-written but still complete) workflow-configuration makefiles or ocrd process calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because variables can get their values from below the line where they are used, and multiple assignments make no sense (and don't work).

True, that's what I meant by "hoisting", that variable assignments are processed before any steps, even if they assignments and steps are interwoven. This is decidedly non-sh behavior and unintuitive if you approach OCRD-WF as a shell script. But if you do approach it like a shell script, you might be tempted to assume statefulness and expect conditionals or loop constructs to be supported, which they are not. There absolutely is a benefit to allowing state/conditionals (like the group-guessing in workflow-configuration scripts) and if we want to support this, we should decide on it now. If not, I'd prefer the make assignment semantics.

I was thinking of something like make -Bsn > workflow.sh (which could be usually rewritten for ocrd process)

This pattern never occured to me but is very clever. Can you add that to https://hackmd.io/emBLZdD5SyWietFxLLpnQg?

But a truly sh-based restricted syntax would be both interoperable and simple, even with standard backslash/quote expansion.

I'm with you, I care about simplicity and interoperability more than about commenting lines with escaped backslashes :)

How do you plan on implementing this translation anyway? Some form of bash -x workflow.ocrd.sh or set -n perhaps, or are you going to write a (restricted) sh parser of your own (talking of incantations...)?

Parsing with the algorithm outlined here results in a data structure

assignments:
  foo: bar
  bar: 42
steps:
  - executable: ocrd-foo
    parameters:
      baz: foo
    input_file_grp: MAX
    output_file_grp: FOO

Then we'd have adapters in ocrd_modelfactory that generate ocrd process calls or workflow-configuration makefiles or the CSV that @VolkerHartmann uses for invoking taverna or a shell script (which would be more or less the same as the OCRD-WF input).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you do approach it like a shell script, you might be tempted to assume statefulness and expect conditionals or loop constructs to be supported, which they are not

Yes. And statefulness is something that makes your model-driven semantics hard again. Not sure myself whether users actually want statefulness. But if it's going to be the declarative, make-like syntax, then we need a better metaphor than the current shell script / command shell / ocrd+sh / .ocrd.sh formulations. The term configuration language does usually not cover variable references, but macro language is too much related to actual programming...

There absolutely is a benefit to allowing state/conditionals (like the group-guessing in workflow-configuration scripts) and if we want to support this, we should decide on it now.

Yes, and – as pointed out above – that's probably true not just for static conditionals (i.e. file checks or variable references) but also dynamic switches (dependent on a previous result). But this would have to go beyond OCR-D CLI (and into OCR-D-Q) anyway.

If what we can offer here is too little for what actual OCR-D users need and can have with their own shell scripts or makefiles, then OCRD-WF is not going to be used much, I am afraid. Perhaps it's just too early to decide on all this.

Can you add that to https://hackmd.io/emBLZdD5SyWietFxLLpnQg?

Done.

Parsing with the algorithm outlined here results in a data structure

I see. So you are planning on implementing your own parser. I hope this is not necessary – if we could find a language suited for us, then we could just delegate implementation and documentation to that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common Workflow Language does have variable references BTW. But it's quite a complex spec itself, definitely not simple.

Do we target human readers/writers here, or will they be ultimately using some GUI in the future anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were considering CWL as an alternative to Taverna way back at the beginning of phase 2 of OCR-D but decided against it then because Taverna offered a GUI (IIRC, @VolkerHartmann?). I still like the design a lot, but it is very different from shell script.

Do we target human readers/writers here, or will they be ultimately using some GUI in the future anyway?

I'd say that human-readability and intuitiveness are very important. A "tiny" language based on shell script fits that description and allows us, asap, to provide reusable workflows in a repository like https://github.com/kba/ocrd-workflows

If what we can offer here is too little for what actual OCR-D users need and can have with their own shell scripts or makefiles, then OCRD-WF is not going to be used much, I am afraid.

The advantage over a plain shell script would be in the tooling and validation. I agree with you that it is pointless to define a script/configuration/workflow language that is then not used. We do want implementers to use OCRD-WF, the common workflows defined should be the basis for integration into digitization workflow software like Kitodo, Visual Library or DWork and the upcoming HTTP API should make use of it (hence the "IANA considerations").

Perhaps it's just too early to decide on all this.

Maybe we can discuss this in the Open Tech Call on Thursday, so I understand better what users/developers expect from such a workflow language and adapt the proposal accordingly. I think a solution (that can be revised with new requirements later) is fairly urgent but we should not prevent future use cases with early design choices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And statefulness is something that makes your model-driven semantics hard again. Not sure myself whether users actually want statefulness. But if it's going to be the declarative, make-like syntax, then we need a better metaphor than the current shell script / command shell / ocrd+sh / .ocrd.sh formulations. The term configuration language does usually not cover variable references, but macro language is too much related to actual programming...

I'd say that human-readability and intuitiveness are very important. A "tiny" language based on shell script fits that description and allows us, asap, to provide reusable workflows

Let's stay with the restricted shell script (as opposed to custom declarative syntax) idea for a moment.

Is the statefulness of sh-like variables really a problem? Yes, variables can be re-assigned (before re-use). But that does not necessarily make your semantics difficult. I don't think you actually need to keep the variable values after computing the commands they are used to formulate. You could basically just run the entire OCRD-WF through sh with sh -x workflow.ocrd.sh 2>&1 | sed -ne "s/^+ //p" to get your (machine-readable) steps – without the need for a parser for variables and quotation. Whatever human-readable combination of variable assignments was needed to arrive at this does not matter in the end. It's only the expanded result that counts. (You can then parse the verbatim commands to arrive at your model.)

Arguably, as soon as you do allow dynamic workflow switches, you will need conditionals though. Representing this is going to be more difficult, but does not necessitate declarative syntax either. In fact, it could still be done with a proper sh subset (if / then / elif/ fi). You loose the free parser though.

On that broader topic again – dynamic switches, depending on result checks – today's call made me think about the following simple concept: How about we introduced a dedicated CLI for OCR-D workflow quality estimators analogous to OCR-D workflow processors? Modules could bundle their knowledge about what a good result is for a particular processor along with everything else. And specialized modules could provide QE tools by the bunch. Let's call them benchmarks for the time being. We are primarily interested in a benchmark's score, which needs to be compared to some threshold to determine if the processor's result was "okay" enough to continue processing. That threshold could be different, depending on the workflow configuration. Benchmarks could also have configurable parameters (like model files) of their own.

An important question is how to deal with page-wise vs document-wise quality. Do we want to stop if even a single page indicated failure? Or does it take a bad average? Or a pro-longed series of bad pages? Or a min-worst-page / min-average kind of set-up?

Regardless, besides the binary score > threshold check we might also be interested in additional, verbose output on what was analysed and what patterns were found – either as part of logging, or as a structured report. So we also must allow creating an output fileGrp.

Now let's assume a call to a benchmark looks like this:

ocrdtest-cis-ocropy-binarization -I OCR-D-BIN-WOLF -O OCR-D-BIN-WOLF-TEST-OCRO -P method component-statistics -P threshold 0.9

(with its return value indicating success or failure, just as the return value of a processor).

Then we could write dynamic workflows like so:

if BIN=wolf
  BINARIZED=OCR-D-BIN-$BIN
  ocrd-olena-binarize -I OCR-D-IMG -O $BINARIZED -P impl $BIN
  ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9
then : 
elif BIN=ocropy
  # try another algorithm
  BINARIZED=OCR-D-BIN-$BIN
  ocrd-cis-ocropy-binarize -I OCR-D-IMG -O $BINARIZED -P method $BIN -P level-of-operation page
  ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9
then : 
elif BIN=wolf
  # try another parameter
  BINARIZED=OCR-D-BIN-$BIN-HIGH
  ocrd-olena-binarize -I OCR-D-IMG -O $BINARIZED -P impl $BIN -P k 0.1
  ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9
then : 
else 
  # give up the workflow
  exit 1
fi
ocrd-tesserocr-deskew -I $BINARIZED -O OCR-D-DESK -P operation_level page
if CROPPED=OCR-D-CROP-TESS
  ocrd-tesserocr-crop -I OCR-D-DESK -O $CROPPED
  ocrdtest-leptonica-cropping -I $CROPPED -P threshold 0.7
then : 
elif CROPPED=OCR-D-CROP-ANY
  ocrd-anybaseocr-crop -I OCR-D-DESK -O $CROPPED
  ocrdtest-leptonica-cropping -I $CROPPED -P threshold 0.5
then : 
else 
  # omit cropping
  CROPPED=OCR-D-DESK
fi
ocrd-tesserocr-segment-region -I $CROPPED -O OCR-D-SEG-TESS 
# and so on...

(borrowing the sh -e convention that any non-zero return value causes the workflow to fail, except if it was immediate to a conditional expression)

That would be fully dynamic, but still not allow arbitrary information flow like determining model names. For the latter, some notion of functional CLI would be needed...

workflow-format.md Outdated Show resolved Hide resolved
workflow-format.md Outdated Show resolved Hide resolved
kba and others added 5 commits August 16, 2020 15:36
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
workflow-format.md Outdated Show resolved Hide resolved
workflow-format.md Outdated Show resolved Hide resolved
kba and others added 3 commits August 17, 2020 12:32
@bertsky
Copy link
Collaborator

bertsky commented Aug 18, 2020

One feature that might be worth considering for this format as well is (automatically) versioning workflows, as explained here.

Also: inclusion of other files!

@kba
Copy link
Member Author

kba commented Aug 18, 2020

Also: inclusion of other files!

Definitely. The idea behind adding variables - even if unsure about the semantics yet - was to support some form of inheritance.

workflow-format.md Outdated Show resolved Hide resolved
### Shebang

The first line of every OCRD-WF must match the regular expression
`#!/usr/bin/env ocrd-wf-v[1-9]+[0-9]*\n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`#!/usr/bin/env ocrd-wf-v[1-9]+[0-9]*\n`
`#!/usr/bin/env ocrd-wf-v[1-9][0-9]*\n`


### Check shebang

* assert that line[0] is `#!/usr/bin/env ocrd-wf`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing version after ocrd-wf?

alnum = alpha | "0" .. "9"
comment-prefix = "#"

shebang = "#!/usr/bin/env ocrd-wf\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing version?

workflow-format.md Outdated Show resolved Hide resolved
workflow-format.md Outdated Show resolved Hide resolved
workflow-format.md Outdated Show resolved Hide resolved
Co-authored-by: Clemens Neudecker <952378+cneud@users.noreply.github.com>
Co-authored-by: Stefan Weil <sw@weilnetz.de>
@kba kba marked this pull request as draft February 4, 2021 10:49
@kba kba mentioned this pull request Mar 4, 2022
4 tasks
@kba
Copy link
Member Author

kba commented Aug 16, 2022

Thanks everybody for the effort in reviewing this but we have since decided to go down the Nextflow route, so this PR has become obsolete.

@kba kba closed this Aug 16, 2022
@kba kba deleted the workflow-format branch September 6, 2022 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants