first draft of workflow format #171

kba · 2020-08-15T12:34:15Z

Spec defines a format for describing OCR-D workflows, independent of the workflow engine to run the workflow.

See https://github.com/kba/ocrd-workflows for examples of workflows defined in this format

bertsky

I kind of like the idea of using a subset of POSIX sh for this. But we must be careful not to be too restrictive (so the format is expressive enough to actually get used) or too flexible (so the format won't be implementable or translateable to makefiles easily).

Notably, we should consider whether this static approach is sufficient. (You cannot make the configuration of follow-up workflow steps dependent on intermediate results here.)

One feature that might be worth considering for this format as well is (automatically) versioning workflows, as explained here.

workflow-format.md

bertsky · 2020-08-16T12:21:24Z

workflow-format.md

+a valid `sh` identifier, everything `=` is parsed as a string.
+
+<!-- TODO define behavior? -->
+**NOTE** The semantics of assigning `value` to `name` are implementation-dependant


Suggested change

**NOTE** The semantics of assigning `value` to `name` are implementation-dependant

**NOTE** The semantics of assigning `value` to `name` are implementation-dependent

But what do you mean by that?

In particular, where and how in the step definition do you use variables? You'll have to describe variable expansion really, esp. in its interaction with quote expansion. In practice, we'd need to be able to do things like:

bin=wolf param_bin='{"input_feature_filter": "binarized", "output_feature_added": "binarized", "command": "scribo-cli $bin @INFILE @OUTFILE --enable-negate-output --k 0.2 --win-size 301"}' ocrd-preprocess-image -I OCR-D-IMG -O OCR-D-BIN-$bin -P level-of-operation page -p "$param_bin" ocrd-cis-ocropy-denoise -I OCR-D-BIN-$bin -O OCR-D-BIN-$bin-DEN -P level-of-operation page ocrd-tesserocr-deskew -I OCR-D-BIN-$bin-DEN -O OCR-D-BIN-$bin-DEN-FLIP -P operation_level page ocrd-cis-ocropy-deskew -I OCR-D-BIN-$bin-DEN-FLIP -O OCR-D-BIN-$bin-DEN-DESK -P level-of-operation page ocrd-tesserocr-segment-region -I OCR-D-BIN-$bin-DEN-DESK -O OCR-D-SEG-REG ocrd-preprocess-image -I OCR-D-SEG-REG -O OCR-D-SEG-REG-BIN-$bin -P level-of-operation region -p "$param_bin" ... ocrd-preprocess-image -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-BIN-$bin -P level-of-operation line -p "$param_bin"

(We may need variable expansion in fileGrp names whenever multiple alternatives are used for different purposes, like different binarizations for segmentation and recognition. In the parameter JSON and parameter literal options, the need is obvious – for brevity and changability.)

For this initial draft, I side-stepped the question how variables come into play, because I need input (like yours, thx) on how this should work.

I really like how standard = variable assignments work in Makefiles and would like to see the same semantics with /bin/sh syntax. I.e. the example you posted would be valid but at workflow execution time, users should be able to override those assignments to use a different model, a different set of parameters etc.

consistency would then also include a check that there is an assignment (either in the file or provided by the workflow engine) to replace all variables in the file.

For ease of parsing and visual effect, I'd prefer ${foo} syntax over $foo.

Also, I think this should not adhere to bourne shell variable expansion (which e.g. does not work inside single quotes and can be escaped with a backslash) but rather the more aggressive make approach to replace all $(var) calls everywhere.

I really like how standard = variable assignments work in Makefiles and would like to see the same semantics with /bin/sh syntax. I.e. the example you posted would be valid but at workflow execution time, users should be able to override those assignments to use a different model, a different set of parameters etc.

I agree with you aesthetically, but then you definitely need a completely different non-sh (non-static scoping) semantics. You essentially want the declarative model for makefiles. Which is not very intuitive for most users I must say.

Also, I think this should not adhere to bourne shell variable expansion (which e.g. does not work inside single quotes and can be escaped with a backslash) but rather the more aggressive make approach to replace all $(var) calls everywhere.

The make approach is problematic itself though – exactly because of its word expansion (and ignorance of any quotation). In sh-like syntax, you can at least have nested quotation of arbitrary depth.

Having a concise, readable and runnable declarative syntax for workflows was among my initial motivations for workflow-configuration. This could also be parsed to run independent of make. I fail to see the merit of introducing yet another syntax if not for simplicity and intuitivity (which I still believe sh-like configuration serves best).

You essentially want the declarative model for makefiles. Which is not very intuitive for most users I must say.

Why do you think it is unintuitive? The OCRD-WF defines variables and command line calls. Variables are hoisted at parse-time (which is different from shell script, true, but since we do not support any conditional syntax, it makes no sense to support non-global scoping).

In sh-like syntax, you can at least have nested quotation of arbitrary depth.

That's a good point. make's interpolation is more powerful IMHO (because it doesn't care about quotes) but it can lead to unintuitive behavior, I agree.

This could also be parsed to run independent of make.

But only if you stick to a completely static sequence of steps, without any fileGrp-guessing logic and the like - nothing against trying to be smart, just that Makefile's are not trivial to parse, with all its more or less arcane incantations :)

I fail to see the merit of introducing yet another syntax if not for simplicity and intuitivity (which I still believe sh-like configuration serves best).

The merit of this syntax is supposed to be interoperability. At the very least, OCRD-WF should be exportable to (less intuitive than hand-written but still complete) workflow-configuration makefiles or ocrd process calls.

Because variables can get their values from below the line where they are used, and multiple assignments make no sense (and don't work).

True, that's what I meant by "hoisting", that variable assignments are processed before any steps, even if they assignments and steps are interwoven. This is decidedly non-sh behavior and unintuitive if you approach OCRD-WF as a shell script. But if you do approach it like a shell script, you might be tempted to assume statefulness and expect conditionals or loop constructs to be supported, which they are not. There absolutely is a benefit to allowing state/conditionals (like the group-guessing in workflow-configuration scripts) and if we want to support this, we should decide on it now. If not, I'd prefer the make assignment semantics.

I was thinking of something like make -Bsn > workflow.sh (which could be usually rewritten for ocrd process)

This pattern never occured to me but is very clever. Can you add that to https://hackmd.io/emBLZdD5SyWietFxLLpnQg?

But a truly sh-based restricted syntax would be both interoperable and simple, even with standard backslash/quote expansion.

I'm with you, I care about simplicity and interoperability more than about commenting lines with escaped backslashes :)

How do you plan on implementing this translation anyway? Some form of bash -x workflow.ocrd.sh or set -n perhaps, or are you going to write a (restricted) sh parser of your own (talking of incantations...)?

Parsing with the algorithm outlined here results in a data structure

assignments: foo: bar bar: 42 steps: - executable: ocrd-foo parameters: baz: foo input_file_grp: MAX output_file_grp: FOO

Then we'd have adapters in ocrd_modelfactory that generate ocrd process calls or workflow-configuration makefiles or the CSV that @VolkerHartmann uses for invoking taverna or a shell script (which would be more or less the same as the OCRD-WF input).

But if you do approach it like a shell script, you might be tempted to assume statefulness and expect conditionals or loop constructs to be supported, which they are not

Yes. And statefulness is something that makes your model-driven semantics hard again. Not sure myself whether users actually want statefulness. But if it's going to be the declarative, make-like syntax, then we need a better metaphor than the current shell script / command shell / ocrd+sh / .ocrd.sh formulations. The term configuration language does usually not cover variable references, but macro language is too much related to actual programming...

There absolutely is a benefit to allowing state/conditionals (like the group-guessing in workflow-configuration scripts) and if we want to support this, we should decide on it now.

Yes, and – as pointed out above – that's probably true not just for static conditionals (i.e. file checks or variable references) but also dynamic switches (dependent on a previous result). But this would have to go beyond OCR-D CLI (and into OCR-D-Q) anyway.

If what we can offer here is too little for what actual OCR-D users need and can have with their own shell scripts or makefiles, then OCRD-WF is not going to be used much, I am afraid. Perhaps it's just too early to decide on all this.

Can you add that to https://hackmd.io/emBLZdD5SyWietFxLLpnQg?

Done.

Parsing with the algorithm outlined here results in a data structure

I see. So you are planning on implementing your own parser. I hope this is not necessary – if we could find a language suited for us, then we could just delegate implementation and documentation to that.

Common Workflow Language does have variable references BTW. But it's quite a complex spec itself, definitely not simple.

Do we target human readers/writers here, or will they be ultimately using some GUI in the future anyway?

We were considering CWL as an alternative to Taverna way back at the beginning of phase 2 of OCR-D but decided against it then because Taverna offered a GUI (IIRC, @VolkerHartmann?). I still like the design a lot, but it is very different from shell script.

Do we target human readers/writers here, or will they be ultimately using some GUI in the future anyway?

I'd say that human-readability and intuitiveness are very important. A "tiny" language based on shell script fits that description and allows us, asap, to provide reusable workflows in a repository like https://github.com/kba/ocrd-workflows

If what we can offer here is too little for what actual OCR-D users need and can have with their own shell scripts or makefiles, then OCRD-WF is not going to be used much, I am afraid.

The advantage over a plain shell script would be in the tooling and validation. I agree with you that it is pointless to define a script/configuration/workflow language that is then not used. We do want implementers to use OCRD-WF, the common workflows defined should be the basis for integration into digitization workflow software like Kitodo, Visual Library or DWork and the upcoming HTTP API should make use of it (hence the "IANA considerations").

Perhaps it's just too early to decide on all this.

Maybe we can discuss this in the Open Tech Call on Thursday, so I understand better what users/developers expect from such a workflow language and adapt the proposal accordingly. I think a solution (that can be revised with new requirements later) is fairly urgent but we should not prevent future use cases with early design choices.

Yes. And statefulness is something that makes your model-driven semantics hard again. Not sure myself whether users actually want statefulness. But if it's going to be the declarative, make-like syntax, then we need a better metaphor than the current shell script / command shell / ocrd+sh / .ocrd.sh formulations. The term configuration language does usually not cover variable references, but macro language is too much related to actual programming...

I'd say that human-readability and intuitiveness are very important. A "tiny" language based on shell script fits that description and allows us, asap, to provide reusable workflows

Let's stay with the restricted shell script (as opposed to custom declarative syntax) idea for a moment.

Is the statefulness of sh-like variables really a problem? Yes, variables can be re-assigned (before re-use). But that does not necessarily make your semantics difficult. I don't think you actually need to keep the variable values after computing the commands they are used to formulate. You could basically just run the entire OCRD-WF through sh with sh -x workflow.ocrd.sh 2>&1 | sed -ne "s/^+ //p" to get your (machine-readable) steps – without the need for a parser for variables and quotation. Whatever human-readable combination of variable assignments was needed to arrive at this does not matter in the end. It's only the expanded result that counts. (You can then parse the verbatim commands to arrive at your model.)

Arguably, as soon as you do allow dynamic workflow switches, you will need conditionals though. Representing this is going to be more difficult, but does not necessitate declarative syntax either. In fact, it could still be done with a proper sh subset (if / then / elif/ fi). You loose the free parser though.

On that broader topic again – dynamic switches, depending on result checks – today's call made me think about the following simple concept: How about we introduced a dedicated CLI for OCR-D workflow quality estimators analogous to OCR-D workflow processors? Modules could bundle their knowledge about what a good result is for a particular processor along with everything else. And specialized modules could provide QE tools by the bunch. Let's call them benchmarks for the time being. We are primarily interested in a benchmark's score, which needs to be compared to some threshold to determine if the processor's result was "okay" enough to continue processing. That threshold could be different, depending on the workflow configuration. Benchmarks could also have configurable parameters (like model files) of their own.

An important question is how to deal with page-wise vs document-wise quality. Do we want to stop if even a single page indicated failure? Or does it take a bad average? Or a pro-longed series of bad pages? Or a min-worst-page / min-average kind of set-up?

Regardless, besides the binary score > threshold check we might also be interested in additional, verbose output on what was analysed and what patterns were found – either as part of logging, or as a structured report. So we also must allow creating an output fileGrp.

Now let's assume a call to a benchmark looks like this:

ocrdtest-cis-ocropy-binarization -I OCR-D-BIN-WOLF -O OCR-D-BIN-WOLF-TEST-OCRO -P method component-statistics -P threshold 0.9

(with its return value indicating success or failure, just as the return value of a processor).

Then we could write dynamic workflows like so:

if BIN=wolf BINARIZED=OCR-D-BIN-$BIN ocrd-olena-binarize -I OCR-D-IMG -O $BINARIZED -P impl $BIN ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9 then : elif BIN=ocropy # try another algorithm BINARIZED=OCR-D-BIN-$BIN ocrd-cis-ocropy-binarize -I OCR-D-IMG -O $BINARIZED -P method $BIN -P level-of-operation page ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9 then : elif BIN=wolf # try another parameter BINARIZED=OCR-D-BIN-$BIN-HIGH ocrd-olena-binarize -I OCR-D-IMG -O $BINARIZED -P impl $BIN -P k 0.1 ocrdtest-cis-ocropy-binarization -I $BINARIZED -O $BINARIZED-TEST-OCRO -P method component-statistics -P threshold 0.9 then : else # give up the workflow exit 1 fi ocrd-tesserocr-deskew -I $BINARIZED -O OCR-D-DESK -P operation_level page if CROPPED=OCR-D-CROP-TESS ocrd-tesserocr-crop -I OCR-D-DESK -O $CROPPED ocrdtest-leptonica-cropping -I $CROPPED -P threshold 0.7 then : elif CROPPED=OCR-D-CROP-ANY ocrd-anybaseocr-crop -I OCR-D-DESK -O $CROPPED ocrdtest-leptonica-cropping -I $CROPPED -P threshold 0.5 then : else # omit cropping CROPPED=OCR-D-DESK fi ocrd-tesserocr-segment-region -I $CROPPED -O OCR-D-SEG-TESS # and so on...

(borrowing the sh -e convention that any non-zero return value causes the workflow to fail, except if it was immediate to a conditional expression)

That would be fully dynamic, but still not allow arbitrary information flow like determining model names. For the latter, some notion of functional CLI would be needed...

workflow-format.md

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

workflow-format.md

…workflow-format

workflow-format.md

Co-authored-by: Elisabeth Engl <53007946+EEngl52@users.noreply.github.com>

…workflow-format

bertsky · 2020-08-18T12:20:57Z

One feature that might be worth considering for this format as well is (automatically) versioning workflows, as explained here.

Also: inclusion of other files!

kba · 2020-08-18T12:53:03Z

Also: inclusion of other files!

Definitely. The idea behind adding variables - even if unsure about the semantics yet - was to support some form of inheritance.

workflow-format.md

stweil · 2020-10-04T06:53:51Z

workflow-format.md

+### Shebang
+
+The first line of every OCRD-WF must match the regular expression
+`#!/usr/bin/env ocrd-wf-v[1-9]+[0-9]*\n` 


Suggested change

`#!/usr/bin/env ocrd-wf-v[1-9]+[0-9]*\n`

`#!/usr/bin/env ocrd-wf-v[1-9][0-9]*\n`

stweil · 2020-10-04T06:57:07Z

workflow-format.md

+
+### Check shebang
+
+* assert that line[0] is `#!/usr/bin/env ocrd-wf`


Missing version after ocrd-wf?

stweil · 2020-10-04T06:58:43Z

workflow-format.md

+alnum = alpha | "0" .. "9"
+comment-prefix = "#"
+
+shebang = "#!/usr/bin/env ocrd-wf\n"


Missing version?

workflow-format.md

Co-authored-by: Clemens Neudecker <952378+cneud@users.noreply.github.com> Co-authored-by: Stefan Weil <sw@weilnetz.de>

kba · 2022-08-16T15:58:47Z

Thanks everybody for the effort in reviewing this but we have since decided to go down the Nextflow route, so this PR has become obsolete.

first draft of workflow format

48e680b

kba requested review from bertsky, cneud and EEngl52 August 15, 2020 14:46

kba added a commit to OCR-D/core that referenced this pull request Aug 15, 2020

introduce OcrdWf and OcrdWfStep in preparation for OCR-D/spec#171

c739e42

bertsky reviewed Aug 16, 2020

View reviewed changes

kba and others added 5 commits August 16, 2020 15:36

workflow-format: clarify comment syntax

847cd1a

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

workflow-format: clarify line continuation syntax

cb3bb7a

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

workflow-format: typo

2c1c8e6

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

workflow-format: clarify assignment syntax

a0fe4b7

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

workflow-format: typo

4bf26dd

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

cneud reviewed Aug 17, 2020

View reviewed changes

workflow-format.md Outdated Show resolved Hide resolved

cneud and others added 2 commits August 17, 2020 11:59

workflow-format: typo

79d8f5e

Merge branch 'workflow-format' of https://github.com/OCR-D/spec into …

bc9807a

…workflow-format

EEngl52 approved these changes Aug 17, 2020

View reviewed changes

workflow-format.md Outdated Show resolved Hide resolved

kba and others added 3 commits August 17, 2020 12:32

workflow-format: typo

6d26d7d

Co-authored-by: Elisabeth Engl <53007946+EEngl52@users.noreply.github.com>

workflow-format: drop comments on continued lines

7abe6af

Merge branch 'workflow-format' of https://github.com/OCR-D/spec into …

45260f7

…workflow-format

kba added a commit to OCR-D/assets that referenced this pull request Aug 21, 2020

remove comments from continued lines OCR-D/spec#171 :D

824c9d5

bertsky mentioned this pull request Aug 21, 2020

additional CLI/METS/ocrd-tool.json specs for quality estimator tools and annotations #172

Open

kba added a commit to OCR-D/core that referenced this pull request Aug 28, 2020

introduce OcrdWf and OcrdWfStep in preparation for OCR-D/spec#171

c9a1604

stweil reviewed Oct 4, 2020

View reviewed changes

workflow-format.md Outdated Show resolved Hide resolved

stweil reviewed Oct 4, 2020

View reviewed changes

workflow-format.md

### Check shebang

* assert that line[0] is `#!/usr/bin/env ocrd-wf`

Copy link

Contributor

stweil Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing version after ocrd-wf?

stweil reviewed Oct 4, 2020

View reviewed changes

workflow-format.md

alnum = alpha | "0" .. "9"

comment-prefix = "#"

shebang = "#!/usr/bin/env ocrd-wf\n"

Copy link

Contributor

stweil Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing version?

cneud reviewed Feb 4, 2021

View reviewed changes

workflow-format.md Outdated Show resolved Hide resolved

cneud reviewed Feb 4, 2021

View reviewed changes

workflow-format.md Outdated Show resolved Hide resolved

cneud reviewed Feb 4, 2021

View reviewed changes

workflow-format.md Outdated Show resolved Hide resolved

Apply suggestions from code review

8502b97

Co-authored-by: Clemens Neudecker <952378+cneud@users.noreply.github.com> Co-authored-by: Stefan Weil <sw@weilnetz.de>

kba marked this pull request as draft February 4, 2021 10:49

kba mentioned this pull request Feb 14, 2022

Common understanding about the workflow requirements OCR-D/zenhub#48

Closed

4 tasks

kba mentioned this pull request Mar 4, 2022

First draft of OCR-D Web API #173

Merged

4 tasks

kba closed this Aug 16, 2022

kba deleted the workflow-format branch September 6, 2022 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first draft of workflow format #171

first draft of workflow format #171

kba commented Aug 15, 2020 •

edited

Loading

bertsky left a comment

bertsky Aug 16, 2020

bertsky Aug 16, 2020

kba Aug 16, 2020

bertsky Aug 17, 2020

kba Aug 17, 2020

kba Aug 18, 2020

bertsky Aug 18, 2020

bertsky Aug 18, 2020

kba Aug 18, 2020

bertsky Aug 20, 2020

bertsky commented Aug 18, 2020

kba commented Aug 18, 2020

stweil Oct 4, 2020

stweil Oct 4, 2020

stweil Oct 4, 2020

kba commented Aug 16, 2022

	NOTE The semantics of assigning `value` to `name` are implementation-dependant
	NOTE The semantics of assigning `value` to `name` are implementation-dependent

	`#!/usr/bin/env ocrd-wf-v[1-9]+[0-9]*\n`
	`#!/usr/bin/env ocrd-wf-v[1-9][0-9]*\n`


		### Check shebang

		* assert that line[0] is `#!/usr/bin/env ocrd-wf`

first draft of workflow format #171

first draft of workflow format #171

Conversation

kba commented Aug 15, 2020 • edited Loading

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky commented Aug 18, 2020

kba commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba commented Aug 16, 2022

kba commented Aug 15, 2020 •

edited

Loading