Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bare bones add #23

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Bare bones add #23

wants to merge 2 commits into from

Conversation

kba
Copy link
Contributor

@kba kba commented Aug 5, 2022

These changes are not meant to be taken over as-is, they are just adaptions to make ocrd-import

  1. faster
  2. ensure the generated page ID have a form that results in predictable and make_file_id-comptaible IDs

Just wanted to make a draft PR as a discussion basis.

@kba kba mentioned this pull request Aug 8, 2022
Copy link
Owner

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to help make ocrd-import more versatile, but don't fully understand your use-case (esp. the P_ change).

Comment on lines -7 to +3
function debug { ocrd log -n ocrd-import debug "$1"; }
function critical { echo critical "$1"; }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the OCR-D logging facility is too inefficient, I suggest to use a proper mechanism to override it with an additional switch. For example, by aliasing ocrd log -n ocrd-import or echo to a common log backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was just a quick hack to reduce the overhead of calling ocrd log.

For example, by aliasing ocrd log -n ocrd-import or echo to a common log backend.

How do you mean, aliasing?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean, aliasing?

I meant

alias log="ocrd log -n ocrd-import"
...
((fastlog)) && alias log=echo
...
function debug { log debug "$1"; }
...
function critical { log critical "$1"; }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 that is a nice solution and configurable too.

@@ -144,6 +144,7 @@ for file in $(find -L . -type f -not -name mets.xml -not -name "*.log" | sort);
set -e
trap rollback ERR
page=p${zeros:0:$((4-${#num}))}$num
echo "PAGE=$page"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use the log fn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just debugging by @stefanCCS, can go without replacement.

@@ -165,104 +166,107 @@ for file in $(find -L . -type f -not -name mets.xml -not -name "*.log" | sort);
# also, avoid . in IDs, because downstream it will confuse filename suffix detection
base="${base//[ :.]/_}"
if ! [[ ${base:0:1} =~ [a-zA-Z] ]]; then
base=f${base}
#base=P_${base}
a=0 # just do something to have a correct syntax for this 'if'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-XS-compliant page names are a problem regardless of numpageid or not. I don't see why this branch should be abandoned. (But if you do, : is the sh word for no-op.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the intention was here. @stefanCCS ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba , @bertsky : My intention was to create the same result using ocrd-import as with using ocrd workspace add.
Here pageelements looking like this
<mets:div TYPE="page" ID="P_3871_007767869_00004">
having an IDstarting with P_ (instead of starting with f).

Comment on lines -265 to +183
done
#case "$mimetype" in
# ${MIMETYPE_PAGE})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to have a version with next to no checks, the right approach would be to just add a switch to circumvent the no-clash check for resulting pages/files. Everything else is already fast (like the PAGE vs ALTO check for .xml) or can be deactivated (like the --no-convert switch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was very broad commenting out. The "offending" call that slows down import is

clashes=($(ocrd workspace find -i "$base" -k local_filename -k mimetype -k pageId))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants