From ce2aa421061e56b8f69cad2bac6cc25f338a8dbe Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 9 May 2024 10:39:04 -0400 Subject: [PATCH 1/8] Update hOCR docs --- README.md | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 36e3871..a681f94 100644 --- a/README.md +++ b/README.md @@ -118,19 +118,10 @@ viewer will be able to highlight text found via a search, and display a search i within the viewer. ### Setting up hOCR + To display a text overlay, Mirador must be provided with hOCR text data - which is OCR'd text that includes position information for the extracted text relative to the image that is being displayed. Here are the steps: -1. Go to "Administration » Structure » Media Types", select the "**File**" media type, and click "**Manage Fields**". -2. Add a new field to the **File** media type called "**hOCR extracted Text**". Set the allowed file extensions to "xml"
![media-file-field_hocr_extracted-file-label.png](docs%2Fmedia-file-field_hocr_extracted-file-label.png) ![media-file-field_hocr_extracted-file-extensions.png](docs%2Fmedia-file-field_hocr_extracted-file-extensions.png) -3. Go to "Administration » Configuration » System » Actions" and click "**Create New Advanced Action**" with the "**Generate Extracted Text for Media Attachment**" action type.
![action-hocr-extracted-text.png](docs%2Faction-hocr-extracted-text.png)
-![action-hocr-extracted-text-config.png](docs%2Faction-hocr-extracted-text-config.png)
- - Give the new action a name that mentions hOCR.
- - In Format field select hOCR Extracted Text with Positional Data - - For Destination File Field Name select the field you just created (`field_hocr_extracted_text`) - - Keep *None* for the destination text field - - And save the action -4. Go to " Administration » Structure » Context" and edit the **Page Derivatives** context
![context-paged-derivatives-add-reaction.png](docs%2Fcontext-paged-derivatives-add-reaction.png) - - Click **Add Reaction** and choose "**Derive File for Existing Media**" - - In the select box choose the action you created above and save. +1. Install https://github.com/discoverygarden/islandora_hocr +2. ### Test hOCR Follow these steps to confirm that hOCR is working. From 74c5abeaf036f155281124224afec5b90bfbceef Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 9 May 2024 10:47:21 -0400 Subject: [PATCH 2/8] Update README.md --- README.md | 21 ++------------------- 1 file changed, 2 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index a681f94..1029528 100644 --- a/README.md +++ b/README.md @@ -120,7 +120,7 @@ within the viewer. ### Setting up hOCR To display a text overlay, Mirador must be provided with hOCR text data - which is OCR'd text that includes position information for the extracted text relative to the image that is being displayed. Here are the steps: -1. Install https://github.com/discoverygarden/islandora_hocr +1. Install https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr 2. ### Test hOCR @@ -139,24 +139,7 @@ Assuming hOCR is [set up](#setting-up-hocr) and [tested](#test-hocr)... We will show how to set up IIIF manifests to include text overlay in Mirador for single pages, and for paged content. -1. Go to "Administration » Structure » Views" and edit the **IIIF Manifest** view. This is included in the Islandora Starter Site. -2. There should be two displays, one for single-page nodes, and one for paged content. They are distinguished by their Contextual filters, found under the "Advanced" tab. In both cases, they have relationships for "field_media_of: Content" (required), and "field_media_use: Taxonomy term" (not required)
![view-iiif-manifest-all-relationships.png](docs%2Fview-iiif-manifest-all-relationships.png).
However, they differ in their contextual filters: - - The single-page contextual filter uses the current Media entity's "Media of" value, matching it with the "Content ID from the URL". The effect of this is to select all Media objects that are attached to the node identified by the current url.
![view-iiif-manifest-1page-contextual-filter.png](docs%2Fview-iiif-manifest-1page-contextual-filter.png) - - The paged-content contextual filter uses the "Content: Member of" relationship to find Media objects that are attached to children of the current node, identified by "Content ID from URL".
![view-iiif-manifest-paged-contextual-filter.png](docs%2Fview-iiif-manifest-paged-contextual-filter.png) -3. The two displays also differ in their path, under "Path Settings". For the single page manifest display, it would normally be `/node/[%node]/manifest` (matching what was configured on the [islandora mirador configuration page](#configuration)), whereas for the paged-content manifest display, it would normally be `/node/[%node]/book-manifest`. - -The rest of the settings for the two displays are identical, as follows...
-![view-iiif-manifest-shared-settings.png](docs%2Fview-iiif-manifest-shared-settings.png) -1. In the left column, under "Fields", add "hOCR Extracted Text". -2. In the left column, under "Format", the Style plugin "IIIF Manifest" should be selected. Click "Settings". You will see two sets of checkboxes - "Tile source field(s)" and "Structured OCR data file field". Under "Structured OCR data file field", check "Media: hOCR extracted Text".
![view-iiif-manifest-style-settings.png](docs%2Fview-iiif-manifest-style-settings.png) -3. In the "Filter criteria" section of the form, ensure that the "field_media_use: Taxonomy Term" filter is set to filter on the OriginalFile media term (not ServiceFile). -4. Save the view. - -To test... -1. Go to the Page node you created in [test ocr](#test-hocr) and add "/manifest" to the end of the URL, or whatever you configured in the single page manifest view display. -2. Look for a seeAlso section in the XML that should contain a reference to the hOCR field with appropriate MIME Type and Description. -3. Repeat for the paged content node, substituting "/book-manifest" to the end of the url, or whatever you configured for the paged content manifest view display. - +1. Follow the instructions at [https://github.com/Born-Digital-US/islandora_iiif_hocr](https://github.com/Born-Digital-US/islandora_iiif_hocr#usage) ### Configuring the Mirador viewer to display for Pages and Paged Content using Contexts From 375263e9f957903514e401d456784a3b90100093 Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 9 May 2024 10:54:13 -0400 Subject: [PATCH 3/8] Update README.md --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1029528..f1700de 100644 --- a/README.md +++ b/README.md @@ -120,8 +120,10 @@ within the viewer. ### Setting up hOCR To display a text overlay, Mirador must be provided with hOCR text data - which is OCR'd text that includes position information for the extracted text relative to the image that is being displayed. Here are the steps: -1. Install https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr -2. + +1. Install the [Solr OCR Highlighting Plugin](https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/) in your solr server TODO: consider adding to isle-buildkit +2. Install the Drupal modules https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr +3. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr. TODO: consider having this action and modules ship with the starter site? ### Test hOCR Follow these steps to confirm that hOCR is working. From b9cfc67f7b8a98bd540a08db67356a4a8af93e5f Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 9 May 2024 10:58:02 -0400 Subject: [PATCH 4/8] Update README.md --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index f1700de..da296a8 100644 --- a/README.md +++ b/README.md @@ -123,7 +123,8 @@ To display a text overlay, Mirador must be provided with hOCR text data - which 1. Install the [Solr OCR Highlighting Plugin](https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/) in your solr server TODO: consider adding to isle-buildkit 2. Install the Drupal modules https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr -3. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr. TODO: consider having this action and modules ship with the starter site? +3. Can't remember the order if you have to do this before or after the module install but there's a couple tweaks to the solr config XML files you need to make. Though `dgi/islandora_hocr` creates some of the necessary solr server components there's still some XML tweaks that need made +4. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr. TODO: consider having this action and modules ship with the starter site? ### Test hOCR Follow these steps to confirm that hOCR is working. @@ -139,9 +140,7 @@ Follow these steps to confirm that hOCR is working. ### Configuring the IIIF Manifest view for the Manifest additions Assuming hOCR is [set up](#setting-up-hocr) and [tested](#test-hocr)... -We will show how to set up IIIF manifests to include text overlay in Mirador for single pages, and for paged content. - -1. Follow the instructions at [https://github.com/Born-Digital-US/islandora_iiif_hocr](https://github.com/Born-Digital-US/islandora_iiif_hocr#usage) +Follow the instructions at [https://github.com/Born-Digital-US/islandora_iiif_hocr](https://github.com/Born-Digital-US/islandora_iiif_hocr#usage) to allow searching inside the mirador viewer on your hOCR ### Configuring the Mirador viewer to display for Pages and Paged Content using Contexts From f9f607e263fb787e6df4a90ca08706f6da58c87a Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 9 May 2024 11:01:20 -0400 Subject: [PATCH 5/8] Update README.md --- README.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index da296a8..334b109 100644 --- a/README.md +++ b/README.md @@ -123,7 +123,14 @@ To display a text overlay, Mirador must be provided with hOCR text data - which 1. Install the [Solr OCR Highlighting Plugin](https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/) in your solr server TODO: consider adding to isle-buildkit 2. Install the Drupal modules https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr -3. Can't remember the order if you have to do this before or after the module install but there's a couple tweaks to the solr config XML files you need to make. Though `dgi/islandora_hocr` creates some of the necessary solr server components there's still some XML tweaks that need made +3. Add +``` + +``` + +to your solr server's `solrconfig.xml` 4. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr. TODO: consider having this action and modules ship with the starter site? ### Test hOCR From db50d80b775f5f040f2ac8b11c7ceda8b6bf27ff Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 9 May 2024 11:02:38 -0400 Subject: [PATCH 6/8] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 334b109..8b291a4 100644 --- a/README.md +++ b/README.md @@ -131,6 +131,7 @@ To display a text overlay, Mirador must be provided with hOCR text data - which ``` to your solr server's `solrconfig.xml` + 4. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr. TODO: consider having this action and modules ship with the starter site? ### Test hOCR From d9b88dbcee112c1e49da96a2ef9ec254751e48a6 Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Thu, 20 Jun 2024 15:38:42 -0400 Subject: [PATCH 7/8] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 8b291a4..27409d2 100644 --- a/README.md +++ b/README.md @@ -121,7 +121,7 @@ within the viewer. To display a text overlay, Mirador must be provided with hOCR text data - which is OCR'd text that includes position information for the extracted text relative to the image that is being displayed. Here are the steps: -1. Install the [Solr OCR Highlighting Plugin](https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/) in your solr server TODO: consider adding to isle-buildkit +1. Ensure you're running isle-buildkit version 3.2.6 or above 2. Install the Drupal modules https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr 3. Add ``` @@ -132,7 +132,7 @@ To display a text overlay, Mirador must be provided with hOCR text data - which to your solr server's `solrconfig.xml` -4. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr. TODO: consider having this action and modules ship with the starter site? +4. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr ### Test hOCR Follow these steps to confirm that hOCR is working. From bc5c65029f2226305b899a5c9e0cb7e977020d87 Mon Sep 17 00:00:00 2001 From: Joe Corall Date: Tue, 30 Jul 2024 09:33:13 -0400 Subject: [PATCH 8/8] Remove old steps --- README.md | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 27409d2..10e79b2 100644 --- a/README.md +++ b/README.md @@ -121,18 +121,9 @@ within the viewer. To display a text overlay, Mirador must be provided with hOCR text data - which is OCR'd text that includes position information for the extracted text relative to the image that is being displayed. Here are the steps: -1. Ensure you're running isle-buildkit version 3.2.6 or above +1. Ensure you're running isle-buildkit version 3.2.12 or above 2. Install the Drupal modules https://github.com/discoverygarden/islandora_hocr and https://github.com/Born-Digital-US/islandora_iiif_hocr -3. Add -``` - -``` - -to your solr server's `solrconfig.xml` - -4. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr +3. Create a derivative action so when Original File images are uploaded to your repository a `file` media entity is created with `field_media_use` equal to the `hOCR` media use term created by https://github.com/discoverygarden/islandora_hocr ### Test hOCR Follow these steps to confirm that hOCR is working.