Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTDA9-287: Highlighting plugin integration #7

Merged
merged 20 commits into from
Aug 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 67 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,87 @@ Install as usual, see
[this](https://www.drupal.org/docs/extending-drupal/installing-modules) for
further information.

This module contains a migration facilitating the creation of a media use term for use in common Islandora configurations. Enabling the module will expose the `islandora_hocr_media_uses` migration to generate a media use term of the URI `https://discoverygarden.ca/use#hocr`.

```shell
# Flow might be something like:
drush en islandora_hocr
drush migrate:import islandora_hocr_media_uses
```

## Configuration


### Derivatives

An action must be created and configured to generate an hOCR derivative. The
action must also be triggered by a context in order for the derivative to be
made. Refer to the [official Islandora docs][islandora-docs] for more information.

## Usage
### Solr

This module contains a migration facilitating the creation of a media use term for use in common Islandora configurations. Enabling the module will expose the `islandora_hocr_media_uses` migration to generate a media use term of the URI `https://discoverygarden.ca/use#hocr`.
We expect to make use of the [Solr OCR Highlighting Plugin](https://dbmdz.github.io/solr-ocrhighlighting/). The particulars of its installation are ultimately up to the environment into which it is being installed.

```shell
# Flow might be something like:
drush en islandora_hocr
drush migrate:import islandora_hocr_media_uses
We have a single environment variable to allow the path of the library on the Solr instance to be specified, such that we can add its path to the configset for Solr:

- `SOLR_HOCR_PLUGIN_PATH`: A path resolvable by Solr to the directory containing the OCR Highlighting Plugin JAR.

There are a couple of config entities included:
- the `islandora_hocr` field type to perform tokenization
- the "Select w/ HOCR highlighting" `/select_ocr` request handler.

### HOCR Indexing

To `node` entities, we have added the ability to index HOCR from related media, making use of the [Solr OCR Highlighting Plugin](https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/)

As an example, you might add the `islandora_hocr_field:content` property to be indexed in Solr via the Search API Solr config, as `islandora_hocr_field`, as a `Fulltext ("islandora_hocr")` field.

Something of an aside, but the `islandora_hocr_field:uri` is presently prototypical: The Solr OCR Highlighting plugin has another character filter which handles processing paths into the contents of the files; however, in the context of things communicating via the network, such access might not always be possible, particular should access control enter in to the equation... as such, we presently expect the full page-level OCR document to be pushed for each page.

## Usage

Assuming indexing is configured as above, with a `islandora_hocr_field`, then you might programmatically perform a Search API query with something like:

```php
$index = \Drupal\search_api\Entity\Index::load('default_solr_index');
$query = $index->query();

// The search term(s).
$query->keys('bravo');
// Additional conditions, as desired.
$query->addCondition('type', 'islandora_object');
// Activate our highlighting behaviour.
$query->setOption('islandora_hocr_properties', [
'islandora_hocr_field' => [],
]);

// Perform the query.
$results = $query->execute();

// Get the additionally-populated property info, so we can identify what fields from the highlighted results correspond to which property.
$info = $results->getQuery()->getOption('islandora_hocr_properties');
// This should be an associative array mapping language codes to Solr fields,
// which can then be found in the $highlights below.
$language_fields = $info['islandora_hocr_field']['language_fields'];

// When processing the results, the
foreach ($results as $result) {
// Highlighting info can be acquired from the items. The format here is the
// same as the format from https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/query/#response-format
// for the given item/document.
$highlights = $result->getExtraData('islandora_hocr_highlights');
}
```

## Troubleshooting/Issues

Having problems or solved one? contact
[discoverygarden](http://support.discoverygarden.ca).

### Known issues

- [Solr Cloud Package](https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/installation/#for-solrcloud-users-installation-as-a-solr-package) (in)compatibility: The path to the library could be omitted; however, the conditional inclusion of prefixes in the config entities is problematic.

## Maintainers/Sponsors

Current maintainers:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
langcode: en
status: true
dependencies:
module:
- search_api_solr
- islandora_hocr
config:
- search_api_solr.solr_field_type.text_und_7_0_0
id: islandora_hocr_und_7_0_0
label: 'Language Undefined hOCR Field'
minimum_solr_version: 7.0.0
custom_code: 'islandora_hocr'
field_type_language_code: und
domains: {}
field_type:
name: islandora_hocr_und
class: solr.TextField
storeOffsetsWithPositions: true
analyzers:
-
type: index
charFilters:
-
class: solrocr.OcrCharFilterFactory
-
class: solr.MappingCharFilterFactory
mapping: accents_und.txt
tokenizer:
class: solr.WhitespaceTokenizerFactory
filters:
-
class: solr.StopFilterFactory
ignoreCase: true
words: stopwords_und.txt
-
class: solr.WordDelimiterGraphFilterFactory
catenateNumbers: 1
generateNumberParts: 1
protected: protwords_und.txt
splitOnCaseChange: 0
generateWordParts: 1
preserveOriginal: 1
catenateAll: 0
catenateWords: 1
-
class: solr.LengthFilterFactory
min: 2
max: 100
-
class: solr.LowerCaseFilterFactory
-
class: solr.RemoveDuplicatesTokenFilterFactory
-
type: query
charFilters:
-
class: solr.MappingCharFilterFactory
mapping: accents_und.txt
tokenizer:
class: solr.WhitespaceTokenizerFactory
filters:
-
class: solr.SynonymGraphFilterFactory
synonyms: synonyms_und.txt
expand: true
ignoreCase: true
-
class: solr.StopFilterFactory
ignoreCase: true
words: stopwords_und.txt
-
class: solr.WordDelimiterGraphFilterFactory
catenateNumbers: 0
generateNumberParts: 1
protected: protwords_und.txt
splitOnCaseChange: 0
generateWordParts: 1
preserveOriginal: 1
catenateAll: 0
catenateWords: 0
-
class: solr.LengthFilterFactory
min: 2
max: 100
-
class: solr.LowerCaseFilterFactory
-
class: solr.RemoveDuplicatesTokenFilterFactory
solr_configs: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
langcode: en
status: true
dependencies:
module:
- islandora_hocr
config:
- search_api_solr.solr_field_type.islandora_hocr_und_7_0_0
id: request_handler_select_islandora_hocr_7_0_0
label: Select w/ OCR Highlighting Component
minimum_solr_version: 7.0.0
environments: { }
recommended: true
request_handler:
name: /select_ocr
class: solr.SearchHandler
lst:
-
name: defaults
str:
-
name: defType
VALUE: lucene
-
name: df
VALUE: id
-
name: echoParams
VALUE: explicit
-
name: omitHeader
VALUE: 'true'
-
name: timeAllowed
VALUE: '${solr.selectSearchHandler.timeAllowed:-1}'
-
name: spellcheck
VALUE: 'false'
arr:
-
name: components
str:
-
VALUE: query
-
VALUE: facet
-
VALUE: mlt
-
VALUE: ocrHighlight
-
VALUE: highlight
-
VALUE: stats
-
VALUE: debug
-
VALUE: spellcheck
-
VALUE: elevator
solr_configs:
searchComponents:
-
name: ocrHighlight
class: solrocr.OcrHighlightComponent
29 changes: 29 additions & 0 deletions islandora_hocr.post_update.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<?php

/**
* @file
* Post-update hooks.
*/

use Symfony\Component\Yaml\Yaml;

/**
* Install initial request handler and field type if they do not yet exist.
*/
function islandora_hocr_post_update_install_initial_entities() {
$ids = [
'search_api_solr.solr_field_type.islandora_hocr_und_7_0_0',
'search_api_solr.solr_request_handler.request_handler_select_islandora_hocr_7_0_0',
];

$config_dir = \Drupal::service('extension.list.module')->getPath('islandora_hocr') . '/config/install';

foreach ($ids as $id) {
$data = Yaml::parseFile("{$config_dir}/{$id}.yml");
$config = \Drupal::configFactory()->getEditable($id);
if ($config->isNew()) {
$config->initWithData($data)->save(TRUE);
}
}

}
7 changes: 7 additions & 0 deletions islandora_hocr.services.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
services:
islandora_hocr.highlighting_solr_config:
class: Drupal\islandora_hocr\EventSubscriber\HighlightingSolrConfigEventSubscriber
factory: [Drupal\islandora_hocr\EventSubscriber\HighlightingSolrConfigEventSubscriber, create]
tags:
- name: event_subscriber
Loading