Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrich/Improve JSON-LD metadata for chapters and lesson #481

Open
6 tasks
bencomp opened this issue Jun 14, 2023 · 3 comments
Open
6 tasks

Enrich/Improve JSON-LD metadata for chapters and lesson #481

bencomp opened this issue Jun 14, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@bencomp
Copy link
Contributor

bencomp commented Jun 14, 2023

After discovering TeSS (found through Taxila, which uses the TeSS platform) and their recommendation for using Bioschemas, I wanted to suggest to include such metadata in Carpentries lessons. Then I discovered that metadata is already included in each page! 🙌

I do have a few ideas about potentially improving the (richness of the) metadata:

  • use LearningResource as @typeTrainingMaterial is not defined in schema.org
  • remove the duplicated url and identifier, as the page URL is also used in the @id property
  • do not use build date as datePublished – schema.org defines this as "Date of first broadcast/publication." (It is debatable whether we are doing it 'wrong', so this is not a big deal to me
  • differentiate between lesson metadata and episode/chapter metadata: each chapter could stand on its own, but we could also link them as parts of the lesson with isPartOf; keywords that apply to chapters overwriting lesson keywords; etc.
  • use the lesson's status as defined in the configuration for creativeWorkStatus (or a mapping like stable -> active)
  • set the description for each chapter (already marked as to-do in the initialise_metadata function in utils-metadata.R)

There is much more we could add, like maintainers and maybe even objectives (using teaches, I think). (Lesson objectives are also mentioned in #375 )
I can help with at least some of these items, so no rush at all.

@froggleston
Copy link
Contributor

This looks very cool - as a metadata wonk myself, I really like the idea of enriching the lesson metadata!

@zkamvar zkamvar added the enhancement New feature or request label Jun 26, 2023
@zkamvar
Copy link
Contributor

zkamvar commented Jun 26, 2023

Thank you for opening this @bencomp, I apologise that I have been under the weather and handling many urgent tasks lately and knew that I wanted to spend some time with this issue, which is why my reply is late. That being said, I echo @froggleston in that I'm excited about the prospect of improving the metadata in the lessons.

I absolutely agree that the metadata could be improved by a lot and there are so many more things that we could include. I will note that @tobyhodges was also instrumental in communicating with TeSS about what was needed for the metadata.

Point-by-point response

WRT to your specific points:

use LearningResource as @typeTrainingMaterial is not defined in schema.org

Funny enough, that was the exact opposite of the recommendation from TeSS: #236, so I'm at a loss on this one ¯\_(ツ)_/¯

remove the duplicated url and identifier, as the page URL is also used in the @id property

This makes sense. I've been uncomfortable with this duplication. Apparently the URL is needed for bioschemas if it's known?

do not use build date as datePublished – schema.org defines this as "Date of first broadcast/publication." (It is debatable whether we are doing it 'wrong', so this is not a big deal to me

I think this makes sense. At the moment dateModified and datePublished are exactly the same, which is a shame:

this_metadata$set(c("date", "modified"), format(Sys.Date(), "%F"))
this_metadata$set(c("date", "published"), format(Sys.Date(), "%F"))

I believe datePublished should only be used for lesson publications and that brings up the thorny problem of how to handle lesson publications and versioning 🤔.

differentiate between lesson metadata and episode/chapter metadata: each chapter could stand on its own, but we could also link them as parts of the lesson with isPartOf; keywords that apply to chapters overwriting lesson keywords; etc.

I would be very interested to figure out how to achieve this!

use the lesson's status as defined in the configuration for creativeWorkStatus (or a mapping like stable -> active)

💯

set the description for each chapter (already marked as to-do in the initialise_metadata function in utils-metadata.R)

Absolutely. I think solving #239 will help with this.

Sources of Metadata

The metadata as we have it now comes from the config.yaml file and the dates files were created/updated (though this could be better implemented). There are still many opportunities to include more rich and (importantly) meaningful metadata that gives people a better idea of what the lesson/episode is about and who contributed to it.

One of the challenges here is to implement this in a way that minimizes work for the lesson authors/contributors. The config.yaml file was implemented in such a way that it consists of simple key value pairs or list entries, but no nested lists or sections of block text. This is a concious choice because nesting and block text in YAML is a huge PITA to write correctly unless you work with it quite regularly (don't even suggest TOML as an alternative. Aside from distrusting anything that calls itself "obvious", it has a different nesting problem that's not easily reasoned about).

An alternative would be to have people write metadata in JSON format, but while many consider JSON to be "human readable", I have seen evidence otherwise. I have seen the panic in the eyes of someone who is new to coding looking at a JSON file with lists and a small bit of nesting.

Thus, #238 and #239 come in to play. Since we have the ability to parse Markdown via {pegboard}, this would be a way to extract paragraphs and lists for which to populate metadata.

The last thing to consider is how to provide the authorship metadata. The repositories have all done this differently and it's a bit of a sticky problem (and I've highlighted some of my concerns in #238 (comment))

@bencomp
Copy link
Contributor Author

bencomp commented Jun 27, 2023

… I apologise that I have been under the weather and handling many urgent tasks lately and knew that I wanted to spend some time with this issue, which is why my reply is late.…

No worries and please do not apologise for being under the weather. Life happens.

Point-by-point response

WRT to your specific points:

use LearningResource as @typeTrainingMaterial is not defined in schema.org

Funny enough, that was the exact opposite of the recommendation from TeSS: #236, so I'm at a loss on this one ¯\_(ツ)_/¯

That was probably a mistake. I looked at their TrainingMaterial profile 1.0-RELEASE, which uses LearningResource as the example @type.

remove the duplicated url and identifier, as the page URL is also used in the @id property

This makes sense. I've been uncomfortable with this duplication. Apparently the URL is needed for bioschemas if it's known?

You are correct here. The examples in the profile suggest to me that @id is used as an identifier for an abstract thing or perhaps an entry in a catalogue, whereas the url is the link to the actual material, possibly on another domain. I would be okay with leaving the url. identifier could be used for the DOI if it is known, e.g., from the config.yaml.

do not use build date as datePublished – schema.org defines this as "Date of first broadcast/publication." (It is debatable whether we are doing it 'wrong', so this is not a big deal to me

I think this makes sense. At the moment dateModified and datePublished are exactly the same, which is a shame:

this_metadata$set(c("date", "modified"), format(Sys.Date(), "%F"))
this_metadata$set(c("date", "published"), format(Sys.Date(), "%F"))

I believe datePublished should only be used for lesson publications and that brings up the thorny problem of how to handle lesson publications and versioning 🤔.

I'm happy to leave that for another issue. I think the code you reference may not run on every build, because I saw different dates for my lesson. Navigating the {sandpaper} codebase still isn't natural to me, so I may be wrong.

differentiate between lesson metadata and episode/chapter metadata: each chapter could stand on its own, but we could also link them as parts of the lesson with isPartOf; keywords that apply to chapters overwriting lesson keywords; etc.

I would be very interested to figure out how to achieve this!

I suppose keywords could go within an episode's YAML header, where the code finds it.
If a lesson has a canonical URL, like the index file, each episode could see that its URL is different and add the link to the index file. But implementing this in R is the interesting challenge 😄

use the lesson's status as defined in the configuration for creativeWorkStatus (or a mapping like stable -> active)

💯

set the description for each chapter (already marked as to-do in the initialise_metadata function in utils-metadata.R)

Absolutely. I think solving #239 will help with this.

Maybe a concatenation of chapter/episode questions would be a simpler (temporary) alternative?

Sources of Metadata

I totally agree with your goal to keep metadata entry/management simple. (I think YAML is fine, especially with clear comments to explain values.)

Authorship (and other contribution roles) is indeed tough to keep track of, at least automatically. I would say we could add a static "The Carpentries and lesson contributors" to the JSON-LD, but that doesn't feel satisfactory either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants