Template tags in glosses in English dictionary (2024-06-01 extraction) #663

yamplum · 2024-06-05T15:51:03Z

Hi there. First of all, thank you very much for providing this project.

I've been trying to understand the structure of the data, in particular the "glosses" lists. My assumption is that the list contains multiple glosses when there is a nested list in the original data, i.e.

main
- sub 1
- sub 2

turns into { senses: [{ glosses: [main] }, { glosses: [main, sub 1] }, { glosses: [main, sub 2] }] }.

I ran some queries for entries that violate that assumption and found a bunch of glosses with template tags of the form :Template:SAFESUBST:#invoke:ordinal in them, for example (rendered on the website):

{
    "word": "iceberg",
    "pos": "noun",
    "senses": [
        {
            "glosses": [
                "The seaward end of a glacier. [:Template:SAFESUBST:#invoke:ordinal–:Template:SAFESUBST:#invoke:ordinal c.]",
                "The seaward end of a glacier."
            ]
        },
        {
            "glosses": [
                "A huge mass of ocean-floating ice which has broken off a glacier or ice shelf [from :Template:SAFESUBST:#invoke:ordinal c.]",
                "A huge mass of ocean-floating ice which has broken off a glacier or ice shelf"
            ]
        },
        {
            "glosses": [
                "An aloof person. [from :Template:SAFESUBST:#invoke:ordinal c.]",
                "An aloof person."
            ]
        },
        {
            "glosses": [
                "An impending disastrous event whose adverse effects are only beginning to show, in reference to one-tenth of the volume of an iceberg being visible above water."
            ]
        }
    ]
}

There were also some cases of entries containing multiple glosses with and without notes in parentheses, leading commas, and trailing colons.

I'm not sure how much of this is intentional but at the very least the template part seems quite wrong. Is there any more info available on the structure of the dataset beyond what's in the README? I'd appreciate any pointers.

The complete list of entries: violating.json

The text was updated successfully, but these errors were encountered:

xxyzz · 2024-06-06T01:09:50Z

There is a "JSON data structure browser" page for each edition, for example: https://kaikki.org/dictionary/errors/mapping/index.html for en edition and https://kaikki.org/zhwiktionary/errors/mapping/index.html for zh edition. Non-en editions also have JSON schema files: https://tatuylonen.github.io/wiktextract/

The [:Template:SAFESUBST* text is because the ordinal template which is used by century template is not expanded properly.

Some text between brackets are extracted to tags or topics fields, but our code couldn't handle all the cases and might break some gloss texts. The original gloss texts are in the raw_glosses field.

xxyzz · 2024-06-06T01:56:57Z

The [:Template:SAFESUBST* text should be fixed by the linked pull request.

xxyzz mentioned this issue Jun 6, 2024

Remove upper case substitution modifiers tatuylonen/wikitextprocessor#287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template tags in glosses in English dictionary (2024-06-01 extraction) #663

Template tags in glosses in English dictionary (2024-06-01 extraction) #663

yamplum commented Jun 5, 2024 •

edited

Loading

xxyzz commented Jun 6, 2024

xxyzz commented Jun 6, 2024

Template tags in glosses in English dictionary (2024-06-01 extraction) #663

Template tags in glosses in English dictionary (2024-06-01 extraction) #663

Comments

yamplum commented Jun 5, 2024 • edited Loading

xxyzz commented Jun 6, 2024

xxyzz commented Jun 6, 2024

yamplum commented Jun 5, 2024 •

edited

Loading