Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template tags in glosses in English dictionary (2024-06-01 extraction) #663

Open
yamplum opened this issue Jun 5, 2024 · 2 comments
Open

Comments

@yamplum
Copy link

yamplum commented Jun 5, 2024

Hi there. First of all, thank you very much for providing this project.

I've been trying to understand the structure of the data, in particular the "glosses" lists. My assumption is that the list contains multiple glosses when there is a nested list in the original data, i.e.

  • main
    • sub 1
    • sub 2

turns into { senses: [{ glosses: [main] }, { glosses: [main, sub 1] }, { glosses: [main, sub 2] }] }.

I ran some queries for entries that violate that assumption and found a bunch of glosses with template tags of the form :Template:SAFESUBST:#invoke:ordinal in them, for example (rendered on the website):

{
    "word": "iceberg",
    "pos": "noun",
    "senses": [
        {
            "glosses": [
                "The seaward end of a glacier. [:Template:SAFESUBST:#invoke:ordinal–:Template:SAFESUBST:#invoke:ordinal c.]",
                "The seaward end of a glacier."
            ]
        },
        {
            "glosses": [
                "A huge mass of ocean-floating ice which has broken off a glacier or ice shelf [from :Template:SAFESUBST:#invoke:ordinal c.]",
                "A huge mass of ocean-floating ice which has broken off a glacier or ice shelf"
            ]
        },
        {
            "glosses": [
                "An aloof person. [from :Template:SAFESUBST:#invoke:ordinal c.]",
                "An aloof person."
            ]
        },
        {
            "glosses": [
                "An impending disastrous event whose adverse effects are only beginning to show, in reference to one-tenth of the volume of an iceberg being visible above water."
            ]
        }
    ]
}

There were also some cases of entries containing multiple glosses with and without notes in parentheses, leading commas, and trailing colons.

I'm not sure how much of this is intentional but at the very least the template part seems quite wrong. Is there any more info available on the structure of the dataset beyond what's in the README? I'd appreciate any pointers.

The complete list of entries: violating.json

@xxyzz
Copy link
Collaborator

xxyzz commented Jun 6, 2024

There is a "JSON data structure browser" page for each edition, for example: https://kaikki.org/dictionary/errors/mapping/index.html for en edition and https://kaikki.org/zhwiktionary/errors/mapping/index.html for zh edition. Non-en editions also have JSON schema files: https://tatuylonen.github.io/wiktextract/

The [:Template:SAFESUBST* text is because the ordinal template which is used by century template is not expanded properly.

Some text between brackets are extracted to tags or topics fields, but our code couldn't handle all the cases and might break some gloss texts. The original gloss texts are in the raw_glosses field.

@xxyzz
Copy link
Collaborator

xxyzz commented Jun 6, 2024

The [:Template:SAFESUBST* text should be fixed by the linked pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants