-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template tags in glosses in English dictionary (2024-06-01 extraction) #663
Comments
There is a "JSON data structure browser" page for each edition, for example: https://kaikki.org/dictionary/errors/mapping/index.html for en edition and https://kaikki.org/zhwiktionary/errors/mapping/index.html for zh edition. Non-en editions also have JSON schema files: https://tatuylonen.github.io/wiktextract/ The Some text between brackets are extracted to |
The |
Hi there. First of all, thank you very much for providing this project.
I've been trying to understand the structure of the data, in particular the "glosses" lists. My assumption is that the list contains multiple glosses when there is a nested list in the original data, i.e.
turns into
{ senses: [{ glosses: [main] }, { glosses: [main, sub 1] }, { glosses: [main, sub 2] }] }
.I ran some queries for entries that violate that assumption and found a bunch of glosses with template tags of the form
:Template:SAFESUBST:#invoke:ordinal
in them, for example (rendered on the website):There were also some cases of entries containing multiple glosses with and without notes in parentheses, leading commas, and trailing colons.
I'm not sure how much of this is intentional but at the very least the template part seems quite wrong. Is there any more info available on the structure of the dataset beyond what's in the README? I'd appreciate any pointers.
The complete list of entries: violating.json
The text was updated successfully, but these errors were encountered: