Skip to content

[Wikibase] Working with JSON dump

Chen edited this page Jan 13, 2018 · 4 revisions

In some cases, you might wish to import & analyze Wikibase entities from serialized JSON. For example, if you are processing the JSON dump generated by dumpJson.php (extensions/Wikibase/repo/maintenance/dumpJson.php), which is actually an array of serialized entities, now it is possible to do so with SerializableEntity class.

Library references

The structure of JSON dump

First of all, suppose you are at the root of your MediaWiki installation, you may dump your Wikibase entities with the following command

php ./extensions/Wikibase/repo/maintenance/dumpJson.php --no-cache --output=entities.json

Now entities.json contains a huge JSON array of dumped wikibase entities, one line per minified JSON of a Wikibase item. Each array item has the structure as follows

{
    "type": "item",
    "sitelinks": [
// ...
    ],
    "descriptions": {
        "en": {
            "language": "en",
            "value": "totality of space and all matter and radiation in it,  including planets, galaxies, light, and us; may include their properties such as energy; may include time/spacetime"
        },
        "zh": {
            "language": "zh",
            "value": "一切空间、时间、物质和能量构成的总体"
        },
// ...
    },
    "id": "Q2",
    "claims": {
        "P2": [{
            "type": "statement",
            "references": [],
            "mainsnak": {
                "snaktype": "value",
                "property": "P2",
                "datavalue": {
                    "value": "Q1",
                    "type": "string"
                },
                "datatype": "external-id"
            },
            "qualifiers": [],
            "id": "Q2$997A7A7B-8737-49B6-9386-BD934CE9E2A7",
            "rank": "normal"
        }],
        "P3": [{
            "type": "statement",
            "references": [{
                "hash": "0e556569b6638a2a8a6ee29edef2644b2fc29c15",
                "snaks-order": ["P2"],
                "snaks": {
                    "P2": [{
                        "snaktype": "value",
                        "property": "P2",
                        "datavalue": {
                            "value": "Q1$8983b0ea-4a9c-0902-c0db-785db33f767c",
                            "type": "string"
                        },
                        "datatype": "external-id"
                    }]
                }
            }],
            "mainsnak": {
                "snaktype": "somevalue",
                "property": "P3",
                "datatype": "wikibase-item"
            },
            "qualifiers": [],
            "id": "Q2$47BA934E-9A36-42C5-8767-C4D8D6A3F333",
            "rank": "normal"
        }],
// ...
    },
    "aliases": {
        "en": [{
            "language": "en",
            "value": "Our Universe"
        }, {
            "language": "en",
            "value": "The Universe"
        }, {
            "language": "en",
            "value": "Universe (Ours)"
        }, {
            "language": "en",
            "value": "The Cosmos"
        }, {
            "language": "en",
            "value": "cosmos"
        }]
    },
    "labels": {
        "en": {
            "language": "en",
            "value": "Universe"
        },
        "zh": {
            "language": "zh",
            "value": "宇宙"
        },
// ...
    }
}

Load a single entity from JSON

You can use SerializableEntity.Parse(string) to create a SerializableEntity instance from the JSON contained in a string, or one of the SerializableEntity.Load overloads to create a SerializableEntity instance from TextReader, JsonReader or JObject.

It is possible to convert a Entity into SerializableEntity with SerializableEntity.Load(IEntity) overload.

Persist a single entity to JSON

You can use SerializableEntity.ToJsonString or SerializableEntity.ToJObject to persists the entity into Wikibase-compatible JSON serialization.

You can also use SerializableEntity.WriteTo to write the JSON serialization into TextReader or JsonReader.

Enumerate through JSON dump of an array of entities

To work with a huge JSON dump as exported by dumpJson.php, you may use one of the SerializableEntity.LoadAll overloads, either to load the array of entities by file name, from TextReader, or from JsonReader. This method returns IEnumerable<SerializableEntity>, so if you plug it into a for-each loop, only the current working entity will be in the memory, making it possible to process a large quantity of entities in a forward-only manner.

The following code example is taken from DataModulesExporter.cs in crystal-pool/WikibaseClientLite, where the input JSON dump file will be converted into a set of LUA modules

foreach (var entity in SerializableEntity.LoadAll(itemsDumpReader))
{
    if (entity.Type == EntityType.Item) items++;
    else if (entity.Type == EntityType.Property)
        properties++;

    // Preprocess
    entity.Labels = FilterMonolingualTexts(entity.Labels, languages);
    entity.Descriptions = FilterMonolingualTexts(entity.Descriptions, languages);
    entity.Aliases = FilterMonolingualTexts(entity.Aliases, languages);

    // Persist
    using (var module = moduleFactory.GetModule(entity.Id))
    {
        using (var writer = module.Writer)
        {
            WriteProlog(writer, $"Entity: {entity.Id} ({entity.Labels["en"]})");
            using (var luawriter = new JsonLuaWriter(writer) {CloseOutput = false})
            {
                entity.WriteTo(luawriter);
            }

            WriteEpilog(writer);
        }

        await module.SubmitAsync($"Export entity {entity.Id}.");
    }

    if ((items + properties) % 500 == 0)
    {
        Logger.Information("Exported LUA modules for {Items} items and {Properties} properties.", items, properties);
    }
}