Skip to content

[MediaWiki] Generators

Chen edited this page Nov 9, 2017 · 32 revisions

MediaWiki API uses the notion of lists and generators to represent a (long) sequence of items. E.g. allpages, categorymembers, revisions, etc. Usually the items in a MediaWiki "list" contain page ID, title, and namespace ID of each page (i.e. WikiPageStub in WCL). For some lists such as recentchanges, etc., the items will also contain extra information like timestamp of the change, user name made the change, and so on. For lists like users, the items will be user instead of basic page information.

On the other hand, MediaWiki "generator" uses a list as backend, and generate a sequence of page information (i.e. WikiPage in WCL), with optionally included page content. generators are convenient if you are interested in the page information, but it will lose the extra information that can be provided in the corresponding list. Note that not all of the MediaWiki lists can be used to feed the generator. users and abusefilters are the example, which do not generate a sequence of pages, and would be pointless to be used as generator.

For long sequences, MediaWiki API will split the sequences into multiple parts, and client should use query continuations to ask for the next part. In WCL, continuations are encapsulated in IAsyncEnumerable<T>, which will request for more results from server when necessary. You just need to keep enumerating from the returned IAsyncEnumerable<T>.

In WCL, lists and generators are represented by classes derived from WikiList<TItem> and WikiPageGenerator<TItem, TPage>. Note that the latter class is actually derived from the first one, all the WikiPageGenerator-derived classes can be used either as list or generator, depending on your needs.

WCL has some implemented generators in WikiClientLibrary.Generators namespace. You can also implement your own generator classes if necessary. Please take a look at the library code for reference.

Library references

How to work with IAsyncEnumerable<T>

IAsyncEnumerable<T> and IAsyncEnumerator<T> are introduced in Ix.Async package as asynchronous counterpart for IEnumerable<T> and IEnumerator<T>. With Ix.Async package, You can consume these asynchronous enumerators in a somewhat similar manner as you are working with ordinary enumerators.

  • You can use all the LINQ extension methods on IAsyncEnumerator<T>.
  • You can use Rx.NET package to convert IAsyncEnumerator<T> to IObservable<T>, if necessary.
  • For now, you can consume the items in IAsyncEnumerator<T> sequentially using the expanded for-each pattern. (See ShowAllTemplatesAsync method below for example); later when async for is introduced into C# 8 (hopefully), you might be able to use async for each on IAsyncEnumerator<T>.

Some caveats when consuming the IAsyncEnumerator<T> taken out from generator classes in Wiki Client Library:

  • Choose a proper PaginationSize. It decides (at most) how many items are to be fetched from server in one MediaWiki API request. So for example, if you are working with top 50 items from RecentChangesGenerator, you might choose 50 rather than 10 (by default) as PaginationSize value, so they will all be fetched at one time.
  • The maximum value of allowed PaginationSize is usually 500 for normal users, and 5000 for users with api-highlimits right (typically bot and sysop).
    • If you are using PageQueryOptions.FetchContent flag with EnumPagesAsync, this limit will be lowered to 1/10, i.e. 50 for normal users, and 500 for users with api-highlimits right.
    • If you are using PageQueryOptions.FetchExcerpt flag with EnumPagesAsync, this limit will be lowered to 10 for normal users, and 20 for users with api-highlimits right.
    • Considering the stability of network traffic, it is advised that you use 50 for typical in-batch WikiPage processing. PyWikiBot also uses this value for pagination in site.preload method.
  • Do not forget to chain the returned IAsyncEnumerator with Take(n) if you are only interested in the top n items in the sequence。
  • And in most cases, do not attempt to revert (AsyncEnumerator.Reverse) a sequence returned by WCL, unless you know what you are doing.
  • A common idiom for fetching a small number of results from the generator is as follows.
    • If you are working with a large number of pages, it's recommended that you convert the returned IAsyncEnumerator to something like IObservable or ISourceBlock, or use expanded for-each pattern.
static async Task ShowRecentChangesAsync()
{
    var generator = new RecentChangesGenerator(myWikiSite)
    {
        // Choose wisely.
        PaginationSize = 50,
        // Configure the generator, e.g. setting filter/sorting criteria
        NamespaceIds = new[] {BuiltInNamespaces.Main, BuiltInNamespaces.File},
        AnonymousFilter = PropertyFilterOption.WithProperty
    };
    // Gets the latest 50 changes made to article and File: namespace,
    // by anonymous users.
    var items = await generator.EnumItemsAsync().Take(50).ToList();
    foreach (var i in items)
    {
        Console.WriteLine(i.Title);
        // Show revision comments.
        Console.Write("\t");
        Console.WriteLine(i.Comment);
    }

    // When you want to fetch extracts for the pages, it's safe to fetch for no more than
    // 10 pages at one time.
    generator.PaginationSize = 10;
    // Gets the latest 50 pages in article and File: namespace that were changed
    // by anonymous users.
    var pages = await generator.EnumPagesAsync(PageQueryOptions.FetchExtract).Take(50).ToList();
    foreach (var i in pages)
    {
        Console.WriteLine(i.Title);
        // Show abstract for each revised page.
        Console.Write("\t");
        Console.WriteLine(i.Extract);
    }
}

How to consume IWikiList-implementation classes

static async Task SearchAsync()
{
    Console.Write("Enter your search keyword: ");
    var generator = new SearchGenerator(myWikiSite, Console.ReadLine())
    {
        PaginationSize = 22
    };
    // We are only interested in the top 20 items.
    foreach (var item in await generator.EnumItemsAsync().Take(20).ToList())
    {
        Console.WriteLine(item);
        Console.WriteLine("\t{0}", item.Snippet);
    }
}

Most of the WikiPageGenerator-derived classes (including AllPagesGenerator) implement IWikiListGenerator<WikiPageStub>, i.e., .EnumItemsAsync() will return a sequence of WikiPageStub. If you are only interested in the titles of the pages, consider using .EnumItemsAsync() instead of .EnumPagesAsync().

Still, there are some classes implementing IWikiList<T> where T is something other than WikiPageStub, including

  • class RecentChangesGenerator : WikiPageGenerator<RecentChangeItem, WikiPage>, IWikiList<RecentChangeItem>, IWikiPageGenerator<WikiPage>
  • class RecentChangesGenerator : WikiPageGenerator<RecentChangeItem, WikiPage>, IWikiList<RecentChangeItem>, IWikiPageGenerator<WikiPage>
  • class SearchGenerator : WikiPageGenerator<SearchResultItem, WikiPage>, IWikiList<SearchResultItem>, IWikiPageGenerator<WikiPage>
  • class GeoSearchGenerator : WikiPageGenerator<GeoSearchResultItem, WikiPage>, IWikiList<GeoSearchResultItem>, IWikiPageGenerator<WikiPage>
  • class RevisionsGenerator : WikiPagePropertyGenerator<Revision, WikiPage>, IWikiList<Revision>, IWikiPageGenerator<WikiPage> The items These

How to consume IWikiPageGenerator-implementation classes

static async Task ShowAllTemplatesAsync()
{
    var generator = new AllPagesGenerator(myWikiSite)
    {
        StartTitle = "A",
        NamespaceId = BuiltInNamespaces.Template,
        PaginationSize = 50
    };
    // You can specify EnumPagesAsync(PageQueryOptions.FetchContent),
    // if you are interested in the content of each page
    using (var enumerator = generator.EnumPagesAsync().GetEnumerator())
    {
        int index = 0;
        // Before the advent of "async for" (might be introduced in C# 8),
        // to handle the items in sequence one by one, we need to use
        // the expanded for-each pattern.
        while (await enumerator.MoveNext(CancellationToken.None))
        {
            var page = enumerator.Current;
            Console.WriteLine("{0}: {1}", index, page);
            index++;
            // Prompt user to continue listing, every 50 pages.
            if (index % 50 == 0)
            {
                Console.WriteLine("Esc to exit, any other key for next page.");
                if(Console.ReadKey().Key == ConsoleKey.Escape)
                    break;
            }
        }
    }
}