Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiprocessing on macOS and Windows #88

Merged
merged 10 commits into from
Aug 23, 2023

Commits on Aug 18, 2023

  1. Remove page_handler arguemnt from process_dump()

    This callback function was added for select pages in some namspaces,
    this should be done with the `namespace_ids` arguemnt to query the
    sqlite database.
    
    Also fix the subprocess still running resouce warning.
    xxyzz committed Aug 18, 2023
    Configuration menu
    Copy the full SHA
    8b11d08 View commit details
    Browse the repository at this point in the history
  2. Remove process() and reprocess() methods from Wtp class

    The code doesn't reply on the Linux fork method to copy variables now.
    multiprocessing code are moved to wiktextract, unpicklable objects
    like sqlite connection and lupa runtime are handled at there.
    xxyzz committed Aug 18, 2023
    Configuration menu
    Copy the full SHA
    dba6397 View commit details
    Browse the repository at this point in the history
  3. Delete tests/test_long.py

    This file is mostly test the `process_dump()`, move it to
    `tests/test_dumppaser.py`. Extract each page is still tested in
    wiktextract.
    xxyzz committed Aug 18, 2023
    Configuration menu
    Copy the full SHA
    05f2ace View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    4edfb17 View commit details
    Browse the repository at this point in the history

Commits on Aug 21, 2023

  1. Use lru_cache on Wtp.get_page()

    `get_page()` is kind slow(1s per call), cache the requests improve the
    performance significantly. This reduces the process time of Chinese
    Wiktionary from 40 minutes to 10 minutes.
    xxyzz committed Aug 21, 2023
    Configuration menu
    Copy the full SHA
    4b3d963 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ff48e09 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    88988e5 View commit details
    Browse the repository at this point in the history
  4. Set Wtp.get_page() cache size to 1000

    The English Wiktionary has 40327 template and 54524 module
    pages(exclude redirects). I think the cache is mostly useful for
    caching shared templat and module pages used in pages processes in a
    worker process. So the cache size relates to page numbers, CPU core
    numbers and shared templates. If 1000 is not enough, we can increase
    it to 10000.
    xxyzz committed Aug 21, 2023
    Configuration menu
    Copy the full SHA
    53b35e1 View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2023

  1. Configuration menu
    Copy the full SHA
    8feccd3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a4a0bfc View commit details
    Browse the repository at this point in the history