-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiprocessing on macOS and Windows #88
Commits on Aug 18, 2023
-
Remove
page_handler
arguemnt fromprocess_dump()
This callback function was added for select pages in some namspaces, this should be done with the `namespace_ids` arguemnt to query the sqlite database. Also fix the subprocess still running resouce warning.
Configuration menu - View commit details
-
Copy full SHA for 8b11d08 - Browse repository at this point
Copy the full SHA 8b11d08View commit details -
Remove
process()
andreprocess()
methods fromWtp
classThe code doesn't reply on the Linux fork method to copy variables now. multiprocessing code are moved to wiktextract, unpicklable objects like sqlite connection and lupa runtime are handled at there.
Configuration menu - View commit details
-
Copy full SHA for dba6397 - Browse repository at this point
Copy the full SHA dba6397View commit details -
This file is mostly test the `process_dump()`, move it to `tests/test_dumppaser.py`. Extract each page is still tested in wiktextract.
Configuration menu - View commit details
-
Copy full SHA for 05f2ace - Browse repository at this point
Copy the full SHA 05f2aceView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4edfb17 - Browse repository at this point
Copy the full SHA 4edfb17View commit details
Commits on Aug 21, 2023
-
Use
lru_cache
onWtp.get_page()
`get_page()` is kind slow(1s per call), cache the requests improve the performance significantly. This reduces the process time of Chinese Wiktionary from 40 minutes to 10 minutes.
Configuration menu - View commit details
-
Copy full SHA for 4b3d963 - Browse repository at this point
Copy the full SHA 4b3d963View commit details -
Configuration menu - View commit details
-
Copy full SHA for ff48e09 - Browse repository at this point
Copy the full SHA ff48e09View commit details -
Replace deprecated
pkg_resources
withimportlib.resources
Using `pkg_resources` is not recommanded: https://setuptools.pypa.io/en/latest/pkg_resources.html
Configuration menu - View commit details
-
Copy full SHA for 88988e5 - Browse repository at this point
Copy the full SHA 88988e5View commit details -
Set
Wtp.get_page()
cache size to 1000The English Wiktionary has 40327 template and 54524 module pages(exclude redirects). I think the cache is mostly useful for caching shared templat and module pages used in pages processes in a worker process. So the cache size relates to page numbers, CPU core numbers and shared templates. If 1000 is not enough, we can increase it to 10000.
Configuration menu - View commit details
-
Copy full SHA for 53b35e1 - Browse repository at this point
Copy the full SHA 53b35e1View commit details
Commits on Aug 22, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 8feccd3 - Browse repository at this point
Copy the full SHA 8feccd3View commit details -
Configuration menu - View commit details
-
Copy full SHA for a4a0bfc - Browse repository at this point
Copy the full SHA a4a0bfcView commit details