Unnecessary dependency on FuzzyTM pulls in many libraries #3423
Labels
bug
Issue described a bug
difficulty easy
Easy issue: required small fix
impact HIGH
Show-stopper for affected users
reach HIGH
Affects most or all Gensim users
Problem description
I'm trying to upgrade to the new Gensim 4.3.0 release. My colleague @juhoinkinen noticed in NatLibFi/Annif#660 that Gensim 4.3.0 pulls in more dependencies than the previous release 4.2.0, including pandas. I suspect that at least the FuzzyTM dependency (which in turn pulls in pandas) is actually unused and thus unnecessary.
Steps/code/corpus to reproduce
Installing Gensim 4.2.0 into an empty venv (only four packages installed):
Installing Gensim 4.3.0 into an empty venv (18 packages installed):
The size of the venv has grown from 249MB to 318MB, an increase of 69MB.
Here is what
pipdeptree
shows - FuzzyTM appears to be the main reason why so many libraries are pulled in:It appears that the FuzzyTM dependency was added in PR #3398 (Flsamodel) by @ERijck . The first commits in this PR depended on the library, but a subsequent commit 9fec00b reworked the code so it doesn't need to import FuzzyTM at all. But the dependency in setup.py wasn't actually removed, it's still there: https://github.com/RaRe-Technologies/gensim/blob/f35faae7a7b0c3c8586fb61208560522e37e0e7e/setup.py#L347
I think the FuzzyTM dependency could be safely dropped, as the library is not actually imported. It would reduce the number of libraries Gensim pulls in and thus reduce the size of installations, including Docker images where minimal size is often required.
Versions
I'm using Ubuntu Linux 22.04.
Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Bits 64
NumPy 1.24.1
SciPy 1.10.0
gensim 4.3.0
FAST_VERSION 0
The text was updated successfully, but these errors were encountered: