Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyOpenSci REVIEW - minor updates #43

Merged
merged 10 commits into from
Jan 25, 2023
Merged

Conversation

Robaina
Copy link
Owner

@Robaina Robaina commented Jan 23, 2023

This PR covers the following comments from reviewer (@Batalex) in PyOpenSci:

pynteny/init.py

  • import * are usually frowned upon in Python, but IMO this is ok in this case

I agree, but, anyway I modified init to explicitly import all the classes, we don't want to be frowned upon.

pynteny/api.py

  • importlib.metadata is only available in the standard library from 3.8 and onwards (3.7 is technically allowed in 0.0.5). This is partially fixed in the main branch. For good measure, I would increase the minimum Python version in the pyproject.toml file as well, even if poetry is only used as a build backend.

Thanks, already taken care of. Now python >= 3.8.

  • The author could use a simple type hint in the Command class to avoid a linter warning in the update method:
class Command():
    """Parent class for Pynteny command
    """
    _args: CommandArgs
    

Great, thanks for catching this!

  • Empty __init__ methods are not necessary, the smaller the code base the better!

Right! Modified.

  • There are really nice uses of the package’s metadata in _repr_html_ and cite

:)

  • I am pleased to see that arguments with None default value are correctly typed as optional

:)

  • There is no need to wrap Path(__file__).parent in another Path() in the getTestDataPath method since it returns a Path object as well. This comment applies to utils.py as well.

Right, thanks for catching this.

pynteny/cli.py

  • args is not used in the Pynteny.app method, so there is no need to parse it.

Thanks. It's there so one can run "pynteny app --help" from the command line. I know this is not necessary. I left it there for consistency with the other subcommands. Perhaps in future versions "pyteny app" could have proper arguments, e.g. select port.

  • I believe getHelpStr could be simpler by writing to an io.StringIO instead of a temp file.

Thanks. I'm leaving it as it is for now but will consider this suggestion in future settings.

  • f-strings without placeholder are considered code smells

Right, removed, not sure why they were there in the first place.

  • Overall, this is a really nice example of a cli with subcommands, congrats!

Thank you!

  • I saw a few unused required and pynteny assignment

Right, subcomand.download does not have required arguments. I removed it. As of the ìnstance pynteny in cli.main, is there to avoid printing of the class descriptor to the command line.

pynteny/filter.py

  • typo L40 specified

Ok

  • As far as I can tell, the hmm_order_dict is not needed in SyntenyPatternFilters init method.

Right, removed.

  • The lambda expr + map is quite difficult to read. A simpler alternative could be:
strand_map = dict(neg=-1, pos=1) # could be set as a module-wide constant
self.strand_order_pattern = [strand_map.get(strand, 0) for strand in parsed_structure["strands"]]

While I appreciate the suggestion, I don't feel like it would improve the readability of the code in this case. With your suggestion, the reader would have to jump to the beginning of the file to understand that "-1" means "neg" and "1", "pos". I personally think that the lambda expression is closer to how one would describe the functionality in plain English.

  • contains_hmm_pattern, contains_distance_pattern, and contains_strand_pattern could use Python’s idioms and return boolean values instead of integers, this is not an issue with their use in pandas rolling method. Boolean values can be aggregated numerically in pandas. This would simplify those methods. In contains_strand_pattern, there is a True equality comparison.

Thanks for the suggestion. I have removed the equality to True as it is not necessary.

Alright, changed.

  • The author could use the size method instead of a filter with a lambda in getAllHMMhits, but this is a matter of preference.

Thanks for the suggestion. I'll leave the filter for now. I feel it is somewhat more idiomatic.

  • There again, preferences, but the loop in filterHitsBySyntenyStructure could be made on a pandas’ groupby object. This is not a prominent feature, but one may do the following:
for group_key, group_df in df.groupby(key):
    pass

Awesome! Will leave it as it is (lack of time) but will definitely consider it in future occurrences.

  • the form len(contig_hits.hmm.unique()) could be simplified to contig_hits.hmm.nunique()

Haha did not know this method existed. Convenient. Thanks

  • Using iterrows is quite inefficient if one only need to loop over the index. The author may use for i in matched_rows.index

Right, changed, thanks!

  • Idiomatic Python does not usually use getter methods such as getSyntenyHits, a public attribute would work just fine.

Thanks for the suggestion, but if I understand correctly, then the value of the attribute could be modified outside the method, and I want to prevent that. Since getter methods are not pythonic, I'll rename it "hits" and decorate it with @Property.

  • I have not profiled the code to see if there is a performance bottleneck there, but I would use itertuples instead of iterrows in addHMMmetaInfoToHits. itertuples is much faster if you need to iterate over data frames one row at a time

Thanks for the suggestion. Definitely will try it out. It's a pitty that the code gets uglier this way. Don't consider myself a python expert, but it seems to me that the python community is having some internal conflicts between efficiency and code readability (pythonicity).

  • I suggest using isinstance rather than type equality in filterFASTAbySyntenyStructure. isinstance is a more robust (and marginally faster) solution because it allows nominal subtyping: any subtype of str or list would pass the test conditions. Using isinstance is therefore considered more pythonic than type equality.

Thanks, changed!

pynteny/hmm.py

  • Nice use of properties
  • Since the author is using pathlib, they might use / or Path.joinpath instead of os.path.join

Makes sense, changed!

  • The author may use the usecols argument for pd.read_csv

Thanks, changed.

  • Why is thetry-except block needed in getHMMnamesByGeneSymbol? (+ bare except is considered a code smell). The code would work without this. The same goes for getHMMgeneID, with a little twist to return None, and for getMetaInfoForHMM

Thanks, changed.

pynteny/parser.py

  • There are too many statements in the try-except block in LabelParser.parse. I do not know what could be failing (a split, an int cast?), and the exception caught is too broad. Tip: I suggest extracting meta.split("_") in a variable since it appears in four lines in a row

Alright. I removed try-except and added a logger.error to check label format.

  • Typing tuples is quite unintuitive: tuple[str] means “a tuple of exactly one str.” This means that splitStrandFromLocus is incorrectly typed

I see. Thanks for catching this. I assumed that tuple styling would be similar to list. But I see now that's not the case. I'll change it to tuple[str, ...] since this appears to be the equivalent of list[str].

pynteny/preprocessing.py

  • In FASTA class, we could replace the getter/setter with a property, which is more “pythonic”. Moreover, we could also make _input_file_str a read-only property.

Created a property and removed _input_file_str, it's just fine to call .as_posixwhen required. Thanks.

  • number_prodigal_record_fields could be a global constant

I could make it a global constant. However, defining the constant at the beginning of the module implies that the reader would have to go up to see its value and then down again to the function. Since this constant is only used by one class method, I think that making it global is not necessary.

That was the original idea, but now Pynteny targets python >= 3.8. That's a pity because parenthesized context managers look cool.

pynteny/subcommands.py

  • There is a piece of code duplicated four times. The following could be extracted in its own function.
if args.logfile is None:
    args.logfile = Path(os.devnull)
elif not Path(args.logfile.parent).exists():
    Path(args.logfile.parent).mkdir(parents=True)
logging.basicConfig(format='%(asctime)s | %(levelname)s: %(message)s',
                    handlers=[
                        logging.FileHandler(args.logfile),
                        logging.StreamHandler(sys.stdout)
                        ],
                    level=logging.NOTSET)
logger = logging.getLogger(__name__)

In addition, this would limit the complexity of synteny_search

That makes sense. I have converted this block of code into a function.

@Robaina Robaina added enhancement New feature or request refactor refactor code labels Jan 23, 2023
@Robaina Robaina self-assigned this Jan 23, 2023
@Robaina Robaina linked an issue Jan 23, 2023 that may be closed by this pull request
@Robaina Robaina closed this Jan 25, 2023
@Robaina Robaina reopened this Jan 25, 2023
@codecov
Copy link

codecov bot commented Jan 25, 2023

Codecov Report

Base: 84.58% // Head: 85.19% // Increases project coverage by +0.61% 🎉

Coverage data is based on head (a4ed45d) compared to base (ffddb24).
Patch coverage: 96.29% of modified lines in pull request are covered.

❗ Current head a4ed45d differs from pull request most recent head 8af7667. Consider uploading reports for the commit 8af7667 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
+ Coverage   84.58%   85.19%   +0.61%     
==========================================
  Files          11       11              
  Lines         837      831       -6     
==========================================
  Hits          708      708              
+ Misses        129      123       -6     
Impacted Files Coverage Δ
pynteny/api.py 75.67% <ø> (-0.65%) ⬇️
pynteny/parser.py 88.79% <86.66%> (+0.09%) ⬆️
pynteny/__init__.py 100.00% <100.00%> (ø)
pynteny/filter.py 77.45% <100.00%> (+0.22%) ⬆️
pynteny/hmm.py 80.64% <100.00%> (+4.40%) ⬆️
pynteny/preprocessing.py 84.54% <100.00%> (ø)
tests/test_database_build.py 100.00% <100.00%> (ø)
tests/test_hmm.py 97.22% <100.00%> (ø)
tests/test_integration_search.py 100.00% <100.00%> (ø)
tests/test_parser.py 96.66% <100.00%> (ø)
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Robaina Robaina added code review and removed enhancement New feature or request labels Jan 25, 2023
@Robaina Robaina changed the title minor updates PyOpenSci REVIEW - minor updates Jan 25, 2023
@Robaina Robaina closed this Jan 25, 2023
@Robaina Robaina reopened this Jan 25, 2023
@Robaina Robaina marked this pull request as ready for review January 25, 2023 17:39
@Robaina Robaina merged commit 7eba519 into main Jan 25, 2023
@Robaina Robaina deleted the 42-pyopensci-review-minor-changes branch January 25, 2023 17:40
@Batalex
Copy link
Contributor

Batalex commented Jan 26, 2023

Wow, you have been busy!
It's too bad the PRs are already merged since any additional changes will result in other PRs.

Just a quick note concerning itertuples:

Thanks for the suggestion. Definitely will try it out. It's a pity that the code gets uglier this way

Each row is a namedtuple, not a tuple. This means you can access the elements by their name, not just by their position.
I will add a comment in your changes so you may see

Comment on lines +401 to +402
i = getattr(row, "Index")
hmm_group = getattr(row, "hmm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed to

row.Index
row.hmm

Not so ugly anymore, right? 😸

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! that pleases me greatly 😄 I really did not want to use getattr.

@Robaina
Copy link
Owner Author

Robaina commented Jan 26, 2023

Hi @Batalex ,

sorry, I got carried away with the merging yesterday. Actually, this is the first time I partake in a code review like this. What would you say is the best course of action to discuss the changes? i.e., add you as a collaborator and request
review of PRs, open a discussion?

Thanks again for all your suggestions. They have been very useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyOpenSci REVIEW - Minor changes
2 participants