PyOpenSci REVIEW - minor updates #43

Robaina · 2023-01-23T17:13:00Z

This PR covers the following comments from reviewer (@Batalex) in PyOpenSci:

pynteny/init.py

import * are usually frowned upon in Python, but IMO this is ok in this case

I agree, but, anyway I modified init to explicitly import all the classes, we don't want to be frowned upon.

pynteny/api.py

importlib.metadata is only available in the standard library from 3.8 and onwards (3.7 is technically allowed in 0.0.5). This is partially fixed in the main branch. For good measure, I would increase the minimum Python version in the pyproject.toml file as well, even if poetry is only used as a build backend.

Thanks, already taken care of. Now python >= 3.8.

The author could use a simple type hint in the Command class to avoid a linter warning in the update method:
class Command():
    """Parent class for Pynteny command
    """
    _args: CommandArgs
    

Great, thanks for catching this!

Empty __init__ methods are not necessary, the smaller the code base the better!

Right! Modified.

There are really nice uses of the package’s metadata in _repr_html_ and cite

:)

I am pleased to see that arguments with None default value are correctly typed as optional

:)

There is no need to wrap Path(__file__).parent in another Path() in the getTestDataPath method since it returns a Path object as well. This comment applies to utils.py as well.

Right, thanks for catching this.

pynteny/cli.py

args is not used in the Pynteny.app method, so there is no need to parse it.

Thanks. It's there so one can run "pynteny app --help" from the command line. I know this is not necessary. I left it there for consistency with the other subcommands. Perhaps in future versions "pyteny app" could have proper arguments, e.g. select port.

I believe getHelpStr could be simpler by writing to an io.StringIO instead of a temp file.

Thanks. I'm leaving it as it is for now but will consider this suggestion in future settings.

f-strings without placeholder are considered code smells

Right, removed, not sure why they were there in the first place.

Overall, this is a really nice example of a cli with subcommands, congrats!

Thank you!

I saw a few unused required and pynteny assignment

Right, subcomand.download does not have required arguments. I removed it. As of the ìnstance pynteny in cli.main, is there to avoid printing of the class descriptor to the command line.

pynteny/filter.py

typo L40 specified

Ok

As far as I can tell, the hmm_order_dict is not needed in SyntenyPatternFilters init method.

Right, removed.

The lambda expr + map is quite difficult to read. A simpler alternative could be:

strand_map = dict(neg=-1, pos=1) # could be set as a module-wide constant
self.strand_order_pattern = [strand_map.get(strand, 0) for strand in parsed_structure["strands"]]

While I appreciate the suggestion, I don't feel like it would improve the readability of the code in this case. With your suggestion, the reader would have to jump to the beginning of the file to understand that "-1" means "neg" and "1", "pos". I personally think that the lambda expression is closer to how one would describe the functionality in plain English.

contains_hmm_pattern, contains_distance_pattern, and contains_strand_pattern could use Python’s idioms and return boolean values instead of integers, this is not an issue with their use in pandas rolling method. Boolean values can be aggregated numerically in pandas. This would simplify those methods. In contains_strand_pattern, there is a True equality comparison.

Thanks for the suggestion. I have removed the equality to True as it is not necessary.

Using inplace=True in pandas operation is not recommended and could be deprecated: API/DEPR: Deprecate inplace parameter pandas-dev/pandas#16529.

Alright, changed.

The author could use the size method instead of a filter with a lambda in getAllHMMhits, but this is a matter of preference.

Thanks for the suggestion. I'll leave the filter for now. I feel it is somewhat more idiomatic.

There again, preferences, but the loop in filterHitsBySyntenyStructure could be made on a pandas’ groupby object. This is not a prominent feature, but one may do the following:
for group_key, group_df in df.groupby(key):
    pass

Awesome! Will leave it as it is (lack of time) but will definitely consider it in future occurrences.

the form len(contig_hits.hmm.unique()) could be simplified to contig_hits.hmm.nunique()

Haha did not know this method existed. Convenient. Thanks

Using iterrows is quite inefficient if one only need to loop over the index. The author may use for i in matched_rows.index

Right, changed, thanks!

Idiomatic Python does not usually use getter methods such as getSyntenyHits, a public attribute would work just fine.

Thanks for the suggestion, but if I understand correctly, then the value of the attribute could be modified outside the method, and I want to prevent that. Since getter methods are not pythonic, I'll rename it "hits" and decorate it with @Property.

I have not profiled the code to see if there is a performance bottleneck there, but I would use itertuples instead of iterrows in addHMMmetaInfoToHits. itertuples is much faster if you need to iterate over data frames one row at a time

Thanks for the suggestion. Definitely will try it out. It's a pitty that the code gets uglier this way. Don't consider myself a python expert, but it seems to me that the python community is having some internal conflicts between efficiency and code readability (pythonicity).

I suggest using isinstance rather than type equality in filterFASTAbySyntenyStructure. isinstance is a more robust (and marginally faster) solution because it allows nominal subtyping: any subtype of str or list would pass the test conditions. Using isinstance is therefore considered more pythonic than type equality.

Thanks, changed!

pynteny/hmm.py

Nice use of properties

Since the author is using pathlib, they might use / or Path.joinpath instead of os.path.join

Makes sense, changed!

The author may use the usecols argument for pd.read_csv

Thanks, changed.

Why is thetry-except block needed in getHMMnamesByGeneSymbol? (+ bare except is considered a code smell). The code would work without this. The same goes for getHMMgeneID, with a little twist to return None, and for getMetaInfoForHMM

Thanks, changed.

pynteny/parser.py

There are too many statements in the try-except block in LabelParser.parse. I do not know what could be failing (a split, an int cast?), and the exception caught is too broad. Tip: I suggest extracting meta.split("_") in a variable since it appears in four lines in a row

Alright. I removed try-except and added a logger.error to check label format.

Typing tuples is quite unintuitive: tuple[str] means “a tuple of exactly one str.” This means that splitStrandFromLocus is incorrectly typed

I see. Thanks for catching this. I assumed that tuple styling would be similar to list. But I see now that's not the case. I'll change it to tuple[str, ...] since this appears to be the equivalent of list[str].

pynteny/preprocessing.py

In FASTA class, we could replace the getter/setter with a property, which is more “pythonic”. Moreover, we could also make _input_file_str a read-only property.

Created a property and removed _input_file_str, it's just fine to call .as_posixwhen required. Thanks.

number_prodigal_record_fields could be a global constant

I could make it a global constant. However, defining the constant at the beginning of the module implies that the reader would have to go up to see its value and then down again to the function. Since this constant is only used by one class method, I think that making it global is not necessary.

If pynteny targets python ≥3.10 as the main branch suggests, then it may use https://docs.python.org/3/whatsnew/3.10.html#parenthesized-context-managers in GeneAnnotator

That was the original idea, but now Pynteny targets python >= 3.8. That's a pity because parenthesized context managers look cool.

pynteny/subcommands.py

There is a piece of code duplicated four times. The following could be extracted in its own function.

if args.logfile is None:
    args.logfile = Path(os.devnull)
elif not Path(args.logfile.parent).exists():
    Path(args.logfile.parent).mkdir(parents=True)
logging.basicConfig(format='%(asctime)s | %(levelname)s: %(message)s',
                    handlers=[
                        logging.FileHandler(args.logfile),
                        logging.StreamHandler(sys.stdout)
                        ],
                    level=logging.NOTSET)
logger = logging.getLogger(__name__)

In addition, this would limit the complexity of synteny_search

That makes sense. I have converted this block of code into a function.

codecov · 2023-01-25T09:18:50Z

Codecov Report

Base: 84.58% // Head: 85.19% // Increases project coverage by +0.61% 🎉

Coverage data is based on head (a4ed45d) compared to base (ffddb24).
Patch coverage: 96.29% of modified lines in pull request are covered.

❗ Current head a4ed45d differs from pull request most recent head 8af7667. Consider uploading reports for the commit 8af7667 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
+ Coverage   84.58%   85.19%   +0.61%     
==========================================
  Files          11       11              
  Lines         837      831       -6     
==========================================
  Hits          708      708              
+ Misses        129      123       -6

Impacted Files	Coverage Δ
pynteny/api.py	`75.67% <ø> (-0.65%)`	⬇️
pynteny/parser.py	`88.79% <86.66%> (+0.09%)`	⬆️
pynteny/__init__.py	`100.00% <100.00%> (ø)`
pynteny/filter.py	`77.45% <100.00%> (+0.22%)`	⬆️
pynteny/hmm.py	`80.64% <100.00%> (+4.40%)`	⬆️
pynteny/preprocessing.py	`84.54% <100.00%> (ø)`
tests/test_database_build.py	`100.00% <100.00%> (ø)`
tests/test_hmm.py	`97.22% <100.00%> (ø)`
tests/test_integration_search.py	`100.00% <100.00%> (ø)`
tests/test_parser.py	`96.66% <100.00%> (ø)`
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Batalex · 2023-01-26T09:07:17Z

Wow, you have been busy!
It's too bad the PRs are already merged since any additional changes will result in other PRs.

Just a quick note concerning itertuples:

Thanks for the suggestion. Definitely will try it out. It's a pity that the code gets uglier this way

Each row is a namedtuple, not a tuple. This means you can access the elements by their name, not just by their position.
I will add a comment in your changes so you may see

Batalex · 2023-01-26T09:08:24Z

pynteny/filter.py

+            i = getattr(row, "Index")
+            hmm_group = getattr(row, "hmm")


This can be changed to

row.Index row.hmm

Not so ugly anymore, right? 😸

I see! that pleases me greatly 😄 I really did not want to use getattr.

Robaina · 2023-01-26T11:23:23Z

Hi @Batalex ,

sorry, I got carried away with the merging yesterday. Actually, this is the first time I partake in a code review like this. What would you say is the best course of action to discuss the changes? i.e., add you as a collaborator and request
review of PRs, open a discussion?

Thanks again for all your suggestions. They have been very useful!

minor updates

88aee79

Robaina added enhancement New feature or request refactor refactor code labels Jan 23, 2023

Robaina self-assigned this Jan 23, 2023

Robaina linked an issue Jan 23, 2023 that may be closed by this pull request

PyOpenSci REVIEW - Minor changes #42

Closed

Robaina added 3 commits January 24, 2023 12:22

some updates

bffcc29

parser review

406c07e

getter as property

a4ed45d

Robaina closed this Jan 25, 2023

Robaina reopened this Jan 25, 2023

Robaina added code review and removed enhancement New feature or request labels Jan 25, 2023

Robaina changed the title ~~minor updates~~ PyOpenSci REVIEW - minor updates Jan 25, 2023

Robaina added 6 commits January 25, 2023 16:24

extrac nested in LabelledFASTA

79b0849

extract nested functions in LabelledFasta

8427d87

update

fa06c38

use only seqkit to remove duplicates

88145a3

update

ffb7aa0

update logger init

8af7667

Robaina closed this Jan 25, 2023

Robaina reopened this Jan 25, 2023

Robaina marked this pull request as ready for review January 25, 2023 17:39

Robaina merged commit 7eba519 into main Jan 25, 2023

Robaina deleted the 42-pyopensci-review-minor-changes branch January 25, 2023 17:40

Batalex reviewed Jan 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyOpenSci REVIEW - minor updates #43

PyOpenSci REVIEW - minor updates #43

Robaina commented Jan 23, 2023 •

edited

Loading

codecov bot commented Jan 25, 2023 •

edited

Loading

Batalex commented Jan 26, 2023

Batalex Jan 26, 2023

Robaina Jan 26, 2023

Robaina commented Jan 26, 2023

PyOpenSci REVIEW - minor updates #43

PyOpenSci REVIEW - minor updates #43

Conversation

Robaina commented Jan 23, 2023 • edited Loading

pynteny/init.py

pynteny/api.py

pynteny/cli.py

pynteny/filter.py

pynteny/hmm.py

pynteny/parser.py

pynteny/preprocessing.py

pynteny/subcommands.py

codecov bot commented Jan 25, 2023 • edited Loading

Codecov Report

Batalex commented Jan 26, 2023

Batalex Jan 26, 2023

Choose a reason for hiding this comment

Robaina Jan 26, 2023

Choose a reason for hiding this comment

Robaina commented Jan 26, 2023

Robaina commented Jan 23, 2023 •

edited

Loading

codecov bot commented Jan 25, 2023 •

edited

Loading