Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing and contrasting the Python and R books #3

Open
DamienIrving opened this issue Jun 29, 2021 · 8 comments
Open

Comparing and contrasting the Python and R books #3

DamienIrving opened this issue Jun 29, 2021 · 8 comments

Comments

@DamienIrving
Copy link

DamienIrving commented Jun 29, 2021

For the Python and R RSE books we essentially had to document the logical progression of steps involved in data processing / package development in a generic, teachable manner. One of the main things we need to do in the talk (I think) is succinctly describe, compare and contrast that logical progression in Python and R.

I've had a go at doing the first part (succinctly describe) below - my naive attempt at the R description is missing a bunch of steps and I don't know anything about many of the tools mentioned, so if an R person could provide a more complete description that would be great.

Once we've got a succinct description for Python and R, I'd love to hear people's thoughts on what are the most noteworthy similarities and differences between the Python an R approaches.

Task

Word count analysis/package to confirm Zipf's Law.

Python

  • Adopt a directory structure consistent with Python packaging.
  • Conduct basic file management at the command line.
  • Prototype code with the Jupyter notebook or an IDE.
  • Write code in a modular, reusable manner using functions (that can be stored in and imported from modules).
  • Incorporate working code into a series of command line scripts/programs.
  • Version the scripts using Git at the command line.
  • Collaboratively develop the scripts using GitHub.
  • Combine the scripts into an automated data processing workflow using Make.
  • Configure the scripts with command line arguments and YAML files.
  • Test the code using assertions and unit tests (pytest).
  • Automatically run the tests using Travis CI (future update: GitHub Actions).
  • Handle program reporting and errors with logging and exceptions.
  • Capture provenance by archiving the scripts, conda environment and Makefile/s on Zenodo.
  • Build and distribute the package using pip and PyPI.
  • Document the package using docstrings, sphinx and ReadTheDocs.

R

  • Adopt a directory structure consistent with R packaging.
  • Conduct basic file management with fs.
  • Prototype code in RStudio.
  • Write code in a modular, reusable manner using functions.
  • Version the R package using Git via RStudio integration or the RStudio terminal.
  • Collaboratively develop the package using GitHub.
  • Test the code using expectations and unit tests (testthat).
  • Automatically run the tests using GitHub Actions.
  • Build and distribute the package using devtools and R-hub.
  • Document the package using vignettes, roxygen2 and pkgdown.
@gvwilson gvwilson self-assigned this Jun 29, 2021
@DamienIrving
Copy link
Author

cc'ing @k8hertweck, @cwickham, @joelostblom and/or @lwjohnst86 to look over and improve the dot points summarising the R approach to a word count analysis/package to confirm Zipf's Law (see above for my naive first attempt).

@DamienIrving
Copy link
Author

DamienIrving commented Jul 1, 2021

An alternative way to look at this information...

Task Python R
Adopt a directory structure consistent with ... Python packaging R packaging
Conduct basic file management ... at the command line with fs
Prototype code with ... Jupyter notebook or an IDE RStudio
Write code in a modular, reusable manner using ... functions (that can be stored in .py module files) functions (stored in .R files)
Version code using ... Git at the command line Git via RStudio integration and gert/usethis functions
Collaboratively develop code using ... GitHub GitHub and usethis PR helpers
Test code using ... assertions and unit tests (pytest) expectations and unit tests (testthat)
Automatically run tests using ... Travis CI (future: GitHub Actions) GitHub Actions
Handle program reporting and errors with ... logging and exceptions logging and stop function for errors
Automate data processing workflows using ... command line scripts and Make scripts and targets package
Configure workflows using ... function arguments, command line arguments and YAML files function arguments
Capture workflow provenance by archiving ... scripts, conda environment and Makefile/s on Zenodo Zenodo, GitHub tags/releases
Generate reproducible documents with ... Pweave, jupyter book rmarkdown
Build and distribute a package using ... pip and PyPI devtools and R-hub, CRAN
Document a package and make a website using ... docstrings, sphinx and ReadTheDocs vignettes, roxygen2 and pkgdown
Create and distribute a data package by ... datapackage library to create, store at Figshare/Zenodo/wherever storing and exporting the data object in an R package

Italics indicate things that aren't covered in detail in our books but are included in the table for completeness.

@k8hertweck
Copy link
Contributor

This is so useful! Thanks @DamienIrving !

I edited the table above for the Git in R section, as that's a section I've worked on and could update off the top of my head. The outline of the R book is available here for future reference.

The ??? in the R section have me thinking about how the overall purposes of the books are slightly different. I think this is because there are already excellent resources for R package development, but the missing gap in the R community is specifically package development for data analysis. This means the background knowledge we expect folks to have when starting the book is slightly different, and as a result, the order of topics and specific emphasis differs fairly substantially.

I'm thinking the table showing equivalency of tasks and which tools are used for each is really useful. For topics we're not covering in R, though, it makes sense to talk about the differences in philosophy of creating packages in each language. For example, the use of reproducible reports in R vs. workflow provenance in Python.

Other folks may have more ideas to add here!

@DamienIrving
Copy link
Author

Thanks, @k8hertweck.

I should have mentioned - don't worry about the order of the tasks in the table. I'm not trying to make the order match the books.

You make a good point about the fact that some of the tasks in the table aren't a focus for the R book. While most of the content of the table should obviously be things that we cover in the books, for completeness I don't think it's a problem if some of the things aren't in the book. I've edited the table so that things that aren't in the book are indicated in italics (e.g. we don't spend time in the Python book discussing the features of the Jupyter notebook or IDEs for prototyping code, but for completeness I mention that in the table).

Are there widely used/accepted tools that R people use for data processing workflow management and coordination (i.e. execution order, logging, configuration) that could be listed in italics even if that topic isn't covered in the book?

I'm also happy for more tasks to be added to the table (e.g. there might be some topics covered in the R book but not Python). Perhaps "automate report generation using..." is a task we should include?

@lwjohnst86
Copy link

Sorry for the late addition to this, June has been super busy for me. @DamienIrving really really nice table, super useful! I've made some edits to it to fill in some of the spaces.

@DamienIrving
Copy link
Author

DamienIrving commented Jul 1, 2021

Thanks, all. The table is looking fairly complete, so it's time to consider what it says about the similarities and differences between how R people and Python people do research software engineering (in a generic, best practice sense, as defined by us). Here's some initial thoughts:

When you break things down into the core tasks associated with data processing and software package creation (i.e. research software engineering), it becomes clear that both Python and R can do the job (i.e. in both cases the tools exist to do the tasks). Having said that, the subtle differences between those tools and the tasks/tools we chose not to include in our books speak to some interesting differences:

  • The typical R experience is more self-contained (e.g. accessing the command line via fs, RStudio Git integration, usethis PR helpers, incorporating data packages into actual R packages)
  • Reproducible documents are a big thing in R but not so much in Python
  • Data packages are a big thing in R but not so much in Python
  • Sophisticated workflow management (e.g. task scheduling, logging, overlay configuration) is more of a thing in Python

Side question: Are the Python and R experiences converging? If you do all your data processing in the Jupyter notebook (which I'm sure is true for a growing number of Python users) then things become more self-contained. Things like make and command line arguments for configuration aren't used and logging becomes less of a need, since you're seeing the output from each command as you execute it. Jupyter book could also make reproducible documents much more popular with Python people.

(I'm certainly not advocating for a move to Jupyter for more than code prototyping - it actually concerns me greatly and is one of the reasons why I think our Python book is important.)

@mbonsma
Copy link

mbonsma commented Jul 6, 2021

I love the table and I agree with your summary of differences @DamienIrving.

As someone who uses Jupyter for almost all analysis IRL, I would agree that the Python and R experiences are converging. The R experience seems intentional and well-supported, while in Python the convergence seems like a coincidence and/or a response to the R ecosystem.

@joelostblom
Copy link

joelostblom commented Jul 7, 2021

Sorry this became long...

Agree with what others have said, super useful comparison table, thanks @DamienIrving ! The only major point that I don't agree with is that reproducible documents are not a big thing in Python. I believe both Jupyter notebooks and R Markdown/Notebooks encourage creation of reproducible reports with commentary and outputs in one place and that they have both have widespread uptake in their communities to the point that they have changed how most people conduct programmatic data analysis.

In terms of the emphasis on reproducibility in the notebook interface, I think the main reproducibility advantage for R Notebooks is that they have be run in order when knitting to another format such as html whereas Jupyter notebooks can be exported with out of order execution. Since R Notebooks don't store output, the most common way to share the results of your analysis is to make sure it runs from top to bottom, which encourages this behavior to a higher degree than in Jupyter notebooks (especially since ipynb is rendered on GitHub). If you are not knitting however, but using the R Notebook interface in RStudio to view output it is still fully possible to run your cells in whatever order you want, save the notebook Rmd file, and then have it not work when you open it in a new session and try running it from top and bottom (please correct me if there is some preventative mechanism for this that I have missed).

Although Jupyter notebooks can be exported out of order, they recently added a visual indicator when cells are edited after they are run and there are packages that keep track of the execution order of cells as well as the definition order of variables (which I think would be one of the best solutions if it could be integrated into the core notebook interfaces in both languages in a non-intrusive way). A simpler mechanisms which I think would be great is if notebooks had a brief warning or visual indicator that encourages to run all cells in order before quitting a session.

I agree that the languages are converging on many solutions, and I believe it would be possible to write a Python book that has more of the same angle as the R book or vice versa. So while I think it is important to contrast the differences between the book strategies and motivate our choices, I think reproducible documents could be part of a project in either language depending on the nature of the project, while makefiles, packages, or other similar mechanisms are always beneficial to include once the project reaches a certain size.

In the table I believe Jupyter notebooks is a more suitable equivalent to R Markdown/R Notebooks and that JupyterBook is more like bookdown/distill in R, which I see as a tool for documentation and multi-page reports, rather than a single reproducible analysis document (but I agree that these are more reproducible since they only way they work is if everything runs from top to bottom). Also Pweave is not maintained anymore I think; using Jupytext with notebooks is a much more popular alternative for most of that functionality.

@gvwilson gvwilson removed their assignment Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants