Refactoring #37

gmaze · 2020-07-09T15:15:34Z

Let's talk about refactoring ideas

Configuration:

This is the most important feature to properly control what's being done, so it requires a nice and flexible UI

Better distinguish Analysis parameters (eg: use_pv) from Local configuration settings (eg: path and folder/file names)
Getter, setter for configuration items
Loader, saver from/to files (using txt files compatible with the Matlab software)
Usable in context ("with") or for sessions
Consistency/validity checks (e.g. Setting warning in configurations to prevent entry of swapped values #28 )

Data Mapping:

This is the most time consuming step, so it requires optimization to improve perf

Reduce the size of the update_salinity_mapping function (500 hundreds line !) by identifying recurrent patterns and affecting inner loop work to specific functions
Consider using xarray to make management of group of variables much easier. Together with a more readable code, this should help in recurrent task like modification of variable batches.
update_salinity_mapping has 2 main loop levels: on profile and on vertical levels. If inner loop work is delegated to functions and if data structure is clarified (with dictionaries or even better: xarray.DataSet), make these loops work in parallel will be much easier and a game changer in terms of performances.

Data fetching:

This is a key component of the software, fetching float but more importantly reference data.

Refactor all functions dedicated to data fetching to a single module (already started in [WIP] Refactor software: step 1 #26 but surely not finished)
Consider using a cache filesystem to avoid repeating previous fetch
When this will be possible, consider fetching data from web-API: always up to date !

Code design:

Refactor code so that functions are really self contained: eg in update_salinity_mapping the longitude values wrapping between -180/180 for get_topo_grid is done outside of get_topo_grid, adding 4 lines and 1 variable to the code. It it this function inner responsibility to check for longitude values, must not be done outside.
Start a documentation, getting started with sphinx autodoc. Will make refactoring easier to be able to navigate the code.
This is linked to documentation: gives clear explanation of configuration parameters and where they are used in the code. This will also help users to set up the software properly.

The text was updated successfully, but these errors were encountered:

edsmall-bodc · 2020-07-09T16:10:49Z

Configuration

Better distinguish Analysis parameters

Agreed - I would suggest having two routines

Set up all paths to/from directories. This could simply be a dictionary (?)
Combine the parameter settings of set_calseries and the parameters from load to one configuration routine. Because there is some logic involved (ie setting theta boundaries) in setting up parameters, this would probably have to be a class with setters/getters for the parameters.

Getters, Setters, Loaders, Savers

Agreed, see above !

Validation

Agreed here too, thought there is still probably more discussion to be had around parameter validation to ensure we aren't accidentally boxing our DMQC operators into corners. At least a set of warnings should be considered, as we have previously discussed. Perhaps we could even give operators a "normal range" that values would be in.

Data Mapping

Reduce the size of salinity mapping

Again, completely agree. There is a ton of leg work done here that we could move out into separate functions to make the routine more accessible and friendly to engineers. I have already refactored some code compared to the matlab version, but there is more to be done

Xarray and Parallelisation

The only concern I have here is that there are some shared resources that are appended to at the end of the loop, such as constructing the selected historical data array. I don't know how xarray would handle a race condition like this, but it would be interesting to find out!

Data Fetching

I'll comment on this section as a whole. The only place real data fetching takes place is in the update_salinity_mapping section. There are two main parts.

One section fetches latitudes, longitudes, and dates of ALL the historical data within the 25 selected WMO boxes.
We then do some checking to see which historical data points we should keep. We then fetch ALL the data again, including the ocean characteristics this time, and then remove the data we don't want.

Clearly this could be done FAR better by either

Fetching ALL the data at the beginning, and then filtering out what we don't need when we know.
Only fetching the data we need when we have clarified which historical data would be best to keep for the analysis, rather than fetching all of it again and then filtering out what we do not need.

A lot of the profiles will use largely the same historical data (since each sequential profile is near the previous profile), so we could do something with that. However, that might make parallelisation more difficult (?)

Very much on board for a web API fetcher. Both for float data and bathymetry.

Code design

Refactor

There are a number of places where data manipulation happens outside a function that REALLY should happen inside. We can make a list of all the places this occurs and slowly work on refactoring these places.

Documentation

I have sat on this issue for quite a long time, frustratingly. I have waited to see what the BODC software team do, but it seems that their documentation is a little all over the place - different depending on the project and the person running it. There was a time frame where they thought they might move everything to ReadTheDocs, but ended up being unable to do so. I'm happy using anything, especially if it makes the code more approachable and it's convenient. It would also be nice to have somewhere to point people to who are new to the code.

We'd probably need a lot of input from our DMQC experts here. There is an instruction book that would probably be worth having a look over. I'm sure @kamwal can help us with that.

matdon17 · 2020-07-13T15:34:46Z

Comments on refactoring ideas, with a particularly focus on what can be achieved in the near term.

Configuration

Better distinguish Analysis parameters

I am in favour of separating analysis and local configuration parameters. From a discussion in BODC today, local configuration parameters could be extended to include preferred plot output format for saving, and no doubt other features (we have talked about introducing perceptually neutral colours schemes, for instance). As we have a working piece of software, I don't think this area is a major priority right now, but equally not a major undertaking to implement some incremental improvements.

Getters, Setters, Loaders, Savers

As above.

Validation

Agree with Ed around not boxing DMQC operators in - I think this might be an area for future development as the code is adopted in routine use and we can develop a better shared understanding of what should and should not be done. Implementing anything now would risk non-adoption of the code if it inadvertently imposes limitations on operators.

Data Mapping

Reduce the size of salinity mapping

Whilst I agree in principle, at this point I think refactoring this element of the code brings too much risk to delivering a complete and reliable set of code which can be brought into routine use.

Xarray and Parallelisation

I think this has one of the most important benefits in terms of performance and providing improved capabilities to users. I am not clear on how achievable this is in the short term, but from past conversations it seems this could be achieved in stages and be de-risked in the process.

Data Fetching

Improvements to data fetching would be good, although I am hesitant about always fetching the latest data - sometimes you may want to be working with a static set from iteration to iteration of DMQC run. At this stage, as we are limited by what is publicly available for reference data, I consider this to be a lower priority and not a current focus. Enabling use of Jupyter at least demonstrates the potential though, and I think is something to take back to ADMT as future growth.

Code design

Refactor

Whilst I don't disagree with what has been highlighted, I suspect the more critical element at this juncture is to assess options for improving performance of the code. One such example is pre-allocation of variables that currently grow with each cycle and might be the cause of slower performance in some cases than Matlab. Improved performance was one of the goals of this conversion.

Documentation

Agreed on implementing something, and quickly. I have no great preferences - so long as it has minimum of maintenance and maximum of accessibility. I think it would be good for 'developer' documentation and 'user' documentation to sit in the same place.

matdon17 · 2020-07-13T15:39:24Z

To summarise my previous comment, I think the current focus should be on:

Refactor code that addresses obvious performance issues, like repeatedly expanded arrays that become increasingly problematic with larger datasets - e.g. through pre-allocation of offending arrays;
Implement solution for documentation of code for developers and users;
Separate local settings from analysis settings
Make some progress on parallelisation, accepting we may not be able to achieve a complete end-state in the near term.

kamwal · 2021-04-28T11:04:10Z

Improvements recently made to:

Fixing the CI pipelines

this required an update the unit tests codes due to more recent updated in the code and update the example of test matlab version of mat_.mat file using in unit tests.

Data Mapping

the update_salinity_mapping code has been recatored to improve the code redability leading to reduce the length of this function from 500 to 340,
in terms of usage of xarray. This has been looked at using the interp_climatology method with the wrangling.py file. Xarray was deemed to give a speed, however, currently it wouldn't been greater than a numpy speedup.
A major benefit would come from using the xarray dataset that allow for linked variable to be work with at the same time (float dataset or grid dataset). These dataset can be passed around or have method built for them.
Another benefit that the Argo community might want to move to is storing the ref data in NetCD, this will allow for lazy loading of larger datasets and could be useful when processing large collection of profiles in historically data rich areas. Xarray kept is a potential option for the next level of improvements
the investigations of the areas consuming the largest time for code evaluation are: covar_xyt_pv, interp_climatology, array_from_header (extent of usage as it's a scipy function), get_region_data, numpy.array (extent of usage as it's a numpy function), update_salinity_mapping. These codes has been idnetyfied based on the newly implemented by adding a basic script for profiling this code. Further, these codes has been refactored to imporve the speed of pyowc code to its previous version of 67% (using CTD reference datda) and 73 % (using Argo reference data).

Documentation

sorting issues with installing the toolboxes using pip and conda
setting up the bilding documentation using sphinx
update the readme.md
update/fix the Pangea Binder to the most updated version of master branch

gmaze added enhancement New feature or request refactoring improve the design, structure, and/or implementation labels Jul 9, 2020

kamwal mentioned this issue Apr 28, 2021

Python vs Matlab speed differences #57

Closed

kamwal added the priority-low Low priority for next release label Nov 7, 2022

gmaze pinned this issue Dec 8, 2022

gmaze added this to the 1st release on Pypi milestone Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring #37

Refactoring #37

gmaze commented Jul 9, 2020 •

edited by kamwal

Loading

edsmall-bodc commented Jul 9, 2020

matdon17 commented Jul 13, 2020

matdon17 commented Jul 13, 2020

kamwal commented Apr 28, 2021

Refactoring #37

Refactoring #37

Comments

gmaze commented Jul 9, 2020 • edited by kamwal Loading

Configuration:

Data Mapping:

Data fetching:

Code design:

edsmall-bodc commented Jul 9, 2020

Configuration

Better distinguish Analysis parameters

Getters, Setters, Loaders, Savers

Validation

Data Mapping

Reduce the size of salinity mapping

Xarray and Parallelisation

Data Fetching

Code design

Refactor

Documentation

matdon17 commented Jul 13, 2020

Configuration

Better distinguish Analysis parameters

Getters, Setters, Loaders, Savers

Validation

Data Mapping

Reduce the size of salinity mapping

Xarray and Parallelisation

Data Fetching

Code design

Refactor

Documentation

matdon17 commented Jul 13, 2020

kamwal commented Apr 28, 2021

Improvements recently made to:

Fixing the CI pipelines

Data Mapping

Documentation

gmaze commented Jul 9, 2020 •

edited by kamwal

Loading