Skip to content

Releases: Mu-Sigma/analysis-pipelines

R package version v1.0.0 - Submitted to CRAN

03 Jan 07:51
0ea395f
Compare
Choose a tag to compare

R package version - v1.0.0

Key additions

  • Updated documentation and resolved CRAN check issues
  • Updated vignettes with minor corrections
  • Made CRAN submission corrections - Updated description and added examples

v0.0.0.9004

14 Dec 07:37
Compare
Choose a tag to compare
v0.0.0.9004 Pre-release
Pre-release

Key features added in this release

Python functions in analysis pipelines

Python functions can now be defined, and Python files sourced through the reticulate package. These functions which have a reference in the R environment can be registered to and AnalysisPipeline object. This means that an interoperable pipeline can be created comprising of R, Spark & Python functions.

A vignette on how to use Python functions has been added. The Interoperable pipelines vignette has also been updated to showcase pipelines with all 3 engines.

Improved formula parsing & execution

Some logical inconsistencies in formula semantics have been resolved. Now, for data functions, the data argument can be either one of 3 things:

  • Not passed, meaning that it is a pronoun which should operate on the input that the pipeline object has been instantiated with
  • A data frame explicitly passed
  • A formula - which denotes an output of a previous function

Additionally, if the input type of the data argument of data function is one of R data frame, Spark DataFrame, or Pandas DataFrame; type conversion is automatically performed according to the engine of that data function.

v0.0.0.9003

14 Dec 07:37
Compare
Choose a tag to compare
v0.0.0.9003 Pre-release
Pre-release

Key features added in this release

Meta-pipelines

Meta-pipelines can now be defined. The meta-pipeline construct is one which allows users to export pipelines they have created for a particular use case to a general analysis flow which can be used for a different dataset and different set of parameters. A pipeline is one where the data can change, though retaining the same schema, and the same set of parameters for the functions. A meta-pipeline is one where only the analysis flow, function dependencies and so on are retained. The specific parameters for each of the functions can be set differently for a new use case.

The objective of a meta-pipeline is to define and execute reusable analysis flows. They can be used to:

  • Document best practices for a particular problem
  • Templatize analyses for particular situations

Meta-pipelines can be created by exporting from pipelines, and new pipelines instantiated from a meta-pipeline, with an easy-to-use method to set the new values of parameters.
This means that our foundation bricks can be represented as meta-pipelines.
One more vignette on meta-pipelines has been added to the package, to explain its usage. The rendered version can be viewed here.

Function registration and execution improvements

  • Some issues with respect to function registration have been fixed.
  • Parsing for formula semantics has been improved

v0.0.0.9002

14 Dec 07:37
Compare
Choose a tag to compare
v0.0.0.9002 Pre-release
Pre-release
  • Fixed a major issue with function registration and improved the mechansim - Auto prompts for parameters show now, as well as defaults are retained.

v0.0.0.9001: Minor patch fixing some import issues and corrections in documentation

14 Dec 07:38
Compare
Choose a tag to compare
  • Fixed certain import issues
  • Corrections in the documentation

v0.0.0.9000: First major beta release of the package.

14 Dec 07:38
Compare
Choose a tag to compare

An overview of the package

  • Compose reusable, interoperable pipelines in a flexible manner
    Leverage available utility functions for performing different analytical operations
    Put these pipelines into production in order to execute repeatedly
    Generated analysis reports by executing these pipelines
    Supported engines
    As of this version, the package supports functions executed on R, or Spark through the SparkR interface for batch pipelines. It also supports Spark Structured Streaming pipelines for streaming analyses. In subsequent releases, Python will also be supported.

Available vignettes
This package contains 5 vignettes:

  • Analysis pipelines - Core functionality and working with R data frames and functions - This is the main vignette describing the package's core functionality, and explaining this through batch pipelines in just R
  • Analysis pipelines for working with Spark DataFrames for one-time/ batch analyses - This vignette describes creating batch pipelines to execute in solely in a Spark environment
  • Interoperable analysis pipelines - This vignette describes creating and executing batch pipelines which are composed of functions executing across supported engines
  • Streaming Analysis Pipelines for working with Apache Spark Structured Streaming - This vignette describes setting up streaming pipelines on Apache Spark Structured Streaming
  • Using pipelines inside Shiny widgets or apps - A brief vignette which illustrates an example of using a pipeline inside a shiny widget with reactive elements and changing data

Key features

User-defined functions

You can register your own data or non-data functions. This adds the user-defined function to the registry. The registry is maintained by the package and once registered, functions can be used across pipeline objects.

Complex pipelines and formula semantics
In addition to simple linear pipelines, more complex pipelines can also be defined. There are cases when the outputs of previous functions in the pipeline, as inputs to arbitrary parameters of subsequent functions.

The package defines certain formula semantics to accomplish this. We take the example of two simple user-defined functions, both which simply return the color of the graph, as well as the column on which the graph should be plotted, in order to illustrate how this works.

Preceding outputs can be passed to subsequent functions simply by specifying a formula of the form 'fid' against the argument to which the output is to be passed . The ID represents the ID of the function in the pipeline. For example, to pass the output of function with ID '1' as an argument to a parameter of a subsequent function, the formula '~f1' is passed to that corresponding argument.

`obj %>>% getColor(color = "blue") %>>% getColumnName(columnName = "Occupancy") %>>%
univarCatDistPlots(uniCol = "building_type", priColor = ~f1, optionalPlots = 0, storeOutput = T) %>>%
outlierPlot(method = "iqr", columnName = ~f2, cutoffValue = 0.01, priColor = ~f1 , optionalPlots = 0) -> complexPipeline

complexPipeline %>>% getPipeline
complexPipeline %>>% generateOutput -> op
op %>>% getOutputById("4")`

Interoperable pipelines

Interoperable pipelines containing functions operating on different engines such as R, Spark and Python can be configured and executed through the analysisPipelines package. Currently, the package supports interoperable pipelines containing R and Spark batch functions.

Pipeline visualization

Pipelines can be visualized as directed graphs, providing information about the engines being used, function dependencies and so on.

Report generation

Outputs generated from pipelines can easily be exported to formatted reports, showcasing the results, generating pipeline as well as a peek at the data

Efficient execution

The 'analysisPipelines' package internally converts the pipeline defined by the user into a directed graph which captures dependencies of each function in the pipeline on data, other arguments as well as outputs as other functions.

  • Topological sort and ordering - When it is required to generate the output, the pipeline is first prepped by performing a topological sort of the directed graph, and identifying sets (or) batches of independent functions and a sequence of batches for execution. A later release of the package will allow for parallel execution of these independent functions
  • Memory management & garbage cleaning - Memory is managed efficiently, by only storing outputs which the user has explicitly specified, or temporarily storing intermediate outputs required for subsequent functions only until they are required for processing. Garbage cleaning is performed after the execution of each batch in order to manage memory effectively.
  • Type conversions - In the case of Interoperable pipelines, executing across multiple engines such as R, Spark and Python, type conversions between data types in the different engines is minimized by identifying the optimal number of type conversions, before execution starts

Logging & Execution times

The package provides logging capabilities for execution of pipelines - By default, logs are written to the console, but alternatively the user can specify an output file to which the logs need to be written through the setLoggerDetails function.

Logs capture errors, as well as provide information on the steps being performed, execution times and so on.

Custom exception-handling

By default, when a function is registered, a generic exception handling function which captures the R error message, in case of error is registered against each function in the registry. The user can define a custom exception handling function, by defining it and providing it during the time of registration.

Upcoming features

  • Support for interoperable pipelines with Python
  • More utilities from the foundation bricks contributed by the Foundation Bricks team!
  • More execution and core functionality improvements