Skip to content

Hacktoberfest

elenasamuylova edited this page Oct 14, 2022 · 21 revisions

Evidently Hacktoberfest 2022

Thanks for your interest in contributing to Evidently!

This page describes how you can contribute during Hacktoberfest (and beyond!).

If you are new to Evidently

evidently reports

Evidently is an open-source Python library for data scientists and ML engineers. It helps evaluate, test, and monitor the performance of ML models from validation to production.

Evidently evaluates different aspects of the data and ML model performance: from data integrity to the ML model quality. You can get the results as interactive dashboards in the Jupyter notebook or export them as JSON or a Python dictionary.

If you have not used Evidently before, you can go through the Getting Started tutorial. It will take you about 10 minutes to understand the basic functionality.

How to contribute

There are different ways how you can contribute to Evidently. You can read our Contribution Guide.

We welcome all improvements or fixes, even the tiny ones, and non-code contributions. Do you see a typo in the documentation? Don’t be shy, and send us a pull request. No contribution is too small!

In addition, during Hacktoberfest, we invite you to make a specific type of contribution: help us add new statistical tests and metrics to detect data drift.

add new drift metric

Here is what it means:

  • Evidently helps users detect data drift (to check if the distributions of the input features remain similar) and prediction drift (to detect when model outputs change).
  • To do this, you typically need to run a statistical test (like Kolmogorov–Smirnov) or calculate statistical distance using a metric like Wasserstein distance. Evidently already has implementations of several tests and metrics inside the library.
  • We invite you to add more metrics and tests as available drift detection methods.

If you want to know more about approaches to data drift detection, here is a blog post.

Why is this useful?

Right now, users can:

Some users rely on custom tests as they have their own preferences or want to use a test they are familiar with. Adding more drift methods to the “library of statistical tests” will give users more options to choose from. This will reduce the need for custom implementations.

Which drift detection methods are already there?

You can see it here in the code.

Which drift detection methods should I contribute?

We added several ideas to the issues. They are labeled as hacktoberfest, or good first issue.

You are welcome to propose your ideas, too. Is there a popular metric we overlooked? Is there something you are using in your work to detect drift? Open an issue to let us know that you want to add a different metric and started working on it!

If you pick an existing issue, we encourage you to post that you started working on it. However, we will not formally "reserve" or "assign" issues and will review pull-requests on a first-come basis.

Instructions to add the new data drift method

For general instructions (e.g., how to clone the repository), head to the Contribution Guide.

Once you have chosen the drift method you want to implement, take the following steps.

Step 1: Add the new module

Add the new module for drift calculation. It should be located in the following folder: https://github.com/evidentlyai/evidently/tree/main/src/evidently/calculations/stattests

You need one file for each method.

Step 2: Create the StatTest object

In your module, you should create the StatTest object.

It requires the following:

  • name - this is how users will call the new method in the code; make it short and clear
  • display_name - this name will appear in the visual report; make sure it is complete and looks nice on a dashboard
  • func - the name of the function that performs the calculations
  • allowed_feature_types - list here the feature types your new stattest is suitable for. It can be num (numerical) and/or cat (categorical).

The last part is important, as not all statistical tests and metrics are suitable for both numerical and categorical features. You should specify it correctly. This way, if the user tries to apply the test to the non-suitable feature type, Evidently will return an error.

Step 3: Implement the function

Implement the func that performs the calculations.

It should take the following inputs:

  • Reference pd.Series - a dataset that is the baseline for comparison
  • Current pd.Series - a dataset that is compared to the first one
  • feature_type: str - feature type
  • threshold: float - values above this threshold mean data drift

It should return:

  • score: float - the calculated drift score (e.g., p-value, distance metric value, etc.)
  • drift_detected: bool - the drift detection result (detected / not detected) Don’t forget about the docstrings! We use Google style annotation.

Finally, add this line to register the new data drift methods:

register_stattest(ks_stat_test)

Now, you have created the new method. To import it, add the function import to init file: https://github.com/evidentlyai/evidently/tree/main/src/evidently/calculations/stattests/__init__.py`

You can take one of the tests available in the library as an example: https://github.com/evidentlyai/evidently/blob/main/src/evidently/calculations/stattests/ks_stattest.py

Step 4: Add software tests

After you’ve implemented your module, the work is not done yet! You need to check that everything works as expected on known corner cases and will continue to work in the future after new changes are added to the library.

Let’s implement the software tests! We use the pytest framework.

You need to add checks like:

  • How will my test work with empty values?
  • What if the current data contains just one value?

You can see the existing software tests here: https://github.com/evidentlyai/evidently/blob/main/tests/stattests/test_stattests.py If you have any questions about implementing the tests, reach out!

Step 5: Apply your new drift detection method

Let's check how your new drift detection method works in practice. Evidently has several interfaces that rely on this method. We suggest creating a test suite.

You need to prepare the datasets to compare and:

  • Create an Evidently test suite (consult the tests user guide if needed)
  • Choose the drift-related tests that you want to include, for example, data or target drift.
  • Create the DataDriftOption object, specify when you want to use your new drift detection method (e.g., apply it only to numerical features), and pass it to TestSuite as a parameter of one or several drift-related tests.

Here is a usage example:

from evidently.options import DataDriftOptions
 
stat_test_option = DataDriftOptions(all_features_stattest='YOUR_TEST')
 
suite = TestSuite(tests=[
    TestFeatureValueDrift(column_name='education-num', options=stat_test_option),
])
 
suite.run(reference_data=ref, current_data=curr, 
column_mapping=ColumnMapping(target='target', prediction='preds'))
suite

You can set whether your new stattest applies to the for features and/or model output in DataDriftOptions.

Here is an end-to-end example of how to use DataDriftOptions: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb

Step 6: Update documentation

Almost there!

Now, it’s time to tell the users that we have a new drift detection option! This is the page in the documentation that lists available drift methods. Add yours here: https://github.com/evidentlyai/evidently/blob/main/docs/book/customization/options-for-statistical-tests.md

Step 7: Create an example (optional)

You can create an example Jupyter notebook, which shows how to call the Evidently data drift test suite with a newly added drift detection method set as an option.

Here is an example notebook where you can add your new method: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb

Step 8: Send the pull request

Send us a PR using the Contribution Guide. If you feel like it, you can send 4 separate PRs:

  • Implementation of the new drift detection method
  • Implementation of the related software tests
  • Documentation update
  • Example update

We monitor all contributions and will try to review yours in a few days.

Note that we will not merge contributions that do not include the software tests to the implemented drift detection methods (but we are happy to review the method implementation before you write the tests).

Don’t forget to sign up for Hacktoberfest!

Hacktoberfest is an independent event happening every year. If you register and complete the requirements of 4 accepted pull requests among the first 40000 participants, you can get a prize. Read more here.

Accepted contributions to Evidently will count toward your Hacktoberfest PRs.

Need help?

Join the Evidently Discord community: https://discord.com/invite/xZjKRaNp8b and ask questions in the #evidently-hacktoberfest channel!

We will also have a Community Call on October 13. Sign up here to join: https://lu.ma/mvxmbhj6

We might host other events. Leave your email to receive updates: https://www.evidentlyai.com/hacktoberfest-2022

Want to share what you did?

If you want to share your contributions with the community, feel free to post on Twitter or other social media with the hashtags #DSHacktoberfest and #EvidentlyHacktoberfest