Skip to content

Releases: sb-ai-lab/RePlay

v0.17.0

07 Jun 07:34
Compare
Choose a tag to compare

RePlay 0.17.0 Release notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes

Highlights

We are excited to announce the release of RePlay 0.17.0!
The new version fixes serious bugs related to the performance of LabelEncoder and saving checkpoints in transformers. In addition, methods have been added to save splitters and SequentialTokenizer without using pickle.

Backwards Incompatible Changes

Change SequentialDataset behavior

When training transformers on big data, a slowdown was detected that increased the epoch time from 5 minutes to 1 hour. The slowdown was due to the fact that by default, the model trainer saves checkpoints every 50 steps of the epoch. While saving the checkpoint, not only the model was saved, but also the entire training dataset was implicitly saved. The behavior was corrected by changing the SequentialDataset and the callbacks used in it. Therefore, using SequentialDataset from older versions will not be possible. Otherwise, no interface changes were required.

Deprecations

Added a deprecation warning related to saving splitters and SequentialTokenizer using a pickle. In future versions, the functionality will be removed.

New Features

A new strategy in the LabelEncoder

The drop strategy has been added. It allows you to throw tokens from the dataset that were not present at the training stage. If all rows are deleted, the corresponding warning will appear.

New Linters

We keep up with the latest trends in code quality control, so the list of linters for testing code quality has been updated. The use of Pylint and PyCodestyle has been removed. Added the linters Ruff, Black and toml-sort.

Improvements

PyArrow dependency

The dependency on PyArrow has been adjusted. The RePlay now can work with any version that is greater than 12.0.1.

Bug fixes

Performance fixes at the partial_fit stage in LabelEncoder

The slowdown occurred when using DataFrame from Pandas. The partial_fit stage had a quadratic running time. The bug has been fixed, now the time linearly depends on the size of the dataset.

Timestamp tokenization when using SasRec

Fixed an error that occurs when training a SasRec transformer with a ti_modification=True parameter.

Loading a checkpoint with a modified embedding in the transformers

The error occurred when loading the model on another device, when the dimensions of embeddings in transformers were changed before that. The example of working with embeddings in transformers has been updated.

v0.16.0

20 Mar 10:01
Compare
Choose a tag to compare
  • It was introduced the support of the dateframes from the polars package. This is available in the following modules: data (Dataset, SequenceTokenizer, SequentialDataset) for working with transformers, metrics, preprocessing and splitters. The new format allows to achieve multiple acceleration of calculations relative to the Pandas and PySpark dataframes. You can see more details about usage in the examples.
  • Removed dependencies on seaborn and matplotlib. Removed functions replay.utils.distributions.plot_item_dist and replay.utils.distributions.plot_user_dist.
  • Added functions to get and set embeddings in transformers - get_all_embeddings, set_item_embeddings_by_size, set_item_embeddings_by_tensor, append_item_embeddings. You can see more details about their use in the examples.
  • Added a QueryEmbeddingsPredictionCallback to get query embeddings at the inference stage in transformers. You can see more details about usage in the examples.
  • Added support for numerical features in SequenceTokenizer and TorchSequentialDataset. It becomes possible to use numerical features inside transformers.
  • Auto padding for inference stage of transformer-based models in a single-user mode is supported.
  • Added a new KL UСB model based on https://arxiv.org/pdf/1102.2490.pdf.
  • Added a callback to calculate cardinality in TensorSchema. Now it is not necessary to pass the cardinality parameter, the value will be calculated automatically.
  • Added the core_count parameter to replay.utils.session_handler.get_spark_session. If nothing is specified, the env variables REPLAY_SPARK_CORE_COUNT and REPLAY_SPARK_MEMORY are taken into account. If they are not specified, the value is set to -1.
  • Corrected the behavior of the item_count parameter in ValidationMetricsCallback. If you are not going to calculate the Coverage metric, then you do not need to pass this parameter.
  • The calculation of the Coverage metric on Pandas and PySpark has been aligned.
  • Removed conversion from PySpark to Pandas in some models. Added the allow_collect_to_master parameter, False by default.
  • 100% test coverage has been achieved.
  • Undetectable type correction during fit in LabelEncoder. The problem occurred when using multiple tuples with null values.
  • Changes in the experimental part:
    • Python 3.10 is supported
    • Interface updates due to the d3rlpy version update
    • Adding a DesicionTransformer

v0.15.0

30 Nov 13:43
Compare
Choose a tag to compare
  • Bert4Rec and SasRec interfaces naming was aligned with each others
  • Minor changes in sasrec_example regarding naming

v0.14.0

24 Nov 16:50
Compare
Choose a tag to compare
  • Introduced support for various hardware configurations including CPU, GPU, Multi-GPU and Clusters (based on PySpark)
  • The part of the library was moved to experimental submodule for further stabilizing and productizing
  • Preprocessing, splitters, metrics support pandas now
  • Introduced 2 SOTA models: BERT4Rec and SASRec transformers with online and offline inference

Let's start a new chapter of RePlay! 🚀🚀🚀