[split_dataset] migrating from tf.keras to keras_core #505

asingh9530 · 2023-07-16T15:41:43Z

Hi Team,

This PR made following changes to following functions

moving keras data-utils for split_dataset from here making sure all the settings stay's intact.
need to rewrite following functions since all of them accept tf.Dataset and we need to support torch and Jax.
- split_dataset
- _convert_dataset_to_list
- _get_data_iterator_from_dataset
- _get_next_sample
- _get_type_spec
- _get_next_sample
- _rescale_dataset_split_sizes
- _restore_dataset_from_list

google-cla · 2023-07-16T15:41:46Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

fchollet

Thanks for the PR! Please also bring in the unit tests and add tests for torch datasets.

QQ: rather than special casing torch datasets, could we just support any object that implements len and getitem?

fchollet · 2023-07-16T16:58:26Z

keras_core/utils/dataset_utils.py

@@ -1,3 +1,11 @@
+import tensorflow as tf
+import torch 
+from torch.utils.data import Dataset as torchDataset


Only import torch, access Dataset from there

@fchollet sure, but I argue that using this specific import is much error proof while importing.

Added changes for this.

fchollet · 2023-07-16T16:58:38Z

keras_core/utils/dataset_utils.py


    Args:
        dataset: A `tf.data.Dataset` object, or a list/tuple of arrays with the
-            same length.
+          same length.


Use 4 space indent.

@fchollet yes will apply black in later stages.

fchollet · 2023-07-16T16:58:47Z

keras_core/utils/dataset_utils.py

@@ -30,20 +36,440 @@ def split_dataset(
    Example:

    >>> data = np.random.random(size=(1000, 4))
-    >>> left_ds, right_ds = split_dataset(data, left_size=0.8)
+    >>> left_ds, right_ds = tf.keras.utils.split_dataset(data, left_size=0.8)


No TF references

@fchollet will add a separate commit for doc string.

fchollet · 2023-07-16T16:59:18Z

keras_core/utils/dataset_utils.py

+        start_time,
+    ):
+        if dataset_type_spec in [tuple, list]:
+            # The try-except here is for NumPy 1.24 compatibility, see:


Is this actually needed?

@fchollet yes since we allow dataspec to be tuple this will be required not sure about list will test it out.

asingh9530 · 2023-07-16T17:51:04Z

Thanks for the PR! Please also bring in the unit tests and add tests for torch datasets.

QQ: rather than special casing torch datasets, could we just support any object that implements len and getitem?

@fchollet hmm🤔 need to think about it. will add further details to this comment.

fchollet · 2023-07-19T20:54:17Z

keras_core/utils/dataset_utils.py

@@ -1,3 +1,11 @@
+import tensorflow as tf


Another thing -- we should not import tf or torch at the top of the file since that would make them required dependencies. They should be imported when needed (e.g. only import torch when you need to process a torch dataset).

umm ok but since they are being used in multiple function declaring there would allows us to have multiple import statements.

Done for torch also importing tf from module_utils

fchollet · 2023-07-20T19:19:11Z

keras_core/utils/dataset_utils.py

@@ -1,3 +1,8 @@
+import torch 


This should not be imported here. Torch is not a dependency of the package.

fchollet · 2023-07-20T19:19:23Z

keras_core/utils/dataset_utils.py

-            If integer, it signifies the number of samples to pack
-            in the left dataset. If `None`, it defaults to the complement
-            to `right_size`.
+          the fraction of the data to pack in the left dataset. If integer, it


Use 4 space indent

The indent is still 2 spaces.

You shouldn't actually need to modify the docstring at all, mind you. The original docstring was fine.

fchollet · 2023-07-20T19:20:00Z

keras_core/utils/dataset_utils_test.py

+
+class DatasetUtilsTest(test_case.TestCase):
+    def test_split_dataset_list(self):
+        n_sample, n_cols, n_pred, left_size, right_size = 100, 2, 1, 0.2, 0.8


Make sure to run sh shell/format.sh and keep lines under 80 chars

fchollet

Thanks for the update!

fchollet · 2023-07-23T04:35:03Z

keras_core/utils/dataset_utils.py

-            If integer, it signifies the number of samples to pack
-            in the left dataset. If `None`, it defaults to the complement
-            to `right_size`.
+          the fraction of the data to pack in the left dataset. If integer, it


The indent is still 2 spaces.

You shouldn't actually need to modify the docstring at all, mind you. The original docstring was fine.

fchollet · 2023-07-23T04:35:37Z

keras_core/utils/dataset_utils.py

-        right_size=right_size,
-        shuffle=shuffle,
-        seed=seed,
+    from torch.utils.data import Dataset as torchDataset


Do not import torch unless the dataset passed is a torch dataset.

Just import torch then access torch.utils.data.Dataset.

To check whether it's a torch dataset, you can do something similar to this https://github.com/keras-team/keras-core/blob/main/keras_core/trainers/epoch_iterator.py#L228

@fchollet nested import like torch.utils.data.Dataset is not working as module never gets added to globals hence throws import error this is same as I mentioned it here and this still is reproducible.

Sure will change logic to check how we detect torch dataset.

fchollet · 2023-07-23T04:39:36Z

keras_core/utils/dataset_utils.py

+        right_split = right_split.prefetch(tf.data.AUTOTUNE)
+        return left_split, right_split
+
+    elif dataset_type_spec == torchDataset:


On second thought, I don't think we should support torch datasets at all here, because the API is becoming completely inconsistent:

pass numpy arrays, get back tf.data.Dataset

pass tf.data.Dataset, get back tf.data.Dataset

pass torch dataset, get back torch dataset

Let's just stick to always returning a tf.data.Dataset IMO.

@fchollet but if that's the case how this function can be used with torch workflow as it will require to return dataset which is compatible with torch ?

@fchollet and since returning torch is only constrained within torch backend and will not impact other it should be safe as even going forward with jax we would need to add for that too. What do you think ? 🧐 Also if api consistency is issue one possible solution is to only return numpy arrays and will leave at user how it is needed, this way we will ensure to support every framework but not sure how this will work ?

tf.data.Datasets are supported by Keras models with all backends, so it's fine IMO. It's also the only way to stay be backwards compatible with tf.keras, which is very important.

@fchollet aah got it then let's only use tf.data.Datasets

@fchollet Please you can review now following changes are made.

torch.data.Dataset is only imported once.

return will be limited to tf.data.Datasets.

fixed indentation.

@fchollet Need clarification from you I have moved logic of detecting torch tensor and torch dataset to is_torch_tensor and is_torch_dataset since is_torch_dataloader is under trainers should I move these function there or keep it here only ?

asingh9530 · 2023-07-23T04:45:09Z

The indent is still 2 spaces.

You shouldn't actually need to modify the docstring at all, mind you. The original docstring was fine.

Apologies I missed it now it is resolved.

asingh9530 · 2023-07-27T12:18:20Z

@fchollet could you approve test workflow as all the changes have been done from my end.

fchollet

Thanks for the updates! There are various docstring issues left but I'll take it from here.

added keras utils -> keras-core utils

dc1e9f7

asingh9530 mentioned this pull request Jul 16, 2023

Moving dataset utils implementation to Keras Core specifically [split_dataset] #503

Closed

added batched_dataset function

feb40ca

fchollet reviewed Jul 16, 2023

View reviewed changes

asingh9530 changed the title ~~added keras utils -> keras-core utils~~ [split_dataset] migrating from tf.keras to keras_core Jul 17, 2023

asingh9530 added 4 commits July 19, 2023 22:41

added torch decoupling

3f874ad

added torch decoupling

c708b3e

torch return logic

26ad5be

removed torchLoader dependency

f2e8afd

fchollet reviewed Jul 19, 2023

View reviewed changes

asingh9530 added 2 commits July 20, 2023 18:57

added unittest

aed0729

added unittest using cardinality

6eec8ca

fchollet reviewed Jul 20, 2023

View reviewed changes

asingh9530 added 4 commits July 21, 2023 19:36

reformatted

c3818a5

reformatted

47e4d4f

removed tf.keras mentions

78e4824

removed torch dependency

4b12421

fchollet reviewed Jul 23, 2023

View reviewed changes

asingh9530 added 4 commits July 23, 2023 10:18

fixed indent issue

d358e72

only tf.data.dataset will be returned

dbedcfb

torch only imported for get_type_spec

84980e3

fixed indentation

daab3c1

asingh9530 requested a review from fchollet July 27, 2023 12:47

fchollet approved these changes Jul 28, 2023

View reviewed changes

fchollet merged commit 8e96f94 into keras-team:main Jul 28, 2023
6 checks passed

asingh9530 mentioned this pull request Aug 5, 2023

Issue keras.utils.split_dataset #637

Closed

[split_dataset] migrating from tf.keras to keras_core #505

[split_dataset] migrating from tf.keras to keras_core #505

Conversation

asingh9530 commented Jul 16, 2023

google-cla bot commented Jul 16, 2023

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asingh9530 Jul 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asingh9530 commented Jul 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asingh9530 Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asingh9530 Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

asingh9530 Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asingh9530 commented Jul 23, 2023

asingh9530 commented Jul 27, 2023

fchollet left a comment

Choose a reason for hiding this comment

asingh9530 Jul 16, 2023 •

edited

Loading

asingh9530 Jul 23, 2023 •

edited

Loading

asingh9530 Jul 23, 2023 •

edited

Loading

asingh9530 Jul 23, 2023 •

edited

Loading