Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【paddle.distributed.fleet】add data_generator in distributed.fleet.dataset #27345

Merged
merged 21 commits into from
Sep 28, 2020

Conversation

yaoxuefeng6
Copy link
Contributor

@yaoxuefeng6 yaoxuefeng6 commented Sep 16, 2020

PR types

Others

PR changes

Others

Describe

add data_generator in paddle.distributed.fleet.dataset to illustrate dataset class well
Based on comments of this pr:#27133 (review)
1, mark static only in new dataset api doc
2, mark deprecated in dataset api in fluid
3, example codes fit with 2.0 apis

my_data_generator class base on class MultiSlotDataGenerator
my_data_generator.py:

import paddle
import paddle.distributed.fleet as fleet

class MyDataGenerator(fleet.MultiSlotDataGenerator):
    def generate_sample(self, line):
        def data_iter():
            for i in range(10000):
                yield ("words", [1, 2, 3, 4]), ("label", [0])

        return data_iter
if __name__ == "__main__":
    d = MyDataGenerator()
    d.run_from_stdin()

paddle.distributed.InMemoryDataset/QueueDataset using data_generator demo

import paddle
paddle.enable_static()

slots = ["slot1", "slot2", "slot3", "slot4"]
slots_vars = []
for slot in slots:
var = paddle.static.data(name=slot, shape=[1], dtype="int64", lod_level=1)
slots_vars.append(var)

# create dataset instance directly with distributed.InMemoryDataset
dataset = paddle.distributed.InMemoryDataset()
# call init() to initialize single node related settings once.
# use my_data_generator in pipecommand
dataset.init(
    batch_size=32,
    thread_num=3,
    pipe_command="python my_data_generator.py",
    use_var=slots_vars)

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -21,7 +21,7 @@
import random

import paddle
import paddle.fluid.incubate.data_generator as data_generator
from paddle.distributed.fleet.dataset import data_generator as data_generator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move data_generator to paddle.distributed.fleet

@jzhang533 jzhang533 self-requested a review September 22, 2020 09:31
"""
DataGenerator is a general Base class for user to inherit
A user who wants to define his/her own python processing logic
with paddle.fluid.dataset should inherit this class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是paddle.fluid.dataset

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Sep 23, 2020
@PaddlePaddle PaddlePaddle unlocked this conversation Sep 23, 2020
@yaoxuefeng6 yaoxuefeng6 changed the title add data_generator in distributed.fleet.dataset 【paddle.fleet】add data_generator in distributed.fleet.dataset Sep 23, 2020
@yaoxuefeng6 yaoxuefeng6 changed the title 【paddle.fleet】add data_generator in distributed.fleet.dataset 【paddle.distributed.fleet】add data_generator in distributed.fleet.dataset Sep 24, 2020
Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

guru4elephant
guru4elephant previously approved these changes Sep 24, 2020
Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020-09-25 11:48:34 ****************
2020-09-25 11:48:34 0. You must have one RD (XiaoguangHu01,Xreki,luotao1) approval for python/paddle/distributed/__init, which manages the underlying code for fluid.

这个文件应该加上 @guru4elephant

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yaoxuefeng6 yaoxuefeng6 merged commit 7801405 into PaddlePaddle:develop Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants