Simplify Quick Start #112

wangkuiyi · 2016-09-23T23:41:00Z

Fixes #111

Motivation

The initial purpose of this PR is that it took me >12 hours to run preprocess.sh on a VM with my MacBook Pro. I checked with Yi Yang that he can run this in a few minutes on his powerful CPU&GPU desktop. But I am afraid that the purpose of a QuickStart should be quick and start enough, so that potential clients can realistically feel the convenience brought by Paddle. Hence this PR.

Comparison

The time cost are primarily due to that the current approach downloads the full Amazon Reviews dataset, which is ~500MB gzipped and ~1.5GB unzipped. The process of the whole dataset also costs much time. So this PR's primary target is to download only part of the dataset. Compare with the existing approach,

this PR uses a ~100-line Python script preprocess_data.py to replace data/get_data.sh, preprocess.py and preprocess.sh, which add up to ~300 lines code,
after a short discussion with @emailweixu , we decided to use space-delimited word segmentation to replace the Moses word segmenter, so no need to download the Mesos segmenter.
preprocess_data.py can read directly from the HTTP server that hosts the data, or from a local copy of the data. In either case, it reads until required number of instances are scanned. This releases it from reading the whole dataset.
The new script doesn't use shuf, which exists in Linux but not in Mac OS X. So the new script works with both Linux and Mac OS X.

Usage

If we get this PR merged, the initialization steps described in the Quick Start guide would change from

cd demo/quick_start
./data/get_data.sh
./preprocess.sh

into

cd demo/quick_start
python ./process_data.py

Details

Above ./process_data.py commands read directly from the default URL http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz for a number of JSON objects until it can generate {train,test,pred}.txt which add up to 100 instances, the default number of dataset size.

If we are going to generate a bigger dataset, say 1000 instances in total, we can run

python ./process_data.py -n 1000

Or, if we already downloaded the reviews_Electronics_5.json.gz file, we can run

python ./process_data.py ~/Download/reviews_Electronics_5.json.gz

An additional command line parameter -t controls the cap size of the dictionary. If we want to generate an 1000-instance dataset while limitinug the dictionary size to 1999, we can do

python ./process_data.py -n 1000 -t 1999 ~/Download/reviews_Electronics_5.json.gz

qingqing01 · 2016-09-24T04:43:39Z

demo/quick_start/process_data.py

+                written = written + 1
+            elif rate < 3.0:
+                o.write('0\t%s\n' % text)
+                written = written + 1


We hope the rate of positive sample : negative sample is 1:1 in the original process.

There are duplicated samples in reviews_Electronics_5.json.gz. It's necessary to remove them to make distinct train set and test set.

The moses tools is used to tokenize the words and punctuation. If we don't want to care about the punctuation, it is ok without moses.

In fact, there is preprocessed data by other people, http://riejohnson.com/cnn_download.html#sup-paper

Thanks for the comments!

Following the link you provided, I found this proprocessed dataset: http://jmcauley.ucsd.edu/data/amazon/ . I am checking if it matches requirement 1.~3. as you commented above. If I can train a model using the data and if the model passes testing, I will go back here to update this PR.

Update mobile_readme.md

* code format * add IpuInplacePass

* Add AnimeGANv2 in link * Update README_en.md

test new sample optimize thrust alloc (PaddlePaddle#112) fix deepwalk sample kernel (PaddlePaddle#122) Update graphsage speed(thrust -> cub), fix sample async bug (PaddlePaddle#120) * fix deepwalk sample kernel, fix inplace kernel sync bug * update v2 sample * change kernel name for readability * delete unused kernel support slot_feature with different length (PaddlePaddle#124) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> add graphsage slot feature (PaddlePaddle#126) 【graphsage】don't alloc a new d_feature_buf if the old one is enough (PaddlePaddle#128) * add graphsage slot feature * update graphsage slot feature * update graphsage slot feature fix linking use type optimization remove file add type-optimization config fix bug in slot_feature (PaddlePaddle#134) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> sage network optimization remove log fix bug in slot_feature (PaddlePaddle#134) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* add & fix zeus int8 * ptq int8 moe add grouped gemm

Yi Wang added 8 commits September 23, 2016 14:31

Rewrite preprocess.py into yi.py, a temporary file.

59fb5d5

Polish yi.py

4542450

Add outputDir to yi.py

d488a3d

Make yi.py able to generate dict.txt

6c0bdf6

Make sure outputs N instances

327b8b9

Generate pred.{txt,list} and all other .list files

a62851e

Make yi.py creates the output dir

39b1d25

Add process_data.py and remove no-longer necessary files.

067cf19

emailweixu assigned qingqing01 Sep 24, 2016

qingqing01 requested changes Sep 24, 2016

View reviewed changes

reyoung changed the base branch from master to develop October 26, 2016 10:25

qingqing01 closed this Dec 12, 2016

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019

Merge pull request PaddlePaddle#112 from PaddlePaddle/update-mobile

0de7738

Update mobile_readme.md

gglin001 added a commit to graphcore/Paddle-fork that referenced this pull request Dec 8, 2021

add IpuInplacePass (PaddlePaddle#112)

8effd41

* code format * add IpuInplacePass

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021

update the format of the log of ofa-bert (PaddlePaddle#112)

9ac5abc

DesmonDay pushed a commit to DesmonDay/Paddle that referenced this pull request Sep 14, 2022

optimize thrust alloc (PaddlePaddle#112)

55739cf

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 19, 2022

Add AnimeGANv2 in link (PaddlePaddle#112)

fdbc6ae

* Add AnimeGANv2 in link * Update README_en.md

zmxdream pushed a commit to zmxdream/Paddle that referenced this pull request Dec 7, 2022

optimize thrust alloc (PaddlePaddle#112)

e2bb915

tianyan01 added a commit to tianyan01/Paddle that referenced this pull request Feb 20, 2024

int8 ptq moe add grouped gemm (PaddlePaddle#112)

fd736c0

* add & fix zeus int8 * ptq int8 moe add grouped gemm

lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024

Add pantheon import in all & Fix write error in py3 (PaddlePaddle#112)

a4f4298

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify Quick Start #112

Simplify Quick Start #112

wangkuiyi commented Sep 23, 2016 •

edited

Loading

qingqing01 Sep 24, 2016 •

edited

Loading

wangkuiyi Sep 26, 2016 •

edited

Loading

Simplify Quick Start #112

Simplify Quick Start #112

Conversation

wangkuiyi commented Sep 23, 2016 • edited Loading

Motivation

Comparison

Usage

Details

qingqing01 Sep 24, 2016 • edited Loading

Choose a reason for hiding this comment

wangkuiyi Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

wangkuiyi commented Sep 23, 2016 •

edited

Loading

qingqing01 Sep 24, 2016 •

edited

Loading

wangkuiyi Sep 26, 2016 •

edited

Loading