Skip to content

Latest commit

 

History

History
59 lines (39 loc) · 2.02 KB

data-preparation.md

File metadata and controls

59 lines (39 loc) · 2.02 KB

Preparing the Data

Download and validate the data.

wget http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/weighted_documents/corrected_org_collection_berttoken.zip -O data/raw/original-u.zip

wget http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/weighted_documents/corrected_sample_100_keepall_jsonl.zip -O data/raw/deepct-u.zip

wget http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/weighted_documents/sample_100_jsonl.zip -O data/raw/deepct-p.zip

md5sum data/raw/original-u.zip
dd98a257270feb0ed34ae69606be7c98  data/raw/original-u.zip

md5sum data/raw/deepct-u.zip
261e7e98e7c0162bbe9eb6fd232a02f9  data/raw/deepct-u.zip

md5sum data/raw/deepct-p.zip
6d44db3b576f3ab4ed09990be490defd  data/raw/deepct-p.zip

Unpack and prepare the data.

cd data/raw/

unzip original-u.zip
cat org_collection_berttoken/1.json org_collection_berttoken/2.json > original-u/original-u.json

unzip deepct-u.zip
cat sample_100_keepall_jsonl_new/1.json sample_100_keepall_jsonl_new/2.json > deepct-u/deepct-u.json

unzip deepct-p.zip
cat sample_100_jsonl/docs00.json sample_100_jsonl/docs01.json sample_100_jsonl/docs02.json sample_100_jsonl/docs03.json sample_100_jsonl/docs04.json sample_100_jsonl/docs05.json sample_100_jsonl/docs06.json sample_100_jsonl/docs07.json sample_100_jsonl/docs08.json > deepct-p/deepct-p.json

md5sum original-u/original-u.json
2a9d315da238f947d2eb4db8bbbf058e  original/original-u.json

md5sum deepct-u/deepct-u.json
16ad6f86d99dd0b822950a6731513213  deepct-u/deepct-u.json

ms5sum deepct-p/deepct-p.json
0ad6965434d72ead8256bfdec25fa65e  deepct-p/deepct-p.json

Now, we can generate the original-pruned index from the original-unpruned and the deepct-pruned collections. This takes around 10-15 minutes to run on a modern server.

python3 ../../tools/prune-original.py deepct-p/deepct-p.json original-u/original-u.json original-p/original-p.json

md5sum original-p/original-p.json
b856c62ebc10fae5c9a1b5eea8f0c27e  original-p/original-p.json

cd ../../