add first working version

mxmlnkn · Feb 20, 2019 · 50e8dbb · 50e8dbb
1 parent 228382a
commit 50e8dbb
Show file tree

Hide file tree

Showing 2 changed files with 403 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,74 @@
-# ratarmount
-Random Access Read-Only Tar Mount
+# Random Access Read-Only Tar Mount (Ratarmount)
+
+Combines the random access indexing idea from [tarindexer](https://github.com/devsnd/tarindexer) and then mounts the tar using [fusepy](https://github.com/fusepy/fusepy) for easy read-only access just like [archivemount](https://github.com/cybernoid/archivemount/). It also will mount TARs inside TARs inside TARs, ... recursively into folders of the same name, which is useful for the ImageNet data set.
+
+# Requirements
+
+ - Python3
+ - fusepy
+
+E.g. on Debian-like systems these can be installed with:
+
+```bash
+sudo apt-get update
+sudo apt-get install python3
+pip3 --user fusepy
+```
+
+# Usage
+
+ python3 ratarmount.py <path to tar> [<mount path>]
+
+Index files are if possible created to / if existing loaded from these file locations in order:
+
+ - `<path to tar>.index.pickle`
+ - `~/.tarmount/<path to tar: '/' -> '_'>.index.pickle`
+
+# The Problem
+
+You downloaded a large TAR file from the internet, for example the [1.31TB](http://academictorrents.com/details/564a77c1e1119da199ff32622a1609431b9f1c47) large [ImageNet](http://image-net.org/), and you now want to use it but lack the space, time, or a file system fast enough to extract all the 14.2 million image files.
+
+## Partial Solutions
+
+### Archivemount
+
+Archivemount[https://github.com/cybernoid/archivemount/] does not seem to support random access in version 0.8.7 and also mounting seems to have performance issues:
+
+ - Mounting the 6.5GB ImageNet Large-Scale Visual Recognition Challenge 2012 validation data set, and then testing the speed with: `time cat mounted/ILSVRC2012_val_00049975.JPEG | wc -c` takes 250ms for archivemount and 2ms for ratarmount.
+ - Trying to mount the 150GB [ILSVRC object localization data set](https://www.kaggle.com/c/imagenet-object-localization-challenge) containing 2 million images was given up upon after 2 hours. Ratarmount takes 45min to create the index and <10s for loading an already created index and mounting it. In contrast, archivemount will take the same amount of time even for subsequent mounts.
+
+### Tarindexer
+
+[Tarindex](https://github.com/devsnd/tarindexer) is a command line to tool written in Python which can create index files and then use the index file to extract single files from the tar fast. However, it also has some caveats which ratarmount tries to solve:
+
+ - It only works with single files, meaning it would be necessary to loop over the extract-call. But this would require loading the possibly quite large tar index file into memory each time. For example for ImageNet, the resulting index file is hundreds of MB large. Also, extracting directories will be a hassle.
+ - It's difficult to integrate tarindexer into other production environments. Ratarmount instead uses FUSE to mount the TAR as a folder readable by any other programs requiring access to the contained data.
+ - Can't handle TARs recursively. In order to extract files inside a TAR which itself is inside a TAR, the packed TAR first needs to be extracted.
+
+### TAR Browser
+
+I didn't find out about [TAR Browser](https://github.com/tomorrow-nf/tar-as-filesystem/) before I finished the ratarmount script. That's also one of it's cons:
+
+ - Hard to find. I don't seem to be the only one who has trouble finding it as it has zero stars on Github after 4 years compared to 29 stars for tarindexer after roughly the same amount of time.
+ - Hassle to set up. Needs compilation and I gave up when I was instructed to set up a MySQL database for it to use. Btw, the setup instructions are not on its Github but [here](https://web.wpi.edu/Pubs/E-project/Available/E-project-030615-133259/unrestricted/TARBrowserFinal.pdf).
+ - Doesn't seem to support recursive TAR mounting. I didn't test it because of the MysQL dependency but the code does not seem to have logic for recursive mounting.
+
+Pros:
+ - supports bz2- and xz-compressed TAR archives
+
+## The Solution
+
+Ratarmount creates an index file with file names, ownership, permission flags, and offset information to be stored at the TAR file's location or inside `~/.ratarmount/` and then offers a FUSE mount integration for easy access to the files.
+
+The test for the ImageNet data set is promising:
+
+ - TAR size: 1.31TB
+ - Contains TARs: yes
+ - Files in TAR: ~26 000
+ - Files in TAR (including recursively in contained TARs): 14.2 million
+ - Index creation (first mounting): 4 hours
+ - Index size: 1GB
+ - Index loading (subsequent mounting): 40s
+ - Reading a 40kB file: 100ms (first time) and 4ms (subsequent times)
+
+The reading time for a small file simply verifies the random access by using file seek to be working. The difference between the first read and subsequent reads is not because of ratarmount but because of operating system and file system caches.