Tiered datasets, a limit, or none? #4

danielamitay · 2012-11-14T21:20:23Z

The dataset from my initial push was actually filtered down (by way of # of reviews, and availability) from an original collection of >20,000 URL schemes.

Previously, this dataset needed to be retrieved from the server, and each additional scheme check takes ~1ms, so I did this filtering for the sake of bandwidth and speed. Seeing as the current dataset adds only ~180kB to a compiled app, perhaps a limit on the number of URL schemes is unnecessary?

As an extreme example, were we to collect 100,000 URL schemes, not only would the compressed file size jump to >1MB, but the detection process itself would take 7X longer to complete.

Thoughts?

Tags regarding dataset: @HBehrens @steipete

HBehrens · 2012-11-14T22:36:09Z

Two quick ideas regarding the size of the dataset:

some apps implement several schemes (currently 12297 schemes and 10736 apps) so it might be feasable to leave out some schemes (saves space + time)
Since app ids are integers and URIs use a limited set of valid ASCII characters there could be a more efficient packaging strategy

I am not completely sure if scanning for the existence of any app is everybody's use case. It could be interesting to enrich the dataset by some attributes (e.g. genre) so one can decide which subset of the dataset to scan for. I could even think of an online tool to express rather complex scenarios like ("today's top 100 games in italy plus these 10 manually chosen apps") to produce a fixed set of app ids one could then download as binary data into the app to configure the next scanning operation.

Anyway, it would be great if you could extend the dataset with your original schemes :) (where the hack did you get 20k URL schemes from?!)

danielamitay · 2012-11-14T22:47:21Z

I'll run through my full dataset and push it later.

Because I wanted my dataset to be accurate and without the possibility of 3rd party manipulation, I grabbed data directly from app IPAs. Basically, by downloading and analyzing a lot of apps (top apps). Old post on the subject: http://danielamitay.com/blog/2011/5/9/detailed-iphone-app-ipa-statistics

This of course means that my dataset is skewed in favor of free apps, but fortunately, free apps are downloaded vastly more often, and my dataset represents the most popular apps.

Again, as the above blog post mentions (however outdated), not every app implements URL schemes. As such, my 20k URL schemes represent ~60k apps.

HBehrens · 2012-11-14T22:52:27Z

I've implemented two scripts earlier this day to

collect data from IPAs (incl. information about the version since schemes change over time, so they can be used to determine the app's version)
merge and canonicalize your mapping format and the format from 1.

If these are helpful for you I can share them. Maybe we can convince other developers to run the scripts on their machines to get more data about paid apps as well.

danielamitay · 2012-11-14T23:10:23Z

My original implementation involved dynamically producing a list of URL schemes on the server (filter schemes via the device model and system version), testing those schemes, and then sending back that data to the server, which would match specific URL scheme combinations. So I originally did check for specific versions of apps (for example, Yelp adds a new URL scheme every few updates).

However, I didn't feel that the increased data requirement or performance requirement was worth the additional insight.

When I get the chance, I'll probably set up a Development/ collection of scripts like those you mentioned.

HBehrens · 2012-11-15T09:55:29Z

Another option to shrink the size would be to separate between schemeApps-ipad.json and schemeApps-iPhone.json.

danielamitay · 2012-11-15T16:42:40Z

It wouldn't actually reduce the size, however. Even if your app is iPhone-only, when it is run on an iPad, you should still be checking for iPad-only schemes--so you should still include schemeApps-iPad.json

However, it would indeed reduce the detection time on iPhones by a non-trivial amount, so that should definitely be implemented.

Note to self: shemeApps~iPhone.json & schemeApps~iPad.json
Retrieve one or both based upon [[UIDevice currentDevice] userInterfaceIdiom];

HBehrens · 2012-11-15T16:50:22Z

When combining with a phase at build-time you could omit the values that are not needed.

danielamitay · 2012-11-15T16:57:19Z

I'm confused? My point was that none of the values can be omitted. Even if your app was only built/compiled for iPhone, when it runs on an iPad, it should still also use the schemeApps~iPad.json, and conversely, even when your app is specifically built/compiled for iPad, it should still also use the schemeApps~iPhone.json.

HBehrens · 2012-11-15T17:05:32Z

Again, we are thinking of different sets of use cases. For an "app scanner" you are right and its perfectly fine to detect apps that run in compatibility mode. In that case, you can only skip the detection for iPad apps if you are physically running on a non-iPad device.

excelltech · 2013-02-19T01:12:51Z

It might be nice to have schemeApps.json for each app category, such as a shemeApps-Games.json, schemeApps-Entertainment.json, ect.

Other than that, the more schemes the better in my opinion. As the checking is done in the background and you get incremental feedback the additional wait time doesn't seem to be a huge deal to me.

Also, I'd like to scan my apps and contribute to the collection. Danielamitay or Hbehrens: Would you mind sharing the scripts you wrote? Is there any criterion for apps to make the cut into the list?

Thanks,

Ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiered datasets, a limit, or none? #4

Tiered datasets, a limit, or none? #4

danielamitay commented Nov 14, 2012

HBehrens commented Nov 14, 2012

danielamitay commented Nov 14, 2012

HBehrens commented Nov 14, 2012

danielamitay commented Nov 14, 2012

HBehrens commented Nov 15, 2012

danielamitay commented Nov 15, 2012

HBehrens commented Nov 15, 2012

danielamitay commented Nov 15, 2012

HBehrens commented Nov 15, 2012

excelltech commented Feb 19, 2013

Tiered datasets, a limit, or none? #4

Tiered datasets, a limit, or none? #4

Comments

danielamitay commented Nov 14, 2012

HBehrens commented Nov 14, 2012

danielamitay commented Nov 14, 2012

HBehrens commented Nov 14, 2012

danielamitay commented Nov 14, 2012

HBehrens commented Nov 15, 2012

danielamitay commented Nov 15, 2012

HBehrens commented Nov 15, 2012

danielamitay commented Nov 15, 2012

HBehrens commented Nov 15, 2012

excelltech commented Feb 19, 2013