Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiered datasets, a limit, or none? #4

Open
danielamitay opened this issue Nov 14, 2012 · 10 comments
Open

Tiered datasets, a limit, or none? #4

danielamitay opened this issue Nov 14, 2012 · 10 comments

Comments

@danielamitay
Copy link
Owner

The dataset from my initial push was actually filtered down (by way of # of reviews, and availability) from an original collection of >20,000 URL schemes.

Previously, this dataset needed to be retrieved from the server, and each additional scheme check takes ~1ms, so I did this filtering for the sake of bandwidth and speed. Seeing as the current dataset adds only ~180kB to a compiled app, perhaps a limit on the number of URL schemes is unnecessary?

As an extreme example, were we to collect 100,000 URL schemes, not only would the compressed file size jump to >1MB, but the detection process itself would take 7X longer to complete.

Thoughts?

Tags regarding dataset: @HBehrens @steipete

@HBehrens
Copy link
Contributor

Two quick ideas regarding the size of the dataset:

  1. some apps implement several schemes (currently 12297 schemes and 10736 apps) so it might be feasable to leave out some schemes (saves space + time)
  2. Since app ids are integers and URIs use a limited set of valid ASCII characters there could be a more efficient packaging strategy

I am not completely sure if scanning for the existence of any app is everybody's use case. It could be interesting to enrich the dataset by some attributes (e.g. genre) so one can decide which subset of the dataset to scan for. I could even think of an online tool to express rather complex scenarios like ("today's top 100 games in italy plus these 10 manually chosen apps") to produce a fixed set of app ids one could then download as binary data into the app to configure the next scanning operation.

Anyway, it would be great if you could extend the dataset with your original schemes :) (where the hack did you get 20k URL schemes from?!)

@danielamitay
Copy link
Owner Author

I'll run through my full dataset and push it later.

Because I wanted my dataset to be accurate and without the possibility of 3rd party manipulation, I grabbed data directly from app IPAs. Basically, by downloading and analyzing a lot of apps (top apps). Old post on the subject: http://danielamitay.com/blog/2011/5/9/detailed-iphone-app-ipa-statistics

This of course means that my dataset is skewed in favor of free apps, but fortunately, free apps are downloaded vastly more often, and my dataset represents the most popular apps.

Again, as the above blog post mentions (however outdated), not every app implements URL schemes. As such, my 20k URL schemes represent ~60k apps.

@HBehrens
Copy link
Contributor

I've implemented two scripts earlier this day to

  1. collect data from IPAs (incl. information about the version since schemes change over time, so they can be used to determine the app's version)
  2. merge and canonicalize your mapping format and the format from 1.

If these are helpful for you I can share them. Maybe we can convince other developers to run the scripts on their machines to get more data about paid apps as well.

@danielamitay
Copy link
Owner Author

My original implementation involved dynamically producing a list of URL schemes on the server (filter schemes via the device model and system version), testing those schemes, and then sending back that data to the server, which would match specific URL scheme combinations. So I originally did check for specific versions of apps (for example, Yelp adds a new URL scheme every few updates).

However, I didn't feel that the increased data requirement or performance requirement was worth the additional insight.

When I get the chance, I'll probably set up a Development/ collection of scripts like those you mentioned.

@HBehrens
Copy link
Contributor

Another option to shrink the size would be to separate between schemeApps-ipad.json and schemeApps-iPhone.json.

@danielamitay
Copy link
Owner Author

It wouldn't actually reduce the size, however. Even if your app is iPhone-only, when it is run on an iPad, you should still be checking for iPad-only schemes--so you should still include schemeApps-iPad.json

However, it would indeed reduce the detection time on iPhones by a non-trivial amount, so that should definitely be implemented.

Note to self: shemeApps~iPhone.json & schemeApps~iPad.json
Retrieve one or both based upon [[UIDevice currentDevice] userInterfaceIdiom];

@HBehrens
Copy link
Contributor

When combining with a phase at build-time you could omit the values that are not needed.

@danielamitay
Copy link
Owner Author

I'm confused? My point was that none of the values can be omitted. Even if your app was only built/compiled for iPhone, when it runs on an iPad, it should still also use the schemeApps~iPad.json, and conversely, even when your app is specifically built/compiled for iPad, it should still also use the schemeApps~iPhone.json.

@HBehrens
Copy link
Contributor

Again, we are thinking of different sets of use cases. For an "app scanner" you are right and its perfectly fine to detect apps that run in compatibility mode. In that case, you can only skip the detection for iPad apps if you are physically running on a non-iPad device.

@excelltech
Copy link

It might be nice to have schemeApps.json for each app category, such as a shemeApps-Games.json, schemeApps-Entertainment.json, ect.

Other than that, the more schemes the better in my opinion. As the checking is done in the background and you get incremental feedback the additional wait time doesn't seem to be a huge deal to me.

Also, I'd like to scan my apps and contribute to the collection. Danielamitay or Hbehrens: Would you mind sharing the scripts you wrote? Is there any criterion for apps to make the cut into the list?

Thanks,

Ed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants