-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tiered datasets, a limit, or none? #4
Comments
Two quick ideas regarding the size of the dataset:
I am not completely sure if scanning for the existence of any app is everybody's use case. It could be interesting to enrich the dataset by some attributes (e.g. genre) so one can decide which subset of the dataset to scan for. I could even think of an online tool to express rather complex scenarios like ("today's top 100 games in italy plus these 10 manually chosen apps") to produce a fixed set of app ids one could then download as binary data into the app to configure the next scanning operation. Anyway, it would be great if you could extend the dataset with your original schemes :) (where the hack did you get 20k URL schemes from?!) |
I'll run through my full dataset and push it later. Because I wanted my dataset to be accurate and without the possibility of 3rd party manipulation, I grabbed data directly from app IPAs. Basically, by downloading and analyzing a lot of apps (top apps). Old post on the subject: http://danielamitay.com/blog/2011/5/9/detailed-iphone-app-ipa-statistics This of course means that my dataset is skewed in favor of free apps, but fortunately, free apps are downloaded vastly more often, and my dataset represents the most popular apps. Again, as the above blog post mentions (however outdated), not every app implements URL schemes. As such, my 20k URL schemes represent ~60k apps. |
I've implemented two scripts earlier this day to
If these are helpful for you I can share them. Maybe we can convince other developers to run the scripts on their machines to get more data about paid apps as well. |
My original implementation involved dynamically producing a list of URL schemes on the server (filter schemes via the device model and system version), testing those schemes, and then sending back that data to the server, which would match specific URL scheme combinations. So I originally did check for specific versions of apps (for example, Yelp adds a new URL scheme every few updates). However, I didn't feel that the increased data requirement or performance requirement was worth the additional insight. When I get the chance, I'll probably set up a |
Another option to shrink the size would be to separate between |
It wouldn't actually reduce the size, however. Even if your app is iPhone-only, when it is run on an iPad, you should still be checking for iPad-only schemes--so you should still include However, it would indeed reduce the detection time on iPhones by a non-trivial amount, so that should definitely be implemented. Note to self: |
When combining with a phase at build-time you could omit the values that are not needed. |
I'm confused? My point was that none of the values can be omitted. Even if your app was only built/compiled for iPhone, when it runs on an iPad, it should still also use the |
Again, we are thinking of different sets of use cases. For an "app scanner" you are right and its perfectly fine to detect apps that run in compatibility mode. In that case, you can only skip the detection for iPad apps if you are physically running on a non-iPad device. |
It might be nice to have schemeApps.json for each app category, such as a shemeApps-Games.json, schemeApps-Entertainment.json, ect. Other than that, the more schemes the better in my opinion. As the checking is done in the background and you get incremental feedback the additional wait time doesn't seem to be a huge deal to me. Also, I'd like to scan my apps and contribute to the collection. Danielamitay or Hbehrens: Would you mind sharing the scripts you wrote? Is there any criterion for apps to make the cut into the list? Thanks, Ed |
The dataset from my initial push was actually filtered down (by way of # of reviews, and availability) from an original collection of >20,000 URL schemes.
Previously, this dataset needed to be retrieved from the server, and each additional scheme check takes ~1ms, so I did this filtering for the sake of bandwidth and speed. Seeing as the current dataset adds only ~180kB to a compiled app, perhaps a limit on the number of URL schemes is unnecessary?
As an extreme example, were we to collect 100,000 URL schemes, not only would the compressed file size jump to >1MB, but the detection process itself would take 7X longer to complete.
Thoughts?
Tags regarding dataset: @HBehrens @steipete
The text was updated successfully, but these errors were encountered: