Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store sparse count data JSON in visualization #58

Closed
fedarko opened this issue Feb 25, 2019 · 4 comments
Closed

Store sparse count data JSON in visualization #58

fedarko opened this issue Feb 25, 2019 · 4 comments
Assignees
Labels
optimization Making code faster or cleaner
Milestone

Comments

@fedarko
Copy link
Collaborator

fedarko commented Feb 25, 2019

This isn't a problem yet, but as we start trying larger datasets here this will become an issue.

Action items:

  • Disable Altair's "MaxRows check" -- this is more of a temporary measure than anything, but it'll prevent Altair from giving up immediately as soon as more than 5k features or 5k samples are in a dataset.

  • Store datasets in their own JSON files in the data directory. Our use of column mapping in the sample plot means that the costs of storing data in memory/etc aren't that severe, but it is still probably a good idea to separate datasets from the Vega-Lite JSON specs generated by Altair. This should also remove our need to disable the MaxRows check (if there are more than 5k features, would the column mapping stuff trigger a MaxRowsError? -- I should look into this, and consider storing that stuff in its own JSON file as well although that sounds a bit excessive).

    • NOTE that this will probably (?) require using cross-origin requests again, which will essentially nullify the benefits of Remove need for cross-origin requests #72 (i.e. users will again have to use view.qiime2.org or python3 -m http.server to view a visualization). It should be doable to make this an option—i.e. if you want to generate a "serverless" visualization then that should be a default, and if you want to make a viz that uses CORS but potentially can tolerate more data then you can specify a --use-external-data flag or something for the python script.

    • This is probably a pretty far off thing, but worth it to start thinking about it now.

  • Reduce the number of samples shown to the user (idea c/o @mortonjt). IMO we should approach this carefully, giving as much control to the user as possible (e.g. making the user very aware of how many samples have been excluded).

@fedarko fedarko added the optimization Making code faster or cleaner label Feb 25, 2019
@fedarko fedarko self-assigned this Feb 25, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented Apr 30, 2019

Also worth using a sparse DF for the table instead of dense DF, as is currently done. (Note that this would need to either 1) convert the BIOM DF to a sparse DF, since until biocore/biom-format#808 the default biom.Table.to_dataframe() produced a dense DF, or 2) would need to require an up-to-date version of biom.) Although I'm still having trouble with generating a sparse DF in a reasonable amount of time—maybe a scipy CSR matrix would be a better option, if we can make it work in the code. Update: Done as of beaf464

fedarko added a commit that referenced this issue May 2, 2019
To allow more than 5k rows (either features or samples) in a spec.

There *are* some limits we should probably enforce (see #58 for
details), but for the time being I think this is a decent temporary
solution.
fedarko added a commit that referenced this issue May 13, 2019
Might be worth eventually making this JSON load separately, or in
a web worker or something. But that's something for #58.
@fedarko
Copy link
Collaborator Author

fedarko commented May 17, 2019

Another thing: as mentioned in 041aaaf, it'd be cool if possible to load count data in a web worker or something to minimize strain on the browser. That's the main scaling problem IMO—it's basically just the contents of a BIOM table.

One silly option: could we just... not include samples that have 0 counts for a given feature? (or vice versa, doesn't really matter, just depends on the way the counts JSON is laid out.) As long as we know that we have a list of all sample IDs, we can infer in the JS code which samples have "0" for a given feature (or vice versa). This would basically let us "cheat" and store a sparse representation of the BIOM table, and since BIOM tables for microbiome studies are usually super duper sparse this could actually let us do silly crap like load a huge portion of the EMP in.

Another option would be to use https://github.com/molbiodiv/biojs-io-biom

@fedarko
Copy link
Collaborator Author

fedarko commented May 21, 2019

Another good idea: improving DataFrame iteration. I know iterrows() is used in process_input(), and there are probably at least a few other places where something faster (e.g. vectorized operations) could be done.

@fedarko fedarko mentioned this issue Jun 3, 2019
@fedarko fedarko added this to the v0.2.0 milestone Jun 28, 2019
@fedarko fedarko changed the title Optimizing rankratioviz' performance Store sparse count data JSON in visualization Jul 2, 2019
fedarko added a commit that referenced this issue Jul 2, 2019
Turns out that we need to delay matching until after filtering
out unextreme features in order to avoid computing the intersection
of two super-huge indices (EMP has on the order of 200k observations
and trying to match those up is going to be horrible).

And it seems like _df_utils.remove_empty_samples() was super slow
on even SparseDataFrames. To make things easier -- and because I
know from experience that it worked, and let Qurro handle huge
EMP-scale amounts of data -- I reorganized things so that we
first filter out ranks and empty samples, THEN do matching.

This works, but it breaks a lot of the unit tests (and some of
the integration tests that rely on specific error messages that
are now changed). Need to fix these then double check that everything
works properly.

This is a prereq for #58. I have an alg for that sketched out, also.
@fedarko
Copy link
Collaborator Author

fedarko commented Jul 2, 2019

Draft algorithm for how this would work:

# to be added to the end of gen_sample_plot()
counts_dict = table.to_dict()
sparse_counts_dict = {}
for feature_id, sample_counts in counts_dict.items():
    fdict = {}
    for sample_id, ct in sample_counts.items():
        if ct != 0:
            fdict[sample_id] = ct
    if len(fdict.keys()) > 0:
        sparse_counts_dict[feature_id] = fdict

Now, sparse_counts_dict only contains data for nonzero counts. So for sparse datasets, we'd need a lot less space for abundance data (i.e. much lower memory overhead).

In the JS, we'll need to somehow figure out a list of sample IDs (probs can just embed it in the sample plot JSON analogously to how feature lists are embedded in the rank plot JSON) independent of the count JSON.

Also, note how we might not even add fdict into sparse_counts_dict (this would happen if all samples have 0 for a given feature, which is ostensibly possible in Qurro at present). This would mean that that feature is effectively useless in the Qurro visualization. I think we can get around this by just only passing nonzero features to the visualization -- see #171 for how we'd address this.

fedarko added a commit that referenced this issue Jul 3, 2019
This is gonna be kind of inconvient to apply to the feature
ranks, also. Think I'm going to go back and try an alternative approach
to all of this stuff (#172, #171, #58) on another branch.
fedarko added a commit that referenced this issue Jul 3, 2019
Turns out this is a ton more efficient. Added bonus of now relying
on pandas' implementation of this instead of ours.

Turns out transposing huge dataframes is a pretty significant
endeavor, so calling .T on the feature table for like the EMP dataset
was taking a super long time. Fortunately, we can finesse our way
around this by instead transposing the sample metadata and then
aligning on the columns.

I'm glad that we reached a solution for this that preserved
all of the matching-up-front niceness re: testing. Solid stuff.

Uh, next up are #171 and then #58? But we can def merge this back
into master now.
fedarko added a commit that referenced this issue Jul 6, 2019
Closes #171.

Turns out the reason q2_moving_pictures and sleep_apnea's main.js
outputs were different is that removing empty features changed
the order of features in their rank plot JSONs around a bit (the
actual data was exactly the same since these datasets didn't have any
empty features, though). I think it was just the use of .filter().

In any case, that's ok with me. We're done here!

Now we can move on to #58.
fedarko added a commit that referenced this issue Jul 7, 2019
Need to actually integrate this with Qurro, but close!
@fedarko fedarko closed this as completed in bc0f97d Jul 7, 2019
fedarko added a commit that referenced this issue Jul 7, 2019
fedarko added a commit that referenced this issue Jul 7, 2019
relates to #58, #171.

now it's pretty quick to run the EMP with -x 2000 into Qurro. going
to do some more basic benchmarking with this.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Making code faster or cleaner
Projects
None yet
Development

No branches or pull requests

1 participant