Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show "removed" samples from the sample plot / Give some indication of why #92

Closed
fedarko opened this issue Apr 27, 2019 · 5 comments · Fixed by #166
Closed

Show "removed" samples from the sample plot / Give some indication of why #92

fedarko opened this issue Apr 27, 2019 · 5 comments · Fixed by #166
Assignees
Labels
enhancement New feature or request important Things that are critical for getting Qurro in a working/useful state

Comments

@fedarko
Copy link
Collaborator

fedarko commented Apr 27, 2019

The solution with #91 provides us with a way to ensure that, e.g., only samples with a numerical metadata value are present in the scatterplot. However, it'd be a good idea to also inform the users of how many samples were excluded due to invalid-ish values.

A really cool example of this is this Vega example, where nulls are represented as gray points along the axes. (Also see a simpler version of the example for Vega-Lite.) However, we don't necessarily need fancy solutions like these for our purposes; I'd be fine having just a little text box or something next to the sample plot that contains the number of "dropped" samples. This could be as simple as the "type": "text" mark on the Vega example above—just count the number of invalid values and say something like {} samples with missing values.

In a sense, this functionality would work double duty—it'd inform the user both about which samples don't have the features contained in the log ratio and about which samples don't have valid metadata for the current x-axis.

@fedarko fedarko added the enhancement New feature or request label Apr 27, 2019
@fedarko fedarko self-assigned this Apr 27, 2019
@fedarko fedarko changed the title Show "removed" samples from the sample plot Show "removed" samples from the sample plot / Give some indication of why May 8, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented May 8, 2019

Also, certain color fields can result in samples being dropped. IMO this is a very important consideration.

Right now all of the samples that are passed to the visualization are included in the "export data" output (EXCEPT for samples with an invalid log ratio, i.e. due to missing certain features in the log ratio). We should either make this consistent—and include samples with an invalid log ratio alongside samples with invalid metadata fields—OR we should only include "currently drawn" samples, I guess? In any case, this should be super explicit.

@fedarko fedarko added the important Things that are critical for getting Qurro in a working/useful state label May 8, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented May 28, 2019

It'd be cool to show either a tally or a percentage of samples dropped (latter idea c/o Lisa).

fedarko added a commit that referenced this issue May 30, 2019
This is a solid start.

We'd want to also log the number of samples dropped due to having
a non-numeric value on a quantitative scale (either for color or
x-axis). I *think* we could do that by running some similar code
in updateSamplePlotField()/updateSamplePlotScale(), and maybe adding
some logic to ensure that each sample is just counted once (i.e. both
null and non-numeric-metadata-on-a-quant-scale samples would be
dropped), but I'm not sure. Should be doable tho :)
fedarko added a commit that referenced this issue May 30, 2019
these are also logged w/ the NaN-drop ct.

In the future, we should also keep track of how many samples were
dropped in the Python side of things, so the user knows *exactly*
what samples have gone where.

Ideally we should be as explicit as possible with this
@fedarko
Copy link
Collaborator Author

fedarko commented Jun 1, 2019

plan of how code for displaying this would work:

  • There should be four divs below the sample plot (i.e. just below the boxplot toggle).
    • The first one of these should be updated every time the feature selection(s) change, to let the user know how many samples are not shown in the plot due to having a NaN balance (i.e. 0 in either/both of the numerator/denominator).
      • There's already code in Qurro that computes the number of dropped samples, so this is super feasible.
    • The second one of these should be updated every time the x-axis scale type changes, or every time the x-axis field changes. This should go through all samples to figure out how many samples will not be shown in the plot due to not being treatable as a number (I think we can just use vega.toNumber), and just let the user know how many samples are affected by this.
    • The third one is basically the same as the x-axis one, but for the color field stuff. We should be able to abstract out almost all of the code for this. (And in the future, if we add more encodings, it should be simple to extend this.)

We should store lists of sample IDs that have been excluded by any of these three metrics, so that we can say in the fourth div how many samples are currently shown in the sample plot at all (similar to Emperor's top-left-corner box). We can get the total number of dropped samples by computing the union of the three "excluded samples" lists, which shouldn't take that much time.

The number of shown samples is just # samples - # dropped samples, then.

Also, if any of the divs saying why a sample has been dropped correspond to 0 samples being dropped (e.g. all of the x-axis fields of all samples are quantitative), then hide those divs I guess.

fedarko added a commit that referenced this issue Jun 3, 2019
Progress towards #92. I guess this is ~ 1/4 of the work there.
fedarko added a commit that referenced this issue Jun 3, 2019
Should be p simple to use this in other contexts
fedarko added a commit that referenced this issue Jun 4, 2019
fedarko added a commit that referenced this issue Jun 4, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented Jun 5, 2019

I've spent some time looking into if it'd be possible to get Vega/Vega-Lite to do some of this work for us—perhaps with the "valid" / "missing" aggregate transforms. But I haven't been able to find much documentation on how those work.

However, I think we can do this by patching the underlying Vega spec. Advantages of this:

  1. Minimizes the amount of code I have to write
  2. Since we're making Vega do the work for us, less chance of things getting outdated when the Vega* versions used are updated
  3. Lets us embed the number of missing samples into the plot directly

Disadvantages:

  1. This might cause a lack of granularity? -> actually nah I think we can just define multiple data sources that filter the samples in different ways so we can include multiple counts, like I'd planned
  2. Might have a slightly heavier memory footprint due to having to potentially store multiple samples (if we define multiple data fields) but idk
  3. Inherently less flexible than doing it ourselves in JS/HTML (e.g. just displaying a percentage in a Vega text mark is going to take me at least 30 minutes to figure out)
  4. We'll need to keep patching the Vega spec whenever we call vegaEmbed, which is doable but could get really annoying
  5. I have no idea how this will work with boxplots

Basically, the way we could do this is similar to how the nullXY data source is created and used in the Vega example here. We can define a Vega dataset that takes its data from the main dataset, but instead of filtering out data points that are null / NaN on either the x- / y- axis / (/color?) fields we can filter to just data points that are null or NaN on these fields.

This will necessitate some work in order to ensure that—

  • every time a scale type is changed to categorical, the NaN checks are removed (I think?)
  • every time a scale type is changed to quantitative, the NaN checks are added back in
  • every time a scale field is changed, the fields being checked are updated
    Although I think we can just add some code (probs another function) into makeSamplePlot() that handles all of this adequately. The code footprint of this solution probably wouldn't be thaaaat heavy.

Here's a Vega spec that shows something like this in action on the sleep apnea dataset.

I'm not super sure about whether I prefer this or the manual (i.e. using custom HTML and JS) solution. Will think about it.

If we'll go with the Vega solution, make sure to attribute the aforementioned scatter plot null values example in the code.

fedarko added a commit that referenced this issue Jun 6, 2019
This is tentative -- will add tests soon
@fedarko
Copy link
Collaborator Author

fedarko commented Jun 6, 2019

Yeah, I think it's going to be easiest to do this manually (without messing around with the generated Vega spec).

Added getValidSamples(), which should be a solid way to figure out omitted samples for a given field. TODOs are:

  • unit-test getValidSamples()
  • integrate getValidSamples() with makeSamplePlot() (updating the x-axis and color field divs).
  • Use all dropped-sample information to compute a total # of samples shown (calling updateMainSampleShownDiv() accordingly).
  • end-to-end tests that this works in practice (when Explicitly wait for Vega plots to be created before enabling various interactive things #85 is addressed).
  • account for weird crap like "Infinity" or "-Infinity" in quantitative scales, or "" in nominal scales. Either filter these out in makeSamplePlot() (saving a log of their filtering, so they can be added back in when the scale is changed) or just accept the weirdness and update the documentation (this esp. might be more feasible for the infinity problem).
    • I think handling Properly handle missing values in sample/feature metadata #101 (so that all metadata vals are parsed as strings, and ""s are converted to Nones in python (i.e. null in JS)) is the best course of action before this. Then we can add tests that verify that this works in python, and that the sample dropping stats in JS are also handled properly.
  • Add tests for aforementioned weird stuff. This will probably tie in with Properly handle missing values in sample/feature metadata #101. We want to make sure all of our decisions in RRVDisplay.getInvalidSampleIDs() match up perfectly with Vega-Lite, and I'm not confident that's the case yet. Need to test that all of these assumptions hold.
    • one alarming thing is that nulls are apparently only filtered by v-lite when the scale is quantitative -- they're fair game in nominal scales. Need to account for this one way or another; should be doable to add a filter transform to the sample plot that removes nulls for the x and color encoding fields?, and update it whenever the current encoding is changed.

fedarko added a commit that referenced this issue Jun 6, 2019
"sty" --> "various small stylistic tweaks, e.g. in the
updateSampleDroppedDiv() generated messages"
fedarko added a commit that referenced this issue Jun 6, 2019
progress on #92 (obvs need more thorough testing of corner cases)
and on #85 (or at least on making clear the amount of work left to
be done, which is nontrivial).
fedarko added a commit that referenced this issue Jun 7, 2019
fedarko added a commit that referenced this issue Jun 7, 2019
fedarko added a commit that referenced this issue Jun 7, 2019
looks like we're approaching a solution to #92
fedarko added a commit that referenced this issue Jun 7, 2019
The newly added tests focus on weird string stuff in nominal
scale types.

A TODO is making sure that there's parity between what
getValidSamples() accepts and what Vega* accepts. From doing some
playing around in the vega editor, I think that the categorical
stuff we accept (incl. stuff like " ", "undefined", ...) is also
accepted by Vega, but there are some things getValidSamples() doesn't
accept that Vega does. Including:

"Infinity" and "-Infinity" on quantitative scales (this throws off
everything)

"" on nominal scales (this should be the one sample metadata value
that we don't display, if anything, due to Q2 metadata standards)

We might be able to handle this sorta stuff by messing around in
makeSamplePlot() to look at the scale type and filter values
accordingly. MIIGHT be worth making another issue for that, i'm not
sure.

So this is getting closer, at least!
fedarko added a commit that referenced this issue Jun 8, 2019
i.e. lists of dropped sample IDs instead of just the count of
dropped sample IDs. Will make finishing up #92 easier
fedarko added a commit that referenced this issue Jun 8, 2019
Same basic logic, just reversed. This will make implementing #92
simpler.

....why did i do it this way at the start
fedarko added a commit that referenced this issue Jun 10, 2019
fedarko added a commit that referenced this issue Jun 10, 2019
fedarko added a commit that referenced this issue Jun 11, 2019
This closes #116 (the new metadata tests ensure this is solved).

This also brings us like 90% of the way to being done with #101.
What we should do is also filter out just-whitespace data, I guess?
To be consistent with QIIME 2.

Now, we should add tests that the Vega* filtering reflects things
properly (NaNs/nulls/Infinities are being filtered). Should be
doable to explicitly set this in Altair.

Once we've verified that the filtering matches the behavior, we'll
be done with #92 and we can merge this branch back in to master.

Also I just spent like five hours making pandas play nicely with NaNs
and now my head hurts

This commit changed a lot of stuff. The bulk of it was making the tests
adapt to the new object-values stuff, as well as testing the new
functionality.

Would be worth testing that differentials' indices are parsed properly,
I guess.
fedarko added a commit that referenced this issue Jun 12, 2019
This sums up the programming work for #92, mostly. What I want to
do now is add tests, etc. to make sure
RRVDisplay.updateSamplePlotFilters() is working properly before I
merge this back in. Also thinking about maybe working on #66 while I'm
at it.

Probably a good time to take care of #85 et al. also?
fedarko added a commit that referenced this issue Jun 16, 2019
This is super exciting. Should be doable to make a new JS test that
loads fancy JSONs with various weird metadata vals and asserts that the
DOM is being updated correctly, although I guess it isn't *super*
high priority since that side of things is pretty decently unit tested
already.
fedarko added a commit that referenced this issue Jun 17, 2019
Might as well, I guess!

TODO--beef up test
fedarko added a commit that referenced this issue Jun 17, 2019
previously only the main div was getting filled in -- now all
of them have some text added and the "invisible" class added, so we
know that the test is actually checking something.
fedarko added a commit that referenced this issue Jun 17, 2019
In the immediate future, we can add more tests that actually check
that the enforced Filters match up with the getInvalidSampleIDs()
decisions. That should be good enough to take care of #92, which will
let us merge this back in.
fedarko added a commit that referenced this issue Jun 17, 2019
Next up: fulfilling #163 so that we can move the resulting JSONs to
a JS test file, and then ensure that stuff works as expected.

NOTE that I also modified the integration test for q2-moving-pictures
to actually register itself as a q2 integration test.
fedarko added a commit that referenced this issue Jun 18, 2019
Related changes along the way:

-Added tests of the newly created function (try_to_replace_line_json)

-Added a json_prefix argument to replace_plot_json_definitions()
 and try_to_replace_line_json() that should let us set up a #92
 integration test super soon. From there, all we'd need to do is
 actually write the JS side of the test, and... then we'd be good!
fedarko added a commit that referenced this issue Jun 19, 2019
This'll let us set up an integration test for #92 and thereby let us
actually FINALLY finish things up here
fedarko added a commit that referenced this issue Jun 19, 2019
This should at least ensure that all the weird stuff in the SST
sample metadata file makes it to JS ok. The one test I have added
is essentially a copy of
testing_utilities.validate_sample_stats_test_sample_plot_json(), but
from the JS side of things.

This is already pretty good, but we can do even better. IMO, the main
thing we need to do is to test that filters created for this test's
RRVDisplay object match the decisions of getInvalidSampleIDs().

Once that's done I guess we're ready to finally merge this stuff back
in?
fedarko added a commit that referenced this issue Jun 19, 2019
TODO -- add more (see commented out stuff). for some reason the
state of rrv seems to persist between tests... I'll try to figure out
a solution tomorrow.
fedarko added a commit that referenced this issue Jun 20, 2019
-Now, get_jsons() has a json_prefix argument analogous to the one for
 replace_js_json_definitions(). This lets us "pick our battles" --
 if we can't find any "var SST...JSON = {"-ish definitions in a
 file, we won't bother trying to replace any of them.
    -This was causing problems earlier because the code was
     constantly replacing the #92 JS integration test file because it
     kept finding Nones for all of the JSONs in there (since it wasn't
     passing the SST prefix to get_jsons()). So that was a pain.
    -also, code should be generally faster now. It's still kind of
     inefficient (because really the line numbers get_jsons() gets to
     should be shared with replace_...() to reduce the number of times
     a file is gone over), but it *works* so that's really all I care
     about for this particular side of things.

-Also, the t[a : b] syntax apparently makes flake8 angry. So I
 disabled E203 in the Makefile's invocation of flake8.
 (I also disabled W503, which kept popping up when I disabled E203.
 From some cursory googling, it looks like both of these are safe
 to disable when using flake8 and black on the same codebase.)
fedarko added a commit that referenced this issue Jun 20, 2019
Turns out the bug with "rrv" (actually with the JSON) state was
because I need to parse/stringify them before passing them. whoops.
(That's the sort of thing that #162 should address in the future.)

So I imagine these filtering tests as being sort of like a table
for a boolean operator:

C/C
C/Q
Q/C
Q/Q

So far we've got C/C (both x-axis and color categorical) and, now, Q/C
(x-axis quantitative and color categorical) tested. Just need to
add the other two soon.

After that we need to verify that getInvalidSampleIDs() matches up
with these filters. Then... we're good!
fedarko added a commit that referenced this issue Jun 20, 2019
After this, I just want to add comparisons to the results of
getInvalidSampleIDs(). Then... we're done!
fedarko added a commit that referenced this issue Jun 20, 2019
A way of ensuring that our decisions match up with the Vega*
filters we create. Close to #92 being taken care of -- just gotta
do this for the other three filter tests.
fedarko added a commit that referenced this issue Jun 20, 2019
fedarko added a commit that referenced this issue Jun 21, 2019
(We still filter out samples that are empty after -x removes their
only associated features.)

I figure making this explicit is sort of needed to really have #92
addressed.

Just want to add a couple of tests for this...
fedarko added a commit that referenced this issue Jun 22, 2019
Still gotta actually clean up Qurro code somewhat.
fedarko added a commit to fedarko/qurro that referenced this issue Jul 15, 2019
JS is weird, but we're handling things properly. biocore#92

(This was more I forgot how this part of the code worked and started
freaking out over it being broken until i realized it wasn't broken
lol)
fedarko added a commit to fedarko/qurro that referenced this issue Jul 15, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request important Things that are critical for getting Qurro in a working/useful state
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant