Show "removed" samples from the sample plot / Give some indication of why #92

fedarko · 2019-04-27T05:33:06Z

The solution with #91 provides us with a way to ensure that, e.g., only samples with a numerical metadata value are present in the scatterplot. However, it'd be a good idea to also inform the users of how many samples were excluded due to invalid-ish values.

A really cool example of this is this Vega example, where nulls are represented as gray points along the axes. (Also see a simpler version of the example for Vega-Lite.) However, we don't necessarily need fancy solutions like these for our purposes; I'd be fine having just a little text box or something next to the sample plot that contains the number of "dropped" samples. This could be as simple as the "type": "text" mark on the Vega example above—just count the number of invalid values and say something like {} samples with missing values.

In a sense, this functionality would work double duty—it'd inform the user both about which samples don't have the features contained in the log ratio and about which samples don't have valid metadata for the current x-axis.

The text was updated successfully, but these errors were encountered:

fedarko · 2019-05-08T03:29:54Z

Also, certain color fields can result in samples being dropped. IMO this is a very important consideration.

Right now all of the samples that are passed to the visualization are included in the "export data" output (EXCEPT for samples with an invalid log ratio, i.e. due to missing certain features in the log ratio). We should either make this consistent—and include samples with an invalid log ratio alongside samples with invalid metadata fields—OR we should only include "currently drawn" samples, I guess? In any case, this should be super explicit.

fedarko · 2019-05-28T22:07:20Z

It'd be cool to show either a tally or a percentage of samples dropped (latter idea c/o Lisa).

This is a solid start. We'd want to also log the number of samples dropped due to having a non-numeric value on a quantitative scale (either for color or x-axis). I *think* we could do that by running some similar code in updateSamplePlotField()/updateSamplePlotScale(), and maybe adding some logic to ensure that each sample is just counted once (i.e. both null and non-numeric-metadata-on-a-quant-scale samples would be dropped), but I'm not sure. Should be doable tho :)

these are also logged w/ the NaN-drop ct. In the future, we should also keep track of how many samples were dropped in the Python side of things, so the user knows *exactly* what samples have gone where. Ideally we should be as explicit as possible with this

fedarko · 2019-06-01T23:11:37Z

plan of how code for displaying this would work:

There should be four divs below the sample plot (i.e. just below the boxplot toggle).
- The first one of these should be updated every time the feature selection(s) change, to let the user know how many samples are not shown in the plot due to having a NaN balance (i.e. 0 in either/both of the numerator/denominator).
  - There's already code in Qurro that computes the number of dropped samples, so this is super feasible.
- The second one of these should be updated every time the x-axis scale type changes, or every time the x-axis field changes. This should go through all samples to figure out how many samples will not be shown in the plot due to not being treatable as a number (I think we can just use vega.toNumber), and just let the user know how many samples are affected by this.
- The third one is basically the same as the x-axis one, but for the color field stuff. We should be able to abstract out almost all of the code for this. (And in the future, if we add more encodings, it should be simple to extend this.)

We should store lists of sample IDs that have been excluded by any of these three metrics, so that we can say in the fourth div how many samples are currently shown in the sample plot at all (similar to Emperor's top-left-corner box). We can get the total number of dropped samples by computing the union of the three "excluded samples" lists, which shouldn't take that much time.

The number of shown samples is just # samples - # dropped samples, then.

Also, if any of the divs saying why a sample has been dropped correspond to 0 samples being dropped (e.g. all of the x-axis fields of all samples are quantitative), then hide those divs I guess.

Progress towards #92. I guess this is ~ 1/4 of the work there.

Should be p simple to use this in other contexts

Progress on #92

fedarko · 2019-06-05T06:07:53Z

I've spent some time looking into if it'd be possible to get Vega/Vega-Lite to do some of this work for us—perhaps with the "valid" / "missing" aggregate transforms. But I haven't been able to find much documentation on how those work.

However, I think we can do this by patching the underlying Vega spec. Advantages of this:

Minimizes the amount of code I have to write
Since we're making Vega do the work for us, less chance of things getting outdated when the Vega* versions used are updated
Lets us embed the number of missing samples into the plot directly

Disadvantages:

This might cause a lack of granularity? -> actually nah I think we can just define multiple data sources that filter the samples in different ways so we can include multiple counts, like I'd planned
Might have a slightly heavier memory footprint due to having to potentially store multiple samples (if we define multiple data fields) but idk
Inherently less flexible than doing it ourselves in JS/HTML (e.g. just displaying a percentage in a Vega text mark is going to take me at least 30 minutes to figure out)
We'll need to keep patching the Vega spec whenever we call vegaEmbed, which is doable but could get really annoying
I have no idea how this will work with boxplots

Basically, the way we could do this is similar to how the nullXY data source is created and used in the Vega example here. We can define a Vega dataset that takes its data from the main dataset, but instead of filtering out data points that are null / NaN on either the x- / y- axis / (/color?) fields we can filter to just data points that are null or NaN on these fields.

This will necessitate some work in order to ensure that—

every time a scale type is changed to categorical, the NaN checks are removed (I think?)
every time a scale type is changed to quantitative, the NaN checks are added back in
every time a scale field is changed, the fields being checked are updated
Although I think we can just add some code (probs another function) into makeSamplePlot() that handles all of this adequately. The code footprint of this solution probably wouldn't be thaaaat heavy.

Here's a Vega spec that shows something like this in action on the sleep apnea dataset.

I'm not super sure about whether I prefer this or the manual (i.e. using custom HTML and JS) solution. Will think about it.

If we'll go with the Vega solution, make sure to attribute the aforementioned scatter plot null values example in the code.

This is tentative -- will add tests soon

fedarko · 2019-06-06T01:50:37Z

Yeah, I think it's going to be easiest to do this manually (without messing around with the generated Vega spec).

Added getValidSamples(), which should be a solid way to figure out omitted samples for a given field. TODOs are:

unit-test getValidSamples()
integrate getValidSamples() with makeSamplePlot() (updating the x-axis and color field divs).
Use all dropped-sample information to compute a total # of samples shown (calling updateMainSampleShownDiv() accordingly).
end-to-end tests that this works in practice (when Explicitly wait for Vega plots to be created before enabling various interactive things #85 is addressed).
account for weird crap like "Infinity" or "-Infinity" in quantitative scales, or "" in nominal scales. Either filter these out in makeSamplePlot() (saving a log of their filtering, so they can be added back in when the scale is changed) or just accept the weirdness and update the documentation (this esp. might be more feasible for the infinity problem).
- I think handling Properly handle missing values in sample/feature metadata #101 (so that all metadata vals are parsed as strings, and ""s are converted to Nones in python (i.e. null in JS)) is the best course of action before this. Then we can add tests that verify that this works in python, and that the sample dropping stats in JS are also handled properly.
Add tests for aforementioned weird stuff. This will probably tie in with Properly handle missing values in sample/feature metadata #101. We want to make sure all of our decisions in RRVDisplay.getInvalidSampleIDs() match up perfectly with Vega-Lite, and I'm not confident that's the case yet. Need to test that all of these assumptions hold.
- one alarming thing is that nulls are apparently only filtered by v-lite when the scale is quantitative -- they're fair game in nominal scales. Need to account for this one way or another; should be doable to add a filter transform to the sample plot that removes nulls for the x and color encoding fields?, and update it whenever the current encoding is changed.

"sty" --> "various small stylistic tweaks, e.g. in the updateSampleDroppedDiv() generated messages"

progress on #92 (obvs need more thorough testing of corner cases) and on #85 (or at least on making clear the amount of work left to be done, which is nontrivial).

re: quantitative scales

looks like we're approaching a solution to #92

The newly added tests focus on weird string stuff in nominal scale types. A TODO is making sure that there's parity between what getValidSamples() accepts and what Vega* accepts. From doing some playing around in the vega editor, I think that the categorical stuff we accept (incl. stuff like " ", "undefined", ...) is also accepted by Vega, but there are some things getValidSamples() doesn't accept that Vega does. Including: "Infinity" and "-Infinity" on quantitative scales (this throws off everything) "" on nominal scales (this should be the one sample metadata value that we don't display, if anything, due to Q2 metadata standards) We might be able to handle this sorta stuff by messing around in makeSamplePlot() to look at the scale type and filter values accordingly. MIIGHT be worth making another issue for that, i'm not sure. So this is getting closer, at least!

i.e. lists of dropped sample IDs instead of just the count of dropped sample IDs. Will make finishing up #92 easier

Same basic logic, just reversed. This will make implementing #92 simpler. ....why did i do it this way at the start

This closes #116 (the new metadata tests ensure this is solved). This also brings us like 90% of the way to being done with #101. What we should do is also filter out just-whitespace data, I guess? To be consistent with QIIME 2. Now, we should add tests that the Vega* filtering reflects things properly (NaNs/nulls/Infinities are being filtered). Should be doable to explicitly set this in Altair. Once we've verified that the filtering matches the behavior, we'll be done with #92 and we can merge this branch back in to master. Also I just spent like five hours making pandas play nicely with NaNs and now my head hurts This commit changed a lot of stuff. The bulk of it was making the tests adapt to the new object-values stuff, as well as testing the new functionality. Would be worth testing that differentials' indices are parsed properly, I guess.

This sums up the programming work for #92, mostly. What I want to do now is add tests, etc. to make sure RRVDisplay.updateSamplePlotFilters() is working properly before I merge this back in. Also thinking about maybe working on #66 while I'm at it. Probably a good time to take care of #85 et al. also?

This is super exciting. Should be doable to make a new JS test that loads fancy JSONs with various weird metadata vals and asserts that the DOM is being updated correctly, although I guess it isn't *super* high priority since that side of things is pretty decently unit tested already.

Might as well, I guess! TODO--beef up test

previously only the main div was getting filled in -- now all of them have some text added and the "invisible" class added, so we know that the test is actually checking something.

In the immediate future, we can add more tests that actually check that the enforced Filters match up with the getInvalidSampleIDs() decisions. That should be good enough to take care of #92, which will let us merge this back in.

Next up: fulfilling #163 so that we can move the resulting JSONs to a JS test file, and then ensure that stuff works as expected. NOTE that I also modified the integration test for q2-moving-pictures to actually register itself as a q2 integration test.

Related changes along the way: -Added tests of the newly created function (try_to_replace_line_json) -Added a json_prefix argument to replace_plot_json_definitions() and try_to_replace_line_json() that should let us set up a #92 integration test super soon. From there, all we'd need to do is actually write the JS side of the test, and... then we'd be good!

This'll let us set up an integration test for #92 and thereby let us actually FINALLY finish things up here

This should at least ensure that all the weird stuff in the SST sample metadata file makes it to JS ok. The one test I have added is essentially a copy of testing_utilities.validate_sample_stats_test_sample_plot_json(), but from the JS side of things. This is already pretty good, but we can do even better. IMO, the main thing we need to do is to test that filters created for this test's RRVDisplay object match the decisions of getInvalidSampleIDs(). Once that's done I guess we're ready to finally merge this stuff back in?

TODO -- add more (see commented out stuff). for some reason the state of rrv seems to persist between tests... I'll try to figure out a solution tomorrow.

-Now, get_jsons() has a json_prefix argument analogous to the one for replace_js_json_definitions(). This lets us "pick our battles" -- if we can't find any "var SST...JSON = {"-ish definitions in a file, we won't bother trying to replace any of them. -This was causing problems earlier because the code was constantly replacing the #92 JS integration test file because it kept finding Nones for all of the JSONs in there (since it wasn't passing the SST prefix to get_jsons()). So that was a pain. -also, code should be generally faster now. It's still kind of inefficient (because really the line numbers get_jsons() gets to should be shared with replace_...() to reduce the number of times a file is gone over), but it *works* so that's really all I care about for this particular side of things. -Also, the t[a : b] syntax apparently makes flake8 angry. So I disabled E203 in the Makefile's invocation of flake8. (I also disabled W503, which kept popping up when I disabled E203. From some cursory googling, it looks like both of these are safe to disable when using flake8 and black on the same codebase.)

Turns out the bug with "rrv" (actually with the JSON) state was because I need to parse/stringify them before passing them. whoops. (That's the sort of thing that #162 should address in the future.) So I imagine these filtering tests as being sort of like a table for a boolean operator: C/C C/Q Q/C Q/Q So far we've got C/C (both x-axis and color categorical) and, now, Q/C (x-axis quantitative and color categorical) tested. Just need to add the other two soon. After that we need to verify that getInvalidSampleIDs() matches up with these filters. Then... we're good!

After this, I just want to add comparisons to the results of getInvalidSampleIDs(). Then... we're done!

A way of ensuring that our decisions match up with the Vega* filters we create. Close to #92 being taken care of -- just gotta do this for the other three filter tests.

This closes #92! Finally.

(We still filter out samples that are empty after -x removes their only associated features.) I figure making this explicit is sort of needed to really have #92 addressed. Just want to add a couple of tests for this...

Still gotta actually clean up Qurro code somewhat.

JS is weird, but we're handling things properly. biocore#92 (This was more I forgot how this part of the code worked and started freaking out over it being broken until i realized it wasn't broken lol)

Progress on biocore#141

fedarko added the enhancement New feature or request label Apr 27, 2019

fedarko self-assigned this Apr 27, 2019

fedarko changed the title ~~Show "removed" samples from the sample plot~~ Show "removed" samples from the sample plot / Give some indication of why May 8, 2019

fedarko added the important Things that are critical for getting Qurro in a working/useful state label May 8, 2019

fedarko added a commit that referenced this issue Jun 1, 2019

MAINT: add placeholder eles for #92 solution

49a74c7

fedarko added a commit that referenced this issue Jun 3, 2019

ENH: display # of NaN-balance samples in a <div>

ba6291a

Progress towards #92. I guess this is ~ 1/4 of the work there.

fedarko added a commit that referenced this issue Jun 3, 2019

MAINT: update demo outputs re recent #92 updates

f76d554

fedarko added a commit that referenced this issue Jun 3, 2019

MAINT: abstract #92 balance work to dom_utils func

6d7856b

Should be p simple to use this in other contexts

fedarko mentioned this issue Jun 3, 2019

Host this on PyPI and conda-forge #153

Closed

3 tasks

fedarko added a commit that referenced this issue Jun 4, 2019

MAINT: add comment re #92 plans

fb0b6c9

fedarko added a commit that referenced this issue Jun 4, 2019

ENH: Display %age of dropped samples for a reason

38a7c92

Progress on #92

fedarko added a commit that referenced this issue Jun 6, 2019

MAINT: add func for finding valid samples #92

df282ec

This is tentative -- will add tests soon

fedarko added a commit that referenced this issue Jun 6, 2019

MAINT: add+use updateMainSampleShownDiv(); sty #92

21e5e56

"sty" --> "various small stylistic tweaks, e.g. in the updateSampleDroppedDiv() generated messages"

fedarko mentioned this issue Jun 6, 2019

Add buttons for exporting plot information from visualization interface #87

Closed

5 tasks

fedarko added a commit that referenced this issue Jun 6, 2019

TST: add basic getValidSamples test, add pendings

483570e

progress on #92 (obvs need more thorough testing of corner cases) and on #85 (or at least on making clear the amount of work left to be done, which is nontrivial).

fedarko added a commit that referenced this issue Jun 7, 2019

MAINT: add small note re: #92 and #131

f4b5425

fedarko added a commit that referenced this issue Jun 7, 2019

TST: Add null/undefined/"" filtering tests #92

535aba8

fedarko added a commit that referenced this issue Jun 7, 2019

TST: Add test for non-num. val filtering #92

a4893c3

re: quantitative scales

fedarko added a commit that referenced this issue Jun 7, 2019

TST: More corner-case testing for getValidSamples

83d85b3

looks like we're approaching a solution to #92

fedarko mentioned this issue Jun 8, 2019

Don't use qurro_balance as default sample plot x-axis; rm from field selectors #155

Closed

3 tasks

fedarko added a commit that referenced this issue Jun 8, 2019

MAINT: Make updateSampleDroppedDiv take in lists

627e31c

i.e. lists of dropped sample IDs instead of just the count of dropped sample IDs. Will make finishing up #92 easier

fedarko added a commit that referenced this issue Jun 8, 2019

MAINT: getValidSamples() -> getInvalidSampleIDs()

dbf1b35

Same basic logic, just reversed. This will make implementing #92 simpler. ....why did i do it this way at the start

fedarko added a commit that referenced this issue Jun 10, 2019

TST: More thorough #92 div-updating tests

2593ff4

fedarko added a commit that referenced this issue Jun 10, 2019

TST: Test CSS after updateSampleDroppedDiv() #92

5546961

fedarko added a commit that referenced this issue Jun 10, 2019

MAINT: Abstract repeated #92 code to sep function

1aac996

fedarko added a commit that referenced this issue Jun 10, 2019

DOC: Update docs re #92 changes

8ab3351

fedarko added a commit that referenced this issue Jun 10, 2019

ENH: More concise dropped-due-to-balance msg #92

a4a3af1

fedarko added a commit that referenced this issue Jun 17, 2019

MAINT/TST: Make destroy clear #92 stat divs

963da78

Might as well, I guess! TODO--beef up test

fedarko mentioned this issue Jun 17, 2019

Add ability to propagate multiple python tests' JSON outputs to JS tests #163

Closed

fedarko added a commit that referenced this issue Jun 19, 2019

TST: Send sample stats test output to JS test #92

933bfb1

This'll let us set up an integration test for #92 and thereby let us actually FINALLY finish things up here

fedarko added a commit that referenced this issue Jun 19, 2019

TST: Add #92 test re filters added for cat. fields

c5c31f1

TODO -- add more (see commented out stuff). for some reason the state of rrv seems to persist between tests... I'll try to figure out a solution tomorrow.

fedarko added a commit that referenced this issue Jun 20, 2019

TST: Add some helper code for #92 integ. tests

8efb671

fedarko added a commit that referenced this issue Jun 20, 2019

TST: Finish up filtering tests #92 #2

2913274

After this, I just want to add comparisons to the results of getInvalidSampleIDs(). Then... we're done!

fedarko added a commit that referenced this issue Jun 20, 2019

TST: Add getInvalidSampleIDs checks to filter tst

d1608c0

A way of ensuring that our decisions match up with the Vega* filters we create. Close to #92 being taken care of -- just gotta do this for the other three filter tests.

fedarko added a commit that referenced this issue Jun 20, 2019

TST: Add final? tests for filters+sample stats

13cfb73

This closes #92! Finally.

fedarko added a commit that referenced this issue Jun 22, 2019

TST: Add "empty" sample filtering integ. tests #92

f5cb3fa

Still gotta actually clean up Qurro code somewhat.

fedarko mentioned this issue Jun 24, 2019

Display info/stats about removed samples; much more detailed testing #166

Merged

fedarko closed this as completed in #166 Jun 24, 2019

fedarko mentioned this issue Jul 7, 2019

Make what data exporting is doing more clear #176

Closed

fedarko added a commit to fedarko/qurro that referenced this issue Jul 15, 2019

BUG: make input number parsing consistent w/ biocore#92

9b0169b

Progress on biocore#141

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show "removed" samples from the sample plot / Give some indication of why #92

Show "removed" samples from the sample plot / Give some indication of why #92

fedarko commented Apr 27, 2019

fedarko commented May 8, 2019 •

edited

Loading

fedarko commented May 28, 2019

fedarko commented Jun 1, 2019 •

edited

Loading

fedarko commented Jun 5, 2019 •

edited

Loading

fedarko commented Jun 6, 2019 •

edited

Loading

Show "removed" samples from the sample plot / Give some indication of why #92

Show "removed" samples from the sample plot / Give some indication of why #92

Comments

fedarko commented Apr 27, 2019

fedarko commented May 8, 2019 • edited Loading

fedarko commented May 28, 2019

fedarko commented Jun 1, 2019 • edited Loading

fedarko commented Jun 5, 2019 • edited Loading

fedarko commented Jun 6, 2019 • edited Loading

fedarko commented May 8, 2019 •

edited

Loading

fedarko commented Jun 1, 2019 •

edited

Loading

fedarko commented Jun 5, 2019 •

edited

Loading

fedarko commented Jun 6, 2019 •

edited

Loading