Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading counts #31

Open
geotheory opened this issue Jan 26, 2022 · 4 comments
Open

Misleading counts #31

geotheory opened this issue Jan 26, 2022 · 4 comments

Comments

@geotheory
Copy link

geotheory commented Jan 26, 2022

The data doesn't add up as far as I can see:

require(tidyverse)
require(ggupset)

d = tidy_movies |> filter(!duplicated(title)) |> 
  select(title, Genres) |> 
  mutate(Genres = map(Genres, tolower) |> map(unique),
         str = map_chr(Genres, paste, collapse=',')) |>
  filter(str_detect(str, '(drama)|(comedy)|(romance)'))

d |> ggplot(aes(x = Genres)) + geom_bar() + 
  geom_text(stat = 'count', aes(label = ..count..), nudge_y = 50) +
  scale_x_upset(sets = c("drama", "comedy", "romance"))

image

d |> filter(str_detect(str, 'drama'), str_detect(str, 'comedy')) |> nrow()
#> [1] 265
d |> filter(str_detect(str, 'drama'), str_detect(str, 'comedy')) |> count(str)
#> # A tibble: 5 × 2
#>   str                            n
#>   <chr>                      <int>
#> 1 action,comedy,drama            9
#> 2 comedy,drama                 180
#> 3 comedy,drama,romance          68
#> 4 comedy,drama,romance,short     2
#> 5 comedy,drama,short             6

The graphic shows drama + comedy as 195, whereas the actual intersect is 180. It seems you are lumping in the other categories not manually selected for the plot with the sets argument. But if you do this then the app is being inconsistent, because when omitting the sets argument the categories are fully exclusive. In fact the real drama + comedy intersect is 265.

@const-ae
Copy link
Owner

Hey, thanks for the report. That is indeed concerning. I will try to take a look later today and see what is going wrong there.

@geotheory
Copy link
Author

Do you have appetite for adding a mode option to switch between exclusive and non-exclusive aggregation? I love Upset plots but my main criticism is they can be extremely misleading. If the aim is to visualise the size of an intersect between two sets (i.e. "better Venn") they mislead when the true intersect is split across multiple bars, some of which may be pushed off-screen by a cap on their number. I think making this expicit in the function's options raises awareness of this problem as well as offering the solution.

@z3tt
Copy link

z3tt commented Jan 3, 2024

Is this fixed by now? Never noticed anything wrong but from now on I will be much more careful. Would be good to know if one can simpy rely on the calculations made by the package.

@geotheory
Copy link
Author

geotheory commented Jan 4, 2024

@z3tt - No it's still as was. Even if/when this is resolved I feel this viz method is pretty problematic to use without clearly caveatting the exclusionary nature of its summarisations (ie. its XY figure excludes where XYZ). If your interest is in the true XY figure (including observations of XYZ) then you need an alternative workflow maybe like this (but note it only summarises genre intersections ie. it omits single-genre movies):

expand_genres = function(x){
  if(length(x) == 1) return(tibble(x = character(0), y = character(0)))
  expand_grid(x = x, y = x) |> filter(y > x)
}

purrr::map_df(d$Genres, expand_genres) |> count(x, y, sort = TRUE)
# A tibble: 20 × 3
   x           y               n
   <chr>       <chr>       <int>
 1 comedy      short         303
 2 comedy      drama         265
 3 drama       romance       243
 4 comedy      romance       206
 5 animation   comedy        167
 6 animation   short         159
 7 action      drama         148
 8 drama       short          75
 9 action      comedy         59
10 action      romance        24
11 comedy      documentary    13
12 documentary drama          10
13 romance     short          10
14 action      short           6
15 animation   romance         5
16 documentary short           3
17 animation   documentary     2
18 animation   drama           2
19 action      animation       1
20 documentary romance         1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants