Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP] Propose visualizations based on data #12724

Closed
wernerdaehn opened this issue Jan 25, 2021 · 8 comments
Closed

[SIP] Propose visualizations based on data #12724

wernerdaehn opened this issue Jan 25, 2021 · 8 comments
Labels
enhancement:request Enhancement request submitted by anyone from the community explore:control Related to the controls panel of Explore explore:dataset Related to the dataset of Explore explore:design Related to the Explore UI/UX sip Superset Improvement Proposal

Comments

@wernerdaehn
Copy link

[SIP] Propose visualizations based on data

Motivation

I have been working for Business Objects and SAP and I am in the Business Intelligence Market for more than 20 years. One thing that is still not satisfying is how the charting options are chosen.
Over the time the number of available charts and their variants will increase more and more and selecting from the long list is cumbersome. Also not everybody knows all visualization options for every case.
But given that superset has a semantic layer, you can preselect the visualizations.

Example: 2 Attributes & 2 Measures? Very likely a Pie Chart will not be the proper visualization.

There is an entire academic theory about different axis types (Nominalscale, Ordinalscale, Intervalscale, Ratioscale) for example. In case you are interested we can work on the details.

Proposed Change

  1. Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...
  2. Define the aggregation type of a measure and if it is semi-additive
  3. For each charting option and variant specify a rank how useful it is based on the number of attributes, number of measures, axis type of the attribute, measure type.
  4. Order the charting options based on an overall rank

Please let me know if you are interested and I would spend some time to work out the details.

@junlincc
Copy link
Member

junlincc commented Jan 25, 2021

Thanks for suggesting! @wernerdaehn

  1. Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...

It is aligned with our long term product roadmap. in fact, when we implemented new time picker in Superset, we thought about allowing user to query the earliest(min) and latest(max) time available in the timestamp dimension. couldn't get to it by v1.0 because of potential performance issues and our time constraints. collecting more metadata of dataset is something we wanna do once we get to refactoring the major control fields like metrics, filter etc.

  1. Define the aggregation type of a measure and if it is semi-additive

something we will consider. it probably will require us to 'thickening' our semantic layer in Superset and steepen the learning curve of Superset.

3 & 4.

both are features available in Tableau. I agree they provides nice user experience and enables non tech users to create visualization intuitively. we would love to get to both someday.

Screen.Recording.2021-01-25.at.1.26.05.AM.mov

@junlincc junlincc added enhancement:request Enhancement request submitted by anyone from the community explore:control Related to the controls panel of Explore explore:dataset Related to the dataset of Explore viz:explore:ux labels Jan 25, 2021
@junlincc
Copy link
Member

@wernerdaehn if you would like contribute any above items to Superset in any ways, we would love to work with you!

@wernerdaehn
Copy link
Author

@junlincc Thanks for the feedback. Just for the records, what Tableau does is just the very beginning!
See here for how wide the topic can get: https://datavizproject.com/

@wernerdaehn
Copy link
Author

Any suggestion of what I can do for you in that regards? Else I will try to come up with something to discuss but would love to get your guidance.

@ktmud
Copy link
Member

ktmud commented Jan 25, 2021

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

What you described is often called automated chart specification, or automated Exploratory Data Analysis (EDA), which is also quite big among DataViZ academics: https://github.com/mstaniak/autoEDA-resources

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, which seems to have identified a couple of items we can already do. I’d recommend keep researching on this topic and start digging into the Superset codebase/architecture to form a more concrete action plan. We should at least be able to answer:

  1. What is possible and what is not, and
  2. What is the MVP
  3. Which API we need to change or add?
  4. What other areas of work we need to tackle first before working on this? E.g. SIP-34 column stats looks like a must.

Some other useful links:

@rusackas
Copy link
Member

I just wanted to chime in and say that I love this idea, and it's something that my team is starting to more seriously investigate. @wernerdaehn would you be interested in joining discussions (synchronously or otherwise) around this and being a part of implementing the solution? If not, I think we may need more clarification on how the approaches to implementation and any risks/dependencies involved, as @ktmud was suggesting. In other words, I think this is a great idea for a SIP, but we need more details to be able to put it to a vote and carry it out effectively.

@wernerdaehn
Copy link
Author

wernerdaehn commented Apr 22, 2021

@rusackas By all means, Evan! More than happy to contribute.

As a preliminary start, here is my thinking:

According to explanatory statistics there are four types of scales, ordered by capabilities:

  • Nominal: Only useful calculation is around counting. Example: Color.
  • Ordinal: has in addition an order. Example: User satisfaction 1-10. It is clear that 1 is better than 2 but a difference between 1-and-3 does not have the same meaning as 8-and-10.
  • Interval: has in addition a useful meaning of distance between two values. Example: Today it is 5°C warmer than yesterday.
  • Ratio: in addition it has a value of 0 and hence absolute comparisons do make sense. Example: Revenue was 10% higher.

If somebody wants to visualize a nominal value and a ratio value, e.g. Revenue per Color, a Bar chart is one of the few that makes sense. For two ratio values, e.g. revenue per customer-age a scatter plot is suited.

The next type of decision is the number of axis.

  • If there is a single nominal axis, e.g. gender, the pie chart might be interesting to show the number of customers per gender.
  • If I want to visualize the revenue compared to the previous year revenue per country and time, I need a chart type that can show a ratio scale, a list of regions and the development over time. A geomap colored as a heatmap and a time animation would do the trick.

The type of axis can further be refined:

  • time: year, month, day, timestamp, week, weekday
  • geo
  • hierarchy

One side effect of these types is how to render missing values. A country without revenue should still be present (geomap) or not (bar chart). A month without revenue should still be shown, you do not want to see just 11 months.

The number of distinct values of nominal and ordinal scales is an important decision point as well. A Pie chart with 5000 categories might not be the best suited chart type. Showing above revenue per country over time could be shown as line chart with one line per country. Excellent for comparisons between countries unless you have 100 countries and 100 lines hence.

The final decision type is the purpose of the visualization:

  • Comparison
  • Relationship
  • Proportion
  • Percent of the whole
  • Location
  • Distribution...

The nice thing is that we can start small and grow the solution. Initially we just categorize each column of the result set into the scale type and each chart has the information which scale type it allows for what axis. That by itself would reduce the list of charts to offer by a lot. And from that we can grow and grow with the available metadata on the data and the chart info.

@stale
Copy link

stale bot commented Jun 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Jun 26, 2021
@apache apache locked and limited conversation to collaborators Feb 2, 2022
@geido geido converted this issue into discussion #18430 Feb 2, 2022
@stale stale bot removed the inactive Inactive for >= 30 days label Feb 2, 2022
@geido geido added explore:design Related to the Explore UI/UX and removed viz:explore:ux labels Feb 9, 2022
@rusackas rusackas added the sip Superset Improvement Proposal label Jun 7, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement:request Enhancement request submitted by anyone from the community explore:control Related to the controls panel of Explore explore:dataset Related to the dataset of Explore explore:design Related to the Explore UI/UX sip Superset Improvement Proposal
Projects
Status: Denied / Closed / Discarded
Development

No branches or pull requests

5 participants