Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Explore long URL problem #18181

Merged
merged 12 commits into from
Jan 28, 2022

Conversation

michael-s-molina
Copy link
Member

@michael-s-molina michael-s-molina commented Jan 26, 2022

SUMMARY

The objective of this PR is to fix the long URL problem In Explore. We had a function called getExploreLongUrl in src/explore/exploreUtils that was responsible for generating a URL with the contents of form_data. That was problematic because form_data can contain a lot of data which resulted in URL length-related errors. To fix this problem, #18151 created an endpoint called /explore/form_data where we can store this information on the server-side. This PR uses this endpoint to store form_data information and only keeps the generated key in the URL.

Another objective of the PR is to maintain backward compatibility with previously generated short URLs. We also want to generate new URLs every time the page loads to avoid content override by different users using the same shared URL. At the same time, after loading the page we want to preserve the generated URL and update its content as the user interacts with the page.

The /explore endpoint is using the old API and it contains a lot of complex logic related to the features in Explore so is NOT an objective of this PR to make big refactors that can potentially introduce regressions. I tried to preserve the original endpoint with the added feature. We should tackle this work in follow-up PRs, probably during the migration for the new API.

There's another function in src/explore/exploreUtils called getExploreUrl that transfers a pruned version of form_data using the URL and it's being used by the legacy charts and filter boxes. This will be tackled in another PR as it involves possible soon-to-be deprecated code.

We also have an endpoint called /explore_json that has the potential of being replaced by the new /explore/form_data endpoint but requires significant refactoring. This change is not in the scope of this PR.

AFTER VIDEO

Screen.Recording.2022-01-26.at.4.11.53.PM.mov

TESTING INSTRUCTIONS

  • Play with Explore controls and check if your state is preserved while refreshing or opening the URL in another tab
  • Check if the actions that generate an URL are using the new form_data_key parameter
  • Check if you can navigate from a dashboard to Explore without losing context

ADDITIONAL INFORMATION

  • Has associated issue: closes Unable to use view chart in explore feature on 1.4.0 #18198
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

Merging #18181 (4ab76ab) into master (fa11a97) will increase coverage by 0.26%.
The diff coverage is 66.12%.

❗ Current head 4ab76ab differs from pull request most recent head 189bff2. Consider uploading reports for the commit 189bff2 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18181      +/-   ##
==========================================
+ Coverage   66.04%   66.31%   +0.26%     
==========================================
  Files        1591     1592       +1     
  Lines       62409    62440      +31     
  Branches     6283     6289       +6     
==========================================
+ Hits        41221    41405     +184     
+ Misses      19567    19382     -185     
- Partials     1621     1653      +32     
Flag Coverage Δ
hive 52.16% <27.27%> (-0.01%) ⬇️
javascript 51.38% <66.66%> (+0.52%) ⬆️
mysql 81.27% <63.63%> (-0.01%) ⬇️
postgres 81.32% <63.63%> (-0.01%) ⬇️
presto 52.01% <27.27%> (-0.01%) ⬇️
python 81.76% <63.63%> (-0.01%) ⬇️
sqlite 81.02% <63.63%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...end/src/dashboard/components/SliceHeader/index.tsx 89.74% <ø> (-0.26%) ⬇️
...dashboard/components/SliceHeaderControls/index.tsx 66.66% <ø> (ø)
...nd/src/explore/components/ExploreActionButtons.tsx 57.14% <0.00%> (-2.32%) ⬇️
...rc/explore/components/ExploreChartHeader/index.jsx 50.68% <ø> (+4.10%) ⬆️
...ntend/src/explore/components/ExploreChartPanel.jsx 72.00% <ø> (+57.33%) ⬆️
...uperset-frontend/src/explore/exploreUtils/index.js 63.90% <ø> (-3.00%) ⬇️
.../src/dashboard/components/gridComponents/Chart.jsx 58.88% <14.28%> (-4.21%) ⬇️
superset/views/core.py 77.59% <60.00%> (-0.15%) ⬇️
...dashboard/components/menu/ShareMenuItems/index.tsx 65.38% <72.72%> (-8.53%) ⬇️
...rset-frontend/src/explore/exploreUtils/formData.ts 81.81% <81.81%> (ø)
... and 38 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa11a97...189bff2. Read the comment docs.

@jinghua-qa
Copy link
Member

/testenv up

@github-actions
Copy link
Contributor

@jinghua-qa Ephemeral environment spinning up at http://52.32.107.160:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

Copy link
Member

@jinghua-qa jinghua-qa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked PR in ephemeral env with the testing instruction, LGTM!

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing improvement! Mostly minor comments so approving, but the error message should probably be made either more generic or more accurate.

@michael-s-molina michael-s-molina merged commit 4b61c76 into apache:master Jan 28, 2022
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

@ktmud
Copy link
Member

ktmud commented Feb 2, 2022

@michael-s-molina Thanks for working on this!

I noticed that when I open an Explore page, the URL will update with a new form data key every time I reload the page even when the form data isn't updated. Based on the PR summary, it seems this is by design, but I'm concerned this may not be scalable in high-traffic Superset deployments:

  1. If the cache is configured as FileSystemCache:
    1. In case CACHE_THRESHOLD is not set or is 0, this could quickly fill up disk space or degrade performance as number of files in the cache directory increases.
    2. In case CACHE_THRESHOLD is not zero, this could exhaust the allowed number of cache keys too fast, making it more likely that a link you shared to someone a while ago becomes invalid.
  2. It may be less of a concern if we configure the cache in Redis, but it's still not ideal (the cache keys still exhaust quicker than necessary if we configure Redis as LRU cache, or require too much storage if we do not).

If the concern was that people opening the same key may accidentally override each other's explore state, how about we create a new key only when the cached data's owner is different than current user? The new key could be based on current key + visiting user id so it's always deterministic even when a user opens other people's shared URL multiple times.

What do you think?

@nytai
Copy link
Member

nytai commented Feb 16, 2022

+1 on @john-bodley's concerns. I've been suffering some of the effects the long URL problem and I'm happy to see it being worked on, but I agree this design raises some concerns. Particularly troubling is that the files system cache seems to be the default, so users going to upgrade superset would likely discover this issue after deploying. Certain deployment setups involve running a single pod/host for dev/test/staging envs and multiple pods for production, so having the file system cache be the default would not present any issues in a single pod/host environment but would be a huge issue in multi-pod/multi-host environment. The form data request could get routed to a pod/host that does not have the data in file system cache.

Additionally, this introduces a hard requirement on a cache service, where one didn't exist before (superset works fine with no cache configs). Storing in the metadata db seems like a better choice given that dependency has always existed.

@michael-s-molina
Copy link
Member Author

Thanks for your comments @john-bodley @nytai @ktmud. Let me start with this:

  1. I realize that the ship has somewhat sailed, but per SIP-0 it mentions,

Any of the following should be considered a major change:

  • Any major new feature, subsystem, or piece of functionality
  • Any change that impacts the public interfaces of the project

What are the "public interfaces" of the project? All of the following are public interfaces that people build around:

  • Visualization types
  • Form data for saved dashboards and charts
  • ...

which would indicate that this feature likely should have been presented as a SIP to allow for discussion regarding design/implementation and sign off from the community.

You're totally right about this. We did a fairly extensive architectural discussion internally at Preset where we created a document with all the architectural options, discussed all the possibilities, and presented the recommended approach. We should have published the result of this work as a SIP for community discussion and sign-off. I apologize for that.

  1. It seems like using a (semi-)persistent store for tackling the problem of long URLs seems someone akin to using a hammer to crack a nut—even though it works there may likely be simpler methods which don't require persistence of state. For example was the ideal of compressing the JSON form data considered? This would likely remedy the problem in far simpler way whilst easily maintaining backwards compatibility etc.

Compressing JSON form data was the first option considered to resolve the problem. It was discarded because of the size of the dashboard and Explore states. Even with sophisticated compression algorithms, it was really easy to create a dashboard with many filter values or a chart with many configurations that exceeded the browser URL limit.

  1. Flask caching (regardless of the backing store) is primarily used as a cache—ideally with TTL functionality to prevent overflow. Using a cache for (semi-)persistent storage seems somewhat of an anti-pattern and likely violates the implicit understanding of how Superset's cache is used, i.e., in theory you should be able to obliterate the entire cache without impacting the functionality of the application (performance will clearly be degraded). If we plan on using a store to remedy the problem we likely should lean on Superset's metadata database—augmenting the shortened URL functionality.

During the evaluation of the architectural options, we considered a database table, Flask-Caching, and even DynamoDB or MongoDB to store this type of information. We listed pros and cons for each option and ultimately decided to use Flask-Caching for some reasons:

  • It supports expiration automatically. This is important because the nature of the state that is stored is different and expires at different times. One example would be generating a link for a dashboard state vs transferring state between a dashboard and Explore.
  • Is already a dependency on Superset
  • We could use Redis (with persistence enabled) and have all the benefits of memory-based storage. This is the default configuration at Preset.
  • It gives the flexibility to use the most appropriate technology for each deployment. For deployments that don't use Redis, we could fall back to the file system or even provide a database backend to store the data in our metadata database.
  • Storing the information in a database table would also require implementing expiration management which is automatically provided by Flask-Caching and we would probably still need to cache the information because of performance reasons.
  1. It seems like there are aspects of this change which are still in flux, i.e., questions regarding backwards compatibility, state changes, etc. and thus I would consider this experimental in nature and likely should have been behind a feature flag (and defaulted to disabled) whilst the design/implementation was hardened.

We didn't consider this an experimental feature, in fact we considered it as a fix to a long-standing problem that was affecting the users. We just did 2 improvements after this PR: one to generate the keys more efficiently and another to better deal with expired keys.

Particularly troubling is that the files system cache seems to be the default, so users going to upgrade superset would likely discover this issue after deploying.

That was considered during the work and we added the following in UPDATING.MD:

  • 17536: introduced a key-value endpoint to store dashboard filter state. This endpoint is backed by Flask-Caching and the default configuration assumes that the values will be stored in the file system. If you are already using another cache backend like Redis or Memchached, you'll probably want to change this setting in superset_config.py. The key is FILTER_STATE_CACHE_CONFIG and the available settings can be found in Flask-Caching docs.
  • 17882: introduced a key-value endpoint to store Explore form data. This endpoint is backed by Flask-Caching and the default configuration assumes that the values will be stored in the file system. If you are already using another cache backend like Redis or Memchached, you'll probably want to change this setting in superset_config.py. The key is EXPLORE_FORM_DATA_CACHE_CONFIG and the available settings can be found in Flask-Caching docs.

Certain deployment setups involve running a single pod/host for dev/test/staging envs and multiple pods for production, so having the file system cache be the default would not present any issues in a single pod/host environment but would be a huge issue in multi-pod/multi-host environment. The form data request could get routed to a pod/host that does not have the data in file system cache.

When defining the default backend for Flask-Caching, we considered that the default configuration should account for single pod/host because this is the most common scenario. We also didn't want to introduce more configuration burden for the default deployment, that's why we opted for the file system. Multi-pod/multi-host environments should change the default configuration as stated in UPDATING.MD.

I’m sorry about this having caused problems in your deployments. I agree this should have been surfaced as a SIP to gather all relevant context from the early discussions.

@villebro
Copy link
Member

I agree with @michael-s-molina - we really should have surfaced this as a SIP, and this was clearly a fault in the process, and I assume full responsibility for my part in this oversight. My sincerest apologies for this. I will do my best to make sure similar lapses in process don't happen in the future.

I think the points brought up above were a good summary of why we felt this change was needed and why this approach was taken, but obviously this discussion should have taken place prior to PRs in the open. Like @nytai mentioned, the fallout caused by this will be difficult to undo at this point, but any potential improvements that come to mind are warmly welcomed and would be good to address before the 1.5 and 2.0 cuts. One safeguard that comes to mind is adding an optional allowlist to config.py for acceptable cache backends. By setting ALLOWED_CACHE_BACKENDS = ["RedisCache"], it would be possible to ensure that cache misconfigurations like these would be caught during testing.

On a related note, it appears the long URL problem has gotten worse when moving from 1.3 to 1.4: #18198 (I have been unable to identify what is causing this).

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Feb 16, 2022 via email

@michael-s-molina
Copy link
Member Author

I'm considering whether we should do a redesign if possible. I think it would be not hard to make the key-value system either store in the database, or store in the cache system.

We don't need to redesign the solution to store the data in the database. All it takes is to implement a custom database backend for Flask-Caching and set it as default.

@ktmud
Copy link
Member

ktmud commented Feb 16, 2022

Originally I had some reservation on choosing Flask-Caching as the backend for key-value storage well, but out of practically it does seem to be the easiest way forward. However, I'd still like to respond to the arguments for using it just to be prudent:

It supports expiration automatically. This is important because the nature of the state that is stored is different and expires at different times. One example would be generating a link for a dashboard state vs transferring state between a dashboard and Explore.

Sharable URLs aren't supposed to expire. To minimize disruptions on the user side (a URL expiring at an unpredictable time), we'd have to configure an LRU cache with a relatively long expiration time (in the range of 6 months or 1 year). For URLs do not need to be shared but are simply used to transfer states between pages, the cache is indeed short-lived therefore can just be implemented with Superset's regular cache storage (although we may need to change the default cache storage from NullCache to SimpleCache otherwise the functionality will not work), or using local or server side session storage.

Is already a dependency on Superset

SQL database is also already a dependency.

We could use Redis (with persistence enabled) and have all the benefits of memory-based storage. This is the default configuration at Preset.

Even with persistence enabled, Redis is not suitable for places sensitive to data losses. You may still have data loss depending on how frequent snapshots are created and disaster recovery can be slow as when a Redis server restarts, it needs to load the full snapshot in memory.

It gives the flexibility to use the most appropriate technology for each deployment. For deployments that don't use Redis, we could fall back to the file system or even provide a database backend to store the data in our metadata database. Storing the information in a database table would also require implementing expiration management which is automatically provided by Flask-Caching and we would probably still need to cache the information because of performance reasons.

If we are to provide a database backend, then why not just make it the default storage? The only additional work comparing to implementing a custom Flask-Caching backend is setting and updating expire date when visit. I doubt performance will be an issue with adequate indexing. Most of the time you are querying and writing one record after all.


Again, I agree Flask-Caching is the practical choice and could work well if configured properly. However, I do believe we should at least remove FileSystemCache as the default storage. It should not be a recommended storage even for single-machine deployments. Plus UPDATING.md is very easy to miss---people will not read it unless it's a major version bump or they see problems. We should probably set it to NullCache and throw an exception at startup. We can still use FileSystemCache (or SimpleCache) when FLASK_ENV==development so someone trying Superset out can still get it up and running without changing configs manually:

EXPLORE_FORM_DATA_CACHE_CONFIG: CacheConfig = (
  {"CACHE_TYPE": "SimpleCache"}  if DEBUG
  else { "CACHE_TYPE": "NullCache" }
)

@nytai
Copy link
Member

nytai commented Feb 16, 2022

I think redis is a natural choice in the context of preset where there is already a very heavy reliance on redis for critical business logic. I'm not convinced all superset deployments would lead to the same conclusion. A lot of superset community members are running the latest docker tag (whether intending to run bleeding edge or not) and have already started to experience some of the effects of this change, eg https://apache-superset.slack.com/archives/C0170U650CQ/p1644941586344789.

While adding a default cache config that write to a db table seems like a more ideal option, I understand the reluctance to go that route especially since it will be overridden in preset with a redis cache anyway. I think we should at the very least adding some more comments around what type of setup is recommended for single/multi instance deployments. Additionally, issuing a warning in logs (with some way of ack/disabling it) if the default filecache is being used seems like it could go a long way in preventing some of the upgrade pains that users are likely to experience in the upcoming release, or are already experiencing due to running latest.

@villebro regarding this problem getting worse in recent version, I think it's due to the native filters data structure having a larger footprint than filterbox. However, the biggest culprit that I have found so far is the dashboard color scheme override, this ends up really blowing up the json payload. I have been telling users to exercise caution with that feature and expect "view chart in explore" to break if a dashboard color scheme is applied.

@villebro
Copy link
Member

villebro commented Feb 17, 2022

@ktmud I like the idea of defaulting to SimpleCache for debug mode and NullCache for other modes. I suggest we make that change asap; any objections if I open a PR for that?

With regards to persisting "critical" key-value pairs such filter/form data state, it's good to keep in mind that a shared link that hasn't been accessed within a set timeframe, say 3 months from now which is the default for the dashboard filter state, most likely won't be accessed 6 months from now, or ever (adjusting this according to own tolerances is very simple). Also, as chart/dashboard state has a tendency of evolving over time, I'm not sure being able to rehydrate state that is years old is usually very useful. Keeping in mind, though, that links that are regularly accessed will not expire (see REFRESH_TIMEOUT_ON_RETRIEVAL in the codebase; this can also be disabled if needed).

However, a potential solution that ensures critical state is persisted beyond the cache timeout and is resilient to the cache being accidentally flushed could be as follows:

  • Dashboard/Explore state continues to be persisted in the cache as is currently done.
  • We rename the current "Copy dashboard/chart URL" and related buttons to something like "Share permanent link" and persist this state in the metadata database, not the cache. When this link is accessed, the data is fetched from the metadata database, and after that the URL is updated to again reference a cached key, so that the state is again maintained in the cache. And if the user then wants to reshare the updated state, they click on the button again. Etc etc.

This approach should be very easy to implement and would give us the best of both worlds: temporary state that's needed during regular use would still be stored in the cache, and users could keep sharing URLs directly from the address bar that expire automatically (also from the cache). And if they really want to make sure the link never expires, they could just click a button to create a permanent link that is guaranteed to be persisted forever in the metadata database. To complement this feature, we could also consider adding a config flag that would make it possible to disable the persistent link sharing functionality for orgs that prefer to only offer automatically expiring links.

@ktmud
Copy link
Member

ktmud commented Feb 17, 2022

Thanks for the thoughtful proposal, @villebro ! A metastore just for permlinks sounds good for me.

@nytai
Copy link
Member

nytai commented Feb 22, 2022

That’s a good proposal as it preserves some of the previous functionality where those actions generated short links in superset

@villebro
Copy link
Member

Thanks for the feedback everyone; we're currently finalizing a proposal which should address all issues mentioned here.

@etr2460
Copy link
Member

etr2460 commented Feb 28, 2022

Thanks @villebro I'm glad we've aligned on a solution here. Do you (or others at Preset) have an ETA on implementing this solution?

@villebro
Copy link
Member

villebro commented Mar 1, 2022

@etr2460 we're expecting to have the main issues addressed in a few PRs that will be opened this week. A short summary of the changes:

  • Changing default cache type to SimpleCache for Debug mode, NullCache for others
  • Adding permanent link endpoints for Dashboard and Explore that will persist the data in the metadata database and replace the current ones, while maintaining backwards compatibility with current short links
  • Update docs to reflect minimum requirements of running a separate cache when running a multi-worker deployment

@villebro
Copy link
Member

villebro commented Mar 1, 2022

@etr2460 @ktmud @nytai Here's the first PR to address the default cache type: #18976 . I'm still working on fixing the integration test configs, but the functional changes should be ready for review. I'm going to start working on the permalink feature next.

@graceguo-supercat
Copy link

graceguo-supercat commented Mar 2, 2022

@villebro Thanks for the extra iterations. I have one question about implementation:

Adding permanent link endpoints for Dashboard and Explore that will persist the data in the metadata database and replace the current ones, while maintaining backwards compatibility with current short links

After this change, when user clicks on the shorten url link from Dashboard and Explore view, do you plan to use the old /r/12345 shorten url functions, or generate a new hashed key?

I kind of prefer to use old shorten url /r/12345 for the permanent shorten link, since this will be consistent with old behavior. While hashed key url, it will be created and only used by explore view navigation, I assume most of hashed explore view urls will be abandoned, and it mean to be re-usable in relatively short term (when cache is available).
So I prefer to make these 2 use cases clearly different. What do you think?

@villebro
Copy link
Member

villebro commented Mar 2, 2022

After this change, when user clicks on the shorten url link from Dashboard and Explore view, do you plan to use the old /r/12345 shorten url functions, or generate a new hashed key?

I kind of prefer to use old shorten url /r/12345 for the permanent shorten link, since this will be consistent with old behavior. While hashed key url, it will be created and only used by explore view navigation, I assume most of hashed explore view urls will be abandoned, and it mean to be re-usable in relatively short term (when cache is available). So I prefer to make these 2 use cases clearly different. What do you think?

@graceguo-supercat I would prefer to steer away from serial ids due to the security implications of them (trivial to guess/iterate through). However, I have an idea of how we can get the best of both worlds, so I'll make this optional in my proposal (=it will be possible to configure Superset to use either one).

@villebro
Copy link
Member

villebro commented Mar 4, 2022

FYI now that #18976 is merged (change default cache + improve docs regarding minimum requirements for single/multi worker installations), I'm finally working on the permalink feature. But it will probably be a couple of days before it's ready for review.

@john-bodley
Copy link
Member

@michael-s-molina and @villebro is there a status update regarding the work outlined above here?

@villebro
Copy link
Member

villebro commented Mar 9, 2022

@john-bodley a PR is being opened today that addresses all remaining issues.

@villebro
Copy link
Member

@john-bodley the PR is now ready for review (apologies for the size): #19078

@villebro
Copy link
Member

@ktmud @etr2460 @nytai here's the PR that adds a new custom cache that leverages the recently introduced key_value table in the metadata database, and thereby reintroduces support for multi-pod deployments without a dedicated cache: #19232.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels need:qa-review Requires QA review size/XL 🚢 1.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to use view chart in explore feature on 1.4.0
10 participants