Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration for database connection collation #22564

Merged
merged 10 commits into from
Sep 13, 2024
Merged

Conversation

mneudert
Copy link
Member

@mneudert mneudert commented Sep 6, 2024

Description:

The MariaDB 11.5 release introduced a change in the default Unicode collation.

Due to the way Matomo connects to the database (i.e. sending SET NAMES without a COLLATE parameter), this can lead to a change in the collation used in some queries.

If the database was created with a previous MariaDB version, like 11.3, and the charset setting was configured as utf8mb4, the effective collation after SET NAMES should be utf8mb4_general_ci. This collation will then also be used to create the archive tables like archive_blob_2024_08.

With MariaDB 11.5, the collation will, and this may depend on the individual server configuration, change to utf8mb4_uca1400_ai_ci. And this collection will then be used to create a new archive table.

The problems arise when queries are using variables, for example during archiving:

SELECT name
FROM archive_blob_2024_08
WHERE idarchive IN (1,2,3) AND (name = 'Actions_sitesearch' OR (name LIKE 'Actions_sitesearch%' AND (SUBSTRING(name, 19, 7) = '_chunk_' OR (SUBSTRING(name, 20, 1) >= '0' AND SUBSTRING(name, 20, 1) <= '9'))))
ORDER BY CAST(IF((@idsubtable := SUBSTRING(SUBSTRING(name, IF(SUBSTRING(name, 19, 7) = '_chunk_', 26, 20)), 1, IF((@secondunderscore := LOCATE('_', SUBSTRING(name, IF(SUBSTRING(name, 19, 7) = '_chunk_', 26, 20))) - 1) < 0, LENGTH(name), @secondunderscore))) = '', -1, @idsubtable ) AS SIGNED) ASC;

If the table archive_blob_2024_08 was created using utf8mb4_general_ci, and the connection collation is set to utf8mb4_uca1400_ai_ci, the variable assignments in the WHERE clause will create a forbidden collation mix. And this breaks archiving.

While one way to work around that issue is to reconfigure the character_set_collations server configuration from utf8mb4=utf8mb4_uca1400_ai_ci to utf8mb4=utf8mb4_general_ci, this may not be possible in many environments, like shared hosting.

This PR introduces a new, optional, database setting connection_collation. If it is set alongside a database charset, this value will be passed to a SET NAMES ... COLLATE ... statement, setting the connection collation back to the value required for an uninterrupted service.

Config update

During installation and the next update, the config will be checked if a collation has been set.

If that is not the case, an automatic update is tried, based on the comparison of the collation of the user table and the one returned from SELECT @@collation_connection. If both are the same the configuration will be updated to this value.

If not, the update will check the most recent archive table (by name). If the collation of that table matches the users table, it will be chosen for the config update.

Otherwise no update will take place, so we don't accidentally break any setups. For this case the diagnostics have been updated to show the used connection collation and suggest updating the configuration manually to a suitable value.

Note: It can happen that, after an upgrade, a mix of database tables has been created. For example 2024_08 and 2024_09 could have been created with different collations. In this case the one of the tables has to be manually altered (or deleted and recreated by invalidation and rearchiving), so all tables have the same collation.

Fixes #22536
Refs DEV-18459

Review

@mneudert mneudert added Needs Review PRs that need a code review Stability For issues that make Matomo more stable and reliable to run for sys admins. labels Sep 6, 2024
@mneudert mneudert added this to the 5.2.0 milestone Sep 6, 2024
@mneudert mneudert requested a review from a team September 6, 2024 14:12
michalkleiner
michalkleiner previously approved these changes Sep 9, 2024
Copy link
Contributor

@michalkleiner michalkleiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and tidy

@sgiehl
Copy link
Member

sgiehl commented Sep 9, 2024

@mneudert I'm wondering if we also should introduce a way to ensure that all tables are created with the same collation (like we do for TiDb). At the moment a tables collation is also based on the default collation of the database, not sure if that could cause problems as well at some point, if the default collation changes 🤔

@mneudert
Copy link
Member Author

mneudert commented Sep 9, 2024

Good point @sgiehl.

I just checked how MariaDB behaves, and found that the current proposed fix is not enough :(

As we create table using DEFAULT CHARSET (without a COLLATE value), until now I would have expected the database default to be used. And that is a value that should not be changed by any database upgrade. But, as I found out, the creation in our case is still using the "unexpected" utf8mb4_uca1400_ai_ci collation for any new table.

So we indeed need to allow setting the collation in all create statements.

Should the config setting be renamed to the more generic collation, and be added to getTableCreateOptions?

@sgiehl
Copy link
Member

sgiehl commented Sep 9, 2024

Should the config setting be renamed to the more generic collation, and be added to getTableCreateOptions?

Guess that makes most sense. And during installation we should detect the current default collation in order to set it to the config. Not sure if we should also do that during the update, to ensure every installation has a collation set in the end.

@michalkleiner michalkleiner dismissed their stale review September 9, 2024 21:40

Further changes required by other reviewers

@michalkleiner
Copy link
Contributor

The changes related to tables creation look good. Are there still any changes pending related to the installation and automatic config setup?

@mneudert mneudert force-pushed the dev-18459 branch 3 times, most recently from 51b3e8a to a919e26 Compare September 10, 2024 16:11
@DaAwesomeP
Copy link

Would appreciate a patch release for this soon! My archive tasks are consistently failing for now.

@mneudert mneudert requested a review from a team September 12, 2024 11:26
core/Updates/5.2.0-b2.php Outdated Show resolved Hide resolved
@mneudert mneudert merged commit 3f541c2 into 5.x-dev Sep 13, 2024
20 of 25 checks passed
@mneudert mneudert deleted the dev-18459 branch September 13, 2024 11:04
@DaAwesomeP DaAwesomeP mentioned this pull request Sep 13, 2024
4 tasks
sgiehl pushed a commit that referenced this pull request Sep 13, 2024
* Define default collation for TiDB tables (#22271)

* Add configuration for database connection collation (#22564)

* Backport test changes from #22356
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review Stability For issues that make Matomo more stable and reliable to run for sys admins.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Illegal mix of collations
4 participants