Revisit usage of database connection pooling #1935

nscuro · 2022-09-07T20:44:28Z

Current Behavior:

While experimenting with persistence metrics exposition, I noticed that there are more database connections (and even more connection pools) than I expected:

hikaricp_connections_idle{pool="HikariPool-3",} 9.0
hikaricp_connections_idle{pool="HikariPool-4",} 11.0
hikaricp_connections_max{pool="HikariPool-3",} 20.0
hikaricp_connections_max{pool="HikariPool-4",} 20.0
hikaricp_connections_min{pool="HikariPool-3",} 10.0
hikaricp_connections_min{pool="HikariPool-4",} 10.0
hikaricp_connections{pool="HikariPool-3",} 10.0
hikaricp_connections{pool="HikariPool-4",} 11.0

This was also reflected in the database itself, where I could see 21 active JDBC connections originating from DT.

According to the DataNucleus documentation:

Datastore connections are obtained from up to 2 connection factories. The primary connection factory is used for persistence operations, and optionally for value generation operations. The secondary connection factory is used for schema generation, and optionally for value generation operations.

If not specified otherwise, the configured connection pool will thus be created twice, for DN's primary and secondary connection factory. That means that whatever we configure the pool size to be via Alpine, in reality the number will be doubled. Which may explain one or the other connections issue we had reported to us in the past. This behavior is unexpected from the user perspective.

Similarly, notice how the metrics above say HikariPool-3 and HikariPool-4. This is because two other instances were created temporarily by the upgrade framework:

dependency-track/src/main/java/org/dependencytrack/upgrade/UpgradeInitializer.java

Line 66 in e9304da

    
           final JDOPersistenceManagerFactory pmf  = (JDOPersistenceManagerFactory) JDOHelper.getPersistenceManagerFactory(JdoProperties.get(), "Alpine");

This temporarily spins up two connection pools with (per default) 10 idle connections each. Because upgrades are executed in a serial fashion, a connection pool may be a little overkill. A single connection should probably suffice in this case.

Proposed Behavior

There are not many big OSS projects using DataNucleus, but Apache Hive is one of them. I looked into how they handle the connection pool situation, and they settled for using a smaller, fixed-size connection pool for DN's secondary connection factory: https://github.com/hsnusonic/hive/blob/714c260e4a7c6b147c897718a33e693699267792/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/PersistenceManagerProvider.java#L256-L259

We should test whether we can do something similar to limit the number of connections we hoard. This will need to be tested in high-load situations, to ensure that it doesn't slow down the system.

Additionally, the upgrade framework should be adjusted to not use a connection pool.

The text was updated successfully, but these errors were encountered:

nscuro · 2022-11-28T21:04:48Z

Dabbled with this further, dumping my progress so far:

The secondary connection factory is not only used for schema and value generation, but also for non-transactional operations:

By default the connection used for transactional and non-transactional operations will be different, potentially from a different connection factory. (Source)

This behavior can be turned off via datanucleus.connection.singleConnectionPerExecutionContext=true, however the config property also disables connections being released after use. Holding on to connections for longer than needed will cause serious issues in DT, as there are many threads running in parallel performing persistence operations. I tested it, and it caused the UI to become unresponsive when a big BOM is uploaded, not cool.

Most queries performed by DT are not in the context of a transaction, and thus qualify as non-transactional. If this wasn't the case, we could've made the secondary connection factory non-pooled (I asked the DataNucleus maintainer about it).

However, in my testing, the secondary connection factory used for non-transactional operations always had significantly more active connections than the primary connection factory used for transactional operations.

Here's an example showing active connections after I uploaded 250 BOMs in bulk, using two fixed-size connection pools:

Keep in mind that "transactional" is supposed to be the "primary" pool. 🤪
Perhaps this means that we need to make two pools configurable, not only one.

By creating a temporary PMF based on the same properties as the `PersistenceFactoryInitializer`, the `UpgradeInitializer` would create *two* connection pools of `10` (per default) connections each, for the duration of its execution. Because upgrades are executed in serial, connection pooling is not beneficial. Creating temporary connection pools is a wasteful operation and should be avoided. Partially addresses DependencyTrack#1935 Signed-off-by: nscuro <nscuro@protonmail.com>

nscuro · 2022-12-11T17:13:48Z

Closing this one, as there is nothing more we can do here.

Ideally, even read-only operations should make use of transactions. I wouldn't expect any performance impact of that, as the majority of RDBMSes will use transactions behind the scenes anyway. Once no non-transactional operations are performed anymore, we can disable the secondary connection pool.

But as of now, the secondary pool is even more actively used than the primary one. Exposition of usage metrics (see also #2245) will allow users to tweak the pools better according to their needs. With #2238, it is possible to (optionally) configure both pools separately. I will add documentation about this on the Database Support page.

By creating a temporary PMF based on the same properties as the `PersistenceFactoryInitializer`, the `UpgradeInitializer` would create *two* connection pools of `10` (per default) connections each, for the duration of its execution. Because upgrades are executed in serial, connection pooling is not beneficial. Creating temporary connection pools is a wasteful operation and should be avoided. Partially addresses DependencyTrack#1935 Signed-off-by: nscuro <nscuro@protonmail.com> Signed-off-by: mulder999 <nospam099-github@yahoo.com>

github-actions · 2023-01-11T10:15:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

nscuro added the enhancement New feature or request label Sep 7, 2022

nscuro added this to the 4.7 milestone Sep 7, 2022

nscuro mentioned this issue Dec 4, 2022

Do not create connection pools for upgrades #2232

Merged

5 tasks

This was referenced Dec 4, 2022

Allow for configuration of both primary and secondary db connection pool stevespringett/Alpine#439

Merged

Bump alpine to 2.2.0-SNAPSHOT #2238

Merged

Update Grafana dashboard to include persistence metrics #2245

Closed

nscuro closed this as completed Dec 11, 2022

github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit usage of database connection pooling #1935

Revisit usage of database connection pooling #1935

nscuro commented Sep 7, 2022

nscuro commented Nov 28, 2022

nscuro commented Dec 11, 2022

github-actions bot commented Jan 11, 2023

Revisit usage of database connection pooling #1935

Revisit usage of database connection pooling #1935

Comments

nscuro commented Sep 7, 2022

Current Behavior:

Proposed Behavior

nscuro commented Nov 28, 2022

nscuro commented Dec 11, 2022

github-actions bot commented Jan 11, 2023