Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit usage of database connection pooling #1935

Closed
nscuro opened this issue Sep 7, 2022 · 3 comments
Closed

Revisit usage of database connection pooling #1935

nscuro opened this issue Sep 7, 2022 · 3 comments
Labels
enhancement New feature or request
Milestone

Comments

@nscuro
Copy link
Member

nscuro commented Sep 7, 2022

Current Behavior:

While experimenting with persistence metrics exposition, I noticed that there are more database connections (and even more connection pools) than I expected:

hikaricp_connections_idle{pool="HikariPool-3",} 9.0
hikaricp_connections_idle{pool="HikariPool-4",} 11.0
hikaricp_connections_max{pool="HikariPool-3",} 20.0
hikaricp_connections_max{pool="HikariPool-4",} 20.0
hikaricp_connections_min{pool="HikariPool-3",} 10.0
hikaricp_connections_min{pool="HikariPool-4",} 10.0
hikaricp_connections{pool="HikariPool-3",} 10.0
hikaricp_connections{pool="HikariPool-4",} 11.0

This was also reflected in the database itself, where I could see 21 active JDBC connections originating from DT.

According to the DataNucleus documentation:

Datastore connections are obtained from up to 2 connection factories. The primary connection factory is used for persistence operations, and optionally for value generation operations. The secondary connection factory is used for schema generation, and optionally for value generation operations.

If not specified otherwise, the configured connection pool will thus be created twice, for DN's primary and secondary connection factory. That means that whatever we configure the pool size to be via Alpine, in reality the number will be doubled. Which may explain one or the other connections issue we had reported to us in the past. This behavior is unexpected from the user perspective.

Similarly, notice how the metrics above say HikariPool-3 and HikariPool-4. This is because two other instances were created temporarily by the upgrade framework:

final JDOPersistenceManagerFactory pmf = (JDOPersistenceManagerFactory) JDOHelper.getPersistenceManagerFactory(JdoProperties.get(), "Alpine");

This temporarily spins up two connection pools with (per default) 10 idle connections each. Because upgrades are executed in a serial fashion, a connection pool may be a little overkill. A single connection should probably suffice in this case.

Proposed Behavior

There are not many big OSS projects using DataNucleus, but Apache Hive is one of them. I looked into how they handle the connection pool situation, and they settled for using a smaller, fixed-size connection pool for DN's secondary connection factory: https://github.com/hsnusonic/hive/blob/714c260e4a7c6b147c897718a33e693699267792/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/PersistenceManagerProvider.java#L256-L259

We should test whether we can do something similar to limit the number of connections we hoard. This will need to be tested in high-load situations, to ensure that it doesn't slow down the system.

Additionally, the upgrade framework should be adjusted to not use a connection pool.

@nscuro nscuro added the enhancement New feature or request label Sep 7, 2022
@nscuro nscuro added this to the 4.7 milestone Sep 7, 2022
@nscuro
Copy link
Member Author

nscuro commented Nov 28, 2022

Dabbled with this further, dumping my progress so far:

The secondary connection factory is not only used for schema and value generation, but also for non-transactional operations:

By default the connection used for transactional and non-transactional operations will be different, potentially from a different connection factory. (Source)

This behavior can be turned off via datanucleus.connection.singleConnectionPerExecutionContext=true, however the config property also disables connections being released after use. Holding on to connections for longer than needed will cause serious issues in DT, as there are many threads running in parallel performing persistence operations. I tested it, and it caused the UI to become unresponsive when a big BOM is uploaded, not cool.

Most queries performed by DT are not in the context of a transaction, and thus qualify as non-transactional. If this wasn't the case, we could've made the secondary connection factory non-pooled (I asked the DataNucleus maintainer about it).

However, in my testing, the secondary connection factory used for non-transactional operations always had significantly more active connections than the primary connection factory used for transactional operations.

Here's an example showing active connections after I uploaded 250 BOMs in bulk, using two fixed-size connection pools:

image

Keep in mind that "transactional" is supposed to be the "primary" pool. 🤪
Perhaps this means that we need to make two pools configurable, not only one.

nscuro added a commit to nscuro/dependency-track that referenced this issue Dec 4, 2022
By creating a temporary PMF based on the same properties as the `PersistenceFactoryInitializer`, the `UpgradeInitializer` would create *two* connection pools of `10` (per default) connections each, for the duration of its execution.

Because upgrades are executed in serial, connection pooling is not beneficial. Creating temporary connection pools is a wasteful operation and should be avoided.

Partially addresses DependencyTrack#1935

Signed-off-by: nscuro <nscuro@protonmail.com>
@nscuro
Copy link
Member Author

nscuro commented Dec 11, 2022

Closing this one, as there is nothing more we can do here.

Ideally, even read-only operations should make use of transactions. I wouldn't expect any performance impact of that, as the majority of RDBMSes will use transactions behind the scenes anyway. Once no non-transactional operations are performed anymore, we can disable the secondary connection pool.

But as of now, the secondary pool is even more actively used than the primary one. Exposition of usage metrics (see also #2245) will allow users to tweak the pools better according to their needs. With #2238, it is possible to (optionally) configure both pools separately. I will add documentation about this on the Database Support page.

@nscuro nscuro closed this as completed Dec 11, 2022
mulder999 pushed a commit to mulder999/dependency-track that referenced this issue Dec 23, 2022
By creating a temporary PMF based on the same properties as the `PersistenceFactoryInitializer`, the `UpgradeInitializer` would create *two* connection pools of `10` (per default) connections each, for the duration of its execution.

Because upgrades are executed in serial, connection pooling is not beneficial. Creating temporary connection pools is a wasteful operation and should be avoided.

Partially addresses DependencyTrack#1935

Signed-off-by: nscuro <nscuro@protonmail.com>
Signed-off-by: mulder999 <nospam099-github@yahoo.com>
@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant