Skip to content

DO NOT MERGE: diagnostic package to debug knex issues#2441

Closed
iainsproat wants to merge 6 commits into
mainfrom
iain/fix-knex-metrics
Closed

DO NOT MERGE: diagnostic package to debug knex issues#2441
iainsproat wants to merge 6 commits into
mainfrom
iain/fix-knex-metrics

Conversation

@iainsproat
Copy link
Copy Markdown
Contributor

@iainsproat iainsproat commented Jun 26, 2024

⚠️ Do not merge!!!

Description & motivation

The prometheus metrics from knex are showing incorrect values. This PR provides a package and docker compose stack to diagnose and debug the issues.

Prerequisites

  • Docker
  • Docker Compose

Instructions

  1. cd packages/knex-metrics-debugger
  2. yarn test
  3. Open Grafana at http://127.0.0.1:3000/
  4. Username is admin, password is grafana.
  5. Open Prometheus at http://localhost:9090/
  6. View raw metrics at http://localhost:9094/metrics
  7. Connect directly to the database at localhost:5432, database speckle, user speckle, password speckle

Changes:

To-do before merge:

Screenshots:

Validation of changes:

Checklist:

  • My pull request follows the guidelines in the Contributing guide?
  • My pull request does not duplicate any other open Pull Requests for the same update/change?
  • My commits are related to the pull request and do not amend unrelated code or documentation.
  • My code follows a similar style to existing code.
  • I have added appropriate tests.
  • I have updated or added relevant documentation.

References

@specklesystems specklesystems deleted a comment from gitguardian Bot Jun 26, 2024
@iainsproat
Copy link
Copy Markdown
Contributor Author

iainsproat commented Jun 26, 2024

Client is fixed to maximum of 25 connections.
Creating 50 connections simultaneously:

  const allQueries = []
  for (let i = 0; i < 50; i++) {
    allQueries.push(
      knex.raw('SELECT clock_timestamp(), pg_sleep(5), clock_timestamp();')
    )
  }
  await Promise.all(allQueries)

The resulting Grafana chart shows the following:

Screenshot 2024-06-26 at 16 07 11

Interpretation

Initially it shows zero free connections; this is incorrect.
When the workload starts, the used connections rises immediately to 25; this is the maximum allowed per the knex configuration.
The pending connections also rises to 25, which is expected as we wish to process 50 tasks and only have capacity for 25 at a time.
All used and pending connections are eventually processed, and they both drop back to zero.
Only at this point does the number of free connections show the correct number of available connections.

@iainsproat
Copy link
Copy Markdown
Contributor Author

knex uses https://github.com/vincit/tarn.js/ under the covers to manage connection pools.
numFree reports on the number of created resources which are currently free. A created resource represents one connection to postgres.
However the free created resources can be 'reaped' after certain conditions are met (Conditions may include that the created resource are free & unused for too long).
Resources are created on demand.
The combination of the two behaviours means that there are times when the number of free connections shown is zero, despite no (or less than the max) connections being in use.

However, resources can only be created up until the pre-defined maximum number of connections. We can use this value, and the number of in-use connections, to calculate the remaining capacity of the connection pool. This would provide a more representable value over the number of free connections.

@iainsproat
Copy link
Copy Markdown
Contributor Author

The phenomenon where we can see 'free' connections being 'reaped' a little while after a period of activity, and then connections are created when required again.
Screenshot 2024-06-26 at 17 51 19

@iainsproat
Copy link
Copy Markdown
Contributor Author

iainsproat commented Jun 26, 2024

New metric, in yellow, showing 'remaining capacity'. This provides a more accurate representation of the actual number of connections that can be created within a pod, and takes into account the pending demand from pending creates and acquisitions. It does not get 'reaped' like the 'available' connections metric (in green):
Screenshot 2024-06-26 at 18 10 32

  metricRemainingCapacity = new prometheusClient.Gauge({
    name: 'speckle_server_knex_remaining_capacity',
    help: 'Remaining capacity of the DB connection pool',
    collect() {
      const max = getMaximumNumberOfConnections()
      const demand =
        knex.client.pool.numUsed() +
        knex.client.pool.numPendingCreates() +
        knex.client.pool.numPendingAcquires()

      //the higher value of zero or the difference between the max and the demand
      const remainingCapacity = max <= demand ? 0 : max - demand
      this.set(remainingCapacity)
    }
  })

@gitguardian
Copy link
Copy Markdown

gitguardian Bot commented Jun 26, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@iainsproat
Copy link
Copy Markdown
Contributor Author

Refer to #2443

@iainsproat iainsproat closed this Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant