Skip to content

#21025: Optimize rendering of config context data#22294

Open
jeremystretch wants to merge 2 commits into
featurefrom
21025-pre-render-config-context-data
Open

#21025: Optimize rendering of config context data#22294
jeremystretch wants to merge 2 commits into
featurefrom
21025-pre-render-config-context-data

Conversation

@jeremystretch

@jeremystretch jeremystretch commented May 26, 2026

Copy link
Copy Markdown
Member

Closes: #21025

  • Add fields on Device and VirtualMachine models (via ConfigContextModel):
    • _config_context_data JSON field to stored cached context data
    • _config_context_generation integer field to serialize recalculations (for race condition avoidance)
  • On ConfigContextModel, split get_config_context() into a new render_config_context() method and default to returning cached config context data (if present)
  • Add a get_affected_objects() method to ConfigContext, which returns the devices/VMs covered by its scope
  • Introduce signal handlers to automatically invalidate cached config context data for the relevant objects upon applicable changes to ConfigContexts or related objects
  • Introduce RenderConfigContextJob, a background job which caches config context data for all devices/VMs with null _config_context_data
  • Introduce the rebuild_config_context_cache management command (with --force option)
  • Remove the conditional *WithConfigContextSerializer REST API serializers for devices/VMs
  • Remove calls to annotate_config_context_data() from UI & GraphQL querysets for devices/VMs
  • Remove ConfigContextQuerySetMixin
  • Drop support for suppressing config_context data by passing ?exclude=config_context with REST API requests

@jeremystretch jeremystretch force-pushed the 21025-pre-render-config-context-data branch 4 times, most recently from c273bc8 to 53fff7c Compare June 1, 2026 18:49
@jeremystretch jeremystretch force-pushed the 21025-pre-render-config-context-data branch from 53fff7c to db44a3b Compare June 3, 2026 16:18
@jeremystretch jeremystretch changed the base branch from main to feature June 3, 2026 16:18
@jeremystretch jeremystretch force-pushed the 21025-pre-render-config-context-data branch 2 times, most recently from 34c7ae9 to 08a1b01 Compare June 3, 2026 19:17
@jeremystretch jeremystretch force-pushed the 21025-pre-render-config-context-data branch 4 times, most recently from 2b90545 to 3ea5839 Compare June 4, 2026 13:42
@jeremystretch

Copy link
Copy Markdown
Member Author

@claude /review

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Claude finished @jeremystretch's task in 6m 17s —— View job


PR Review: #21025 Optimize rendering of config context data

  • Gather context
  • Read key changed files
  • Analyze code quality, correctness, and conventions
  • Post review findings

Overview

This PR introduces a pre-rendered caching layer for config context data on Device and VirtualMachine objects — a significant performance improvement that eliminates the expensive on-demand annotation for every API/GraphQL request. The design is sound: a JSON cache column (_config_context_data), a generation counter for compare-and-set protection against concurrent invalidations, a comprehensive signal-based invalidation graph, and a background job for background repopulation.

The implementation is thorough. The ConfigContextScopeParityTest test is particularly well done — verifying that the forward matcher (get_for_object) and the inverse matcher (get_affected_objects) agree across every scope dimension is the right way to prove correctness of the inversion. The coverage of the compare-and-set race condition test (test_job_skips_stale_write_on_concurrent_invalidation) is impressive.


Issues Found

1. Top-level cross-app imports in extras/cache.py

netbox/extras/cache.py, lines 13–15:

from dcim.models import Device
from extras.jobs import RenderConfigContextJob
from virtualization.models import VirtualMachine

These are module-level imports from cross-app models. This is inconsistent with the pattern used in extras/signals.py, where Device and VirtualMachine are always imported lazily inside functions (e.g., from dcim.models import Device inside invalidate_on_device_vm_tag_change). The top-level imports in cache.py work at runtime (since cache.py is only loaded at ready() time when both apps are fully initialized), but they create a static import-time dependency that could cause issues in tests, management commands, or if the import order ever changes. Consider moving them to be lazy imports inside each function that needs them, matching the existing convention. Fix this →

2. TenantGroup in mptt_triggers vs direct-equality matching

netbox/extras/signals.py (in _connect_upstream_handlers) and netbox/extras/cache.py (mptt_attrs):

TenantGroup is listed in mptt_triggers, meaning reparenting a TenantGroup triggers _make_mptt_reparent_handler, which calls get_descendants(include_self=True) and invalidates all devices in that subtree. However, get_for_object in querysets.py uses direct equality for tenant_groupsQ(tenant_groups=tenant_group) — not ancestor traversal. This means reparenting a TenantGroup has no effect on which config contexts match a device (the device's tenant.group FK doesn't change), so the reparent handler would trigger unnecessary (though harmless) invalidations.

Similarly, in invalidate_for_scope_delta, tenant_groups is in direct_attrs (correct), but TenantGroup also appears in mptt_triggers for reparent handling (inconsistent). This over-invalidates but doesn't produce stale caches.

Worth confirming this is intentional, or simplify by removing TenantGroup from mptt_triggers if its matching is strictly direct.

3. get_affected_objects() docstring says it returns a tuple, but returns a generator-like tuple

netbox/extras/models/configs.py, lines ~218–230:

The docstring says "Return a (device_qs, vm_qs) tuple" which is accurate. invalidate_config_context_for_configcontext iterates it with for queryset in configcontext.get_affected_objects(). This works since the tuple yields device_qs and vm_qs in order, but invalidate_config_context_for_objects derives the model_label from queryset.model._meta.label_lower — so the order matters. This is implicit and fragile. Consider returning a dict or named tuple, or having get_affected_objects() return an iterable of (model_label, queryset) pairs explicitly.

4. ?exclude=config_context is silently dropped — API breaking change not highlighted in PR body

The PR body mentions this: "Drop support for suppressing config_context data by passing ?exclude=config_context", and the docs add a note. However, config_context is now always returned in Device and VM API responses. For deployments with many devices and slow background job queues, the first response after an upgrade could be slower (falling back to on-demand render) and larger (always including the field). The upgrade warning in the docs is good; consider whether a deprecation notice was given in any prior version.

5. No test for Cluster.type/group change → VM cache invalidation

The _connect_upstream_handlers direct_triggers registers a handler for Cluster changes (type_id, group_id, site_id). But there's no test in test_configcontext_cache.py that exercises a Cluster.type or Cluster.group change and verifies the associated VM's cache is invalidated. Given that cluster-type and cluster-group are first-class ConfigContext scope dimensions, this is a gap worth covering.

6. No test for Tenant.group change → Device/VM cache invalidation

Similarly, ('tenancy', 'Tenant', ('group_id',), 'tenant_id', 'tenant_id') is in direct_triggers but the test suite doesn't have a test for "change tenant.group → device/VM cache is invalidated".

7. RENDER_CONFIG_CONTEXT_MAX_PASSES = 100 — consider adding a warning for unusual pass counts

netbox/extras/jobs.py, line 22:

The multi-pass loop is correct and well-motivated. However, in normal operation, a single pass should suffice. More than 2–3 passes would indicate either a very busy system or an unexpected churn loop. Consider logging a warning after (say) 3 passes so operators have visibility into abnormal behavior, rather than silently burning through all 100 passes.

8. Upgrade time: rebuild_config_context_cache in upgrade.sh is synchronous

upgrade.sh, lines 104–109:

The upgrade script now blocks until all devices and VMs have their config context rendered. For large deployments this could add meaningful upgrade time. The docs mention this, but the command has no progress indicator (only a final count). The iterator(chunk_size=500) in the management command is good for memory, but a per-chunk progress log (e.g., self.stdout.write(f' ... rendered {n} so far')) would improve operator experience for large datasets.


Minor Notes

  • ConfigContextModel.serialize_object override (configs.py): correctly excludes _config_context_data and _config_context_generation from changelog snapshots. The list(exclude or []) pattern avoids mutable-default pitfalls. ✓
  • CAS race condition protection in both RenderConfigContextJob and rebuild_config_context_cache is consistent — same pattern, same rationale. The duplication is acceptable given the context difference (background job vs. management command).
  • Warning comment in querysets.py about the dual-maintenance requirement between get_for_object and get_affected_objects is important and appropriately placed. The parity test provides a strong safety net for this.
  • weak=False on all dynamically connected signal handlers — correct; without this, the closures could be garbage collected.
  • transaction.on_commit deferral of job enqueue — correct pattern to prevent the background worker from reading uncommitted state.
  • annotate_config_context_data() still available on the queryset and still used by the job/management command for the on-demand render path. The query-count test confirms zero queries in warm-cache reads. ✓

Summary

This is a well-engineered optimization. The core design (nullable JSON cache + generation counter + signal-based invalidation + background sweep) is solid, the parity test provides strong correctness guarantees, and the test coverage is thorough. The main issues are:

  1. Cross-app top-level imports in cache.py (should be lazy, minor style issue)
  2. TenantGroup in mptt_triggers causing harmless over-invalidation (confirm it's intentional)
  3. Missing invalidation tests for Cluster.type/group and Tenant.group upstream changes
  4. No per-pass warning in RenderConfigContextJob for pathological churn scenarios

@jeremystretch jeremystretch force-pushed the 21025-pre-render-config-context-data branch from 3ea5839 to d43893f Compare June 4, 2026 16:05
@jeremystretch

Copy link
Copy Markdown
Member Author

@claude /review Please review again, considering the following responses to the above concerns:

  1. Not an issue. (Bizarre that this was even flagged.)
  2. Fixed
  3. Acceptable for now. We'll likely explore improving config context "plugability" in the future.
  4. Known; will flag this in the release notes.
  5. Added tests
  6. Added tests
  7. Fixed
  8. Acceptable for now. We might adopt a more robust approach after testing at scale.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Claude finished @jeremystretch's task in 1m 24s —— View job


PR Review (Re-review): #21025 Optimize rendering of config context data

  • Gather context
  • Read updated files to verify fixes
  • Analyze updated code
  • Post updated review findings

Re-review Summary

I've verified each of the eight responses against the current state of the code. All items are accounted for and the PR is in good shape.


Item-by-item verification

1. Cross-app top-level imports in extras/cache.py — acknowledged

Fair enough; cache.py is only imported during AppConfig.ready() at which point all apps are fully initialized, so the import order concern doesn't apply in practice. Accepted.

2. TenantGroup in mptt_triggers — confirmed fixed ✓

mptt_triggers in signals.py (lines 307–314) now contains only Region, SiteGroup, DeviceRole, Platform, and Location. TenantGroup was moved out — it now appears only in _connect_upstream_delete_handlers (as tenant_groups) and in invalidate_for_scope_delta's direct_attrs dict (line 101 of cache.py). This is the correct placement: direct-equality matching, not ancestor traversal.

3. get_affected_objects() ordering — acknowledged

Accepted as-is for now.

4. ?exclude=config_context breaking change — acknowledged

Release notes will cover it.

5. Missing test for Cluster.type/group change → VM cache invalidation — confirmed added ✓

test_direct_upstream_cluster_type_change_invalidates_vm (line 464) and test_direct_upstream_cluster_group_change_invalidates_vm (line 478) are present and correctly exercise the path: create cluster with type/group, attach a VM, warm the cache, change the cluster attribute, assert the VM cache is NULLed.

6. Missing test for Tenant.group change → Device/VM cache invalidation — confirmed added ✓

test_direct_upstream_tenant_group_change_invalidates_device (line 493) and test_direct_upstream_tenant_group_change_invalidates_vm (line 509) cover both sides.

7. No warning for exhausted render passes — confirmed fixed ✓

RenderConfigContextJob.run() (lines 68–76 of jobs.py) now has an else clause on the for loop that fires a self.logger.warning(...) when all 100 passes complete without converging. The message is clear and actionable.

8. rebuild_config_context_cache blocking in upgrade.sh — acknowledged

Acceptable for now.


Conclusion

All previously raised concerns are either resolved, accepted, or acknowledged with a plan. The implementation is solid: the CAS generation counter, the enqueue_once coalescing strategy, the scope-parity test, and the comprehensive upstream invalidation graph are all well-executed. This is ready to merge from a code review perspective.

@jeremystretch jeremystretch marked this pull request as ready for review June 4, 2026 16:10
@jeremystretch jeremystretch requested review from a team and arthanson and removed request for a team June 4, 2026 16:10

@arthanson arthanson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, just one small item I'm not sure is a bug with returning the cached item that gets mutated, it is a change from previous and probably should return a copy to avoid inadvertent side-effects.

"""
from extras.models.tags import TaggedItem

def _mptt_descendants(m2m):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to double-check this with the Ltree stuff once merged. at least rename it

"""
cached = getattr(self, '_config_context_data', None)
if cached is not None:
return cached

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callers are mutating this (extras/views.py:1288) I think me may want to return a new copy here instead of the direct cached object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pre-render and cache config context data

2 participants