Skip to content

Features/cohort generation with ibis and full benchmarking (on databricks)#39

Open
azimov wants to merge 56 commits into
developfrom
features/cohort-definition-set
Open

Features/cohort generation with ibis and full benchmarking (on databricks)#39
azimov wants to merge 56 commits into
developfrom
features/cohort-definition-set

Conversation

@azimov

@azimov azimov commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

This branch is a result of heavily benchmarking and trying to improve the ibis execution layer with real cohorts from the Phenotype library (submitted to OHDSI symposium as an abstract).

The benchmarking code is messy and bloats this repository so I moved it here

This branch also makes ibis a requirement of the package which I think is merited. The side effect of this is that python 3.9 will be removed from support (which is on the roadmap anyway)

A lot of changes were made to create cohort definition sets but here is a summary of the changes made:

Codesets (ibis/codesets.py)

Replaced in-memory concept ID caching (CachedConceptSetResolver) with a single database-resident codeset table per cohort. Concept sets are expanded via SQL joins through concept_ancestor/concept tables, not Python memory. ibis.memtable() removed entirely; utilities like _literal_select() build small expressions without temporary tables.

Compile steps (ibis/compile_steps.py)

Concept filtering switched from isin(concept_ids) (Python tuples) to semi-joins/anti-joins against the codeset table. _resolve_concept_ids() replaced by _filter_by_concept_table(). Added source concept filtering (e.g., condition_source_concept_id).

Context (ibis/context.py)

concept_ids_for_codeset() -> tuple[int, ...] replaced by concept_set_table(codeset_id) -> Table — filters the shared codeset table by id and returns an ibis expression. ExecutionContext accepts codeset_table directly instead of a resolver class.

Cohort engine (engine/cohort.py)

Intermediate results materialized to backend staging tables at each stage boundary (primary → qualified → included → ended). New _materialize()/_drop_staging_tables() functions. _union_all() uses divide-and-conquer recursion to avoid expression tree explosion. build_cohort_table() accepts cohort_id, materialize, cohort_table, session_prefix.

Custom eras (engine/custom_era.py — new)

Drug exposure era computation: _compute_exposure_end_date() resolves exposure end from supply details; _compute_eras() collapses overlapping exposures via window functions within a configurable gap. Previously blocked as UnsupportedFeatureError; now supported.

API (api.py)

build_cohort() gained cohort_id, materialize, codeset_table, cohort_table, session_prefix. Removed use_persistent_cache.

Normalize layer

FilteredConceptCriterion now carries source_codeset_id and source_concept_column for source concept filtering. All domain criteria populate these during normalization. Custom eras no longer raise UnsupportedFeatureError.

Lower layer (lower/common.py)

Emits FilterByCodeset steps for source concept predicates, matching T-SQL ConceptSetExpressionQueryBuilder behavior.

Other engine files (8 files)

Minor interface adjustments: passing source_codeset_id through the pipeline, accepting codeset_table/concept_set_table instead of concept_ids, materialization plumbing.

Cohort definition set (cohort_definition_set/ — new, 4 files)

  • _core.py: CohortDefinition, CohortGenerationResult, CohortDefinitionSet (indexed container).
  • _generate.py: generate_cohort_set() / async_generate_cohort_set() — batch generation with incremental checksum skipping, per-cohort codeset management, thread-safe backend locking.
  • _checksum_store.py: Persistent generation history — v1/v2 compatible checksum table, window functions for latest per-cohort status.
  • __init__.py: Public API — CohortDefinitionSet, generate_cohort_set, async_generate_cohort_set, summarise_generation_results.

azimov and others added 30 commits May 4, 2026 14:20
…ved overall performance. Uncovered custom eras bug
…to preserve all events

When DrugExposure(first=True) and QualifiedLimit=First, every person
has exactly 1 event with event_id=1 (assigned by
_assign_primary_event_ids).  The CustomEra window previously grouped
by event_id alone, collapsing all rows into 1 partition and dropping
N-1 rows with _rn==0.

Grouping by (person_id, event_id) gives each row its own partition,
preserving all events.

Adds regression tests via build_cohort and generate_cohort_set.
…to preserve all events

When DrugExposure(first=True) and QualifiedLimit=First, every person
has exactly 1 event with event_id=1 (assigned by
_assign_primary_event_ids).  The CustomEra window previously grouped
by event_id alone, collapsing all rows into 1 partition and dropping
N-1 rows with _rn==0.

Grouping by (person_id, event_id) gives each row its own partition,
preserving all events.

Adds regression test via build_cohort.
…y use

Adds materialize: bool = True parameter to build_cohort and
build_cohort_table. When False, skips the staging-table creation
added for large-cohort SQL compilation performance. Used by the
phenotype regression tests so they remain compile-only.

Also:
- Add union-scaling regression tests (1-100 criteria)
- Fix phenotype test to use materialize=False (was hanging)
- Fix SIM108 ternary in compare_cohort_outputs.py
ccrce/execution/engine/custom_era.py:
- Issue 2 ✓ — _padded_end includes gap_days + offset; era end uses max(padded_end) - gap_days matching Circe BE
- Issue 3 ✓ — Filters both drug_concept_id and drug_source_concept_id (with column existence guard)
- Issue 5 ✓ — compute_drug_eras accepts cohort_person_ids; apply_custom_era_strategy semi-joins drug_exposure to cohort persons
…ion-set

# Conflicts:
#	tests/execution/test_custom_era.py
…for any exception to carry on to other cohorts
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.49154% with 118 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.38%. Comparing base (c2e7067) to head (d99b9c7).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
circe/execution/ibis/codesets.py 81.03% 33 Missing ⚠️
circe/cohort_definition_set/_checksum_store.py 59.72% 29 Missing ⚠️
circe/execution/ibis/compile_steps.py 69.23% 16 Missing ⚠️
circe/execution/engine/cohort.py 80.55% 7 Missing ⚠️
circe/execution/engine/custom_era.py 91.07% 5 Missing ⚠️
circe/cohort_definition_set/_generate.py 95.69% 4 Missing ⚠️
circe/execution/engine/censoring.py 33.33% 4 Missing ⚠️
circe/execution/ibis/operations.py 84.00% 4 Missing ⚠️
circe/execution/engine/group_demographics.py 87.50% 3 Missing ⚠️
circe/cohortdefinition/builders/measurement.py 50.00% 2 Missing ⚠️
... and 9 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #39      +/-   ##
===========================================
+ Coverage    85.79%   86.38%   +0.58%     
===========================================
  Files          169      173       +4     
  Lines        12514    12903     +389     
===========================================
+ Hits         10737    11146     +409     
+ Misses        1777     1757      -20     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant