Features/cohort generation with ibis and full benchmarking (on databricks) by azimov · Pull Request #39 · OHDSI/Circepy

azimov · 2026-06-17T16:34:45Z

This branch is a result of heavily benchmarking and trying to improve the ibis execution layer with real cohorts from the Phenotype library (submitted to OHDSI symposium as an abstract).

The benchmarking code is messy and bloats this repository so I moved it here

This branch also makes ibis a requirement of the package which I think is merited. The side effect of this is that python 3.9 will be removed from support (which is on the roadmap anyway)

A lot of changes were made to create cohort definition sets but here is a summary of the changes made:

Codesets (`ibis/codesets.py`)

Replaced in-memory concept ID caching (CachedConceptSetResolver) with a single database-resident codeset table per cohort. Concept sets are expanded via SQL joins through concept_ancestor/concept tables, not Python memory. ibis.memtable() removed entirely; utilities like _literal_select() build small expressions without temporary tables.

Compile steps (`ibis/compile_steps.py`)

Concept filtering switched from isin(concept_ids) (Python tuples) to semi-joins/anti-joins against the codeset table. _resolve_concept_ids() replaced by _filter_by_concept_table(). Added source concept filtering (e.g., condition_source_concept_id).

Context (`ibis/context.py`)

concept_ids_for_codeset() -> tuple[int, ...] replaced by concept_set_table(codeset_id) -> Table — filters the shared codeset table by id and returns an ibis expression. ExecutionContext accepts codeset_table directly instead of a resolver class.

Cohort engine (`engine/cohort.py`)

Intermediate results materialized to backend staging tables at each stage boundary (primary → qualified → included → ended). New _materialize()/_drop_staging_tables() functions. _union_all() uses divide-and-conquer recursion to avoid expression tree explosion. build_cohort_table() accepts cohort_id, materialize, cohort_table, session_prefix.

Custom eras (`engine/custom_era.py` — new)

Drug exposure era computation: _compute_exposure_end_date() resolves exposure end from supply details; _compute_eras() collapses overlapping exposures via window functions within a configurable gap. Previously blocked as UnsupportedFeatureError; now supported.

API (`api.py`)

build_cohort() gained cohort_id, materialize, codeset_table, cohort_table, session_prefix. Removed use_persistent_cache.

Normalize layer

FilteredConceptCriterion now carries source_codeset_id and source_concept_column for source concept filtering. All domain criteria populate these during normalization. Custom eras no longer raise UnsupportedFeatureError.

Lower layer (`lower/common.py`)

Emits FilterByCodeset steps for source concept predicates, matching T-SQL ConceptSetExpressionQueryBuilder behavior.

Other engine files (8 files)

Minor interface adjustments: passing source_codeset_id through the pipeline, accepting codeset_table/concept_set_table instead of concept_ids, materialization plumbing.

Cohort definition set (`cohort_definition_set/` — new, 4 files)

_core.py: CohortDefinition, CohortGenerationResult, CohortDefinitionSet (indexed container).
_generate.py: generate_cohort_set() / async_generate_cohort_set() — batch generation with incremental checksum skipping, per-cohort codeset management, thread-safe backend locking.
_checksum_store.py: Persistent generation history — v1/v2 compatible checksum table, window functions for latest per-cohort status.
__init__.py: Public API — CohortDefinitionSet, generate_cohort_set, async_generate_cohort_set, summarise_generation_results.

…ved overall performance. Uncovered custom eras bug

…to preserve all events When DrugExposure(first=True) and QualifiedLimit=First, every person has exactly 1 event with event_id=1 (assigned by _assign_primary_event_ids). The CustomEra window previously grouped by event_id alone, collapsing all rows into 1 partition and dropping N-1 rows with _rn==0. Grouping by (person_id, event_id) gives each row its own partition, preserving all events. Adds regression tests via build_cohort and generate_cohort_set.

…to preserve all events When DrugExposure(first=True) and QualifiedLimit=First, every person has exactly 1 event with event_id=1 (assigned by _assign_primary_event_ids). The CustomEra window previously grouped by event_id alone, collapsing all rows into 1 partition and dropping N-1 rows with _rn==0. Grouping by (person_id, event_id) gives each row its own partition, preserving all events. Adds regression test via build_cohort.

…y use Adds materialize: bool = True parameter to build_cohort and build_cohort_table. When False, skips the staging-table creation added for large-cohort SQL compilation performance. Used by the phenotype regression tests so they remain compile-only. Also: - Add union-scaling regression tests (1-100 criteria) - Fix phenotype test to use materialize=False (was hanging) - Fix SIM108 ternary in compare_cohort_outputs.py

ccrce/execution/engine/custom_era.py: - Issue 2 ✓ — _padded_end includes gap_days + offset; era end uses max(padded_end) - gap_days matching Circe BE - Issue 3 ✓ — Filters both drug_concept_id and drug_source_concept_id (with column existence guard) - Issue 5 ✓ — compute_drug_eras accepts cohort_person_ids; apply_custom_era_strategy semi-joins drug_exposure to cohort persons

…ion-set # Conflicts: # tests/execution/test_custom_era.py

…for any exception to carry on to other cohorts

…rary

codecov · 2026-06-17T16:37:09Z

Codecov Report

❌ Patch coverage is 90.49154% with 118 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.38%. Comparing base (c2e7067) to head (d99b9c7).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines	Patch %	Lines
circe/execution/ibis/codesets.py	81.03%	33 Missing ⚠️
circe/cohort_definition_set/_checksum_store.py	59.72%	29 Missing ⚠️
circe/execution/ibis/compile_steps.py	69.23%	16 Missing ⚠️
circe/execution/engine/cohort.py	80.55%	7 Missing ⚠️
circe/execution/engine/custom_era.py	91.07%	5 Missing ⚠️
circe/cohort_definition_set/_generate.py	95.69%	4 Missing ⚠️
circe/execution/engine/censoring.py	33.33%	4 Missing ⚠️
circe/execution/ibis/operations.py	84.00%	4 Missing ⚠️
circe/execution/engine/group_demographics.py	87.50%	3 Missing ⚠️
circe/cohortdefinition/builders/measurement.py	50.00%	2 Missing ⚠️
... and 9 more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #39      +/-   ##
===========================================
+ Coverage    85.79%   86.38%   +0.58%     
===========================================
  Files          169      173       +4     
  Lines        12514    12903     +389     
===========================================
+ Hits         10737    11146     +409     
+ Misses        1777     1757      -20

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…quired python version

azimov and others added 30 commits May 4, 2026 14:20

Added cohort definition sets and cohort generation features

15487b1

Merge branch 'develop' into features/cohort-definition-set

073bcaa

started benchmarking scripts

cabaa19

Removed and re-worked slop benchmark code

02fc184

implementation of custom end era logic in ibis layer

4b89967

more tests around custom era logic for parity with java

09748f2

merge fatures/ibis-custom-eras' into features/cohort-definition-set

471d75b

Fixed perfromance issues wwith ibis compilation parse trees and impro…

5c99060

…ved overall performance. Uncovered custom eras bug

updated benchmarks to run well on databricks

4a4da42

Added files needed by python from benchmark script

4d402f9

Attempt to fix test warnings

ba78e1b

remove chat slop

9860604

Merge branch 'features/ibis-custom-eras' into features/cohort-definit…

1dfe42b

…ion-set # Conflicts: # tests/execution/test_custom_era.py

simplification of checksum process

3f5b21f

Error log

f151953

added fixes to checksums using raw insert/upsert instead of memtables

95fcc6b

Error log

3141f10

Async cohort generation and compilation with improved error handling …

fc48d8a

…for any exception to carry on to other cohorts

updated concept set resolution to remove in memory python nonsense

f27ddb0

concept set optimizations

a837d7a

concept set fixes

6a825dd

concept set fixes

072e5a9

concept set cohort table names

e99f27f

concept set cache table creation up front

d00c78e

concept set cache table creation up front

817727d

concept set cache table creation up front

8dcd3d2

azimov and others added 22 commits May 18, 2026 13:45

concept set cache table creation

ba4825b

Removed ill-faited concept set cache - need to refine design better

e7cf487

Refactor table usage for concept sets

bdcacbf

Error log

15802d6

Removal of inefficient python memory usage and memtables

d60b35b

more fixes from concept set changes

440e6e9

More crazy codeset join fixes

1c3a3d7

Removal of heavy recursion for large unions

2604f9d

Replaced concept set temp tables per cohort with per execution

2078f00

Fixed bad recursion for excluded concept sets

2af353c

Modification to inefficient joins in correlated criteria

739eaac

Optimization to codeset resolution and Additional criteria

90a374f

Fix concept casting issue on databricks

8f194df

Fixed issue with use of source concepts

3ff5a30

Tests and fixes for cohorts that are not identical from phenotype lib…

8b132c8

…rary

Added benchmark report markdown file

55f75b6

Benchmark script changes for local run on databricks

ed2cec3

git ignore

4412902

git ignore

f0dca61

git ignore

e92f452

git ignore

77da42f

removed benchmarking code (moved to stand alone repository)

f830cb5

azimov added 4 commits June 17, 2026 09:42

Fixing tests and removing python 3.9 support

023b28d

Actual fix - make ibis a core dependency of the package and update re…

1b2b1e3

…quired python version

Ruff fixes

b7a433a

Ruff fixes

d99b9c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features/cohort generation with ibis and full benchmarking (on databricks)#39

Features/cohort generation with ibis and full benchmarking (on databricks)#39
azimov wants to merge 56 commits into
developfrom
features/cohort-definition-set

azimov commented Jun 17, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azimov commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codesets (ibis/codesets.py)

Compile steps (ibis/compile_steps.py)

Context (ibis/context.py)

Cohort engine (engine/cohort.py)

Custom eras (engine/custom_era.py — new)

API (api.py)

Normalize layer

Lower layer (lower/common.py)

Other engine files (8 files)

Cohort definition set (cohort_definition_set/ — new, 4 files)

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

azimov commented Jun 17, 2026 •

edited

Loading

Codesets (`ibis/codesets.py`)

Compile steps (`ibis/compile_steps.py`)

Context (`ibis/context.py`)

Cohort engine (`engine/cohort.py`)

Custom eras (`engine/custom_era.py` — new)

API (`api.py`)

Lower layer (`lower/common.py`)

Cohort definition set (`cohort_definition_set/` — new, 4 files)

codecov Bot commented Jun 17, 2026 •

edited

Loading