Features/cohort generation with ibis and full benchmarking (on databricks)#39
Open
azimov wants to merge 56 commits into
Open
Features/cohort generation with ibis and full benchmarking (on databricks)#39azimov wants to merge 56 commits into
azimov wants to merge 56 commits into
Conversation
…ved overall performance. Uncovered custom eras bug
…to preserve all events When DrugExposure(first=True) and QualifiedLimit=First, every person has exactly 1 event with event_id=1 (assigned by _assign_primary_event_ids). The CustomEra window previously grouped by event_id alone, collapsing all rows into 1 partition and dropping N-1 rows with _rn==0. Grouping by (person_id, event_id) gives each row its own partition, preserving all events. Adds regression tests via build_cohort and generate_cohort_set.
…to preserve all events When DrugExposure(first=True) and QualifiedLimit=First, every person has exactly 1 event with event_id=1 (assigned by _assign_primary_event_ids). The CustomEra window previously grouped by event_id alone, collapsing all rows into 1 partition and dropping N-1 rows with _rn==0. Grouping by (person_id, event_id) gives each row its own partition, preserving all events. Adds regression test via build_cohort.
…y use Adds materialize: bool = True parameter to build_cohort and build_cohort_table. When False, skips the staging-table creation added for large-cohort SQL compilation performance. Used by the phenotype regression tests so they remain compile-only. Also: - Add union-scaling regression tests (1-100 criteria) - Fix phenotype test to use materialize=False (was hanging) - Fix SIM108 ternary in compare_cohort_outputs.py
ccrce/execution/engine/custom_era.py: - Issue 2 ✓ — _padded_end includes gap_days + offset; era end uses max(padded_end) - gap_days matching Circe BE - Issue 3 ✓ — Filters both drug_concept_id and drug_source_concept_id (with column existence guard) - Issue 5 ✓ — compute_drug_eras accepts cohort_person_ids; apply_custom_era_strategy semi-joins drug_exposure to cohort persons
…ion-set # Conflicts: # tests/execution/test_custom_era.py
…for any exception to carry on to other cohorts
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #39 +/- ##
===========================================
+ Coverage 85.79% 86.38% +0.58%
===========================================
Files 169 173 +4
Lines 12514 12903 +389
===========================================
+ Hits 10737 11146 +409
+ Misses 1777 1757 -20 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch is a result of heavily benchmarking and trying to improve the ibis execution layer with real cohorts from the Phenotype library (submitted to OHDSI symposium as an abstract).
The benchmarking code is messy and bloats this repository so I moved it here
This branch also makes ibis a requirement of the package which I think is merited. The side effect of this is that python 3.9 will be removed from support (which is on the roadmap anyway)
A lot of changes were made to create cohort definition sets but here is a summary of the changes made:
Codesets (
ibis/codesets.py)Replaced in-memory concept ID caching (
CachedConceptSetResolver) with a single database-resident codeset table per cohort. Concept sets are expanded via SQL joins throughconcept_ancestor/concepttables, not Python memory.ibis.memtable()removed entirely; utilities like_literal_select()build small expressions without temporary tables.Compile steps (
ibis/compile_steps.py)Concept filtering switched from
isin(concept_ids)(Python tuples) to semi-joins/anti-joins against the codeset table._resolve_concept_ids()replaced by_filter_by_concept_table(). Added source concept filtering (e.g.,condition_source_concept_id).Context (
ibis/context.py)concept_ids_for_codeset() -> tuple[int, ...]replaced byconcept_set_table(codeset_id) -> Table— filters the shared codeset table by id and returns an ibis expression. ExecutionContext acceptscodeset_tabledirectly instead of a resolver class.Cohort engine (
engine/cohort.py)Intermediate results materialized to backend staging tables at each stage boundary (primary → qualified → included → ended). New
_materialize()/_drop_staging_tables()functions._union_all()uses divide-and-conquer recursion to avoid expression tree explosion.build_cohort_table()acceptscohort_id,materialize,cohort_table,session_prefix.Custom eras (
engine/custom_era.py— new)Drug exposure era computation:
_compute_exposure_end_date()resolves exposure end from supply details;_compute_eras()collapses overlapping exposures via window functions within a configurable gap. Previously blocked asUnsupportedFeatureError; now supported.API (
api.py)build_cohort()gainedcohort_id,materialize,codeset_table,cohort_table,session_prefix. Removeduse_persistent_cache.Normalize layer
FilteredConceptCriterionnow carriessource_codeset_idandsource_concept_columnfor source concept filtering. All domain criteria populate these during normalization. Custom eras no longer raiseUnsupportedFeatureError.Lower layer (
lower/common.py)Emits
FilterByCodesetsteps for source concept predicates, matching T-SQLConceptSetExpressionQueryBuilderbehavior.Other engine files (8 files)
Minor interface adjustments: passing
source_codeset_idthrough the pipeline, acceptingcodeset_table/concept_set_tableinstead ofconcept_ids, materialization plumbing.Cohort definition set (
cohort_definition_set/— new, 4 files)_core.py:CohortDefinition,CohortGenerationResult,CohortDefinitionSet(indexed container)._generate.py:generate_cohort_set()/async_generate_cohort_set()— batch generation with incremental checksum skipping, per-cohort codeset management, thread-safe backend locking._checksum_store.py: Persistent generation history — v1/v2 compatible checksum table, window functions for latest per-cohort status.__init__.py: Public API —CohortDefinitionSet,generate_cohort_set,async_generate_cohort_set,summarise_generation_results.