Add num_subjects: advanced-search operator by bendichter · Pull Request #2827 · dandi/dandi-archive

bendichter · 2026-05-13T23:57:47Z

Draft. Stacked on #2822 — review will be cleaner once the contributor PR merges. The diff against master will collapse to this PR's commits only at that point.

Adds num_subjects:N — match dandisets with at least N subjects, per Version.metadata.assetsSummary.numberOfSubjects (populated by dandischema's aggregate_assets_summary()).

num_subjects:10                       # dandisets with ≥10 subjects
num_subjects:1 species:mouse          # AND'd: ≥1 subject AND has a mouse asset
NUM_SUBJECTS:50                       # case-insensitive operator key
num_subjects:abc                      # 400 — non-negative integer required

Why ≥N (no upper bound)

The intent users actually have is "studies with at least N subjects". Upper bounds are rare enough that the extra syntax cost (>=, <=, .., validation, dropdown UI) wasn't worth it. If the need shows up, the implementation can grow GitHub-style comparisons without breaking the bare-integer form.

Why no `num_sessions:`

Considered, deliberately omitted:

dandischema.AssetsSummary does not aggregate sessions. Its numeric fields are numberOfBytes, numberOfFiles, numberOfSubjects, numberOfSamples, numberOfCells — no numberOfSessions.
BareAsset has no sessionId field either, so we can't derive a count from per-asset metadata at query time.
A useful num_sessions: would need upstream work in dandischema (add numberOfSessions to AssetsSummary) plus an aggregator that infers sessions from BIDS ses- paths and/or NWB session metadata. That's its own project.

Implementation

COUNT_OPS dict in dandiapi/api/services/search/operators.py maps num_subjects → $.assetsSummary.numberOfSubjects. Adding any of num_files, num_bytes, num_samples, num_cells is a one-line entry — same dispatch.
_apply_count_filter (filters.py) builds jsonb_path_exists(metadata, '$path ? (@ >= $val)', '{"val": N}'::jsonb). The integer is bound via the jsonpath vars parameter (not inlined into SQL or jsonpath text) — injection-safe.
Versions missing the field don't match: the jsonpath ? predicate drops missing/null/non-numeric values naturally. A freshly-created dandiset with no assets summarized yet won't satisfy num_subjects:0.
Match semantics mirror the contributor pattern: a dandiset matches if at least one of its versions satisfies the predicate.

Files

dandiapi/api/services/search/operators.py — COUNT_OPS table; folded into OPERATOR_KEYS
dandiapi/api/services/search/filters.py — _apply_count_filter() + dispatch entry
dandiapi/api/views/serializers.py — OpenAPI help text
web/src/components/DandisetSearchField.vue — popover entry
dandiapi/api/tests/test_dandiset.py — consolidated integration test (≥ threshold, missing-field exclusion, case-insensitive key, composition with created_after:) + invalid-value 400 test

Test plan

pytest dandiapi/api/tests/test_dandiset.py dandiapi/api/tests/test_search_parser.py -k 'advanced_search or search_parser' — 49 pass
ruff check dandiapi/api/services/search/ dandiapi/api/tests/test_dandiset.py — clean
vue-tsc --noEmit + eslint — clean
Manual: curl 'http://localhost:8000/api/dandisets/?search=num_subjects:abc' → 400 with helpful message; num_subjects:0 → 0 results (none of the seeded dandisets have assetsSummary.numberOfSubjects populated yet, which is the correct answer)

Filters dandisets to those owned by a given user. The value is matched case-insensitively against User.username OR User.email. The special form `owner:me` resolves to the requesting user (consistent with the existing ?user=me query parameter) and returns 400 if the request is anonymous. Implementation reuses the existing `get_owned_dandisets()` permission helper. We pass `with_superuser=False` so `owner:admin` returns only what admin explicitly owns — guardian's default would otherwise inflate to the entire archive for any superuser. Unknown users return zero results (not an error): a search for a nonexistent owner is a valid 0-hit query. Tests cover username/email lookup, case-insensitivity, unknown user, `owner:me` for an authenticated user, anonymous `owner:me` → 400, the superuser non-inflation guarantee, and combination with other operators. OpenAPI help text and the frontend operator popover updated.

Real users encounter the dandiset list with owners shown by display name (e.g. "Super User"), not by username. Searching that string was returning 0 because the lookup only matched username/email. Now matches case-insensitively against username, email, first_name, last_name, OR "first_name last_name" — so owner:"Super User" works the same as owner:ben.dichter@gmail.com. Multiple users may match (e.g. shared last name); we union dandisets owned by any of them via a direct DandisetUserObjectPermission query. Updated OpenAPI help text and the frontend popover example to `owner:"Jane Doe"` so users discover the new shape.

@yarikoptic

Round-2 review feedback on dandi#2821: - @yarikoptic flagged that owner:me silently shadows a real user named "Me". Fix: distinguish quoted vs unquoted at the parser level. Unquoted owner:me → magic alias for the requesting user. Quoted owner:"me" → literal lookup (matches a user whose first/last name is "Me"). Same pattern lets owner:"Me Someoneyou" reach the literal full-name match while keeping the convenient owner:me shortcut. Implementation: ParsedSearch.operators is now a list of `Operator` dataclasses (key, value, quoted) instead of bare tuples. Filters consume the new shape and the owner filter switches on the quoted flag. - Replaced personal email (ben.dichter@gmail.com) in the full-name test fixture with a generic example user. - Consolidated 10 small owner-tests into 3 denser ones that share setup per @yarikoptic's "make each test matter more" feedback. Coverage is unchanged (every documented lookup path is asserted; cross-key AND with another operator; multi-user union via shared last name; unknown user → 0; superuser non-inflation; owner:me magic; owner:"me" literal-escape; anonymous owner:me → 400). DB setup runs ~3x instead of ~10x. Updated OpenAPI help text and the search popover to mention the owner:me alias and the quoted-escape.

The unquoted owner:me → current-user shortcut required threading a `quoted` flag through the parser and a `request_user` arg through the filter dispatch — non-trivial machinery to support one alias. Per dandi#2822 review discussion, removing it from this PR keeps the owner operator focused on literal lookup-by-value (username / email / first / last / "first last") and avoids the design debate about the right escape mechanism for "I literally want a user named Me." The alias can come back in a focused follow-up PR if/when there's appetite for it. Concrete drops: - owner:me magic + 400-on-anonymous in `_apply_owner_filter` - `Operator.quoted` field on the parser dataclass - `quoted` and `request_user` parameters on `_apply_owner_filter` - `get_owned_dandisets` import (no longer used here) - `test_advanced_search_owner_me_magic_and_literal_escape` test - The two `owner-me-quoted` / `owner-me-unquoted` parser test cases - "owner:me" mentions in OpenAPI help text and the popover entry

…okup 29 new operators total: catch-all `contributor:` plus one per dandi-schema RoleType (`author`, `data_curator`, `funder`, `contact_person`, etc.). Independent-operator semantics — `author:Doe funder:NIH` returns dandisets where SOME contributor has Doe-as-Author AND SOME contributor (possibly different) has NIH-as-Funder. Each role-specific operator constrains a single contributor[] element to have BOTH the name match AND the role. Implementation: - A single `_CONTRIBUTOR_ROLE_OPS` dict drives both the parser allowlist and the filter dispatch; adding a future role is one new entry. - `_contributor_jsonpath()` builds a Postgres jsonb_path_exists predicate that ORs across `name`, `email`, AND `identifier` (so ORCID for Persons and ROR URL for Organizations both work, including bare-ID substring forms like `01cwqze88` matching the full ROR URL). - All contributor operators in a single query AND on the same Version's metadata so a draft + published version with disjoint contributor lists never combine into a spurious match. Why 29 separate operators rather than a `contributor: + role:` pair: independent operators compose cleanly (cross-key AND falls out naturally; no ambiguity about which role applies to which contributor when there are multiple). Same precedent as Gmail's `from:`/`to:`/`cc:`. The 28 role names come straight from `dandischema.RoleType`. Test: one consolidated test covers catch-all + role-specific lookup, case-insensitivity, identifier (ORCID + ROR + bare-ID substring), role-substring matching `dcite:`-prefixed stored values, role + ORCID composition (positive and negative), and independent cross-role AND. Plus a separate test for the typo → 400-with-suggestion path. Anonymous test fixtures use generic Doe placeholders, no real names. OpenAPI help text and the search popover updated.

The previous commit treated `affiliation:` as a role-name match (looking for `dcite:Affiliation` in `contributor[].roleName`), but real DANDI data never uses that role; affiliations live in a separate nested field `contributor[].affiliation[]`. The operator silently returned 0 hits despite plenty of (e.g.) Stanford-affiliated contributors. Fix: route `affiliation:` through a dedicated jsonpath that scans `$.contributor[*].affiliation[*]` and matches against the affiliation's `name` OR `identifier` (case-insensitive substring). So: affiliation:Stanford → matches Stanford University affiliation:"University College London" → quoted multi-word affiliation:00f54p054 → matches via ROR ID substring Composes with role/contributor operators on the same Version, same as the other contributor-style operators (independent-operator AND). Also refactored `_apply_contributor_filters` to accept a list of (where, params) pairs rather than (value, role) — cleaner since both the role-based and affiliation operators now share the same dispatch.

Per review: `other:` would be a thin surface for "uncategorized contributors" — not a useful filter — and `ethics_approval:` isn't a contributor-style role users would search by. Removing them tightens the operator vocabulary to the 25 substantive RoleType values + the contributor catch-all + affiliation.

Two structural improvements + one product trim, in response to the review on dandi#2822: 1. New `dandiapi/api/services/search/operators.py` (pure Python, no Django) holds every operator-vocabulary constant: DATE_OPS, ASSET_OPS, OWNER_OPS, AFFILIATION_OPS, CONTRIBUTOR_ROLE_OPS, FILE_TYPE_ALIASES, ASSET_NAME_PATH_OPS, AFFILIATION_JSONPATH. OPERATOR_KEYS is now the union of those tables — single source of truth, no more duplication between parser.py (allowlist) and filters.py (dispatch). Adding a new operator is one entry; the parser automatically knows about it. 2. Trim the role-restricting shortcuts from 25 to 9. After review discussion: most RoleType values aren't operators users actually reach for (`conceptualization:`, `methodology:`, `validation:`, `visualization:`, etc.). Kept the ones that map to common search intents: contributor (catch-all), author, contact_person, data_collector, data_curator, data_manager, maintainer, project_lead, funder, sponsor The catch-all `contributor:` still matches anyone in any role; only the role-restricting shortcuts are pruned. `project_lead:` is intentionally shorter than the schema name `ProjectLeader`. 3. Shrank the verbose docstrings on private filter helpers (the rationale stays in commit messages, not as documentation rot on internal API). 4. Added test_contributor_role_ops_match_actual_dandischema_roletype as a drift guard: every non-catch-all CONTRIBUTOR_ROLE_OPS value must be a real RoleType.name. Renames or removals on the schema side trip the test, forcing an explicit decision instead of silently changing public search syntax. OpenAPI help text and the search popover updated to reflect the trimmed list (`project_lead`, `data_collector`, `data_manager`, `sponsor` now shown; the misleading "many more" tail removed).

@example

- Variable renames: ds_baker_curator → ds_doe_curator, ds_baker_author_only → ds_doe_author_only (the test data was already Doe; only the variable names still carried the old name). - One stale query string `AUTHOR:baker` updated to `AUTHOR:doe`. - One fixture email field `'jane.doe.com'` (broken: no @) restored to `'jane.doe@example.com'` — leftover from the earlier perl rename that stripped @example out.

Per dandi#2822 review discussion: the old semantics required all asset operators to be satisfied by a SINGLE asset, which meant `species:mouse species:rat` only matched dandisets with a multi-species recording (rare). The natural user reading is "the dandiset has mouse data AND has rat data" — those can be on different assets, and that's the common case for comparative-species dandisets. Implementation: each asset operator now builds an independent AssetSearch subquery and the dandiset queryset is filtered with `id__in=...` per operator. Django generates one subquery per operator and AND's them at the dandiset level. Cross-key likewise: `species:mouse approach:electrophysiological` now matches any dandiset that has SOME mouse asset AND SOME ephys asset, not just dandisets with a mouse-ephys asset. Tests updated: - `test_advanced_search_repeated_same_key_operator_combines_with_and` is now `..._combines_at_dandiset_level`, with a new fixture that has two separate assets (one mouse, one rat) to actually exercise the cross-asset case the old semantic excluded. - `test_advanced_search_repeated_asset_operators_intersect` is now `test_advanced_search_asset_operators_combine_at_dandiset_level`, with a similar two-assets-split fixture that demonstrates the new inclusive behavior. Contributor / affiliation semantics unchanged — those still AND on the same Version's metadata (since contributors live per-version, not per-asset). Within that single version, predicates can match different contributor[] entries.

Postgres jsonpath quirk: `like_regex` requires its pattern to be a STRING LITERAL inside the jsonpath text — not a `$variable`. The contributor + affiliation builders I wrote tried to use the `vars` argument of `jsonb_path_exists` for the regex pattern, which Postgres rejects with `syntax error at or near "$val" of jsonpath input`. (The asset operators avoid this by concatenating `to_jsonb(?::text)::text` into the jsonpath at SQL execution time — the regex pattern ends up as a properly-quoted JSON string literal in the path. The user value is still bound as a parameter, never inlined into the SQL.) Refactor: applied the same SQL-time concatenation trick to the contributor + affiliation builders. Three new helpers — `_contributor_where`, `_affiliation_where`, and a shared `_LIKE_REGEX_PATTERN` constant — replace the old `_contributor_role_jsonpath` + `_build_jsonpath_where` pair that relied on the broken `vars` mechanism. Removed the unused `AFFILIATION_JSONPATH` constant from operators.py and dropped the `json` import from filters.py since we no longer marshal `vars` objects. Net behavior unchanged; the failing CI tests should pass now.

CI surfaced an assertion that AUTHOR:doe should match the same set as author:doe. The old _TOKEN_RE / _BARE_OP_RE only accepted lowercase operator keys, so uppercase tokens fell through to free text and returned 0 results. Accept either case in the regex and lowercase the captured key before validation/dispatch. Matches user expectations (GitHub's search operators are case-insensitive on the key side too).

Co-authored-by: Isaac To <candleindark@users.noreply.github.com>

@candleindark

Per @candleindark's review: a contributor can be an Organization as well as a Person, and the affiliation jsonpath (which traverses `contributor[*].affiliation[*]`) should walk past Organizations (which have no `affiliation` field of their own) without exploding. Added Organization contributors to both `ds_stanford` and `ds_ucl`: NIH as a Funder on ds_stanford and Wellcome Trust as a Funder on ds_ucl. The new assertions confirm: - `affiliation:Stanford` (and the other affiliation queries) keep working with mixed Person/Organization contributors. - The Organization's own `identifier` is NOT matched by `affiliation:` (it's not an affiliation; the test pins this). - Cross-key with `funder:NIH affiliation:Stanford` works — different contributor elements on the same Version. Also: used `National Institutes of Health (NIH)` for the org name so the `funder:NIH` substring test actually matches (the abbreviation isn't part of the spelled-out form alone). Realistic — DANDI contributors often use this parenthetical form.

`num_subjects:N` matches dandisets whose `assetsSummary.numberOfSubjects` is at least N. The threshold reads as "studies with ≥N subjects", which is the search intent users actually have (an upper bound is rarely useful and isn't worth the syntax cost). A `num_sessions:` operator was considered but is intentionally NOT included: dandischema's `AssetsSummary` does not aggregate sessions, and assets carry no `sessionId` field from which to derive a count. Adding it would require upstream schema work. Implementation: - `COUNT_OPS` dispatch table in `operators.py` maps op → jsonpath into `Version.metadata`; adding any of `num_files`, `num_bytes`, `num_samples`, `num_cells` is a one-line entry. - `_apply_count_filter` builds a `jsonb_path_exists(metadata, ...)` predicate. The integer is bound via the jsonpath `vars` parameter (not inlined into SQL or jsonpath text), so the value is injection-safe. Versions whose metadata lacks the count field never match — the jsonpath `?` filter drops missing/null/non-numeric values naturally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bendichter · 2026-06-01T15:21:30Z

Some considerations here:

GitHub uses comparison operators, e.g. "comments:>10" example

Also in GitHub, you can express a range with "comments:10..100"

I opted not to use that here, and just have all queries be ≥. So for example: subjects:2 means that a dandiset as 2 or more subjects. The justification is I think a query that uses < would be pretty uncommon (why would they want a dandiset that has fewer than 50 subjects?). It could be useful for analytics of the available data, but I don't think it would be used often for finding a dataset. I could be wrong and I could be convinced to change this. It's just a more complex parser.

bendichter and others added 19 commits May 13, 2026 15:52

Apply ruff format to test_dandiset.py

74fcc5b

Use project_leader (full schema name) as the operator name

463c366

Apply ruff format

e4d03d3

Update dandiapi/api/services/search/filters.py

a994d90

Co-authored-by: Isaac To <candleindark@users.noreply.github.com>

ruff format

9b8b9ad

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add num_subjects: advanced-search operator#2827

Add num_subjects: advanced-search operator#2827
bendichter wants to merge 19 commits into
dandi:masterfrom
bendichter:advanced-search-counts

bendichter commented May 13, 2026

Uh oh!

bendichter commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bendichter commented May 13, 2026

Why ≥N (no upper bound)

Why no num_sessions:

Implementation

Files

Test plan

Uh oh!

bendichter commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why no `num_sessions:`

bendichter commented Jun 1, 2026 •

edited

Loading