diff --git a/CHANGES.txt b/CHANGES.txt index 3fd711eb..7d8cecff 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -226,3 +226,5 @@ v<3.4.0>, <05/10/2026> -- ADEngine and MCP audit-cycle fixes from a UCI Ionosphe v<3.5.0>, <05/11/2026> -- Sustainable model persistence. New `pyod.utils.persistence` module with three additive helpers (`save`, `load`, `compat_load`); no breaking change to existing `joblib.dump` / `joblib.load` workflows. `save(clf, path, metadata=None)` writes a versioned envelope (`_pyod_persistence_version`, `pyod_version`, `sklearn_version`, `numpy_version`, `scipy_version`, `joblib_version`, `python_version`, `saved_at`, `model_class`, optional user metadata, model). `load(path, strict=False, return_metadata=False)` reads the envelope, compares the recorded dependency versions against the running environment, and emits a `UserWarning` on drift in any of sklearn, joblib, numpy, or scipy. Python-version drift is severity `info` and is diagnostic only: non-strict `load` does not warn and `strict=True` does not raise on `python_version`-only drift on the normal envelope path; after a compat repair, strict mode still refuses to return the repaired model, but the error follows the no-drift compat-repair branch and does not name `python_version`. `strict=True` escalates every `warn`-severity drift to `ValueError`, rejects raw legacy artifacts that have no envelope, and refuses to return a model that required a compatibility repair. `return_metadata=True` returns `(model, envelope_without_model_field)`. `compat_load(path, mmap_mode=None)` mirrors `joblib.load` with the BUILD-opcode dispatch entry patched on a subclass of `joblib.numpy_pickle.NumpyUnpickler`; when sklearn's `Tree.__setstate__` would raise `ValueError: node array from the pickle has an incompatible dtype`, the saved Tree-node state is realigned to the running sklearn dtype first. Realignment is allowlist-driven: `_TREE_NODE_FIELD_DEFAULTS` (currently `{"missing_go_to_left": 0}`, the pre-1.3 sklearn default) zero-fills documented missing fields, `_TREE_NODE_FIELD_RENAMES` (empty in v3.5.0) maps known renames with rename targets resolved BEFORE the missing-field-default check so a future rename does not also need a default entry, and any other dtype difference (unknown new field, kind change, signedness change, itemsize change, shape change) raises `ValueError` with a re-fit recommendation. Same-name byte-order-only differences realign safely. Current dtype is discovered dynamically from `sklearn.tree._tree.NODE_DTYPE` (no hardcoded layout). A single `UserWarning` recommending re-fit fires when at least one Tree was realigned; non-tree artifacts (ECOD, COPOD, HBOS, LOF, ...) pass through silently. `load()` falls through to `compat_load()` automatically when `joblib.load` raises the documented dtype prefix; the original exception is preserved via `raise ... from`, and a non-prefix `ValueError` from `joblib.load` propagates without invoking `compat_load`. Dependency floor: `requirements.txt` and `docs/requirements.txt` now pin `joblib>=1.5` because `compat_load` reuses `joblib.numpy_pickle._validate_fileobject_and_memmap` and the joblib 1.5 `NumpyUnpickler(filename, file_handle, ensure_native_byte_order, mmap_mode=...)` constructor; older joblib lacks both, and the import is guarded with a clear `ImportError` recommending an upgrade. Closes issue #519. Tests: 31 new in `test_persistence.py` covering Tree-dtype realignment (synthetic aged pickles produced by an `_OldDtypeTree` pickle-time shim), the committed binary fixture under `pyod/test/fixtures/iforest_sklearn_1_2_x.joblib` (a real sklearn 1.2.2 IsolationForest, regenerable via `regen_iforest_sklearn_1_2.py`), envelope round-trip, version-drift warnings including the `info`-only `python_version` silent case, strict-mode rejection paths, schema-version validation including a future-version reject, the strict-after-compat no-drift case, exception chaining, a synthetic rename test that proves `_TREE_NODE_FIELD_RENAMES` works without a paired `_TREE_NODE_FIELD_DEFAULTS` entry, and a monkey-patched `joblib.load` test that pins the exact-prefix fall-through gate (non-prefix `ValueError` propagates unchanged; prefix `ValueError` invokes `compat_load` exactly once). CI: new `persistence-nightly` job in `testing-cron.yml` installs pre-release `sklearn` / `numpy` / `scipy` / `joblib` (scientific-python nightly index) and runs only `test_persistence.py`; failure surfaces upstream dtype evolution before downstream users hit it and is not a release blocker. Docs: `docs/model_persistence.rst` rewritten with quick-start, trust-boundary, why-versioning, legacy-load decision tree, cross-sklearn-version compatibility section, troubleshooting table keyed on error text, strict-mode notes, and envelope-metadata-reading guidance. `docs/pyod.utils.rst` cross-references the new module. `examples/save_load_model_example.py` now leads with `persistence.save` / `persistence.load` and notes raw `joblib` as a secondary alternative. Deferred: a true header-only `inspect_artifact(path)` and `pyod inspect ` CLI require a `.pyod` zip container layout (metadata sidecar separate from the model payload) and remain Phase 3 work; deep-learning state-dict persistence stays scoped to its own future design. No breaking API changes. v<3.5.1>, <05/13/2026> -- External-contributor PR review pass (jbbqqf + tuanaiseo bundles) plus NSF funding acknowledgment. Bug fixes: LUNAR no longer shares its `MinMaxScaler` across instances because the constructor default was a mutable shared object; `LUNAR.__init__` now defaults `scaler=None`, `_resolve_scaler()` materializes a fresh `MinMaxScaler` per fit (or deep-copies a user-supplied instance, or disables scaling on `scaler=False`), and the fitted scaler lives on `self.scaler_` so `sklearn.base.clone()` round-trips (closes #502). DIF stops double-normalizing during fit: the inner `self.decision_function(X)` call that set `decision_scores_` was receiving an already-min-max-scaled `X`, and `decision_function` re-scales internally; the fix preserves the raw `X` and passes it to `decision_function`, so `decision_scores_` now matches `decision_function(X_train)` (closes #546). SOS perplexity inner loop replaces `np.sum(...)` with the ndarray `.sum()` method (closes #635); numerical equivalence test asserts bit-exact match. SUOD defers the optional `suod` import to the `SUOD()` constructor with an actionable `ImportError` instead of the old print-then-crash pattern at module top (closes #640). LOF docstring corrects the `novelty` default from `False` to `True` (matches the actual `__init__` default, which is required for PyOD's fit-then-predict contract because scikit-learn's `LocalOutlierFactor` only exposes `predict`/`decision_function` on unseen data in novelty mode); a regression test pins both the `inspect.signature` default and the docstring substring (closes #638). GAAL torch-optional handling: `pyod/models/gaal_base.py` (closes #660 via tuanaiseo), then a follow-up extends the same guarded-import + actionable `ImportError` pattern to `pyod/models/mo_gaal.py`, `pyod/models/so_gaal.py`, and `pyod/models/so_gaal_new.py` so user-visible imports `from pyod.models.mo_gaal import MO_GAAL` and `from pyod.models.so_gaal import SO_GAAL` no longer print-then-crash when torch is absent; all four GAAL files now raise the unified message pointing at `pip install pyod[torch]` or `pip install torch`. `pyod/models/__init__.py` adds an inline comment explaining why detector imports are deliberately omitted at the package level (several detectors require optional extras). Funding: README.rst gains an Acknowledgments section and docs/about.rst gains a Funding section, both citing NSF Award No. 2346158, "NSF POSE: Phase II: OpenAD: An Integrated Open-Source Ecosystem for Anomaly Detection," using the NSF PAPPG recipient-obligation form with the standard disclaimer; lead and sub-awardee organizations are listed separately from PI/co-PI names to avoid stale per-person affiliation claims. Tests: 6 new across `test_lof.py` (1), `test_dif.py` (1), `test_sos.py` (1), `test_lunar.py` (2), and `test_suod.py` (1). No breaking API changes. v<3.5.2>, <05/18/2026> -- Reproducibility and kwargs-forwarding bug fixes surfaced by the PyOD 3 paper (KDD 2027 ADS Cycle 1) §5 evidence work. Bug fixes: (1) Closes #685 (`ABOD`/`KNN`/`LUNAR`/`SOD` accepted arbitrary `**kwargs` and forwarded them unfiltered to `sklearn.neighbors.NearestNeighbors`, which crashed at fit time -- or, for KNN, at `__init__` time -- on any kwarg outside `NearestNeighbors`'s signature, including the sklearn-convention `random_state`, a `verbose` flag, or a typo like `n_neighbours`). The four detectors were introduced in commit b8f6c81 (fix for #654) with the over-forwarded `**kwargs`. The fix removes `**kwargs` from each `__init__` and stops forwarding `**self.kwargs` / `**kwargs` to `NearestNeighbors`; the six named forwarding parameters added in b8f6c81 (`algorithm`, `leaf_size`, `metric`, `p`, `metric_params`, `n_jobs`) still cover the use case #654 asked for. Unknown kwargs on ABOD / KNN / SOD now raise a clean `TypeError: .__init__() got an unexpected keyword argument '...'` at construction time that points at the user's call site (the sklearn stack frame from the late-fit crash is gone); regression tests assert that the error message names the detector class and does NOT contain `NearestNeighbors` so a future regression that re-introduces the old shape is caught. LUNAR is the one #685 detector that is actually stochastic (it calls `train_test_split`, uses `np.random` in `generate_negative_samples`, and initializes plus trains a torch network), so it does not reject `random_state`; instead, `LUNAR.__init__` now declares an explicit `random_state=None` parameter that accepts either `int` or `numpy.random.RandomState` (sklearn convention; both forms go through `sklearn.utils.check_random_state`) and threads through (a) `torch.manual_seed` (and `torch.cuda.manual_seed_all` when CUDA is available) before the network is built, deriving a single int seed by drawing once from `check_random_state(random_state)`, (b) the numpy `RandomState` returned by the same `check_random_state` used as the `random_state` argument to `sklearn.model_selection.train_test_split`, and (c) the same `random_state` argument added to `generate_negative_samples`. After the fix, two `LUNAR(random_state=42)` instances fit on the same X produce identical `labels_` and `decision_scores_` (within 1e-6). Soft API removal: the accidental arbitrary-`**kwargs` surface added in b8f6c81 is gone. Code that relied on it -- for example `ABOD(some_unknown_kwarg=value)` -- now fails fast at the constructor call instead of at the `NearestNeighbors` constructor inside fit. The six named forwarding parameters still work; this is the only meaningful behavior change. (2) Closes #686 (`ADEngine.investigate` was non-deterministic on byte-identical input because no public API pinned `random_state`). The fix adds `random_state: int | None = None` to `ADEngine.__init__`; the engine stores the seed and passes it through `ADEngine.build_detector` -> `build_detector_from_plan(plan, kb, random_state=...)`. The factory then injects `random_state` into `plan['params']` only for detector classes whose `__init__` declares an explicit `random_state` parameter (verified via `inspect.signature`); detectors that do not declare it -- ABOD, KNN, SOD, and other deterministic classes -- are instantiated unchanged, so the v3.5.1 call shape for those classes is preserved bit-for-bit. A caller-supplied `plan['params']['random_state']` wins over the engine default to preserve explicit caller intent. The factory does `dict(plan.get('params', {}))` before injecting so the caller's plan is not mutated. `build_from_preset(...)` was likewise updated to forward the engine seed: `EmbeddingOD` presets `for_text` / `for_image` (called via `build_detector_from_plan` when `plan.get('preset')` is set) now receive `random_state` as a kwarg, `EmbeddingOD.__init__` accepts and stores it, and `resolve_detector(detector, contamination, random_state=...)` injects the seed into the inner shortcut detector (`'LUNAR'`, `'KNN'`, ...) when that detector class declares `random_state`. `EmbeddingOD._preprocess_fit` also passes `random_state=self.random_state` to the optional ``PCA(n_components=self.reduce_dim, ...)`` dimensionality-reducer so a preset plan with `reduce_dim` set is fully deterministic (PCA can otherwise pick a randomized SVD solver under `svd_solver='auto'` on high-dimensional embeddings, which would have left a stochastic preprocessing step before the seeded detector). The external encoder's own inference (sentence-transformers, DINOv2) is treated as deterministic given fixed weights and is NOT seeded by `EmbeddingOD.random_state`; the docstring documents this boundary. With this, `ADEngine(random_state=42).build_detector({'detector_name': 'EmbeddingOD', 'preset': 'for_text', 'params': {'quality': 'balanced'}})` now produces an `EmbeddingOD(detector='LUNAR', random_state=42)` and the inner `LUNAR` is seeded -- closing the round-2-flagged gap where `EmbeddingOD.for_text()` defaults to LUNAR and silently dropped the engine seed. With `ADEngine(random_state=42)`, repeated `investigate(X)` calls on the same X now produce byte-identical `state.consensus['labels']` and identical `state.analysis['consensus_analysis']['anomaly_ratio']`, and the engine seed propagates end-to-end through `detect()`, `investigate()` -> `run()`, post-recovery reruns, and the `EmbeddingOD` text / image preset path because every path instantiates through `self.build_detector()`. The previously-broken LUNAR direct-plan case is also covered: `ADEngine(random_state=42).run_detection(X, {'detector_name': 'LUNAR', 'params': {...}})` is now bit-stable across reruns. Backward compatibility: `ADEngine()` without a seed retains v3.5.1 behavior (no determinism guarantee). (3) Closes #469 (LODA results are not reproducible because `LODA.__init__` did not accept `random_state` and the inner `np.random.randn` + `np.random.permutation` calls fell back to numpy's module-level state). The fix adds `random_state: int | None = None` to `LODA.__init__`, threads it through `sklearn.utils.check_random_state`, and replaces the two `np.random.*` call sites with `rng.randn(...)` and `rng.permutation(...)` so two `LODA(random_state=42)` fits on the same X produce bit-identical `decision_scores_`. Because LODA now declares `random_state` in its signature, `ADEngine(random_state=42)` propagates the engine seed to LODA plans through the same `_accepts_random_state` factory path used for IForest / LUNAR. Tests: 31 new across `test_ad_engine.py::TestRandomStateDeterminism` (4 -- determinism + cross-seed + default + LUNAR-plan determinism), `test_ad_engine.py::TestRandomStateFactory` (11 -- IForest seed injection, plan-level override wins, KNN/ABOD/SOD not given a seed, plan dict not mutated, no-seed default unchanged, plus 3 preset-path tests for `EmbeddingOD.for_text` seed propagation, plan-level wins, and no-seed default, plus a monkeypatch test asserting `EmbeddingOD._preprocess_fit` constructs `PCA(random_state=...)` with the engine seed), `test_abod.py::TestABODKwargsRejection` (3 -- tightened to assert `ABOD` in the error message and `NearestNeighbors` not in it), `test_knn.py::TestKNNKwargsRejection` (3 -- same tightening for `KNN`), `test_lunar.py::TestLUNARKwargsAndRandomState` (4 -- unknown kwarg rejection with tightened message check + default construction + same-seed determinism + `RandomState` object input accepted), `test_sod.py::TestSODKwargsRejection` (3 -- same tightening for `SOD`), `test_loda.py::TestLODARandomState` (3 -- same-seed determinism + cross-seed differ + no-seed unchanged). Related progress on #599 (sklearn-style `random_state` across pyod): `ADEngine.__init__`, `LUNAR.__init__`, `LODA.__init__`, and `EmbeddingOD.__init__` now accept `random_state`; ABOD / KNN / SOD reject unknown kwargs cleanly at construction. Other detectors with internal stochasticity (e.g., deep-learning models that depend on torch state, `IForest` which already had `random_state`) are not in scope for v3.5.2 and remain follow-up work tracked under #599. +v<3.5.3>, <05/19/2026> -- KB-tools API for agent-driven and LLM-API-driven routing. Surface 1 (agent tools): `ADEngine.get_kb_for_routing(profile, top_k=3, constraints=None)` returns a structured KB snapshot (every shipped detector with strengths, weaknesses, best_for, avoid_when, complexity, benchmark_rank, modality_match) filtered by `constraints.exclude_detectors` and `constraints.data_type_strict` (default True), sorted by benchmark rank for the profile modality. `ADEngine.make_plan(detector_choices, justifications=None, params=None)` validates the caller-chosen ordered detector list against the KB (case-sensitive; unknown / non-shipped names raise `ValueError`), overlays per-detector params with engine contamination resolution, and returns a closed-schema `DetectionPlan` consumable by `build_detector` / `run`. The pair lets agent runtimes (Claude Code, Codex CLI, MCP tool clients) reason over the KB directly and commit a routing decision without going through hand-coded rules. Surface 2 (programmatic API): `ADEngine.plan_detection(profile, llm_client=callable, top_k=3)` accepts a user-supplied `(prompt: str) -> str` callable wrapping any LLM SDK (Anthropic, OpenAI, vLLM, self-hosted). When `llm_client` is set, the engine builds the routing prompt internally via `pyod.utils._llm.build_routing_prompt`, invokes the callable, parses the response via `pyod.utils._llm.parse_routing_response`, and returns the same `DetectionPlan` shape. On LLM call failure or response parse failure, falls back to rule-driven routing with a `RuntimeWarning`; set `PYOD3_LLM_STRICT=1` to re-raise instead. `LLMCallable` is a Protocol -- PyOD ships no provider-specific adapter classes; users wrap their own SDK. The parser tolerates surrounding prose and markdown fences, skips unknown detector names with a logged warning, dedupes, and truncates to `top_k`; raises `RoutingParseError` if no JSON array is extractable or no valid detector survives KB validation. `top_k` generalization: `ADEngine.plan_detection(..., top_k=3)` exposes the previously hard-coded `valid[1:3]` alternatives slice as a parameter. Default 3 preserves v3.5.2 behavior; values < 1 are clamped to 1. Tests: 44 new in `test_kb_router_surface1.py` covering schema, filters, ordering, KB validation, top_k clamping, stub LLM client canned plan, top_k truncation of LLM response, malformed response fallback, `PYOD3_LLM_STRICT=1` re-raise, prose tolerance, markdown-fence tolerance, dedupe, and bare-string entries. All 205 existing ADEngine tests continue to pass. Backward compatibility: every v3.5.2 caller pattern (`plan_detection(profile)`, `plan_detection(profile, priority=...)`, `plan_detection(profile, constraints=...)`) produces identical output. The new `top_k=3` and `llm_client=None` parameters are keyword-only with backward-compatible defaults. Out of scope: `routing_rules.json` rule authoring (rules remain the offline fallback); LLM-decided `top_k` (caller decides); built-in CLI adapter classes for Codex / Claude Code (users wrap subscriptions themselves); async `llm_client`. No breaking API changes. Round 1 reviewer fixes (Codex via /implement-review auto): (a) High: `_plan_via_llm` now enforces the constrained KB context after parsing -- if the LLM returns a detector excluded by `constraints.exclude_detectors` or filtered by `data_type_strict`, the engine raises `RoutingParseError` and falls back to rule routing with a `RuntimeWarning`. Previously the LLM path validated only against the global KB and could bypass hard `exclude_detectors` constraints. (b) Medium: `get_kb_for_routing` now consults modality-specific benchmark-rank keys instead of `{modality}.title() + '_overall'` only -- `time_series` uses `TSB_AD_overall` / `TSB_AD_overall_iforest`, `graph` uses `BOND_deep` / `BOND_overall`, `text` uses `NLP_ADBench_overall`, `image` uses `MVTec_overall`, all with `ADBench_overall` as the universal fallback. Previously non-tabular modalities effectively sorted alphabetically because the legacy key form did not match the KB's actual rank fields. (c) Medium: new per-call kwarg `plan_detection(..., llm_strict: bool | None = None)`. Precedence: explicit `True` re-raises on LLM/parse failure; explicit `False` falls back with `RuntimeWarning`; `None` defers to `PYOD3_LLM_STRICT` env var. The env-only switch was process-global and incorrect for concurrent callers in the same process. Six additional regression tests cover the constraint bypass, modality rank-key ordering for time_series and graph, and the three-way llm_strict precedence (True/False/None). Round 2 reviewer fixes (Codex via /implement-review auto): (d) Med: `plan_detection`'s new `top_k`, `llm_client`, and `llm_strict` parameters are now actually keyword-only via a `*` separator before them in the signature, matching the release notes claim. (e) Med: `get_kb_for_routing` now stamps each returned detector entry with `resolved_rank` and `resolved_rank_key` fields carrying the modality-specific benchmark rank it used for sorting; `build_routing_prompt` reads those fields so the LLM-facing prompt now shows e.g. `rank=10 (TSB_AD_overall)` for time-series detectors instead of the empty `rank=` it previously rendered (because the prompt had hard-coded the legacy `{modality}.title() + '_overall'` key). Three additional regression tests cover (a) the keyword-only signature contract, (b) prompt rank annotation under time-series, and (c) the text-modality fallback path when the KB has no rank data. +v<3.5.4>, <06/03/2026> -- Claims-honesty and framing-consistency remediation of the v3 agentic layer from an internal audit (no detector behavior change). Determinism: `ADEngine.random_state` docstring upgraded from the vague "deterministic-up-to-numpy-module-state" hedge to the audited guarantee (a run-to-run audit of the shipped shallow detectors found every one either honors the seed or is deterministic by construction; deep detectors additionally depend on framework seeding). Counts: every public surface now reports 60 buildable detectors instead of 60+/61/50+; `scripts/regen_skill.py` and `pyod/cli.py` exclude `status == "planned"` so the non-buildable `LLMAD` no longer inflates the od-expert skill's counts/lists or `pyod info` (now `60 total (43 tabular, 7 time-series, 8 graph, 2 text, 2 image, 1 multimodal)`); `LLMAD` stays in the raw KB as a roadmap entry. Expert-level: `docs/index.rst`, `od_expert/SKILL.md`, `docs/skill_maintenance.rst`, and `docs/examples/agentic.rst` reword "expert-level/expert-quality results" to a complete-workflow/accessibility claim. Trust verdict: `docs/examples/adengine.rst` demotes the quality verdict to descriptive diagnostics with a "heuristic, not a guarantee" note and corrects the stale "Jaccard" stability description to the cutoff-gap formula; `_quality_metrics.compute_quality` docstring documents `separation` as circular (computed from the run's own predicted labels, near-always high, and not independent of the majority-vote consensus labels); the od-expert skill's Trigger 4 is reframed to cutoff-instability on `stability` only, and its result-interpretation, per-modality confidence lines, and examples route confidence through low `agreement` plus label-free caveats instead of `separation`/`overall`/`verdict`. Consensus: skill guidance softened from "never report from a single detector" to "prefer consensus for robustness; about as accurate as the best single pick." Framing consistency: ADEngine is described as a "lifecycle orchestration" engine rather than "intelligent orchestration" across README, docs, the API reference, and the module docstring, matching the finding that the layer's value is the drivable, reproducible workflow rather than selection intelligence. Tests: 2 new count-locking regression tests (`test_cli.py::test_pyod_info_excludes_planned_detectors`, `test_skill_kb_consistency.py::test_skill_count_prose_matches_kb`) compute expected buildable counts from the KB and fail on regression. Reviewed via /implement-review (Codex, 4 rounds): R1 raised 3 High + 2 Medium + 1 Low, R2 verified 5/6 and flagged trust-gate residue, R3 cleared it, R4 confirmed commit-ready. No breaking API changes. diff --git a/README.rst b/README.rst index 94ff2d49..c6bc5d37 100644 --- a/README.rst +++ b/README.rst @@ -62,7 +62,7 @@ PyOD 3 is the most comprehensive Python library for anomaly detection. Four pill =========================== ======================================================================================== Pillar What it means =========================== ======================================================================================== -Multi-Modal 60+ detectors across **tabular, time series, graph, text, and image** data, one API +Multi-Modal 60 detectors across **tabular, time series, graph, text, and image** data, one API Full Lifecycle From raw data to explained anomalies and next-step guidance in a single call Agentic ``od-expert`` turns natural-language requests into ADEngine workflows; MCP exposes structured tools for other agents Most Used 38+ million downloads; benchmark-backed routing (ADBench, TSB-AD, BOND, NLP-ADBench) @@ -122,7 +122,7 @@ Layer Name When to use 3 Agentic Investigation You want an AI agent to drive OD through natural conversation `Layer 3 walkthrough `__ ========= ===================== ====================================================================== ======================================= -Layers 2 and 3 are powered by ``ADEngine``, PyOD's intelligent orchestration core. The full multi-turn Layer 3 investigation flow is available through the ``od-expert`` skill for Claude Code and Codex. The MCP server (``python -m pyod.mcp_server``) exposes ten stateless tools for MCP-compatible LLMs, spanning knowledge queries (``list_detectors``, ``explain_detector``, ``compare_detectors``, ``get_benchmarks``), planning (``profile_data``, ``plan_detection``, ``build_detector``), and detection (``run_detection``, ``analyze_results``, ``explain_findings``); stateful ``investigate`` / ``iterate`` MCP tools are deferred. +Layers 2 and 3 are powered by ``ADEngine``, PyOD's lifecycle orchestration core. The full multi-turn Layer 3 investigation flow is available through the ``od-expert`` skill for Claude Code and Codex. The MCP server (``python -m pyod.mcp_server``) exposes ten stateless tools for MCP-compatible LLMs, spanning knowledge queries (``list_detectors``, ``explain_detector``, ``compare_detectors``, ``get_benchmarks``), planning (``profile_data``, ``plan_detection``, ``build_detector``), and detection (``run_detection``, ``analyze_results``, ``explain_findings``); stateful ``investigate`` / ``iterate`` MCP tools are deferred. .. image:: https://raw.githubusercontent.com/yzhao062/pyod/development/docs/figs/agentic-demo.png :alt: PyOD 3 agentic investigation demo on cardiotocography dataset @@ -142,7 +142,7 @@ About PyOD PyOD, established in 2017, is the longest-running and most widely used Python library for anomaly detection. With `38+ million downloads `__, it serves both academic research (featured in `Analytics Vidhya `__, `KDnuggets `__, and `Towards Data Science `__) and commercial products. -V3 extends the library with ``ADEngine`` (intelligent orchestration) and the ``od-expert`` skill (agentic workflow), while keeping the classic ``fit``/``predict`` API fully backward-compatible. V3 is built on SUOD [#Zhao2021SUOD]_ for fast parallel training and numba JIT for per-model speedups. +V3 extends the library with ``ADEngine`` (lifecycle orchestration) and the ``od-expert`` skill (agentic workflow), while keeping the classic ``fit``/``predict`` API fully backward-compatible. V3 is built on SUOD [#Zhao2021SUOD]_ for fast parallel training and numba JIT for per-model speedups. **Impact & Recognition**: @@ -253,7 +253,7 @@ Additional Topics Implemented Algorithms ^^^^^^^^^^^^^^^^^^^^^^ -PyOD is organized into two functional groups: **(i) Detection Algorithms**, with dedicated subsections for tabular, time series, and graph data (EmbeddingOD inside the tabular table adds multi-modal support for text and image via foundation model encoders); and **(ii) Utility Functions** for data generation, evaluation, and intelligent orchestration. +PyOD is organized into two functional groups: **(i) Detection Algorithms**, with dedicated subsections for tabular, time series, and graph data (EmbeddingOD inside the tabular table adds multi-modal support for text and image via foundation model encoders); and **(ii) Utility Functions** for data generation, evaluation, and lifecycle orchestration. **(i-a) Tabular & Multi-Modal Detection Algorithms** : diff --git a/docs/examples/adengine.rst b/docs/examples/adengine.rst index 3c51f42c..297c0674 100644 --- a/docs/examples/adengine.rst +++ b/docs/examples/adengine.rst @@ -1,7 +1,7 @@ -Layer 2: ADEngine Intelligent Orchestration +Layer 2: ADEngine Lifecycle Orchestration ============================================ -ADEngine is PyOD's intelligent anomaly detection engine. It profiles your data, selects benchmark-backed detectors from PyOD's 60+ catalog, runs multiple detectors in parallel, computes consensus scores, and assesses result quality, all in one call. +ADEngine is PyOD's anomaly detection lifecycle engine. It profiles your data, selects benchmark-backed detectors from PyOD's 60-detector catalog, runs multiple detectors in parallel, computes consensus scores, and reports descriptive diagnostics, all in one call. Use Layer 2 when you are not sure which detector to pick. @@ -61,11 +61,13 @@ ADEngine runs the top-3 detectors from PyOD's knowledge base and computes a cons Quality Assessment ------------------ -ADEngine quantifies how trustworthy the results are through three metrics: +ADEngine reports three descriptive diagnostics of a run. They summarize the +score distribution and cross-detector behavior. They are not a label-free +guarantee that the results are correct (see the note below): -* **Separation** -- ratio of anomaly scores to inlier scores ([0, 1]) -* **Agreement** -- mean pairwise Spearman correlation between detectors ([0, 1]) -* **Stability** -- Jaccard index of top-k sets under +/- 20% contamination ([0, 1]) +* **Separation** -- relative mean score gap between the run's flagged set and the rest ([0, 1]). It is computed from the run's own predicted labels, so it is descriptive only; it does not show that the cutoff or the vote is correct. +* **Agreement** -- mean pairwise Spearman correlation between detectors ([0, 1]). The most useful of the three: low agreement flags inputs with no shared structure (near-noise), where the detectors rank points differently. +* **Stability** -- standardized score gap at the rank-k cutoff ([0, 1]). Low values mean many tied scores near the threshold, so the flagged set is sensitive to the contamination value. .. code-block:: python @@ -78,6 +80,15 @@ ADEngine quantifies how trustworthy the results are through three metrics: Verdicts are ``'high'`` (>=0.7), ``'medium'`` (>=0.4), or ``'low'`` (<0.4). +.. note:: + + The verdict is a heuristic summary of the score distribution and + cross-detector behavior, not a guarantee that the results are correct. Use + it as a rough signal, not as a basis for trusting results without labels: + low ``agreement`` is the most reliable component and flags near-noise + inputs, while ``separation`` is descriptive only. To judge correctness, + validate against held-out labels or a domain review. + ---- Session API (Step by Step) diff --git a/docs/examples/agentic.rst b/docs/examples/agentic.rst index 4d892438..5511ba88 100644 --- a/docs/examples/agentic.rst +++ b/docs/examples/agentic.rst @@ -1,7 +1,7 @@ Layer 3: Agentic Investigation =============================== -PyOD 3's ``od-expert`` skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality assessment, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert. +PyOD 3's ``od-expert`` skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality diagnostics, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert. .. figure:: ../figs/agentic-demo.png :alt: PyOD 3 agentic investigation demo on a diabetes screening dataset @@ -43,9 +43,9 @@ When a user asks about anomalies in their data, PyOD's ``od-expert`` skill auto- 1. **Walks the master decision tree** -- timestamps, graph structure, text/image, or tabular? Load the matching ``references/.md``. 2. **Walks the top-10 pitfall checklist** -- is any pitfall active for this data? Example: feature scale ratio > 100 triggers Pitfall 1 (unscaled features for distance-based detectors) and the agent recommends a pre-scaling step or flags it in the report. 3. **Walks the 11 escalation triggers** -- does anything about the request call for a pause? Example: "medical screening" fires Trigger 8 (high-stakes domain) and the agent commits to dual-detector validation and a confidence caveat. -4. **Selects detectors** -- calls ``engine.plan(state)`` to pick the top-3 from PyOD's 61-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). Each plan entry in ``state.plans`` has ``detector_name``, ``confidence``, ``reason``, ``evidence``. +4. **Selects detectors** -- calls ``engine.plan(state)`` to pick the top-3 from PyOD's 60-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). The benchmark ranks seed the plan; the agent may override them from its own judgment or the user's constraints. Each plan entry in ``state.plans`` has ``detector_name``, ``confidence``, ``reason``, ``evidence``. 5. **Runs in parallel** -- executes all selected detectors and builds a rank-normalized consensus in ``state.consensus``. -6. **Re-walks a subset of triggers post-run** -- detector disagreement (T3), weak quality (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates. +6. **Re-walks a subset of triggers post-run** -- detector disagreement (T3), cutoff instability (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates. 7. **Generates a report** -- Markdown or JSON, always including a "what I assumed and why" block that lists the contamination rate, the detectors used, the best detector, and any caveats the trigger/pitfall walk surfaced. The agent's decisions at each of these steps are visible in the interactive demo's dark "od-expert" panels. @@ -122,9 +122,10 @@ Why this dataset? It exercises the skill's machinery: the feature scale ratio is low-dim small datasets. Scale mismatch noted for the final report. - Results: 62 flagged (8.1%), separation 0.96, - agreement 0.59, quality HIGH (0.79). Top case: - patient #13. KNN strongest individually. + Results: 62 flagged (8.1%), agreement 0.59 + (label-free; separation and the quality verdict + are descriptive only). Top case: patient #13. + KNN strongest individually. Behind the scenes: @@ -281,4 +282,4 @@ With PyOD 3 and the v3.2.0 ``od-expert`` skill: 6. Re-checks quality-related triggers post-run and hedges the report accordingly. 7. Always reports the assumptions and caveats, including the scale mismatch, contamination, and any triggered escalations. -The agent becomes an OD expert through the library, not despite it. +The agent follows an OD expert's workflow through the library, not despite it. diff --git a/docs/examples/index.rst b/docs/examples/index.rst index 27c84f85..5a97e3ef 100644 --- a/docs/examples/index.rst +++ b/docs/examples/index.rst @@ -54,7 +54,7 @@ following ``state.next_action`` at each step. See :doc:`agentic` for the full wa Examples by Data Type --------------------- -* :doc:`tabular`: 50+ detectors for tabular data (ECOD, IForest, KNN, LOF, ...) +* :doc:`tabular`: 43 detectors for tabular data (ECOD, IForest, KNN, LOF, ...) * :doc:`timeseries`: 5 shipped + 2 experimental time series detectors (KShape, MatrixProfile, SpectralResidual, ...) * :doc:`graph`: 8 graph detectors (DOMINANT, CoLA, CONAD, ...) * :doc:`embedding`: Text and image detection via foundation model embeddings diff --git a/docs/examples/tabular.rst b/docs/examples/tabular.rst index 8e24b510..8ed7e770 100644 --- a/docs/examples/tabular.rst +++ b/docs/examples/tabular.rst @@ -1,7 +1,7 @@ Layer 1: Tabular Anomaly Detection ==================================== -PyOD has 50+ tabular detectors covering probabilistic, linear, proximity, ensemble, and deep learning approaches. All use the same ``fit``/``predict``/``decision_function`` API. +PyOD has 43 tabular detectors covering probabilistic, linear, proximity, ensemble, and deep learning approaches. All use the same ``fit``/``predict``/``decision_function`` API. .. code-block:: python diff --git a/docs/index.rst b/docs/index.rst index ddc764b7..250f4f89 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -64,16 +64,16 @@ Welcome to PyOD 3 documentation! .. note:: - **New in V3.** Any AI agent can now run expert-level anomaly detection on your data. Just ask. + **New in V3.** Any AI agent can now run a complete anomaly detection workflow on your data. Just ask. PyOD 3 is the most comprehensive Python library for anomaly detection. Four pillars: =========================== ======================================================================================== Pillar What it means =========================== ======================================================================================== -Multi-Modal 60+ detectors across **tabular, time series, graph, text, and image** data, one API +Multi-Modal 60 detectors across **tabular, time series, graph, text, and image** data, one API Full Lifecycle From raw data to explained anomalies and next-step guidance in a single call -Agentic Ask in plain English, and AI agents run expert-level detection without OD expertise +Agentic Ask in plain English, and AI agents run the full detection workflow without OD expertise Most Used `38+ million downloads `_; benchmark-backed routing (ADBench, TSB-AD, BOND, NLP-ADBench) =========================== ======================================================================================== @@ -131,7 +131,7 @@ Layer Name When to use 3 Agentic Investigation You want an AI agent to drive OD through natural conversation :doc:`examples/agentic` ========= ===================== ====================================================================== ============================ -Layers 2 and 3 are powered by :class:`~pyod.utils.ad_engine.ADEngine`, PyOD's intelligent orchestration core. Layer 3 adds the ``od-expert`` skill that auto-activates in Claude Code, Codex, and MCP-compatible agents. +Layers 2 and 3 are powered by :class:`~pyod.utils.ad_engine.ADEngine`, PyOD's lifecycle orchestration core. Layer 3 adds the ``od-expert`` skill that auto-activates in Claude Code, Codex, and MCP-compatible agents. .. figure:: figs/agentic-demo.png :alt: PyOD 3 agentic investigation demo on cardiotocography dataset @@ -157,7 +157,7 @@ About PyOD PyOD, established in 2017, is the longest-running and most widely used Python library for `anomaly detection `_. With `38+ million downloads `_, it serves both academic research and commercial products worldwide. -V3 extends the library with :class:`~pyod.utils.ad_engine.ADEngine` (intelligent orchestration) and the ``od-expert`` skill (agentic workflow), while keeping the classic ``fit``/``predict`` API fully backward-compatible. V3 is built on SUOD :cite:`a-zhao2021suod` for fast parallel training and numba JIT for per-model speedups. +V3 extends the library with :class:`~pyod.utils.ad_engine.ADEngine` (lifecycle orchestration) and the ``od-expert`` skill (agentic workflow), while keeping the classic ``fit``/``predict`` API fully backward-compatible. V3 is built on SUOD :cite:`a-zhao2021suod` for fast parallel training and numba JIT for per-model speedups. **Citing PyOD**: @@ -208,7 +208,7 @@ Benchmarks Implemented Algorithms ====================== -PyOD is organized into two functional groups: **(i) Detection Algorithms**, with dedicated subsections for tabular, time series, and graph data (EmbeddingOD inside the tabular table adds multi-modal support for text and image via foundation model encoders); and **(ii) Utility Functions** for data generation, evaluation, and intelligent orchestration. +PyOD is organized into two functional groups: **(i) Detection Algorithms**, with dedicated subsections for tabular, time series, and graph data (EmbeddingOD inside the tabular table adds multi-modal support for text and image via foundation model encoders); and **(ii) Utility Functions** for data generation, evaluation, and lifecycle orchestration. **(i-a) Tabular & Multi-Modal Detection Algorithms** : @@ -390,7 +390,7 @@ Encoding :func:`~pyod.utils.encoders.resolve_encoder` Resolve an Encoding SentenceTransformerEncoder Encode text via sentence-transformers models (see :doc:`pyod.utils `) Encoding OpenAIEncoder Encode text via OpenAI Embeddings API (see :doc:`pyod.utils `) Encoding HuggingFaceEncoder Encode text or images via HuggingFace transformers (see :doc:`pyod.utils `) -Intelligence :class:`~pyod.utils.ad_engine.ADEngine` Intelligent anomaly detection lifecycle engine: profiling, planning, execution, analysis, and reporting +Orchestration :class:`~pyod.utils.ad_engine.ADEngine` Anomaly detection lifecycle engine: profiling, planning, execution, analysis, and reporting =================== =============================================== ===================================================================================================================================================== diff --git a/docs/install.rst b/docs/install.rst index a63d1d35..696bb42b 100644 --- a/docs/install.rst +++ b/docs/install.rst @@ -114,7 +114,7 @@ Example output: .. code-block:: text PyOD version: 3.1.0 - Detectors (ADEngine): 61 total (44 tabular, 7 time-series, 8 graph, 3 text, 2 image, 1 multimodal) + Detectors (ADEngine): 60 total (43 tabular, 7 time-series, 8 graph, 2 text, 2 image, 1 multimodal) Classic API: OK ADEngine (Layer 2): OK MCP extra: OK (run: pyod mcp serve) diff --git a/docs/pyod.ad_engine.rst b/docs/pyod.ad_engine.rst index 060f1aa1..768db791 100644 --- a/docs/pyod.ad_engine.rst +++ b/docs/pyod.ad_engine.rst @@ -1,7 +1,7 @@ ADEngine ======== -:class:`pyod.utils.ad_engine.ADEngine` is PyOD's intelligent anomaly detection engine. It provides three layers of capability: +:class:`pyod.utils.ad_engine.ADEngine` is PyOD's anomaly detection lifecycle engine. It provides three layers of capability: * **Knowledge queries** -- list detectors, explain detectors, get benchmarks * **Detection lifecycle** -- profile, plan, run, analyze, explain, iterate, report diff --git a/docs/skill_maintenance.rst b/docs/skill_maintenance.rst index 504dabd2..5c01aec9 100644 --- a/docs/skill_maintenance.rst +++ b/docs/skill_maintenance.rst @@ -6,7 +6,7 @@ PyOD ships agent skills (currently ``od-expert``) as packaged Markdown that Clau What makes a skill "real" ------------------------- -A "real" skill encodes domain expertise so a non-expert user gets expert-quality results without driving every decision. The four criteria: +A "real" skill encodes domain expertise so a non-expert user can run a complete, auditable workflow without driving every decision. The four criteria: 1. **Drives the agent autonomously through a complete workflow.** From data to profile to detector selection to run to analyze to iterate to report, the agent makes informed decisions on the user's behalf and only pauses when uncertain (adaptive escalation). 2. **Encodes domain knowledge a non-expert lacks.** Decision rules, pitfalls, result interpretation patterns, and worked examples, all distilled from real literature and practice. diff --git a/examples/agentic_demo.html b/examples/agentic_demo.html index 7ec80c3c..150621f2 100644 --- a/examples/agentic_demo.html +++ b/examples/agentic_demo.html @@ -574,7 +574,7 @@

Any AI Agent Becomes an OD Expert

Post-run triggers (T3, T4, T10)
✓ T3 agreement = 0.59 > 0.4 floor.
- ✓ T4 separation = 0.96, stability = 0.81, overall = 0.79 (verdict: high).
+ ✓ T4 stability = 0.81 (cutoff stable); separation = 0.96 is descriptive only; overall quality is a diagnostic summary, not correctness evidence.
✓ T10 not over-tight.
state.next_action['action'] = 'report_to_user'.
@@ -705,7 +705,7 @@

Any AI Agent Becomes an OD Expert

Detectors: KNN, IForest, LOF (3 / 3 converged)
Best detector: KNN (highest correlation with consensus)
Final flagged (consensus): 62 patients (8.1%)
- Quality: HIGH (0.79) + Diagnostics: internally consistent; validate clinically
@@ -731,7 +731,7 @@

Any AI Agent Becomes an OD Expert