Analyze Study (Time to Event)

This notebook provides a comprehensive post-hoc analysis of a completed Octopus time-to-event (survival) study. It covers:

Study Details -- load, validate, and summarize the study configuration and workflow structure.
Development Performance -- metric scores across outer splits for all prediction tasks and result types. Use this to guide model selection and hyperparameter decisions.
Selected Features -- number of selected features per task and outer split, plus feature selection frequency.
Test Set Evaluation -- for a selected task: test-set metrics (per-split, mean, and merged) and feature importances (permutation and SHAP).

Prerequisite: Run the basic time-to-event workflow first to generate study data:

python examples/basic_timetoevent.py

Imports

from octopus.poststudy import OctoTestEvaluator, load_study_information
from octopus.poststudy.analysis.notebook import (
    display_feature_groups_table,
    display_performance_tables,
    display_study_overview,
)
from octopus.poststudy.analysis.plots import (
    dev_performance_plot,
    feature_count_plot,
    feature_frequency_plot,
    fi_plot,
    performance_plot,
)
from octopus.poststudy.analysis.tables import (
    fi_ensemble_table,
    get_performance,
    get_selected_features,
    workflow_graph,
)
from octopus.poststudy.study_io import find_latest_study

Input

For the analysis, a path to a saved study directory needs to be provided. Octopus saves studies in directories named <prefix>-YYYYMMDD_HHMMSS. The helper find_latest_study scans the studies root folder and returns the most recent directory matching the given name prefix. Overwrite study_directory if you want to use a specific study path.

study_name_prefix = "basic_timetoevent"
study_directory = find_latest_study("../studies", study_name_prefix)
print(f"Using study: {study_directory}")

Study Details

load_study_information validates the study directory (checks outersplit and task directories) and returns a StudyInfo object used by all downstream analysis functions. display_study_overview prints a concise summary. workflow_graph renders the task dependency tree.

study_info = load_study_information(study_directory)
display_study_overview(study_info)

print(workflow_graph(study_info))

Development Performance

What it shows: Metric scores per outer split for all prediction tasks in the workflow. Feature selection tasks (mrmr, roc, boruta) are skipped automatically. If a task has multiple result types (e.g. best and ensemble_selection), each is shown separately. By default the dev partition is shown -- use this partition for model selection and hyperparameter decisions.

What to look for: - Consistent bar heights across splits → stable model performance - Large variation between splits → model sensitive to the train/test partition (common with small datasets) - For time-to-event, the key metric is CI (Concordance Index, higher is better, 1.0 = perfect). CI above 0.7 is generally considered good discrimination - Use the metric dropdown to switch between available metrics (CI, CI_UNO)

perf = get_performance(study_info)

fig = dev_performance_plot(perf)
fig.show()

display_performance_tables(perf)

Selected Features

What it shows: Number of selected features per outer split and task, plus how frequently each feature was selected across splits. Covers all tasks and result types discovered on disk.

What to look for: - Feature count plot: Consistent count across splits → stable feature selection. A drop between workflow tasks confirms that feature selection steps are reducing dimensionality as intended. - Feature frequency plot: Features selected in all splits are robust and likely genuinely informative. Features selected in only one or two splits may be noise or split-specific artefacts.

feature_table, frequency_table = get_selected_features(study_info)

fig = feature_count_plot(feature_table)
fig.show()

fig = feature_frequency_plot(frequency_table)
fig.show()

Test Set Evaluation

The sections above use the dev partition -- these results should guide model selection, hyperparameter tuning, and feature selection decisions. Looking at test scores during these steps introduces data leakage and inflates reported performance.

The sections below evaluate a single selected task on the held-out test partition. Only look at test results after all modelling decisions have been made. If test results lead you to change the model or features, the test set is no longer independent and the reported scores lose their validity.

OctoTestEvaluator loads the fitted models and the stored train/test splits for the selected task. All subsequent calls (performance, feature importance) use this evaluator.

Test Performance

What it shows: Test-set metrics per outer split (each model scored only on its own held-out test fold), plus a Mean row (average of per-split scores) and a Merged row (all test predictions pooled across splits and scored as one set). All time-to-event metrics are computed automatically.

What to look for: - The Merged row is the standard "pooled out-of-fold" evaluation -- the most robust summary - CI (Concordance Index): the probability that, for a random pair of subjects, the model correctly ranks who experiences the event first. CI = 0.5 is random, CI > 0.7 is reasonable, CI > 0.8 is strong - Large spread across splits → model performance depends on the data partition

test_evaluator = OctoTestEvaluator(
    study_info=study_info, task_id=0, result_type="best"
)
test_perf = test_evaluator.performance()
performance_plot(test_perf).show()

test_perf.round(3)

Feature Importance -- Permutation

What it shows: Permutation importance on the held-out test data. Each feature is randomly shuffled and the resulting performance drop is measured. A large drop means the feature is important. group_permutation also computes importance at the feature-group level, revealing whether correlated features contribute collectively.

Each outer-split model is evaluated independently on its own test fold, then results are aggregated into an ensemble summary.

What to look for: - Features sorted by importance -- top features contribute most to predictions - n_repeats=3 is used here for speed; use 10+ for real analyses to get stable estimates and reliable p-values

fi_table_perm = test_evaluator.calculate_fi(
    fi_type="group_permutation", n_repeats=3
)
fi_plot(fi_table_perm).show()

_ = display_feature_groups_table(test_evaluator)

fi_ensemble_table(fi_table_perm).head(10)

Feature Importance -- SHAP

What it shows: SHAP (SHapley Additive exPlanations) values computed on the held-out test data. SHAP attributes each prediction to individual features using game-theoretic Shapley values. More fine-grained than permutation importance but slower.

What to look for: - Features sorted by mean |SHAP value| -- top features have the largest average impact on predictions - Available shap_type options: kernel (default, model-agnostic), permutation, exact (slowest, most accurate) - Compare with permutation results -- features that rank highly in both methods are reliably important

fi_table_shap = test_evaluator.calculate_fi(
    fi_type="shap", shap_type="kernel"
)
fi_plot(fi_table_shap).show()

fi_ensemble_table(fi_table_shap).head(10)