Analyze Study (Regression)

This notebook provides a comprehensive post-hoc analysis of a completed Octopus regression study. It covers:

Study Details -- load, validate, and summarize the study configuration and workflow structure.
Development Performance -- metric scores across outer splits for all prediction tasks and result types. Use this to guide model selection and hyperparameter decisions.
Selected Features -- number of selected features per task and outer split, plus feature selection frequency.
Test Set Evaluation -- for a selected task: test-set metrics (per-split, mean, and merged), prediction vs ground truth scatter, residual analysis, and feature importances (permutation and SHAP).

Prerequisite: Run the basic regression workflow first to generate study data:

python examples/basic_regression.py

Imports

from octopus.poststudy import OctoTestEvaluator, load_study_information
from octopus.poststudy.analysis.notebook import (
    display_feature_groups_table,
    display_performance_tables,
    display_study_overview,
)
from octopus.poststudy.analysis.plots import (
    dev_performance_plot,
    feature_count_plot,
    feature_frequency_plot,
    fi_plot,
    performance_plot,
    prediction_plot,
    residual_plot,
)
from octopus.poststudy.analysis.tables import (
    fi_ensemble_table,
    get_performance,
    get_selected_features,
    workflow_graph,
)
from octopus.poststudy.study_io import find_latest_study

Input

For the analysis, a path to a saved study directory needs to be provided. Octopus saves studies in directories named <prefix>-YYYYMMDD_HHMMSS. The helper find_latest_study scans the studies root folder and returns the most recent directory matching the given name prefix. Overwrite study_directory if you want to use a specific study path.

study_name_prefix = "basic_regression"
study_directory = find_latest_study("../studies", study_name_prefix)
print(f"Using study: {study_directory}")

Study Details

load_study_information validates the study directory (checks outersplit and task directories) and returns a StudyInfo object used by all downstream analysis functions. display_study_overview prints a concise summary. workflow_graph renders the task dependency tree.

study_info = load_study_information(study_directory)
display_study_overview(study_info)

print(workflow_graph(study_info))

Development Performance

What it shows: Metric scores per outer split for all prediction tasks in the workflow. Feature selection tasks (mrmr, roc, boruta) are skipped automatically. If a task has multiple result types (e.g. best and ensemble_selection), each is shown separately. By default the dev partition is shown -- use this partition for model selection and hyperparameter decisions.

What to look for: - Consistent bar heights across splits → stable model performance - Large variation between splits → model sensitive to the train/test partition (common with small datasets) - For regression, key metrics are R2 (higher is better, 1.0 = perfect), MAE and RMSE (lower is better) - R2 below 0.5 suggests the model explains less than half the variance -- consider more features or different models - Use the metric dropdown to switch between available metrics

perf = get_performance(study_info)

fig = dev_performance_plot(perf)
fig.show()

display_performance_tables(perf)

Selected Features

What it shows: Number of selected features per outer split and task, plus how frequently each feature was selected across splits. Covers all tasks and result types discovered on disk.

What to look for: - Feature count plot: Consistent count across splits → stable feature selection. A drop between workflow tasks confirms that feature selection steps are reducing dimensionality as intended. - Feature frequency plot: Features selected in all splits are robust and likely genuinely informative. Features selected in only one or two splits may be noise or split-specific artefacts.

feature_table, frequency_table = get_selected_features(study_info)

fig = feature_count_plot(feature_table)
fig.show()

fig = feature_frequency_plot(frequency_table)
fig.show()

Test Set Evaluation

The sections above use the dev partition -- these results should guide model selection, hyperparameter tuning, and feature selection decisions. Looking at test scores during these steps introduces data leakage and inflates reported performance.

The sections below evaluate a single selected task on the held-out test partition. Only look at test results after all modelling decisions have been made. If test results lead you to change the model or features, the test set is no longer independent and the reported scores lose their validity.

OctoTestEvaluator loads the fitted models and the stored train/test splits for the selected task. All subsequent calls (performance, feature importance) use this evaluator.

Test Performance

What it shows: Test-set metrics per outer split (each model scored only on its own held-out test fold), plus a Mean row (average of per-split scores) and a Merged row (all test predictions pooled across splits and scored as one set). All regression metrics are computed automatically.

What to look for: - The Merged row is the standard "pooled out-of-fold" evaluation -- the most robust summary - R2 values: above 0.7 is generally good, above 0.9 excellent (domain-dependent). Negative R2 means the model is worse than predicting the mean - MAE gives the average prediction error in the same units as the target -- easier to interpret than RMSE for stakeholders - RMSE penalises large errors more heavily than MAE -- a big gap between MAE and RMSE suggests outlier predictions - Large spread across splits → model performance depends on the data partition

test_evaluator = OctoTestEvaluator(
    study_info=study_info, task_id=0, result_type="best"
)
test_perf = test_evaluator.performance()
performance_plot(test_perf).show()

test_perf.round(3)

Prediction vs Ground Truth

What it shows: Per-split scatter plots of ground truth (x-axis) vs model prediction (y-axis). The dashed diagonal line represents perfect predictions -- points on the line mean the model predicted exactly the true value.

What to look for: - Points clustered tightly around the diagonal --> accurate model - Systematic deviation above or below the diagonal --> the model is biased (over- or under-predicting) - Fan-shaped spread (wider at higher values) --> heteroscedastic errors, the model is less reliable for extreme values - Outlier points far from the diagonal --> individual samples the model struggles with

pred_df = test_evaluator.predict()
prediction_plot(pred_df).show()

Residual Plot

What it shows: Per-split scatter plots of predicted value (x-axis) vs residual (prediction minus ground truth, y-axis). The dashed horizontal line at zero marks perfect predictions.

What to look for: - Random scatter around zero --> errors are unstructured, the model captures the underlying pattern well - Funnel shape (spread increases with predicted value) --> heteroscedasticity, the model's uncertainty grows for larger predictions - Curved pattern --> the model is missing a nonlinear relationship in the data - Clusters of large residuals --> subgroups of samples the model handles poorly

residual_plot(pred_df).show()

Feature Importance -- Permutation

What it shows: Permutation importance on the held-out test data. Each feature is randomly shuffled and the resulting performance drop is measured. A large drop means the feature is important. group_permutation also computes importance at the feature-group level, revealing whether correlated features contribute collectively.

Each outer-split model is evaluated independently on its own test fold, then results are aggregated into an ensemble summary.

What to look for: - Features sorted by importance -- top features contribute most to predictions - n_repeats=3 is used here for speed; use 10+ for real analyses to get stable estimates and reliable p-values

fi_table_perm = test_evaluator.calculate_fi(
    fi_type="group_permutation", n_repeats=3
)
fi_plot(fi_table_perm).show()

fi_ensemble_table(fi_table_perm).head(10)

_ = display_feature_groups_table(test_evaluator)

Feature Importance -- SHAP

What it shows: SHAP (SHapley Additive exPlanations) values computed on the held-out test data. SHAP attributes each prediction to individual features using game-theoretic Shapley values. More fine-grained than permutation importance but slower.

What to look for: - Features sorted by mean |SHAP value| -- top features have the largest average impact on predictions - Available shap_type options: kernel (default, model-agnostic), permutation, exact (slowest, most accurate) - Compare with permutation results -- features that rank highly in both methods are reliably important

fi_table_shap = test_evaluator.calculate_fi(
    fi_type="shap", shap_type="kernel"
)
fi_plot(fi_table_shap).show()

fi_ensemble_table(fi_table_shap).head(10)