Predict on New Data with OctoPredictor

Octopus trains models using nested cross-validation. Test data for each outer split is stored within the study and used by OctoTestEvaluator for post-hoc evaluation (see the analysis notebook).

OctoPredictor serves a different purpose: scoring new, external data that was never part of the study. The data must come from a different source, a different time period, or a genuinely held-out population. If any sample in the prediction data was seen by any model during fit, cross-validation, or hyperparameter tuning, the resulting scores are invalid.

This notebook demonstrates the full deployment workflow.

	OctoPredictor
Purpose	Score genuinely new data that was never part of training
Data source	External: new batch, new CSV, new API response
Prediction	Ensemble average across all outer-split models
Deployment	Can be saved/loaded standalone
Use case	Production inference, scoring new samples

1. Prepare Data

Before training, we set aside a portion of data to simulate new samples arriving later. In a real deployment this step does not exist -- you train on all available data and new data arrives naturally. Here we hold out 20% to have something to predict on.

The held-out data is saved to a CSV file and reloaded later as if it came from an external source. The study never sees this data during training.

from sklearn.model_selection import train_test_split

from octopus.example_data import load_breast_cancer_data

df, features, _ = load_breast_cancer_data()

# Hold out 20% BEFORE any octopus code runs
train_df, external_df = train_test_split(
    df, test_size=0.2, random_state=42, stratify=df["target"]
)

# Save external data to CSV (simulates a separate data source)
external_csv = "new_patient_batch.csv"
external_df.to_csv(external_csv, index=False)

print(f"Training data: {len(train_df)} samples")
print(f"External data: {len(external_df)} samples (saved to {external_csv})")

2. Train a Study

Train a classification study on the training data. In practice this step happens once -- you train the model, validate it (see the analysis notebook), and then deploy it for ongoing scoring.

from octopus.modules import Tako
from octopus.study import OctoClassification
from octopus.types import ModelName

study = OctoClassification(
    study_name="predict_demo",
    studies_directory="./studies",
    target_metric="ACCBAL",
    feature_cols=features[:10],  # subset for faster demo
    target_col="target",
    sample_id_col="index",
    stratification_col="target",
    workflow=[
        Tako(
            description="baseline",
            task_id=0,
            models=[ModelName.ExtraTreesClassifier],
            n_trials=25,
        ),
    ],
)

study.fit(data=train_df)
print(f"Study trained at: {study.output_path}")

3. Load Study and Inspect Workflow

Before creating a predictor, load the study metadata with load_study_information(). This returns a StudyInfo object used by all downstream functions.

An octopus study can contain multiple workflow tasks (e.g. feature selection followed by model training, or multiple model configurations). OctoPredictor operates at the task level -- you need to select which task to use for predictions. Use workflow_graph() to visualise the task structure and identify the task ID you want.

from octopus.poststudy import load_study_information
from octopus.poststudy.analysis.tables import workflow_graph

study_info = load_study_information(str(study.output_path))
print(workflow_graph(study_info))

4. Load New Data

The data you pass to OctoPredictor must be genuinely new. No sample may have been used during training, cross-validation, or hyperparameter tuning. If it was, predictions and scores are scientifically invalid.

Here we reload the CSV saved in step 1. In your own workflow, replace this with your actual data source (e.g. new_df = pd.read_csv("data_from_lab_system.csv")).

import pandas as pd

# In practice: new_df = pd.read_csv("data_from_lab_system.csv")
new_df = pd.read_csv(external_csv)
print(f"Loaded {len(new_df)} new samples")
new_df.head()

5. Create an OctoPredictor

OctoPredictor takes a StudyInfo and a task_id. It loads the fitted models for that task from all outer splits. Since we already loaded study_info above, we pass it directly together with the task ID identified from the workflow graph.

from octopus.poststudy import OctoPredictor

predictor = OctoPredictor(study_info=study_info, task_id=0)

print(f"Task ID:        {predictor.task_id}")
print(f"Result type:    {predictor.result_type}")
print(f"ML type:        {predictor.study_info.ml_type}")
print(f"Target metric:  {predictor.study_info.target_metric}")
print(f"Outer splits:   {predictor.study_info.n_outersplits}")
print(f"Feature cols:   {len(predictor.feature_cols)}")

6. Predict

Each outer-split model predicts independently on the new data. predict() returns a wide-format DataFrame: one row per sample, with row_id, per-split columns (split_0, split_1, ...), and ensemble.

What it shows: One row per input sample. The ensemble column is the final prediction. For classification, it contains class labels derived from the argmax of averaged probabilities across all outer-split models. For regression, it is the arithmetic mean of per-split predictions. The split_N columns show each individual model's prediction.

What to look for: - The ensemble column is what you use in production -- it is the model's answer for each sample - Compare per-split columns to gauge prediction uncertainty: if all splits agree on a sample, the model is confident; if they disagree, the prediction is less certain - Compute per-row standard deviation across split columns (df[split_cols].std(axis=1)) to flag uncertain samples for manual review

predictions = predictor.predict(new_df)
predictions.head(10)

7. Predict Probabilities

For classification tasks, predict_proba() returns a wide-format DataFrame with ensemble-averaged class probabilities plus per-split detail columns. Each model produces its own probability estimates, and the ensemble average provides more calibrated probabilities than any single model.

What it shows: One row per sample. The first class columns (e.g. 0, 1) contain ensemble-averaged probabilities. Additional <class>_split_N columns show each outer-split model's individual probability estimates.

What to look for: - Ensemble probabilities near 0.0 or 1.0 → high-confidence predictions; near 0.5 → the model is uncertain about the class - Compare per-split probability columns for the same sample -- large spread indicates the models disagree, which may warrant closer inspection - Use these probabilities for threshold tuning: instead of the default 0.5 cutoff, choose a threshold that balances precision and recall for your application (e.g. a lower threshold for high-recall screening)

proba = predictor.predict_proba(new_df)
print(f"Class labels: {list(predictor.classes_)}")
proba.head(10)

8. Evaluate Performance (Optional)

If the new data includes ground-truth labels (e.g. from a follow-up study, retrospective labelling, or quality control), you can evaluate how well the model performs on the external data. All metrics compatible with the ML type are computed automatically.

This is optional -- in many deployment scenarios, labels are not available at prediction time.

What it shows: Metric scores per outer split, plus a Mean row (average of per-split scores) and an Ensemble row (scoring the ensemble-averaged predictions against ground truth). Each outer-split model is scored independently on the full external dataset, so every bar uses the same samples but a different model.

What to look for: - The Ensemble row reflects actual deployment performance -- this is the score you would get in production where predictions are ensemble-averaged before evaluation - Ensemble scores are often better than Mean scores because averaging predictions before scoring is generally better than averaging scores (especially for nonlinear metrics like AUCROC) - Scores comparable to the study's dev/test performance → the model generalises well to new data - A significant drop compared to the study → possible distribution shift between the training population and the new data (e.g. different time period, different lab, different patient demographics) - Large variation across outer splits → the individual models learned different patterns; the ensemble mitigates this

from octopus.poststudy.analysis.plots import performance_plot

perf = predictor.performance(new_df)
performance_plot(perf).show()

perf.round(3)

9. Feature Importance (Optional)

calculate_fi() computes feature importance on the provided data. For permutation FI, the predictor loads the per-split training data from the study directory to build the replacement pool. In deployment mode (after save/load), the user-provided data is used as the pool instead.

n_repeats controls how many times each feature is shuffled. Higher values give more stable estimates. Here we use n_repeats=10 for demonstration; increase for production analyses.

What it shows: Permutation importance of each feature on the new data. Each feature is randomly shuffled and the resulting metric drop is measured. A large drop means the model relies heavily on that feature for its predictions.

What to look for: - Features sorted by importance -- the top features drive model predictions most - Features with importance near zero or negative contribute little and could potentially be removed in future model iterations - Compare with the FI results from the study's analysis notebook: if the same features dominate, the model's behaviour is consistent across populations; if rankings differ substantially, the new data may have different feature-target relationships - Use n_repeats=10 or higher for stable estimates; low repeat counts produce noisy rankings

from octopus.poststudy.analysis.tables import fi_ensemble_table

fi = predictor.calculate_fi(new_df, fi_type="permutation", n_repeats=10)
fi_ensemble_table(fi).head(10)

10. Save and Load for Deployment

Save the predictor to a standalone directory. This bundles models and metadata -- no study directory needed afterward. Ship this artifact to a different machine, a Docker container, or a scoring service.

save_dir = "./deployed_predictor"
predictor.save(save_dir)
print(f"Saved to: {save_dir}")

import pandas as pd

loaded = OctoPredictor.load(save_dir)

loaded_preds = loaded.predict(new_df)
pd.testing.assert_frame_equal(predictions, loaded_preds)
print("Predictions match after save/load round-trip.")

11. Cleanup

import shutil
from pathlib import Path

Path(external_csv).unlink(missing_ok=True)
shutil.rmtree(str(study.output_path), ignore_errors=True)
shutil.rmtree(save_dir, ignore_errors=True)