Regression

Octopus supports regression tasks for predicting continuous numeric outcomes. The setup is very similar to classification — you use OctoRegression instead of OctoClassification, pick a regression metric, and the rest of the pipeline (nested CV, hyperparameter optimization, ensembling) works the same way.

Overview

Regression predicts a continuous value (e.g., price, temperature, disease severity score) rather than a class label. Octopus handles the full pipeline: data splitting, model training with Optuna-based hyperparameter optimization, and evaluation — all wrapped in nested cross-validation.

Key differences from classification:

Use OctoRegression instead of OctoClassification.
Metrics measure prediction error (MAE, RMSE) or explained variance (R²) rather than class discrimination (AUCROC, F1).
There is no positive_class or ml_type parameter.
Stratification is optional — it can still be useful if you have a categorical column that should be balanced across splits (e.g., site or batch).

Data Format

Your dataset should be a pandas DataFrame with:

Feature columns — numeric, boolean, or categorical. No text, datetime, or object columns.
Target column — a continuous numeric value.
Sample ID column — a column that uniquely identifies each row.

Requirements and constraints:

Column	Type	Missing values allowed	Notes
Feature columns	int, float, bool, categorical	Yes (imputed automatically)	Single-value features are removed automatically. Bool is converted to int.
Target column	int or float	No	Continuous numeric value
Sample ID column	any	No	Used for group-aware splitting. Rows with the same ID are kept together.
Stratification column	int or bool	No	Optional. Cannot be the same as `sample_id_col`.

Octopus also auto-converts null-like strings ("None", "null", "nan", "NA", "") to NaN in feature and target columns. The reserved column names datasplit_group and row_id cannot appear in your data.

import pandas as pd

df = pd.DataFrame({
    "sample_id": [1, 2, 3, 4, 5],
    "temperature": [22.1, 18.5, 25.3, 19.8, 23.7],
    "humidity": [0.65, 0.80, 0.55, 0.70, 0.60],
    "pressure": [1013, 1008, 1020, 1015, 1011],
    "yield": [85.2, 72.1, 91.4, 78.3, 88.6],  # continuous target
})

Basic Usage

from octopus.study import OctoRegression
from octopus.modules import Octo
from octopus.types import ModelName

study = OctoRegression(
    study_name="my_regression",
    target_metric="MAE",
    feature_cols=["temperature", "humidity", "pressure"],
    target_col="yield",
    sample_id_col="sample_id",
    workflow=[
        Octo(
            task_id=0,
            depends_on=None,
            description="regression_step",
            models=[ModelName.ExtraTreesRegressor],
            n_trials=100,
            n_inner_splits=5,
            ensemble_selection=True,
        )
    ],
)

study.fit(data=df)

Key parameters:

Parameter	Description	Default
`target_metric`	Metric to optimize	`"RMSE"`
`sample_id_col`	Column identifying unique subjects (prevents correlated observation leakage)	`None`
`n_outer_splits`	Number of outer cross-validation splits	`5`
`single_outer_split`	Run only one split for quick testing (e.g., `0`)	`None`
`n_cpus`	Number of CPUs (`0` = all, `-1` = all but one)	`0`

Tip

If your dataset contains multiple rows per subject (e.g. longitudinal measurements, repeated experiments), set sample_id_col to the column identifying subjects. Octopus will ensure all rows from the same subject stay in the same split, preventing information leakage.

Choosing a Metric

Metric	Description	Direction	When to use
`MAE`	Mean Absolute Error	Minimize	Default choice; easy to interpret in the target's units
`RMSE`	Root Mean Squared Error	Minimize	Penalizes large errors more than MAE
`MSE`	Mean Squared Error	Minimize	Same as RMSE but without the square root
`R2`	R² (coefficient of determination)	Maximize	Measures proportion of variance explained; 1.0 = perfect

Tip

MAE is robust and interpretable — an MAE of 5.0 means the model is off by 5 units on average. R2 is useful when you want a normalized score between 0 and 1, but can be misleading on small datasets or when the target has low variance.

Available Models

Octopus offers a broad range of regression models, from simple linear methods to gradient boosting and neural networks:

Model	Type	Default	Notes
`ExtraTreesRegressor`	Tree ensemble	Yes	Fast, good baseline
`RandomForestRegressor`	Tree ensemble	Yes	Robust general-purpose model
`XGBRegressor`	Gradient boosting	Yes	Strong on tabular data
`CatBoostRegressor`	Gradient boosting	Yes	Native categorical support
`HistGradientBoostingRegressor`	Gradient boosting	Yes	Native categoricals, handles missing values
`ElasticNetRegressor`	Linear (regularized)	Yes	Combines L1 and L2 regularization
`GradientBoostingRegressor`	Gradient boosting	No	Sklearn implementation
`RidgeRegressor`	Linear (L2)	No	Simple regularized linear model
`ARDRegressor`	Bayesian linear	No	Automatic relevance determination
`SvrRegressor`	Support Vector	No	Kernel-based; best for small datasets
`GaussianProcessRegressor`	Kernel	No	Probabilistic predictions; small datasets only
`TabularNNRegressor`	Neural network	No	Embedding-based NN for mixed feature types

Models marked as "Default" are included automatically when you don't specify a models list in the Octo configuration.

Note

Linear models (ElasticNetRegressor, RidgeRegressor, ARDRegressor, SvrRegressor, LogisticRegressionClassifier) apply a StandardScaler to features automatically. Tree-based models do not require scaling.

Feature Importance

Feature importance methods work the same way as in classification:

permutation — Permutation importance: shuffles each feature and measures the performance drop. Works with any model.
shap — SHAP values: game-theoretic attributions. More detailed but slower.
constant — Baseline constant importance (for reference).

from octopus.types import FIComputeMethod

Octo(
    ...,
    fi_methods=[FIComputeMethod.PERMUTATION],
)

Custom Hyperparameters

You can override the default hyperparameter search space for any model using the hyperparameters parameter. This is useful when you have domain knowledge about reasonable parameter ranges:

from octopus.models.hyperparameter import IntHyperparameter, FloatHyperparameter

Octo(
    ...,
    models=[ModelName.RandomForestRegressor],
    hyperparameters={
        ModelName.RandomForestRegressor: [
            IntHyperparameter(name="max_depth", low=2, high=32),
            FloatHyperparameter(name="min_samples_split", low=0.01, high=0.5),
        ]
    },
)

See Use Own Hyperparameters for a full example.

Example Workflows

For runnable end-to-end examples, see:

Basic Regression: The simplest way to run a regression study using the diabetes dataset.
Use Own Hyperparameters: Shows how to define your own hyperparameter search ranges instead of using the defaults.
Multi-Step Regression: Chains Octo and MRMR into a multi-step pipeline that selects features before final model training.