Skip to content

Workflow & Modules

Overview

Real-world datasets often contain many columns, but only a subset of them actually helps a machine-learning model make accurate predictions. Finding that subset -- feature selection -- is a core goal of Octopus.

A workflow is an ordered list of tasks that are executed one after another. Each task wraps a module, and each module either selects features, trains models, or both. By chaining tasks together you build a pipeline that progressively narrows the feature set: start with cheap, fast filters to discard obvious noise, then hand the reduced set to more expensive methods for further refinement.

Module types

Octopus ships two kinds of modules:

Type Purpose Examples
Feature Selection Reduce the number of features ROC, MRMR, Boruta
Machine Learning Train models, optimize hyperparameters, and optionally select features Octo, AutoGluon

Both types return a list of selected features that the next task in the workflow can consume.

How tasks are connected

Every task has a task_id (starting at 0) and an optional depends_on parameter pointing to the task_id of a prior task.

  • The first task (depends_on=None) receives all columns listed in feature_cols.
  • A dependent task (depends_on=N) receives only the features selected by task N, plus any scores, predictions, and feature-importance tables that task N produced.

Example workflow

A typical three-step pipeline looks like this:

Task 0 (Octo)          all 30 features
        |
        v               selected_features (e.g. 20)
Task 1 (MRMR)          receives 20 features from Task 0
        |
        v               selected_features (e.g. 15)
Task 2 (Octo)          receives 15 features from Task 1

In Python this translates to:

from octopus.study import OctoClassification
from octopus.modules import Mrmr, Octo
from octopus.types import ModelName

study = OctoClassification(
    ...,
    workflow=[
        Octo(
            task_id=0,
            depends_on=None,
            description="step1_octo_full",
            models=[ModelName.ExtraTreesClassifier],
            n_trials=200,
            n_inner_splits=5,
            max_features=30,
        ),
        Mrmr(
            task_id=1,
            depends_on=0,
            description="step2_mrmr",
            n_features=15,
        ),
        Octo(
            task_id=2,
            depends_on=1,
            description="step3_octo_reduced",
            models=[ModelName.ExtraTreesClassifier],
            n_trials=200,
            n_inner_splits=5,
            ensemble_selection=True,
        ),
    ],
)

study.fit(data=df)

Tip

Ordering matters: tasks with depends_on=None must appear before tasks that reference them, and task_id values must form a contiguous sequence starting at 0.


Feature Selection Modules

The table below lists all feature-selection modules roughly ordered from cheapest to most expensive:

Module Wraps Description
ROC scipy, networkx (custom) Removes correlated features using graph-based grouping
MRMR Custom implementation Maximum Relevance Minimum Redundancy filter
Boruta Custom (based on BorutaPy) Shadow-feature statistical test

Machine Learning Modules

Module Description
Octo Core ML module with HPO, ensembling, and feature importance
AutoGluon AutoGluon TabularPredictor wrapper