Workflow & Modules
Overview
Real-world datasets often contain many columns, but only a subset of them actually helps a machine-learning model make accurate predictions. Finding that subset -- feature selection -- is a core goal of Octopus.
A workflow is an ordered list of tasks that are executed one after another. Each task wraps a module, and each module either selects features, trains models, or both. By chaining tasks together you build a pipeline that progressively narrows the feature set: start with cheap, fast filters to discard obvious noise, then hand the reduced set to more expensive methods for further refinement.
Module types
Octopus ships two kinds of modules:
| Type | Purpose | Examples |
|---|---|---|
| Feature Selection | Reduce the number of features | ROC, MRMR, Boruta |
| Machine Learning | Train models, optimize hyperparameters, and optionally select features | Octo, AutoGluon |
Both types return a list of selected features that the next task in the workflow can consume.
How tasks are connected
Every task has a task_id (starting at 0) and an optional depends_on parameter pointing to
the task_id of a prior task.
- The first task (
depends_on=None) receives all columns listed infeature_cols. - A dependent task (
depends_on=N) receives only the features selected by task N, plus any scores, predictions, and feature-importance tables that task N produced.
Example workflow
A typical three-step pipeline looks like this:
Task 0 (Octo) all 30 features
|
v selected_features (e.g. 20)
Task 1 (MRMR) receives 20 features from Task 0
|
v selected_features (e.g. 15)
Task 2 (Octo) receives 15 features from Task 1
In Python this translates to:
from octopus.study import OctoClassification
from octopus.modules import Mrmr, Octo
from octopus.types import ModelName
study = OctoClassification(
...,
workflow=[
Octo(
task_id=0,
depends_on=None,
description="step1_octo_full",
models=[ModelName.ExtraTreesClassifier],
n_trials=200,
n_inner_splits=5,
max_features=30,
),
Mrmr(
task_id=1,
depends_on=0,
description="step2_mrmr",
n_features=15,
),
Octo(
task_id=2,
depends_on=1,
description="step3_octo_reduced",
models=[ModelName.ExtraTreesClassifier],
n_trials=200,
n_inner_splits=5,
ensemble_selection=True,
),
],
)
study.fit(data=df)
Tip
Ordering matters: tasks with depends_on=None must appear before tasks that reference
them, and task_id values must form a contiguous sequence starting at 0.
Feature Selection Modules
The table below lists all feature-selection modules roughly ordered from cheapest to most expensive:
| Module | Wraps | Description |
|---|---|---|
| ROC | scipy, networkx (custom) | Removes correlated features using graph-based grouping |
| MRMR | Custom implementation | Maximum Relevance Minimum Redundancy filter |
| Boruta | Custom (based on BorutaPy) | Shadow-feature statistical test |
Machine Learning Modules
| Module | Description |
|---|---|
| Octo | Core ML module with HPO, ensembling, and feature importance |
| AutoGluon | AutoGluon TabularPredictor wrapper |