Feature Importance in Octopus Modules

All modules in Octopus inherit from the Module base class, which provides standardized methods for extracting feature importances from fitted models. This guide explains how to use the feature importance functionality across different module types.

Overview

The Module base class provides a unified interface for feature importance extraction through the get_feature_importances() method. This method supports three different calculation strategies:

Internal - Uses built-in feature importances from tree-based models
Permutation - Calculates permutation importance (works with any model)
Coefficients - Uses coefficient magnitudes from linear models

Basic Usage

After fitting a module, you can extract feature importances using:

# Fit a module
module.fit(
    data_traindev=train_data,
    data_test=test_data,
    feature_cols=feature_cols,
    study=study,
    outersplit_id=0,
    output_dir=output_dir,
)

# Get feature importances (default: internal method)
importance_df = module.get_feature_importances()
print(importance_df)

Output format:

       feature  importance
0    feature_1    0.450123
1    feature_0    0.320456
2    feature_3    0.120789
3    feature_2    0.108632

The returned DataFrame contains two columns: - feature: Feature name - importance: Importance score (higher = more important)

Features are sorted by importance in descending order.

Methods

1. Internal Importance (Tree-based Models)

Best for: Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost

importance_df = module.get_feature_importances(method="internal")

This method extracts the built-in feature_importances_ attribute from tree-based models. It's fast and directly reflects how the model uses features during training.

Advantages: - Very fast (no additional computation) - Directly reflects model's internal feature usage - Most interpretable for tree-based models

Limitations: - Only works with models that have feature_importances_ attribute - Not available for linear models or other model types

Example:

from octopus.modules import Octo
from octopus.types import ModelName

# Configure Octo with RandomForest
octo = Octo(
    task_id=0,
    models=[ModelName.RandomForestClassifier],
    n_trials=50,
)

# Fit the module
octo.fit(...)

# Get internal feature importances
fi_df = octo.get_feature_importances(method="internal")

2. Permutation Importance (Any Model)

Best for: Any fitted model, especially when internal importances aren't available

importance_df = module.get_feature_importances(
    method="permutation",
    data=validation_data,
    target=validation_target,
)

Permutation importance measures how much the model's performance decreases when a feature's values are randomly shuffled. It provides model-agnostic feature importance scores.

Parameters: - data: DataFrame with feature columns (required) - target: Target values as Series (required)

Advantages: - Works with any model type - Model-agnostic (comparable across different models) - Reflects actual predictive importance

Limitations: - Slower (requires multiple predictions) - Results can vary slightly between runs - May be correlated for highly correlated features

Example:

from octopus.modules import Mrmr

# Configure and fit MRMR module
mrmr = Mrmr(task_id=1, depends_on=0, n_features=50)
mrmr.fit(...)

# Get permutation importance on test set
fi_df = mrmr.get_feature_importances(
    method="permutation",
    data=test_data,
    target=test_data["target"],
)

3. Coefficient Importance (Linear Models)

Best for: Logistic Regression, Linear Regression, Ridge, Lasso, ElasticNet

importance_df = module.get_feature_importances(method="coefficients")

This method extracts and ranks features by the absolute magnitude of their coefficients in linear models.

Advantages: - Fast (uses existing coefficients) - Directly interpretable for linear models - Handles multi-class models (averages across classes)

Limitations: - Only works with models that have coef_ attribute - Assumes features are on comparable scales - Not suitable for non-linear models

Example:

from octopus.modules import Octo
from octopus.types import ModelName

# Configure Octo with LogisticRegression
octo = Octo(
    task_id=0,
    models=[ModelName.LogisticRegressionClassifier],
    n_trials=30,
)

# Fit the module
octo.fit(...)

# Get coefficient-based importances
fi_df = octo.get_feature_importances(method="coefficients")

Error Handling

The feature importance methods include comprehensive error checking:

Unfitted Model Error

module = Octo(task_id=0)
# Forgot to call fit()!

try:
    importance = module.get_feature_importances()
except ValueError as e:
    print(e)  # "Octo must be fitted before getting feature importances"

Incompatible Method Error

# Using LogisticRegression (no feature_importances_ attribute)
octo.fit(...)

try:
    importance = octo.get_feature_importances(method="internal")
except ValueError as e:
    print(e)  # "Model LogisticRegression does not have feature_importances_..."

Missing Parameters Error

try:
    importance = module.get_feature_importances(method="permutation")
    # Forgot to provide data and target!
except ValueError as e:
    print(e)  # "Permutation importance requires data and target parameters"

Module-Specific Examples

Octo (Optimization Module)

from octopus.modules import Octo
from octopus.types import ModelName

octo = Octo(
    task_id=0,
    models=[ModelName.RandomForestClassifier, ModelName.XGBClassifier],
    n_trials=100,
)

octo.fit(
    data_traindev=train_data,
    data_test=test_data,
    feature_cols=feature_cols,
    study=study,
    outersplit_id=0,
    output_dir=output_dir,
)

# Get internal importances (works if best model is tree-based)
fi_internal = octo.get_feature_importances(method="internal")

# Get permutation importances (works for any best model)
fi_permutation = octo.get_feature_importances(
    method="permutation",
    data=test_data,
    target=test_data["target"],
)

ROC (Feature Selection Module)

from octopus.modules import Roc

roc = Roc(
    task_id=0,
    threshold=0.8,
    correlation_type=CorrelationType.SPEARMAN,
)

selected_features, results = roc.fit(...)

# ROC doesn't have a predictive model (model_ = None)
# Use permutation importance if you have a downstream model
# or implement custom feature ranking

Boruta (Feature Selection Module)

from octopus.modules import Boruta
from octopus.types import ModelName

boruta = Boruta(
    task_id=0,
    model=ModelName.RandomForestClassifier,
    perc=100,
)

boruta.fit(...)

# Boruta uses RandomForest internally
fi_df = boruta.get_feature_importances(method="internal")

Best Practices

1. Choose the Right Method

Tree-based models: Use method="internal" for speed and interpretability
Linear models: Use method="coefficients" for direct coefficient interpretation
Any model: Use method="permutation" for model-agnostic importance
Feature selection modules: Check if they have a model_ attribute first

2. Validate Importances

Always validate feature importances make sense:

fi_df = module.get_feature_importances(method="internal")

# Check that importances sum to ~1.0 for tree models
total_importance = fi_df["importance"].sum()
print(f"Total importance: {total_importance}")

# Identify top features
top_features = fi_df.head(10)
print(f"Top 10 features:\n{top_features}")

3. Compare Across Methods

For tree-based models, compare internal and permutation importance:

fi_internal = module.get_feature_importances(method="internal")
fi_permutation = module.get_feature_importances(
    method="permutation",
    data=test_data,
    target=test_data["target"],
)

# Merge and compare
import pandas as pd
comparison = pd.merge(
    fi_internal,
    fi_permutation,
    on="feature",
    suffixes=("_internal", "_permutation"),
)
print(comparison)

4. Save Importances for Later Analysis

# Get feature importances
fi_df = module.get_feature_importances(method="internal")

# Save to disk
output_path = module.path_results / "feature_importances.parquet"
fi_df.to_parquet(output_path)

# Save to CSV for easy inspection
fi_df.to_csv(module.path_results / "feature_importances.csv", index=False)

Integration with Workflow

Feature importances are automatically calculated and saved during workflow execution through ModuleResults. However, you can also access them directly:

from octopus.manager.workflow_runner import WorkflowTaskRunner
from octopus.modules import Octo, Rfe
from octopus.types import ModelName

# Define workflow
workflow = [
    Octo(task_id=0, models=[ModelName.RandomForestClassifier]),
    Rfe(task_id=1, depends_on=0, n_features_to_select=20),
]

# Run workflow
runner = WorkflowTaskRunner(
    study=study,
    workflow=workflow,
    cpus_per_outersplit=4,
    log_dir=log_dir,
)

runner.run(outersplit_id=0, data_train=train_data, data_test=test_data)

# After workflow completes, you can load modules and get importances
from octopus.modules import Octo
from upath import UPath

octo_dir = study.output_path / "outersplit0" / "task0" / "module"
loaded_octo = Octo.load(octo_dir)

# Get importances from loaded module
fi_df = loaded_octo.get_feature_importances(method="internal")

Advanced Usage

Handling GridSearchCV Models

The feature importance methods automatically unwrap GridSearchCV objects:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from octopus.types import ModelName

# Octo internally uses GridSearchCV
octo = Octo(task_id=0, models=[ModelName.RandomForestClassifier])
octo.fit(...)

# This automatically extracts best_estimator_ from GridSearchCV
fi_df = octo.get_feature_importances(method="internal")

Custom Feature Importance

For modules with custom importance calculation needs, override the methods:

from octopus.modules import Task
import pandas as pd

class CustomModule(Task):
    def _get_internal_importance(self) -> pd.DataFrame:
        """Custom importance calculation."""
        # Your custom logic here
        custom_scores = self._calculate_custom_scores()

        df = pd.DataFrame({
            "feature": self.selected_features_,
            "importance": custom_scores,
        })
        return df.sort_values("importance", ascending=False).reset_index(drop=True)

Summary

The feature importance functionality in Octopus modules provides:

Unified interface across all modules via get_feature_importances()
Multiple methods (internal, permutation, coefficients) for different model types
Automatic error handling with clear, actionable error messages
Standardized output format (DataFrame with feature/importance columns)
Easy integration with workflows and results saving

This makes it easy to understand which features are most important for your models and make informed decisions about feature selection and model interpretation.