Feature Selection Pipeline

class moosefs.feature_selection_pipeline.FeatureSelectionPipeline(data: DataFrame | None = None, *, X: DataFrame | None = None, y: Series | None = None, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None, stability_mode: str = 'fold_stability')[source]

Bases: object

End-to-end pipeline for ensemble feature selection.

Orchestrates feature scoring, merging, metric evaluation, and Pareto-based selection across repeated runs and selector ensembles.

__init__(data: DataFrame | None = None, *, X: DataFrame | None = None, y: Series | None = None, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None, stability_mode: str = 'fold_stability') None[source]

Initialize the pipeline.

Parameters:
  • data – Combined DataFrame where the last column is treated as the target.

  • X – Feature DataFrame (use together with y instead of data).

  • y – Target Series aligned with X.

  • fs_methods – Feature selectors (identifiers or instances).

  • merging_strategy – Merging strategy (identifier or instance).

  • num_repeats – Number of cross-validation folds to run.

  • num_features_to_select – Desired number of features to select.

  • metrics – Metric functions (identifiers or instances).

  • task – ‘classification’ or ‘regression’.

  • min_group_size – Minimum number of methods in each ensemble.

  • fill – If True, enforce exact size after merging.

  • random_state – Seed for reproducibility.

  • n_jobs – Parallel jobs (use num_repeats when -1 or None).

  • stability_mode – Stability metric configuration: - “selector_agreement”: Stability within ensemble (do selectors agree?) - “fold_stability”: Stability across CV folds (consistent features?) - “all”: Include both stability metrics in Pareto optimization Default: “fold_stability” (most important for robust features)

Raises:

ValueError – If task is invalid or required parameters are missing.

Note

  • Exactly one of data or the pair (X, y) must be provided.

  • Bootstrap is ONLY used by FrequencyBootstrapMerger (merger-specific). Pipeline-level bootstrap has been removed to avoid redundancy with CV.

static _set_seed(seed)[source]

Seed numpy/python RNGs for reproducibility.

Note: This sets the global random state. In parallel execution, each worker process has its own random state, so this is safe. PYTHONHASHSEED must be set before Python starts, so we don’t set it here.

static _validate_X_y(*, data=None, X=None, y=None)[source]

Normalize user inputs into a feature DataFrame and target Series.

_per_repeat_seed(idx)[source]

Derive a per-repeat seed from the top-level seed.

_effective_n_jobs()[source]

Return parallel job count capped by number of repeats.

_merging_requires_bootstrap(merger)[source]

Return True when the given merger asks for bootstrap statistics.

NOTE: Bootstrap is ONLY used by FrequencyBootstrapMerger, which has its own num_bootstrap parameter. Pipeline-level bootstrap has been removed to avoid redundancy with cross-validation and reduce complexity.

_should_collect_bootstrap()[source]

Return True if bootstrap stats should be gathered.

Bootstrap is only collected when a merger explicitly requests it (e.g., FrequencyBootstrapMerger with needs_bootstrap_merging=True).

_generate_selector_ensembles(min_group_size)[source]

Generate all selector-name combinations with minimum size.

Parameters:

min_group_size – Minimum ensemble size.

Returns:

List of tuples of selector names.

run(verbose=True)[source]

Execute the pipeline and return best merged features.

Returns:

(selected_features, best_ensemble_name)
  • selected_features: List of selected feature names

  • best_ensemble_name: Name of the best ensemble (tuple of selector names, or (merger_name, selectors) if multiple mergers are used)

Return type:

tuple

_build_ensemble_index()[source]

Enumerate all selector ensemble × merger combinations.

_reset_run_tracking()[source]

Clear per-run state containers.

_execute_folds(cv_splits, verbose, pbar=None)[source]

Run each CV fold, possibly in parallel, updating pbar as each fold completes.

_collect_fold_results(parallel_results, result_dicts)[source]

Merge per-fold outputs into unified mappings.

_select_best_ensemble(result_dicts)[source]

Select best ensemble using single-stage Pareto with consistency metric.

For each ensemble, computes: - Mean of each performance metric across folds - Consistency score (inverse of average std across performance metrics) - Stability metrics (already computed per ensemble)

Returns:

Tuple of (best_ensemble_name, best_fold_idx) Note: best_fold_idx is None when refit_on_full_data=True

_refit_on_full_data(ensemble_name)[source]

Run selectors and merger on full data for the chosen ensemble.

_cv_splits()[source]

Yield train/test indices for K-fold CV (stratified when classification).

_pipeline_run_for_fold(fold_idx, train_idx, test_idx, verbose)[source]

Execute one CV fold and return partial results tuple.

_compute_bootstrap_stats(train_data, idx, feature_names)[source]

Collect selection counts across bootstrap resamples for each selector.

_compute_subset(train_data, idx)[source]

Compute selected Feature objects per method for this repeat.

_compute_merging(fs_subsets_local, idx, verbose=True, bootstrap_stats=None, feature_names=None)[source]

Merge per-ensemble features and return mapping for this repeat.

_merge_ensemble_features(fs_subsets_local, idx, selectors, merger, *, bootstrap_stats=None, feature_names=None)[source]

Merge features for a specific ensemble of selectors.

NOTE: Bootstrap is ONLY used by FrequencyBootstrapMerger. This merger explicitly requests bootstrap stats via needs_bootstrap_merging=True, and handles all bootstrap aggregation internally.

_compute_performance_metrics(X_train, y_train, X_test, y_test, fold_cache)[source]

Compute performance metrics using configured metric methods.

Parameters:
  • X_train – Train/test data

  • y_train – Train/test data

  • X_test – Train/test data

  • y_test – Train/test data

  • fold_cache – Dict for caching model training across ensembles in this fold

Returns:

List of metric values

_compute_metrics(fs_subsets_local, merged_features_local, train_data, test_data, idx, fold_cache)[source]

Compute performance and stability metrics for each ensemble.

Parameters:
  • fs_subsets_local – Feature subsets per selector

  • merged_features_local – Merged features per ensemble

  • train_data – Train/test splits

  • test_data – Train/test splits

  • idx – Fold index

  • fold_cache – Shared cache for model training across ensembles

Returns:

List of result dicts (one per metric)

static _calculate_means(result_dicts, ensemble_names)[source]

Calculate mean metrics per ensemble across repeats.

_inject_cross_fold_stability(result_dicts)[source]

Compute stability of merged features across folds for each ensemble.

Fold stability is a single value per ensemble (computed across all folds), but we replicate it for each fold index so it can be used in Pareto selection.

static _compute_pareto(values, names)[source]

Return the name of the winner using Pareto analysis.

Parameters:
  • values – List of metric vectors (one per group).

  • names – Corresponding group names.

Returns:

Name of the best group according to Pareto dominance.

Raises:

ValueError – If all groups have failed (all -inf values).

_extract_repeat_metrics(ensemble, *result_dicts)[source]

Return a row per repeat for the given ensemble.

_load_class(input, instantiate=False)[source]

Resolve identifiers to classes/instances and optionally instantiate.

Parameters:
  • input – Identifier or instance of a selector/merger/metric.

  • instantiate – If True, instantiate using extracted parameters.

Returns:

Class or instance.

Raises:

ValueError – If input is invalid.

_num_metrics_total()[source]

Count performance metrics plus configured stability signals.