High-dimensional Variable Screening

Statistical methodologies for identifying important features in ultra-high dimensional data (\(p \gg n\)).

Overview

In the era of Big Data, researchers often encounter ultra-high dimensional problems where the number of features (p) grows exponentially with the number of observations (n). Examples include genomics (gene expression data), finance (high-frequency trading), and image processing. In such settings, standard statistical methods often fail due to the “curse of dimensionality,” computational infeasibility, or noise accumulation.

Variable screening serves as a crucial first step in the analysis pipeline. Its goal is to rapidly reduce the dimensionality from ultra-high to a moderate scale (typically below n) while ensuring that all true important variables are retained with high probability: a property known as the Sure Screening Property.

Variable Screening

Our research focuses on developing robust, flexible, and efficient screening frameworks that can handle:

  • Complex relationships: Beyond linear correlations, capturing non-linear and interactive effects.
  • Model uncertainty: Using ensemble methods to improve stability and reduce false positives.
  • Diverse data types: Applicable to classification, regression, and survival analysis settings.

🎯 Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models

Jianqing Fan, Yang Feng, & Rui Song, Journal of the American Statistical Association (JASA), 2011 📄 Paper

Summary

This work extends independence screening to nonparametric additive models. It relaxes the assumption of linearity, allowing for the detection of variables that have strong non-linear relationships with the response. The proposed method, NIS, uses B-spline basis expansions to capture these non-linear signals.

Highlights

  • Extends the Sure Screening Property to nonparametric additive models.
  • Uses B-spline approximations to capture flexible functional forms.
  • Establishes strong theoretical guarantees on the screening consistency.
  • Reduces the false negative rate significantly compared to linear screening methods when non-linearity exists.

📦 SIS: An R Package for Sure Independence Screening in Ultrahigh Dimensional Statistical Models

Diego Franco Saldana & Yang Feng, Journal of Statistical Software (JSS), 2018 📄 Paper | 💻 R Package: SIS

Summary

This work presents the SIS R package, a comprehensive software tool that implements a wide variety of sure independence screening methods. It unifies standard screening techniques (like SIS and ISIS) with their variants for different model families, making high-dimensional analysis accessible to practitioners.

Highlights

  • Provides a unified interface for linear, generalized linear, and Cox proportional hazards models.
  • Implements iterative screening (ISIS) to handle variables that are marginally weak but jointly important.
  • Optimized for computational efficiency in handling massive datasets.

🎯 Nonparametric Independence Screening via Favored Smoothing Bandwidth

Yang Feng, Yichao Wu, & Leonard Stefanski, Journal of Statistical Planning and Inference (JSPI), 2018 📄 Paper

Summary

This paper proposes a novel bandwidth selection strategy for nonparametric screening. Standard bandwidth selectors (like cross-validation) are computationally expensive and optimized for estimation, not screening. This work introduces a favored smoothing bandwidth that maximizes the separation between important and unimportant variables, enhancing screening power.

Highlights

  • Identifies that optimal bandwidth for screening differs from optimal bandwidth for estimation.
  • Proposes a “favored bandwidth” that maximizes the signal-to-noise ratio for screening.
  • Improves the ranking of important variables, leading to better screening results.
  • Offers a computationally efficient alternative to full cross-validation in the screening step.

🚀 RaSE: A Variable Screening Framework via Random Subspace Ensembles

Ye Tian & Yang Feng, Journal of the American Statistical Association (JASA), 2021 📄 Paper | 💻 R Package: RaSEn

Summary

This paper introduces RaSE, a general framework for variable screening that leverages the power of Random Subspace Ensembles. Instead of screening variables based on the full data, RaSE aggregates screening results from many randomly selected subspaces. This approach significantly improves the quality of screening, especially when signals are weak or sparse.

Highlights

  • Proposes a generic ensemble framework applicable to any base screening method.
  • Proves that RaSE leads to a “better” screening property than using a single screener.
  • Covers a wide range of problems including linear/logistic regression and classification.
  • Demonstrates superior finite-sample performance in identifying marginal and interaction effects.

🧬 Omics Feature Selection with the Extended SIS R Package

Arce Domingo-Relloso, Yang Feng, et al., American Journal of Epidemiology (AJE), 2024 📄 Paper | 💻 Code

Summary

An application-focused work that demonstrates the power of the extended SIS package in epigenetic epidemiology. The study identifies a multi-marker signature for Body Mass Index (BMI) from high-dimensional DNA methylation data in the Strong Heart Study, showcasing the practical utility of screening methods in complex biological data.

Highlights

  • Applies advanced screening methods to real-world high-dimensional omics data.
  • Identifies novel epigenetic markers associated with BMI.
  • Bridges the gap between statistical methodology and biomedical discovery.

For more details, software, and related work, please visit our Publications page.