Neyman-Pearson Classification
Statistical learning frameworks for asymmetric error control, prioritizing type I error control over overall accuracy.
Overview
In many real-world classification problems, different types of errors have vastly different consequences. For example, in disease diagnosis, failing to detect a disease (false negative) is often much more severe than a false alarm (false positive). Standard classification methods, which minimize the overall classification error, may not be suitable for such asymmetric cost scenarios.
Neyman-Pearson (NP) classification addresses this by prioritizing one type of error (e.g., type I error) and minimizing the other (type II error) subject to the priority error being controlled under a user-specified level \(\alpha\). This paradigm is rooted in the classical Neyman-Pearson lemma from hypothesis testing.
Our research group has developed a comprehensive suite of theoretical frameworks and algorithms for NP classification, covering:
- High-dimensional settings where the number of features exceeds the sample size.
- Non-parametric methods that do not assume specific data distributions.
- Multi-class problems via cost-sensitive learning.
- Sample size calculations to guarantee error control with high probability.
βοΈ Neyman-Pearson Multi-class Classification via Cost-sensitive Learning
Ye Tian & Yang Feng, Journal of the American Statistical Association (JASA), 2024 π Paper | π» R Package: npcs
Summary Extends the NP paradigm to multi-class settings. It establishes a theoretical connection between NP classification and cost-sensitive learning, proposing an algorithm that transforms the constrained optimization problem into a series of cost-sensitive problems. This allows for controlling multiple error rates simultaneously or prioritizing specific error types among multiple classes.
Highlights
- Establishes a theoretical equivalence between NP classification and cost-sensitive learning.
- Proposes a unified framework for multi-class NP problems with flexible error control requirements.
- Develops efficient algorithms that leverage existing cost-sensitive learning solvers.
- Demonstrates superior performance in controlling target error rates compared to naive baselines.
π Neyman-Pearson Classification: Parametrics and Sample Size Requirement
Xin Tong, Lucy Xia, Jiacheng Wang, & Yang Feng, Journal of Machine Learning Research (JMLR), 2020 π Paper
Summary This work investigates the parametric setting of NP classification (e.g., LDA, QDA variants) and addresses a critical practical question: How much data do we need? It provides finite-sample analysis and sample size formulas to ensure that the type I error is controlled with high probability, bridging the gap between asymptotic theory and practical application.
Highlights
- Derives explicit sample size formulas for NP-LDA and NP-QDA to guarantee error control.
- Provides finite-sample high-probability bounds for the type I error.
- Analyzes the impact of dimensionality and signal strength on the required sample size.
- Offers practical guidelines for study design in asymmetric error cost scenarios.
π Neyman-Pearson Classification Algorithms and NP Receiver Operating Characteristics
Xin Tong, Yang Feng, & Jingyi Jessica Li, Science Advances, 2018 π Paper | π» R Package: nproc | π Python Package: nproc
Summary Introduces the concept of NP-ROC curves to visualize and evaluate classifiers under the NP paradigm. Unlike standard ROC curves, NP-ROC focuses on the high-priority error control region. The paper also proposes two novel algorithms, NP-umbrella and NP-ssp, which adapt state-of-the-art scoring functions (like Logistic Regression, SVM, Random Forest) to the NP setting without needing to retrain the underlying models.
Highlights
- Introduces the NP-ROC band as a new visualization tool for asymmetric error evaluation.
- Proposes the NP-umbrella algorithm to construct NP classifiers from any scoring function.
- Develops the NP-ssp method for semi-supervised settings to leverage unlabeled data.
- Implemented in the user-friendly
nprocR package.
π A Survey on Neyman-Pearson Classification and Suggestions for Future Research
Xin Tong, Yang Feng, & Anqi Zhao, Wiley Interdisciplinary Reviews: Computational Statistics, 2016 π Paper
Summary A comprehensive review of the history and development of Neyman-Pearson classification. It categorizes existing methods, discusses the theoretical foundations, and outlines open problems and future directions for the field, serving as an essential primer for researchers interested in asymmetric error control.
Highlights
- Provides a systematic taxonomy of existing Neyman-Pearson classification methods.
- Clarifies the relationship between NP classification and other asymmetric learning paradigms.
- Identifies key theoretical challenges and open problems in the field.
- Serves as a foundational reference for new researchers entering the area.
𧬠Neyman-Pearson Classification under High-Dimensional Settings
Anqi Zhao, Yang Feng, Lie Wang, & Xin Tong, Journal of Machine Learning Research (JMLR), 2016 π Paper
Summary The first work to bring NP classification into the high-dimensional era. It proposes a penalized empirical risk minimization framework that integrates variable selection with error control. The method, NP-sLDA, is shown to possess the βNP-oracle property,β meaning it can select the correct features and achieve optimal error rates simultaneously.
Highlights
- Proposes the first penalized framework for high-dimensional NP classification.
- Establishes the βNP-oracle propertyβ for simultaneous variable selection and error control.
- Develops an efficient algorithm for solving the constrained high-dimensional optimization problem.
- Demonstrates the methodβs effectiveness in gene expression data analysis.
For more details, software, and related work, please visit our Publications page.