Neyman-Pearson Classification

Overview

In many real-world classification problems, different types of errors have vastly different consequences. For example, in disease diagnosis, failing to detect a disease (false negative) is often much more severe than a false alarm (false positive). Standard classification methods, which minimize the overall classification error, may not be suitable for such asymmetric cost scenarios.

Neyman-Pearson (NP) classification addresses this by prioritizing one type of error (e.g., type I error) and minimizing the other (type II error) subject to the priority error being controlled under a user-specified level \(\alpha\). This paradigm is rooted in the classical Neyman-Pearson lemma from hypothesis testing.

Social Network Analysis

Our research group has developed a comprehensive suite of theoretical frameworks and algorithms for NP classification, covering:

High-dimensional settings where the number of features exceeds the sample size.
Non-parametric methods that do not assume specific data distributions.
Multi-class problems via cost-sensitive learning.
Sample size calculations to guarantee error control with high probability.

⚖️ Neyman-Pearson Multi-class Classification via Cost-sensitive Learning

Ye Tian & Yang Feng, Journal of the American Statistical Association (JASA), 2024 📄 Paper | 💻 R Package: npcs

Summary Extends the NP paradigm to multi-class settings. It establishes a theoretical connection between NP classification and cost-sensitive learning, proposing an algorithm that transforms the constrained optimization problem into a series of cost-sensitive problems. This allows for controlling multiple error rates simultaneously or prioritizing specific error types among multiple classes.

Highlights

Establishes a theoretical equivalence between NP classification and cost-sensitive learning.
Proposes a unified framework for multi-class NP problems with flexible error control requirements.
Develops efficient algorithms that leverage existing cost-sensitive learning solvers.
Demonstrates superior performance in controlling target error rates compared to naive baselines.

📊 Neyman-Pearson Classification: Parametrics and Sample Size Requirement

Xin Tong, Lucy Xia, Jiacheng Wang, & Yang Feng, Journal of Machine Learning Research (JMLR), 2020 📄 Paper

Summary This work investigates the parametric setting of NP classification (e.g., LDA, QDA variants) and addresses a critical practical question: How much data do we need? It provides finite-sample analysis and sample size formulas to ensure that the type I error is controlled with high probability, bridging the gap between asymptotic theory and practical application.

Highlights

Derives explicit sample size formulas for NP-LDA and NP-QDA to guarantee error control.
Provides finite-sample high-probability bounds for the type I error.
Analyzes the impact of dimensionality and signal strength on the required sample size.
Offers practical guidelines for study design in asymmetric error cost scenarios.

📉 Neyman-Pearson Classification Algorithms and NP Receiver Operating Characteristics

Xin Tong, Yang Feng, & Jingyi Jessica Li, Science Advances, 2018 📄 Paper | 💻 R Package: nproc | 🐍 Python Package: nproc

Summary Introduces the concept of NP-ROC curves to visualize and evaluate classifiers under the NP paradigm. Unlike standard ROC curves, NP-ROC focuses on the high-priority error control region. The paper also proposes two novel algorithms, NP-umbrella and NP-ssp, which adapt state-of-the-art scoring functions (like Logistic Regression, SVM, Random Forest) to the NP setting without needing to retrain the underlying models.

Highlights

Introduces the NP-ROC band as a new visualization tool for asymmetric error evaluation.
Proposes the NP-umbrella algorithm to construct NP classifiers from any scoring function.
Develops the NP-ssp method for semi-supervised settings to leverage unlabeled data.
Implemented in the user-friendly nproc R package.

📖 A Survey on Neyman-Pearson Classification and Suggestions for Future Research

Xin Tong, Yang Feng, & Anqi Zhao, Wiley Interdisciplinary Reviews: Computational Statistics, 2016 📄 Paper

Summary A comprehensive review of the history and development of Neyman-Pearson classification. It categorizes existing methods, discusses the theoretical foundations, and outlines open problems and future directions for the field, serving as an essential primer for researchers interested in asymmetric error control.

Highlights

Provides a systematic taxonomy of existing Neyman-Pearson classification methods.
Clarifies the relationship between NP classification and other asymmetric learning paradigms.
Identifies key theoretical challenges and open problems in the field.
Serves as a foundational reference for new researchers entering the area.

🧬 Neyman-Pearson Classification under High-Dimensional Settings

Anqi Zhao, Yang Feng, Lie Wang, & Xin Tong, Journal of Machine Learning Research (JMLR), 2016 📄 Paper

Summary The first work to bring NP classification into the high-dimensional era. It proposes a penalized empirical risk minimization framework that integrates variable selection with error control. The method, NP-sLDA, is shown to possess the “NP-oracle property,” meaning it can select the correct features and achieve optimal error rates simultaneously.

Highlights

Proposes the first penalized framework for high-dimensional NP classification.
Establishes the “NP-oracle property” for simultaneous variable selection and error control.
Develops an efficient algorithm for solving the constrained high-dimensional optimization problem.
Demonstrates the method’s effectiveness in gene expression data analysis.

For more details, software, and related work, please visit our Publications page.