Optimal marker gene selection for cell type discrimination in single cell analyses.

Change log
Dumitrascu, Bianca 
Villar, Soledad 
Mixon, Dustin G 
Engelhardt, Barbara E  ORCID logo  https://orcid.org/0000-0002-6139-7334

Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.

Algorithms, Cluster Analysis, Gene Expression, Gene Expression Profiling, Genetic Markers, Humans, RNA-Seq, Sequence Analysis, RNA, Single-Cell Analysis, Transcriptome
Journal Title
Nat Commun
Conference Name
Journal ISSN
Volume Title
Springer Science and Business Media LLC
U.S. Department of Health & Human Services | NIH | National Heart, Lung, and Blood Institute (NHLBI) (R01 HL133218)
National Science Foundation (NSF) (IIS-1750729)