Gene Signatures - How to score & interpret them

Gene signatures are commonly used in routine single cell analysis. Many methods exists but they are not all created equally. In this tutorial we are going to go follow a recent benchmarking paper @badia-i-mompel2022 and follow their guidelines on best practices when scoring gene signatures!

With this tutorial we hope to familiarize you with the concepts of gene signatures, how they are scored in single cell datasets and how to interpret the scores obtained!

Before we start here are some key concepts that will help us and frame the vignette!

  • What is a gene signature?

    A "gene signature" can be stated as a single or a group of genes in a cell having a unique pattern of gene expression that is the consequence of either changed biological process or altered pathogenic medical terms @mallik2018.

  • What is a cell type signature?

    A cell type signature is a gene signature representing a group of genes underlying the biological processes characteristic of a cell type.

  • How do we score them in our dataset?

    Scoring a gene signature means to obtain a value for that signature for each cell in our datasets that represents how active the gene program is in each cell. There are many ways to score gene signatures as shown in the decoupleR paper @badia-i-mompel2022. However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don't worry, we'll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.

  • How do we interpret that score?

    Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).

  • Can we interrogate the scores obtained?

    Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that score. Sometime with gene signatures containing 50 genes it could be that just a few genes are contributing to the signature. If we just stopped at the score we could be mislead into thinking that all of the genes making up the signature are important when it is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores but vastly different gene programs underlying them.

To follow along with code examples and further explanations visit our teaching resources website material gene-signatures-1. You can find the raw code [R] [Python] as well as the HTML compiled versions [R vignette] [Python vignette]

Previous
Previous

Transcription factor activity scores, why do we need them and how to score them