Transcription factor activity scores, why do we need them and how to score them

Sep 15

scRNA-seq data returns individual molecular reads for each cell representing the expression of each gene in each cell. However, transcript abundances at the individual gene level can be hard to interpret. Another confounding factor with these readouts is the high sparsity of the data. This sparsity acutely affects genes with low mRNA abundance (Figure 1) @mereu2020 . Transcription factors (TFs) are key players involved in regulating the present and future cell states by binding to regulatory regions in the DNA and driving gene expression programs @baskar2022. Therefore, they are tightly regulated and are often found at low abundances due to their powerful effects on the cells. Hence, being able to quantify the activity of TFs in a cell can provide very valuable information when characterizing the biological processes underlying a cell type or state. However, due to their low expression they severely suffer from dropout events and their mRNA abundance can't be accurately quantified by looking at the number of UMIs. To address this issue methods have been developed to quantify their activities by leveraging the expression of the genes they regulate.

Before we start here are some key concepts that will help us and frame the vignette!

What is a transcription factor?
Transcription factors are broadly understood as proteins that bind to regulatory regions of the DNA acting as key regulators of gene-expression programs @baskar2022.
What information do we need to compute the activity of a TF?
The activity of TF is scored based on the expression of the genes it regulates. Therefore, we need a database that contains which genes are regulated by each transcription factor and the relation between them. Some TF can activate the expression of some genes and repress the expression of others. There are many databases that contain this information and in this vignette, we aim to provide the current state-of-the-art databases to use, this can be considered as a gene regulatory network (GRN).
Do transcription factors act the same in all cell types?
No! This is crucial to keep in mind when interpreting TF activities. If we take as an example Blimp1 (PRDM1) a well-characterized TF in B and T cells it has been shown to have very different functions. In B cells, Blimp1 drives plasmablast formation and antibody secretion, whereas in T cells, Blimp1 regulates functional differentiation, including cytokine gene expression. Studies have determined both conserved and unique functions of Blimp1 in different immune cell subsets such as the unique direct activation of the igh gene transcription in B cells and a conserved antagonism with BCL6 in B cells, T cells, and myeloid cells @nadeau2022. This is important to consider because ideally, we would have gene-specific GRNs. These can be obtained from multiome datasets but most of the time we don't have this information, hence reference GRNs are available for these cases and this is what is going to be covered in this vignette.
How do we score them in our dataset?
There are many ways to score the activation of TFs as shown in the decoupler paper @badia-i-mompel2022. However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don't worry, we'll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.
How do we interpret the activity obtained?
Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).
Can we further interrogate the activity scores obtained?
Yes! In fact, it is very important to look past the score obtained by a cell and into which are the genes driving that activity. Sometimes with TF regulating many genes downstream, it could be that just a few genes are contributing to its activity in our dataset. Therefore, if we just stopped at the activity score we could be mislead into thinking that all of the genes downstream of the TF are important when it , usually, is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores for one TF but vastly different genes gene programs underlying them.

To follow along with code examples and further explanations visit our teaching resources website material 💊Transcription factor activity scores, why do we need them and how to score them. You can find the raw code [R code] [Python code] as well as the HTML compiled versions [R vignette] [Python vignette]

Marc Elosua Bayes

Transcription factor activity scores, why do we need them and how to score them

Gene Signatures - How to score & interpret them

Get involved!