RNA-seq Data Analysis

1 The Data

This exercise uses RNA-seq data from FU et al. 2015, Nature Cell Biology. The data compares gene expression in luminal cells of pregnant versus lactating mice. You can download the data from this Zenodo repository and you can have a look at this Galaxy tutorial for some ideas on how to get started.

The dataset consists of three files (no file extension, but tab separated):

heatmap_genes A list of interesting genes to analyze (used in Figure 6b of the paper)
limma-voom_luminalpregnant-luminallactate: Differential expression results, including logFC, AveExpr, t-statistic, and p-value for all genes
limma-voom_normalised_counts: Normalized counts for genes across different samples

To simplify the analysis you can consider the following:

Read the files using read_tsv()
Clean column names with janitor::clean_names() to make them easier to work with
Combining relevant data frames using left_join() (limma-voom_luminalpregnant-luminallactate and limma-voom_normalised_counts share a common column)
Use select to remove columns you don’t need for analysis to get a better overview
Filter only significant genes (this tutorial defines them as p_value < 0.01 & abs(logFC) > 0.58)

It is a bit tricky to read in the data correctly, so if you need a hint, you can unfold the code below

Expand to see how to read the data

# Read all the files and clean the column names
heatmap_genes <- read_tsv("R/byod/solutions_for_my_datasets/data/heatmap_genes")
DE_results <- read_tsv("R/byod/solutions_for_my_datasets/data/limma-voom_luminalpregnant-luminallactate")
DE_results <- janitor::clean_names(DE_results)
normalized_counts <- read_tsv("R/byod/solutions_for_my_datasets/data/limma-voom_normalised_counts")
normalized_counts <- janitor::clean_names(normalized_counts)

# combine tables
genes <- left_join(DE_results, normalized_counts,
  by = c("entrezid", "symbol", "genename")
) |>
  select(-entrezid, genename)

2 Questions

What are the top 20 most significantly differentially expressed genes?

Tip: Filter the genes based on p-value and log fold change, then order by p-value.

Create a heatmap of the top 20 most significant genes. How do the expression patterns differ between pregnant and lactating samples?

Checkout the pheatmap package for creating heatmaps.

Create a volcano plot to visualize the differential expression results. How many genes are significantly up-regulated or down-regulated?

First create a simple volcano plot
Then you can try to label the 10 most significant genes in the volcano plot
Check my R script and/or this tutorial on how to create a volcano plot with ggplot

Perform a Principal Component Analysis (PCA) on the top 20 significant genes. What can you conclude about the separation of pregnant and lactating samples?

Tip: Use prcomp() for PCA and factoextra::fviz_pca_ind() for visualization.

3 Useful Functions

To complete these tasks, you may find the following R functions and packages helpful:

Data manipulation:
- janitor::clean_names(): Clean column names
- left_join(): Combine data frames
Statistical analysis:
- scale(): Scale numeric variables
- prcomp(): Perform Principal Component Analysis
Visualization:
- pheatmap::pheatmap(): Create heatmaps (needs a matrix, so you need to reshape your data first into a matrix)
- ggrepel::geom_text_repel(): Add non-overlapping labels to plots
- factoextra::fviz_pca_ind(): Visualize PCA results
Other useful functions:
- as.matrix(): Convert data frames to matrices
- drop_na(): Drop rows with missing values (maybe needed for PCA because it doesn’t handle missing values)

4 Example analysis

Error: object 'FALS' not found