Reference genome and annotation in RNA-seq

Reference genome and gene annotation choices strongly affect RNA-seq mapping, counting, and interpretation.

RNA-seq reads are usually aligned or quantified against a reference genome, transcriptome, or annotation database. If the selected reference does not match the organism, genome build, or annotation version of the dataset, expression estimates can be inaccurate.

Genome build

Human hg19 and hg38 use different coordinate systems and annotation resources. Mouse mm10 and mm39 also differ. A dataset analyzed with one build should not be casually mixed with annotations from another build.

Gene annotation

Gene models define where exons, transcripts, and genes are located. Different annotation versions may include different gene names, transcript IDs, and noncoding RNA definitions. This can affect count matrices and downstream interpretation.

Organism check

Before running analysis, users should confirm whether the dataset is human, mouse, or another organism. Public accession records and associated publications usually provide organism and library information.

Consequences of wrong reference choice

Low mapping or assignment rates
Incorrect gene counts
Missing or outdated gene symbols
Misleading pathway analysis
Inconsistent results across tools

THRAISE users should select the reference genome that matches the dataset organism and original study whenever possible.

Practical recommendation

Check the original publication, GEO/SRA metadata, organism field, and methods section. Use a consistent reference and annotation throughout alignment, counting, differential expression, and visualization.

This guide is provided for research and educational purposes. RNA-seq results should be interpreted with appropriate experimental design, quality control, statistical review, and biological validation.

Back to THRAISE Home