Statistics and reproducibility

No statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those reported in previous publications21,22,24. We performed a two-cohort design to estimate reproducibility after our experiments and focused our analytical approach on the identification of patterns that are detectable within our given sample number. For testing the generalizability of our findings, we considered external datasets in mice and other organisms.

In an initial pilot analysis, we used a Lilliefors test to assess assumptions on normality. For comparing different ages, we used a two-sided Mann–Whitney U test to account for non-normality and visually double-checked that compared groups would have a similar skewness. For quantifying the significance of genome-wide correlations (and thus bypassing the need for gene-specific P values), we used a t-distribution test as the latter appears to be an accurate approximation when working with thousands of data points66.

Mann–Whitney U test, Spearman correlations and Fisher’s exact test were computed through scipy.stats (version 1.2.1)67. Bootstrapped estimates of the 95% confidence intervals of the medians were obtained through Seaborn68. For comparison purposes, we also performed differential gene expression analyses using DESeq2 (ref. 69), which provide significance values that follow a set of frequent assumptions on gene expression distributions70. Significance of the difference between two correlations was tested with Daniel Soper’s Free Statistics Calculators 4.0 (, which implemented a corresponding test developed by Fisher71.

For adjusted P values, we followed a Benjamini–Hochberg correction72.

No randomization of samples was performed by us as we had ordered mice of different ages and allocated mice of different ages to different groups for analysis.

Investigators were not blinded to group allocation during data collection and outcome assessment and further data analysis. Blinding during data collection was not possible as old mice look different from younger mice; blinding was not relevant during data analysis, as the latter used a machine learning strategy to find the properties informing on age-dependent change.

No data were excluded from the analyses, except for additional control analysis, which tested the robustness of the conclusion against different exclusion criteria (Supplementary Fig. 10c–f).

After completion of the manuscript, however, we noted that the original experiment contained muscle tissues for which no sequence data were obtained, and that the original preparation of sequence data included sorted alveolar macrophages, alveolar type 2 cells and monocyte-derived dendritic cells. Preceding the analysis started in this paper, sequence data of the latter had not been carried forward toward analysis as quality-control (QC) metrics appeared different and indicative of lower quality than the other experimental preparations. Retrospectively, analyzing these three cell populations toward a length-associated transcriptome imbalance after the (otherwise) completion of this paper, we find, consistent with our comprehensive reanalysis of cell types through public single-cell transcriptomic data23,24, a length-associated transcriptome with a relative fold reduction of long transcripts in alveolar macrophages and alveolar type 2 cells (Supplementary Fig. 17).

The experiments were not and could not have been randomized.

Estimation of confidence intervals for bar plots

Bar plots represent empirically observed fractional data; for example, phenomenon present in 14 of 20 tissues would be 70%. Confidence intervals were estimated by bootstrapping 1,000 times. Each bootstrap corresponds to drawing with replacement. In the above example, this would mean doing 20 independent randomizations where for each randomization there was a 70% chance that the phenomenon would be present (and 30% chance that not).

Animal keeping

All mouse procedures were approved by the Institutional Animal Care and Use Committee at Northwestern University. All strains including wild-type mice were bred and housed at a specific-pathogen-free barrier facility at the Center for Comparative Medicine at Northwestern University. Male C57BL/6J mice were provided by the NIA, one of the National Institutes of Health (NIH), and were housed at Northwestern University Feinberg School of Medicine Center for Comparative Medicine for 4 weeks before euthanasia. Our rationale was to focus on a standardized murine model that is commonly used across different laboratories so that our findings could be investigated or continued by others.

Mice were euthanized by pentobarbital sodium overdose. Immediately the chest cavity was opened, and animals were perfused via the left ventricle with 10 ml of HBSS (Ca/Mn+). The following tissues were collected: lung, heart, liver, kidney, adrenal gland, white (perigonadal) and brown adipose tissue, skin, muscle satellite cells, frontal cortex, cerebellum, esophagus, stomach, and small and large intestines. Gut epithelial cells were isolated after flushing large intestine with EDTA/EGTA solution. Lungs were subjected to enzymatic digestion to release stromal and immune cells and sorted by FACS as described elsewhere73. All tissues and cells were immediately frozen on dry ice and stored at −80 °C for processing. Muscle satellite cells were prepared as described in work by Runyan et al.74.

RNA isolation and RNA sequencing

RNA was isolated using an RNeasy DNA/RNA kit after homogenization and lysis in guanidine thiocyanate buffer supplemented with β-mercaptoethanol. RNA concentration and quality were assessed using an Agilent TapeStation. RNA-seq libraries were prepared using an NEB Next RNA Ultra kit with a polyA enrichment module using an Agilent Bravo NGS Automated fluidics handling platform as described elsewhere73. Libraries were multiplexed and sequenced on an Illumina NextSeq 500 platform using 75 cycles of high-output flow cells and a dual indexing strategy. Our rationale was to use a protocol that had been standardized and applied by our sequencing facility.

While targeting 6 mice per age and organ, we ultimately only obtained sequenced samples for an average of 5.76 mice per age and organ (because of errors in sample isolation and/or liquid handling).


Sequencing reads were analyzed using an implementation of Ceto ( in Nextflow75. Briefly, BCL files were demultiplexed and converted to fastq files using bcl2fastq (version, with default parameters. The raw reads were trimmed using trimmomatic76 (version 0.36), with the following parameters: trailing = 30 and minlen = 20. Trimmed reads were aligned to the mouse reference genome (GRCm38.p3) with annotations from Ensembl v78 using tophat (version 2.1.0)77, with the following parameters: no novel junctions, read-mismatches = 2, read-edit-distance = 2 and max-multihits = 5. Aligned reads were counted using Htseq-count from htseq78, with the following parameters: intersection-nonempty, reverse strand, feature-type = exons, and id-attribute = gene_id. Our rationale was to use a bioinformatic setup that had been standardized and applied by our facilities.

Differential expression of bulk RNA sequencing

For measurements derived from multiple individuals, differential gene expression analysis was performed with DESeq2 (ref. 69), version 1.20 (mouse) and 1.22 (human) using an α value of 0.05 for the adjusted P-value cutoff. We subsequently only kept genes that mapped unambiguously between Ensembl gene identifiers and NCBI (Entrez) gene identifiers13.

To estimate the differential gene expression between individuals, we directly computed the log2 ratio of raw counts for transcripts detected in both individuals.

Characteristics of genes

For transcription factors, we mapped the Gene Transcription Regulatory Database (v18_06)14 to ±200 nucleotides to transcriptional start sites supplied by BioMart for the human reference genome build GRCh38.p12 and the mouse reference genome build GrCm38.p6. For miRNAs, we used miRDB (v5.0)15. For mature transcripts, length parameters and GC content were identified from GenBank and mapped to genes as described elsewhere using the median across different transcripts13. Number of exons, and their minimal, median and maximal lengths, were extracted from BioMart. For genes and chromosomes, characteristics were extracted as described elsewhere13. Our rationale was to consider a broad set of information that might inform on the formation or turnover of transcripts.


Gradient-boosting regression models were built in scikit-learn (version 0.20.3)16. Of the transcripts, 90% were included as the training set and 10% were used as a test set. The 10% of the test set transcripts that had been withheld during training were used to evaluate the performance of the models. Our rationale for the gradient boosting was to account for possible non-linearities. We only considered protein-coding genes with at least one research publication and an official gene symbol, and which unambiguously mapped in a 1:1 relation between NCBI (Entrez) gene identifiers and Ensembl gene identifiers.

Kernel-density visualizations

Kernel-density visualizations were created with Seaborn68 using default parameters.

Comparison to two cohorts of mice

To quantitatively evaluate the performance of our machine learning approach, we first estimated the maximal performance that should be achievable by our experimental data. The latter will depend on biological, experimental and technical variability, the true number of genes that change expression during aging and their true magnitudes of change, and the sensitivity of RNA-seq to detect transcript molecules and their change. We built upon the two-cohort design of our experimental survey and compared transcriptional fold changes between the two cohorts. Specifically, the six mice of each age and tissue pair had drawn from two cohorts with three mice per age each that were euthanized on different days (or two and three mice per age if we only obtained samples from five mice, or two mice per age if we only obtained samples from four mice). As anticipated2, Spearman correlations for the relative fold changes between measurements obtained by two cohorts of mice appeared small (interquartile range (0, 0.250)) and, in some cases, even slightly negative (Supplementary Fig. 2b). Of direct relevance to our efforts to evaluate the performance of our machine learning approach, the Spearman correlations between observed relative fold changes and predicted relative fold changes (Extended Data Fig. 1a), however, resembled or exceeded those observed between both cohorts (Supplementary Fig. 2b).

Comparison to differential gene expression analysis

We found good agreement between the prediction accuracy of our gradient-boosting regression models and the number of expressed genes detected to be differentially expressed (Extended Data Fig. 2). We also found that gradient-boosting regression reached statistical significance (P < 0.01) for tissue–age pairs where the transcript of no single gene was statistically significantly differentially expressed at a Benjamini–Hochberg-adjusted P-value cutoff of 0.05 (Extended Data Fig. 2c), suggesting that differential gene expression analyses might yield false-negative findings.

Alternate bioinformatics

To ensure robustness of results beyond individual bioinformatic pipelines, we reanalyzed in-house bulk RNA-seq datasets using the publicly available nf-core/RNA-seq pipeline version 1.4.2 implemented in Nextflow 19.10.0 using Singularity 3.2.1–1 with the minimal command: nextflow run nf-core/rnaseq -r 1.4.2 –singleEnd -profile singularity –unStranded –fc_group_features_type ‘gene_id’75,79,80. Briefly, lane-level reads were trimmed using trimGalore! 0.6.4 and aligned to the GRCm38 genome using STAR 2.6.1d81. Gene-level assignment was then performed using featureCounts (1.6.4)82. Included in the nf-core/RNA-seq QC output is a matrix of Pearson correlations of log2(CPM) values generated using edgeR (3.26.5)83. Lanes with Pearson R < 0.7 compared to all other lanes constituting a given sample were excluded from further analysis. Extant lanes were then merged by sample with SAMtools 1.6 using the minimal command ‘samtools merge –r’84. Merged BAM files were then reassigned using Rsubread 1.32.4 in R 3.5.1 using the minimal command ‘featureCounts(files, annot.inbuilt = ‘mm10’, minFragLength = 25)’ and merged into separate datasets by tissue in DESeq2 (1.22.2)69,82. Using a combined factor of age, and influenza dose (plaque-forming units), differential expression analysis (DEA) was performed with the formula ‘~combined’. A local estimate of gene dispersion best fit observed dispersions in all cases. DEA was therefore performed using the minimal command ‘DESeq(dataset, fitType = ‘local’, parallel = T)’. DEA tables were output for all permutations of age and influenza dose for a given tissue and analyzed as above.

Length correlations

To avoid assumptions on linearity, we used the Spearman correlation between transcript length and relative fold change of transcripts in older individuals over younger ones. Significance was obtained through the scipy.stats (version 1.2.1) implementation of the Spearman correlation67.

For human GTEx data, we restricted our analysis to tissues present in samples from young and middle-aged and old donors.

Quantification through difference in median transcript length

We alternatively quantified the length-associated transcriptome imbalance through the difference in median transcript length among those genes that, in a differential gene expression analysis, showed a significant relative fold increase and the transcript length of those genes that showed significant relative fold decrease.

Technical robustness checks

As technical artifacts could affect transcripts according to length40,42,43, we further tested the robustness of the recovered length-associated transcriptome imbalance.

First, we repeated our initial analysis (Fig. 1c) and filtered the datasets for those organs where the relative fold changes correlated across both cohorts exceed a Spearman correlation of 0.2 and found that the correlation between transcript lengths and relative fold changes persisted (Supplementary Fig. 6a). Further, and based on counting the tissues exceeding a Spearman correlation of 0.2, we noted, as expected21, that the age-dependent changes were most reproducible when comparing 24-month-old organs against 4-month-old organs (Supplementary Fig. 6a). Second, we did not observe changes in RNA integrity with age (Supplementary Fig. 6b). Third, we observed the same correlations between transcript length and relative fold change if we removed the requirement for the annotation of genes to be supported by several lines of evidence such as gene symbols or literature published in MEDLINE, lowering the likelihood that our findings could be an artifact of the stringency of applied gene annotations (Supplementary Fig. 6c).

For our fourth robustness check, we excluded samples where the sequencing data yielded lower-quality metrics. Reassuringly, this yielded practically identical measurements of the length correlations—with two conditions (12-month-old lung, and 18-month-old large intestine) even turning a positive correlation (favoring long genes) toward the more negative correlation (disfavoring long genes) as seen for the majority of conditions (Supplementary Fig. 6d). Solely, the condition with the strongest imbalance (18-month-old large intestine), now showed a weaker imbalance (Spearman correlation with length, from −0.67 to −0.24; Supplementary Fig. 6d). As a fifth robustness check, and after our initial discovery of the age-dependent length correlation, we asked one team member, who was not involved in the design or execution of the initial bioinformatic preprocessing, to prepare an independent bioinformatic pipeline, QC and sample filtering, and differential gene expression analysis. To mitigate the potential risk associated with custom pipelines, the team member used nf-core/RNA-seq, which provides community-curated bioinformatics pipelines80. Again, we observed practically identical measurements of the length correlations—with 18-month-old hearts again turning a positive correlation toward the more representative negative correlation (Supplementary Fig. 6e). Sixth, we excluded, after the bioinformatic processing and differential gene expression analysis, genes with a known annotation relating to the inflammatory response (genes that tend to be short) and neuronal genes (genes that tend to be long). Again, we observed practically identical measurements of the length correlations (Supplementary Fig. 10f).

Seventh, after a non-parametric LOWESS regression between transcript length and relative fold changes42,85, we no longer observed any length-associated transcriptome imbalance when considering the correlation between transcript lengths and residual fold changes after LOWESS regression. This suggests that the length-associated transcriptome imbalance is a transcriptome-wide phenomenon that could be accounted for by transcript length alone (Supplementary Fig. 10g).

Eighth, reanalyzing published transcript degradation rates43, we found that longer transcripts were slightly more stable when compared to other transcripts across the genome (Spearman, −0.03; Supplementary Fig. 10h).

Ninth, we tested if a length-associated transcriptome imbalance persists if, instead of transcript length, we considered other, correlated (Extended Data Fig. 3a) readouts of length. We therefore measured the correlation between the observed relative fold changes and the median gene length (Supplementary Fig. 7a,b) and the median length of the coding sequence (Supplementary Fig. 7c,d). Indeed, we continued to detect a length-associated transcriptome imbalance (Supplementary Fig. 7).

Biological variation

We determined whether our sample size of six mice would be sufficient to conclude whether the length-associated transcriptome imbalance measured between mice of distinct chronological ages would exceed the length-associated transcriptome imbalance seen among mice of the same chronological age. We performed retrospective subsampling of all mice of a given chronological age and separated them into two equally sized groups (or three versus two mice if one sample had not been processed; Supplementary Table 1). For each possible permutation of separating mice into two different groups, we measured the correlation between transcript lengths and the relative fold changes of transcripts between those two groups. Comparing these permutations against the length-associated imbalance that we observed across distinct chronological ages, we inferred that—given the number of 17 organs and target sample size of 6 mice—we can presently only conclude for the comparison between 4-month-old and 24-month-old mice that the majority of organs demonstrate an imbalance with age that exceeds interindividual variability (Mann–Whitney U < 0.001; Fig. 1c and Supplementary Fig. 8a,b).

Next, we performed all pairwise comparisons between all mice of a given age relative to all 4-month-old mice. Reminiscent of our preceding analysis, we observed the imbalance was most pronounced when comparing 24-month-old mice against 4-month-old mice (Supplementary Fig. 8b). For all but two organs (lung and skin; Supplementary Fig. 8c), we found a relative fold decrease of long transcripts for more than half of all pairwise comparisons. Notably, the occurrence of a general trend of the length correlation by age does not indicate that all possible combinations of mice show a fitting reduction of long transcripts by age. Across all combinations, and organs, we found this fraction to be 67% (interquartile range (53%, 81%)), indicating that the length-dependent imbalance against long transcripts is not fully penetrant (Extended Data Fig. 8c).


NanoString analysis was performed by NUSeq Core using the metabolism panel (v1).

Samples were hybridized overnight at 65 °C for 16 h, according to NanoString’s recommended protocol. As all samples were DV300 >90% (that is, 90% of the RNA species should be of 300 nucleotides or longer), 75 ng input RNA was used for each hybridization reaction. Samples were then immediately loaded into the NanoString nCounter cartridges and processed. Differential expression was performed using nSolver version 4.0.70 (NanoString) between the 4-month group and the 24-month group using default parameters.

Reanalysis of previous studies

We considered genes reported to show a relative fold decrease or relative fold increase in earlier studies. For mice and rats, we used protein-coding genes with at least one research publication and an official gene symbol, and the median transcript lengths derived from GenBank. For killifish, we used genes and gene lengths as reported by Reichwald et al.19.

For studies reporting transcriptome measurements (Schaum et al.22 and Shavlakadze et al.21), we used a significance threshold of P < 0.01 for the correlation between transcript length and relative fold changes. For studies that only provided lists of differentially expressed genes and their relative fold increase or relative fold decrease (Benayoun et al.20 and Reichwald et al.19), we applied a two-sided Mann–Whitney U test, to determine whether the median transcript length of transcripts with a relative fold increase was different at the P < 0.01 level from the median length of transcripts with a relative fold decrease. In case that multiple ages were tested separately by individual studies, we selected the ages closest to 4 and 24 months of age for young and old, respectively.

Single-cell transcriptome imbalance

As cell types, we considered the cell_type and cell_ontology_class columns within the respective meta-data tables contained in the h5ad files of Kimmel et al.23 and Tabula Muris Senis24. We only considered protein-coding genes that were detected in at least one cell of a given cell type in an individual organ in a given study. We determined the transcriptome imbalance for each cell type by correlating transcript length against the log2-fold ratio formed by the summed raw counts of the older animals divided by the summed raw counts of the younger animals.

Exclusion of genes with inflammation and neuronal function

To determine robustness beyond short stress-induced genes and long neuronal genes, we removed them from our analysis after bioinformatic processing. We excluded genes if the lowercase spelling of the Gene Ontology term contained ‘immune’, ‘stress’, ‘inflamm’ or ‘infect’, ‘brain’, ‘neuro’, ‘nerv’, ‘cerebral’, ‘cortex’ or ‘memory’.

Mapping of rhesus macaque genes

For the analysis of Murray et al.86, we mapped genes to human transcript length through gene symbols shared with humans.

Analysis of Flynn et al.

Contrasting other anti-interventions, we reanalyzed the raw data of Flynn et al.87 as, despite the statement in a corresponding figure legend, their study did not include the corresponding supplementary table with differential gene expression results (Supplementary Table 3).

Functional enrichments

We considered the genes with the 5% shortest and 5% longest median transcript length. We used the annotation of pro-longevity and anti-longevity genes from HAGR29,88,89. If genes were simultaneously annotated as pro-longevity and anti-longevity genes (21 of 665 in human, 19 of 665 in mice), we kept both annotations. After intersecting with protein-coding genes with a reported transcript length, this yielded 417 anti-longevity and 267 pro-longevity genes in humans, and 307 anti-longevity and 200 pro-longevity genes in mice. Our rationale was to adhere to the most comprehensive curation of individual longevity genes.

For differential enrichment, we considered genes enriched among the genes with transcripts of one length extreme (5% shortest and 5% longest) at a Benjamini–Hochberg P value < 0.05 and depleted among the genes with the other length extreme. Unless indicated otherwise, we restricted the background gene lists for the enrichment to those genes carrying at least one annotation.

False-positive rate of opposing enrichment and depletion

To understand whether an opposing enrichment and depletion of Gene Ontology terms is common, we asked explicitly whether the number of categories that were opposingly enriched among short and long transcripts (Supplementary Tables 7–10) was higher than would be expected by chance when comparing two random samples drawn from the length distributions observed in the mouse and human genomes. We performed 100 randomizations for mouse and 100 randomizations for human genes. For mice, no single randomization identified any annotation. For humans, 1 of 100 randomizations identified a single annotation (of 14,223 possible annotations for mouse genes, and 15,371 possible annotations for human genes), while the remaining 99 randomizations identified no annotations. These values were significantly lower than the number of annotations that we observed when using the shortest and longest transcripts of mouse and humans, that is, 138 and 140, respectively (Supplementary Tables 7–10).

Alternative enrichment analysis for longevity genes

As an additional analysis, which we expected to have lower statistical power due to the increased number of uninformative genes, we repeated our enrichment analysis while considering all protein-coding genes, rather than solely those occurring in the database used for annotation (Extended Data Fig. 9g,h). We observed that the directionality in these trends with aging always persisted in human and persisted in three of four cases in mice—with the exception being anti-longevity genes among the longest transcripts. Notably, for humans, one case—the depletion of anti-longevity genes from the longest transcripts did not reach statistical significance (two-sided Fisher’s exact P = 0.31). Further, in mice, the enrichment of anti-longevity genes among the shortest transcripts did not reach statistical significance (two-sided Fisher’s exact P = 0.31).

Annotation network construction

To organize annotations according to their similarity in the shared genes rather than the human-imposed hierarchical organization, we represented the annotations found to be enriched as nodes, drew edges between two nodes if at least one gene carried both annotations and simply the network as follows: Starting with node with the fewest attached genes, we kept the edge from that node to the node with the largest intersection set of attached genes. In case of a tie, that is, in case there were several nodes with intersection sets of attached genes of the same size, we kept the edge to the node with the fewest number of attached genes. In case a tie remained, we kept the edge to the annotation node with the fewest genes attached but now including genes that were not included in the enrichment analysis. We repeated this procedure for the other nodes in order of increasing number of genes attached.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


Leave a Comment