I’ve been looking for a way to compare RNAfold [1] dG results for two RNA sequences, where the two sequences differ in length. My initial thought was to simply divide the computed dG’s by sequence length (i.e., normalize by sequence length) and then compare the results. The analysis presented below shows why this won’t work.

Given an RNA sequence of the form CCCCC…GGGGG…, such that the same number of G’s as C’s follow the C’s, we fold the RNA into a hairpin structure such as the one pictured below using RNAfold:

We follow this procedure for varying lengths of the hairpin:

and observe that the computed dG varies linearly with hairpin length:

However, despite the linearity demonstrated above, we cannot simply normalize these dG results by dividing them by sequence length, as I had hoped:

The problem is that the linear relation shown above does not intercept the y axis at the origin:

If normalization by dividing dG by sequence length was valid, the above graph would show a horizontal line. But only as sequences get really long do they approach a constant value.

In a later post I’ll analyze whether a folding dG result for a sequence of a given length can be normalized by dividing the dG by the maximum possible dG for that sequence length, similar to the BLAST score normalization approach described here.

Python and R code used in this analysis is posted on the Badass Data Science wiki here.

[1] *Fast Folding and Comparison of RNA Secondary Structures (The Vienna RNA Package)*. Ivo L. Hofacker , Walter Fontana , Peter F. Stadler , L. Sebastian Bonhoeffer , Manfred Tacker , Peter Schuster