I recently read that two separate BLAST alignments to the same reference sequence can be compared to each other by normalizing the alignments by the maximum bit score of the reference sequence BLASTed against itself [1].

In this procedure, the user first aligns the reference sequence to itself to find the maximum possible bit score, and then calculates the percentages aligned from the query sequences by dividing each of the two query alignments’ bit scores by the maximum bit score.

To convince myself that the method is valid, I aligned known length subsets of a sequence against the whole sequence and calculated the percentage aligned by dividing the alignments’ bit scores by the maximum possible bit score. The following plot shows the linear relationship between bit score and known percentage aligned, which demonstrates that the normalization approach is valid:

It should be noted that I used exact subsets. No insertions or deletions were considered.

BLAST results used in this analysis are posted below. Python code to generate these results is posted on the Badass Data Science wiki.

Example sequences used in this analysis (also available on the Badass Data Science wiki):

## References

[1] Rasko, D.A., Myers, G.S., Ravel, J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics, 2005, 6:2. doi:10.1186/1471-2105-6-2.

## 2 thoughts on “comparing BLAST results by bit score ratio”

## badassdatascience

(April 17, 2012 - 3:22 pm)The same linearity applies to Smith-Waterman alignments. Code demonstrating this is available at the link given above.

## RNAfold and sequence length | badass data science

(May 2, 2012 - 3:18 am)[…] In a later post I’ll analyze whether a folding dG result for a sequence of a given length can be normalized by dividing the dG by the maximum possible dG for that sequence length, similar to the BLAST score normalization approach described here. […]