Before we can score segments in the genome having a small number of mismatches to a CRISPR for their off-target risk, we must first find these segments.

Searching for every possible mismatch permutation proves computationally expensive, so we apply the following heuristic: We only search for mismatches in the top positions relevant to CRISPR efficiency.

To find these positions, we built a random forest regression model on a set of 2077 CRISPRs having known outcome [1], where the features are the base content at each position relative to the PAM. Aggregating the categorical bases for each position by mean, we observe:

We arbitrarily select the top nine for off-target mismatch concern.

Part Two will explore a strategy for finding these mismatches by alignment.

# References

- Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R, Virgin HW, Listgarten J, Root DE. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016 Feb;34(2):184-191. doi: 10.1038/nbt.3437.