Before we can score segments in the genome having a small number of mismatches to a CRISPR for their off-target risk, we must first find these segments.

Searching for every possible mismatch permutation proves computationally expensive, so we apply the following heuristic: We only search for mismatches in the top positions relevant to CRISPR efficiency.

To find these positions, we built a random forest regression model on a set of 2077 CRISPRs having known outcome [1], where the features are the base content at each position relative to the PAM. Aggregating the categorical bases for each position by mean, we observe:

We arbitrarily select the top nine for off-target mismatch concern.

Part Two will explore a strategy for finding these mismatches by alignment.

