rapidly extracting a subsequence from chromosome sequence data in Java

The Challenge

We have a text file containing the nucleotides of a chromosome, say human chromosome 11, and need to be able to quickly extract a subsequence from the chromosome text given a nucleotide position and number of subsequent nucleotides to include. The problem is that chromosome files are huge, e.g. 135 megabytes for chromosome 11, so we don’t want to use typical string processing tools. An example of the text we are extracting nucleotides from is:

chromosome_snapshot

Note that this file contains no line breaks or header, so that the byte position of a nucleotide corresponds to its position in the genome.

The Solution

Our solution is to use random file access to jump to the desired start nucleotide position in the file and read forward from that position for the required number of bases. Java code that implements this strategy is:

import java.io.*;

public class ReadNucleotidePositions {

    public static void main(String[] args) {

	// user settings
	Integer start = 68081400 - 1;   // base zero
	Integer numberOfCharactersToRead = 200;
	String chromosomeFile = "chromosome_11.txt";

	String output = "";

	try {
	    // open a file for reading
	    RandomAccessFile raf = new RandomAccessFile(chromosomeFile, "r");

	    // seek the start position
	    raf.seek(start);

	    // repeat read operation for the number of times specified
	    for (Integer i=0; i<numberOfCharactersToRead; i++) {

		// this has to be type "int", not type "Integer" for the cast to work
		int someCharInteger = raf.read();

		// cast as character and append to output string
		output += (char) someCharInteger;
	    }
	}
	catch (IOException ex) {
	    ex.printStackTrace();
	}

	System.out.println();
	System.out.println(output);
	System.out.println();
    }
}

Here we start at position 68081400 and subtract one to make the coordinate system base zero. We specify that we want to read the nucleotide at this position followed by the next 199 nucleotides in chromosome 11.

We open a “RandomAccessFile” and use the “seek” method to move the file pointer to the position we intend to start reading from. Then we loop for the number of nucleotides we are reading, using the “read” method at each iteration to extract the character at that position in the chromosome text. The value returned from the “read” method is of type “int”, which we must cast to type “char” before adding it to our String object containing the extracted sequence.

run_results

Finally, we check the extracted sequence against a reference (in this case the UCSC Genome Browser) to ensure our sequence extraction is accurate:

UCSC_genome_browser

Leave a Reply

Your email address will not be published.