I’m constantly asked to explain what I do for a living. Here is an attempt to do so in laypersons’ terms. I’ll assume my readers are non-scientists and non-engineers, but that they’ve taken a high school biology class.
“Bioinformatics” is the application of mathematics and computer science to biological data, particularly molecular biology data. By “molecular” I mean DNA, RNA, and intracellular functions. This is in contrast to other applications of math to biology such as epidemiology or clinical trials. I don’t do those things but they sound fun. Bioinformaticians are a type of scientist; who work closely with laboratory scientists. We generally sit at a computer all day since our work is about analyzing data rather than generating it in a lab.
In this series of posts I’ll survey some of the work I do. I work for a large corporation, so this writing reflects my application of bioinformatics toward the selling of solutions and products. In contrast, many bioinformaticians work in academia or government supporting “pure” (whatever) research.
I’ll start with the most fundamental problem in modern science: dealing with data. Future posts in this series will cover predictive modeling, required skills, and other topics.
Data, Data Everywhere
All sciences are experiencing a flood of data right now; far more has been generated in the last decade than in the entire prior history of science. This is due to modern instrumentation such as telescope arrays, satellite remote sensing, low cost sensors, and, of course, fast DNA/RNA sequencing machines. Traditional scientists have not had the skills to cope with this data deluge, so a new breed of professionals emerged to fill the gap. The term “informatics” was appended to the fields’ names to describe the new professions: “astroinformatics”, “geoinformatics”, “cheminformatics”, and in my case “bioinformatics”.
It often starts with databases, for example a cheminformatician might build and query a database of molecular structures and their properties. This is how I got my start in bioinformatics. When I entered the field I did not know a thing about molecular biology, but my employer needed my database development skills because we were storing information about genes and RNA molecules that impact those genes. Beyond building the databases, I had to write software that communicated with them to derive meaningful information from the data being stored.
Far into my bioinformatics career I’m still building databases and software that interacts with them. The screenshot below shows a subset of a database I developed recently to connect genes to diseases. Diseases related to cleft lip are shown in blue, and genes related to those diseases are shown in purple. You can see that some genes relate to more than one disease. Phenotypes (the physical manifestation of a section of DNA) are shown in yellow, in this case those related to cleft lip. With this database, researchers can input a disease they are studying and retrieve a list of genes to examine:
Sometimes the data that needs to be stored is not the raw result of an experiment, but the output of a computational process on that data that is too time-consuming to repeat every time the information is asked for. For example, it is very time-consuming to map one organism’s genome against another’s to see which segments of DNA are the same and which differ. Therefore the computation is usually conducted once and results stored in a public database for easy retrieval. To illustrate this, the screenshot below shows a cat’s cannabis receptor gene DNA sequence—this is responsible for processing THC—mapped to the same receptor gene DNA sequences for humans, mice, rats, cows, chickens, and horses. You can see that the sequences are mostly the same, suggesting almost identical genetic function, but that there are slight differences between the species:
Both of the screenshots that have been presented above illustrate another bioinformatics data challenge: How to communicate data and information derived from data? We often create web tools that enable users to query our databases and retrieve graphical insight. Usually we partner with software engineers to do this, as web development itself is not a bioinformatics skill.
Modern DNA sequencing machines enable the sequencing of a whole genome in under a day for only a few thousand dollars. Therefore many individuals are being sequenced. (Sometimes repeatedly in the case of cancer research were tumor cell mutations are being examined). The resulting volume of data is a blessing and a curse. It is a blessing because with many samples we can make defendable statistical correlations between DNA variations and complex diseases.
It is a curse because the challenge of storing and processing all that data remains, even in the days of cloud computing. Therefore bioinformaticians are called upon to make decisions about which data to delete and which to keep. Alternatively, we might decide to delete the data and keep only the analysis of it once complete. But the analysis of the data itself is an issue: The data is so large that computations to process the data can take days per genome. Therefore bioinformaticians must apply sophisticated techniques from computer science to optimize their procedures.
Taking a cue from mainstream engineering, we often automate repetitive data processing tasks to free our time for qualitative interpretation of results.
A sudden high volume of novel data usually precedes advances in scientific theory. The best example I can think of is how advances in astronomy led to the establishment of the Sun-centered solar system. Similarly, geological and fossil data led to the idea of continental drift. In the case of molecular biology, I think we are poised for a similar advancement in theory riding on the tail of the data explosion.