how I make a living: what is bioinformatics? (part #1)

I’m constantly asked to explain what I do for a living. Here is an attempt to do so in laypersons’ terms. I’ll assume my readers are non-scientists and non-engineers, but that they’ve taken a high school biology class. “Bioinformatics” is the application of mathematics and computer science to biological data, particularly molecular biology data. By […]

DIY Twitter analytics (part 3: hashtag network)

I’ve been mathematically analyzing my Twitter feed to determine how best to position my tweets for maximum impact, and have been documenting the work on this blog. While I’ve not come to any brilliant conclusions yet, I’ve made progress. My first post on the subject described clustering my followers by their hashtag use to see […]

DIY Twitter analytics (part 2: correlations)

I’ve been working with the Twitter API to develop my own Twitter analytics tool chain, and have been documenting the results on this blog. My last post on the subject described clustering my followers by their hashtag use to see whose tweets are most like mine. My goal of this project is to figure out best […]

DIY Twitter analytics (part 1: clustering related users)

I’ve started working with the Twitter API to develop my own Twitter analytics tool chain. My goals are to figure out who the influencers in my subjects are, figure out how best to position my tweets, etc. I could certainly pay for this service, but then I wouldn’t learn any new technical skills in the […]

graph database for heterogeneous biological data

To assist with a project I’m working on, I recently implemented a substantial portion of DisGeNET as a graph database. Furthermore, I added MeSH, OMIM, Entrez, and GO into the database to facilitate linking of data between these sources. Here I briefly describe these data sources, describe graph databases, and then show how use of […]

gene annotation database with MongoDB

After reading Datanami’s recent post “9 Must-Have Skills to Land Top Big Data Jobs in 2015” [1], I decided to round out my NoSQL knowledge by learning MongoDB. I have previously reported NoSQL work with Neo4j on this blog, where I discussed building a gene annotation graph database [2]. Here I build a similar gene […]

Apache Spark and stock price causality

The Challenge I wanted to compute Granger causality (described below) for each pair of stocks listed in the New York Stock Exchange. Moreover, I wanted to analyze between one and thirty lags for each pair’s comparison. Needless to say, this requires massive computing power. I used Amazon EC2 as the computing platform, but needed a […]

data natives

We hear a lot of marketing yammer about “digital natives”, that is, folks fluent in social media and in particular marketing using social media. Writers who use this term often juxtapose such digital natives against “analog natives”, i.e., individuals who matured or were educated before online social media became such a significant part of our […]

rapidly extracting a subsequence from chromosome sequence data in Java

The Challenge We have a text file containing the nucleotides of a chromosome, say human chromosome 11, and need to be able to quickly extract a subsequence from the chromosome text given a nucleotide position and number of subsequent nucleotides to include. The problem is that chromosome files are huge, e.g. 135 megabytes for chromosome […]

graph database for gene annotation

Lately I’ve been experimenting with graph databases using Neo4j and the Cypher query language. To get a feel for these tools, I created the following gene annotation network. The Cypher commands I used are discussed in this post, followed by a demonstration of querying the database. Creating the Graph Database We are creating the following […]