I recently performed analysis for a media watchdog organization regarding Big Tech’s political influence. Consequently, I’ve received some (American) right-wing press cut. Not too happy that only one side of the aisle is paying attention to what, in my opinion, is a centrist concern, but I’ll take whatever attention (to the issue) that I can…

# Tag: python

## church to bar ratio, by U.S. county (3rd edition)

Church to bar ratio by county from U.S. Census Bureau data: The brighter the color, the higher the church to bar ratio. Counties missing data necessary for the computation are shown in black. Method From the 2013 County Business Patterns data published at http://www.census.gov/econ/cbp/download/, I extracted the number of establishments in each county that have…

## DIY Twitter analytics (part 3: hashtag network)

I’ve been mathematically analyzing my Twitter feed to determine how best to position my tweets for maximum impact, and have been documenting the work on this blog. While I’ve not come to any brilliant conclusions yet, I’ve made progress. My first post on the subject described clustering my followers by their hashtag use to see…

## DIY Twitter analytics (part 1: clustering related users)

I’ve started working with the Twitter API to develop my own Twitter analytics tool chain. My goals are to figure out who the influencers in my subjects are, figure out how best to position my tweets, etc. I could certainly pay for this service, but then I wouldn’t learn any new technical skills in the…

## graph database for heterogeneous biological data

To assist with a project I’m working on, I recently implemented a substantial portion of DisGeNET as a graph database. Furthermore, I added MeSH, OMIM, Entrez, and GO into the database to facilitate linking of data between these sources. Here I briefly describe these data sources, describe graph databases, and then show how use of…

## HRC Corporate Equality Index correlates with Fortune’s 50 most admired companies

The Human Right’s Campaign, one of America’s largest civil rights groups, scores companies in its yearly Corporate Equality Index (CEI) according to their treatment of lesbian, gay, bisexual, and transgender employees [1]. The companies automatically evaluated are the Fortune 1000 and American Lawyer’s top 200. Additionally, any sufficiently large private sector organization can request inclusion…

## fast genomic coordinate comparison using PostgreSQL’s geometric operators

PostgreSQL provides operators for comparing geometric data types, for example for computing whether two boxes overlap or whether one box contains another. Such operators are quick compared to similar calculations implemented using normal comparison operators, which I’ll demonstrate below. Here I show use of such geometric data types and operators for determining whether one segment…

## gene annotation database with MongoDB

After reading Datanami’s recent post “9 Must-Have Skills to Land Top Big Data Jobs in 2015” [1], I decided to round out my NoSQL knowledge by learning MongoDB. I have previously reported NoSQL work with Neo4j on this blog, where I discussed building a gene annotation graph database [2]. Here I build a similar gene…

## clustering stocks by price correlation (part 2)

In my last post, “clustering stocks by price correlation (part 1)“, I performed hierarchical clustering of NYSE stocks by correlation in weekly closing price. I expected the stocks to cluster by industry, and found that they did not. I proposed several explanations for this observation, including that perhaps I chose a poor distance metric for…

## clustering stocks by price correlation (part 1)

I’ve been building my knowledge of clustering techniques to apply to genetic circuit engineering, and decided to try the same tools for stock price analysis. In this post I describe building a hierarchical cluster of stocks by pairwise correlation in weekly price, to see how well the stocks cluster by industry, and compare the derived…

## simulating RNA-seq read counts

The Challenge I want to explore the statistics of RNA sequencing (RNA-seq) on next-generation sequencing (NGS) platforms in greater detail, so I thought I’d start by simulating read counts to experiment with. This post details how I constructed a simulated set of read counts, examines its concordance with the expected negative binomial distribution of the…

## net change of zero between closing and opening stock prices

I decided to investigate the variation between trading days’ closing prices and the following trading days’ opening prices for stocks listed on the New York Stock Exchange. I started with data in the following format for all trading days between January 2nd 2000 and October 30th 2014: I then calculated the percent change between one…

## EC2 spot instance price change: no correlation with day of week

My plans for world domination involve heavy use of Amazon EC2 instances, but I have to be frugal about it so I’m running spot instances to save cash. Therefore a means of forecasting spot instance prices would be helpful. Thus far I’ve had little success using mainstream forecasting tools such as ARIMA and exponential smoothing….

## Apache Spark and stock price causality

The Challenge I wanted to compute Granger causality (described below) for each pair of stocks listed in the New York Stock Exchange. Moreover, I wanted to analyze between one and thirty lags for each pair’s comparison. Needless to say, this requires massive computing power. I used Amazon EC2 as the computing platform, but needed a…

## Kaplan-Meier estimator in Python

The following Python class computes and draws Kaplan-Meier product limit estimators for given data. An example of how to use the class follows the code. Code Usage Example

## building a web-enabled temperature logger

Not wanting to miss out on the “Internet of Things”, I decided to learn some of its foundational technology, namely microprocessor programming. Actually, I used a Raspberry Pi in this project instead of a classic microprocessor, but the idea is the same. Here I describe building a web-enabled temperature logger, complete with a web application…

## 100th post to badass data science

This marks the 100th post to badass data science. I’ve written about everything from Lady Gaga to computational fluid dynamics, usually with a science or data related spin. I thought I’d look at my posts analytically rather than simply reminisce. First, here is a tag cloud for the first 99 posts: From this tag cloud,…

## pyDome updates: tangential and spoke angles

In a previous post, I introduced pyDome, a Python program for calculating geodesic dome vertices, chords, and faces. I have since added two hub angle computations to the program, and report on that progress here. Face angle calculations still need to be implemented. Angles Between Chords and the Hub Tangent Plane The angle between a…

## maximized entropy of a finite distribution

I received the following tweet yesterday from @ProbFact and decided to check it out in more detail: Two-Dimensional Case I generated the following test to investigate the claim: Create four category discrete distributions where two of the categories have 0.25 probability each, and the third category probability varies between 0.1 and 0.4. The fourth category’s…

## examining mRNA complexity by annotation region using MapReduce

I became interested in how annotated mRNA regions (e.g., 5′ UTR, coding, and 3′ UTR) vary in information content, speculating that coding regions (CDS) of transcripts will be generally more complex than other regions due to their role in specifying protein recipes. Measuring sequence complexity using Shannon entropy validated this hypothesis, at least with regard…

## dynamically generated matplotlib images via django

I finally figured out how to serve a dynamically generated matplotlib image through Django. Here is the necessary views.py and urls.py code for an example case: views.py urls.py Result

## industrial diversity correlates with population

It seems logical that U.S. counties having greater populations would support more diverse industry than counties having lesser population. Perhaps this has been proven already, but I recently stumbled upon my own verification of the idea: The above plot shows industry diversity (expressed in the form of Shannon entropy, discussed below) as a function of…

## SymPy: a computer algebra system for Python

In a previous post, I examined Maxima, a free computer algebra system (CAS). Yesterday I discovered SymPy, a Python library that adds CAS functionality to the Python language, and decided to give it the same test drive I gave Maxima. I report the results here, and then provide a brief summary of why using CAS…

## pyDome: a geodesic dome designer

I am pleased to announce the release of pyDome, a geodesic dome designer written in Python. The software is freely available on GitHub at https://github.com/badassdatascience/pyDome. User modification of the code is encouraged. In a previous post (http://badassdatascience.com/2012/04/15/geodesic-dome-design-part-1/), I described the procedure for calculating a geodesic dome’s vertices and chords for a class one dome. pyDome…

## simulating the Monty Hall puzzle

The Monty Hall puzzle (named after the original host of Lets Make a Deal) offers contestants three identical doors. One of these doors leads to the contestant winning a new car, while the other two lead to rooms containing goats. The contestant selects a door, and the host immediately opens up one of the two…

## Cartesian products in Python

Given Python lists A and B, the Cartesian product of the two lists is the set of all ordered pairs (a, b) such that a is an element of A and b is an element of B [1]. In mathematical notation, this is denoted: We can write a more general definition of the Cartesian product for…

## church to bar ratio, by U.S. county

Church to bar ratio by county from U.S. Census Bureau data: The brighter the color, the higher the church to bar ratio. Counties missing data necessary for the computation are shown in black. Method From the 2011 County Business Patterns data published at http://www.census.gov/econ/cbp/download/index.htm, I extracted the number of establishments in each county that have…

## SWIG, C++, Python, and Monte-Carlo simulation

In the previous post, I introduced MCS-libre, my C++ library for Monte-Carlo simulation. Here I show how to access it from Python using the Simplified Wrapper and Interface Generator (SWIG), while in the process demonstrating how to use SWIG with C++ classes. First we download and decompress the MCS-libre library code: Next, we create a SWIG…

## writing a software pipeline manager (part 1)

I find myself regularly chaining programs together into software pipelines, and decided that having a pipeline management tool would be helpful. So I wrote one and posted the code here. To illustrate how the tool works, consider the following software pipeline: Here we see that the steps “get weather forecast” and “get stock quote” depend…

## industrial diversity vs percent change in unemployment rate

This analysis may exceed the bounds of my statistics knowledge, but I will deliver it anyway in the name of “process” blogging. I welcome experienced critique of the method! Result A modest positive correlation exists between a county’s industrial diversity and its percent change in unemployment rate over the period 2007-2010. Method Several months ago…

## myers-briggs personality interaction map

After my last consulting gig for Yoyodyne Propulsion Systems (YPS), they invited me back to troubleshoot their R&D team’s group dynamics [1][2]. To get started, I administered a web-based Myers-Briggs Type Indicator (MBTI) assessment to each member of YPS’s R&D team to discern their personality types [3]. I then plotted the personality similarities between the individuals…

## RNAfold and sequence length

I’ve been looking for a way to compare RNAfold [1] dG results for two RNA sequences, where the two sequences differ in length. My initial thought was to simply divide the computed dG’s by sequence length (i.e., normalize by sequence length) and then compare the results. The analysis presented below shows why this won’t work….

## comparing BLAST results by bit score ratio

I recently read that two separate BLAST alignments to the same reference sequence can be compared to each other by normalizing the alignments by the maximum bit score of the reference sequence BLASTed against itself [1]. In this procedure, the user first aligns the reference sequence to itself to find the maximum possible bit score,…

## lunar ephemeris calculations with PyEphem

While analyzing data for my recent post demonstrating that the lunar cycle does not correlate with crime incidents, I needed to compute daily lunar ephemeris data to match with daily crime incident counts. To accomplish this I turned to PyEphem, a Python package that computes–among other things–lunar position and phase for any given date. I first…

## data scientist goes coolhunting…

Intuitive coolhunting scales poorly. Here’s some math to help fix that problem: Axioms of cool Five axioms enable us to mathematically model cool: No one is intrinsically cool, individuals simply channel it. Ability to temporarily hold coolness varies by individual. Coolness naturally flows into some individuals more readily than others. Rate of coolness flow into…

## church to bar ratio in the lower 48, by county

Calculated the church to bar ratio by county from US Census Bureau data: The color partitions were derived from the log-transformed ratio distribution to facilitate visual clarity. Method From the 2009 County Business Patterns data published at http://www.census.gov/econ/cbp/download/index.htm, extracted the number of establishments in each county that have NAICS codes 813110 (places of worship including…