monte-carlo simulation in C++ with MCS-libre

Monte-Carlo simulation is a sometimes elegant (and sometimes crude) method for simulating complex systems. Parameters that affect the system are selected from random distributions and the system response to these values is then calculated. Repeating this process many times produces often useful information about the system. The method is especially useful for examining non-linear systems that are difficult to write equations for.

Monte-Carlo simulation proves useful in the following areas:

  • Discrete-Event Simulation
  • Systems Science
  • Bayesian Inference

This list is by no means exhaustive. Please feel free to comment if your favorite application of Monte-Carlo simulation is missing.

MCS-libre is a C++ library to facilitate Monte-Carlo simulation that I wrote about ten years ago. I’m writing about its use now so that I can reference it when I demonstrate SWIG in a future post. The library is freely available at http://sourceforge.net/projects/mcs-libre. Usage information, with a full example, is given below:

Usage

Sampling

A class called mcsLibre facilites sampling during simulation. Begin by including the necessary header file:

#include "mcsLibre.h"

Then declare an object of type mcsLibre:

mcsLibre samplesource;

This initializes MCS-libre using the default 100 random number seeds. Users can provide an alternate set of seeds if they wish.

One can then sample from normal and uniform distributions across multiple random number streams.

double i = samplesource.normal(mean, stdev, stream);
double j = samplesource.uniform(min, max, stream);

The stream argument used above is an integer that specifies which random number stream to generate values from. It is imperative that statistically independent variables sample from different streams!

Collecting Simulation Statistics

A class called mcsLibreCollectStats assists with collection of statistical information during simulation. Begin by including the necessary header file:

#include “mcsLibreCollectStats.h”

Then declare an object of type mcsLibreCollectStats as follows:

mcsLibreCollectStats stats_var1;

To add random samples of a variable to the collection object, use the add_value function:

stats_var1.add_value(30.3);

Retrieving Statistical Information

The mcsLibreCollectStats class provides functions for extracting statistical information out of its objects:

double m1 = stats_var1.get_mean_value(); // mean
double s1 = stats_var1.get_stdev_value(); // standard deviation
double mn1 = stats_var1.get_min_value(); // minimum value
double mx1 = stats_var1.get_max_value(); // maximum value
double n1 = stats_var1.get_number_samples(); // number of samples
double sum1 = stats_var1.get_sum_of_samples(); // sum of samples

Example

The rest of this post demonstrates the use of MCS-libre with an example simulation.

Simulating Reynold’s Number

Reynold’s number is a non-dimensional value used to describe flow around an object in a fluid dynamics system. It is the ratio of inertial forces to viscous forces, and is described by the equation

reynolds_number

where V is the velocity of the object relative to the fluid, x is the distance the fluid travels across the object, and nu is the kinematic viscosity of the fluid. In this example we sample values for V, x, and nu from different distributions and then calculate the Reynold’s number from the sampled values. We do this thousands of times, storing each result, to gain information about how the Reynold’s number value responds to the sampled variable space.

Full Implementation of Example

/* Demonstration of MCS-libre toolkit for computing propagation of error in the Reynolds number. */

#include <iostream>

// include mcsLibre simulation library
#include "mcsLibre.h"

// include mcsLibre statistical probe class
#include "mcsLibreCollectStats.h"

main()
{
  std::cout << "Performing Monte-Carlo simulation..." << std::endl;

  // create object for simulation
  mcsLibre reynolds;

  // create probe for data collection
  mcsLibreCollectStats reynolds_stats;

  // define parameter types
  double Re, Re_mean, Re_stdev;    // Reynolds number of flow
  double nu, nu_mean, nu_stdev;    // kinematic viscosity of fluid
  double v, v_mean, v_stdev;       // velocity of flow
  double x, x_mean, x_stdev;       // distance considered

  // set mean and standard deviations from measurements
  nu_mean = 0.0000157; nu_stdev = 0.000002;  //  m^2 / s
  v_mean = 2; v_stdev = 0.1;                 //  m / s
  x_mean = 12; x_stdev = 1;                  //  m

  // simulation loop
  int samples = 15000;      // number of simulations
  for (int loopvar = 0; loopvar < samples; loopvar++)
  {
    // simulate nu from stream 1
    nu = reynolds.normal(nu_mean, nu_stdev, 1);

    // simulate v from stream 2
    v = reynolds.normal(v_mean, v_stdev, 2);

    // simulate x from stream 3
    x = reynolds.normal(x_mean, x_stdev, 3);

    // calculate a sample Reynolds number
    Re = (v * x) / nu;

    // send sample value to data collection probe
    reynolds_stats.add_value(Re);
  }

  // retrieve mean value from simulation
  Re_mean = reynolds_stats.get_mean_value();

  // retrieve standard deviation value from simulation
  Re_stdev = reynolds_stats.get_stdev_value();

  std::cout << std::endl << "Simulation of Error Propagation " << "in Reynold's Number" << std::endl;
  std::cout << "Re = (v * x) / nu" << std::endl << std::endl;
  std::cout << "Let v = normal(" << v_mean << ", " << v_stdev << "), x = normal(" << x_mean << ", "  << x_stdev << ")," << std::endl;
  std::cout << "and nu = normal(" << nu_mean << ", " << nu_stdev << ")." << std::endl << std::endl;
  std::cout << "Simulating " << samples << " Reynold's numbers yields:" << std::endl << std::endl;
  std::cout << "Sample Mean = " << Re_mean << "   Sample Standard Deviation = " << Re_stdev << std::endl;
  std::cout << "Sample Max = " << reynolds_stats.get_max_value();
  std::cout << "    Sample Min = " << reynolds_stats.get_min_value() << std::endl;
  std::cout << "Sum of Samples = " << reynolds_stats.get_sum_of_values() << std::endl;
  std::cout << "Number of Samples = " << reynolds_stats.get_number_samples() << std::endl;
  std::cout << std::endl;
}

Example Results

results

Posted in data science, engineering, science, statistics, system dynamics | Tagged , , , , , , | Leave a comment

gene fusion variants mapped by shared PubMed IDs

Introduction

A gene fusion occurs when parts of two genes’ RNA combine to form one hybrid mRNA molecule before translation into protein. A common set of fusions found in lung cancer are the multiple combinations of genes EML4 and ALK [1], hereafter denoted EML4-ALK. The COSMIC database [2] lists 29 distinct fusions of EML4-ALK, which all vary by where the breakpoint between the EML4 RNA sequence and the ALK RNA sequence occurs, as shown in the following screenshot:

COSMIC_inferred_breakpoints

Each of the fusions in the above screenshot corresponds to one or more PubMed IDs, which indicate papers that provide evidence for the fusion. We can map the PubMed IDs in common between the fusions as a chord chart (below).

Results

Mapping COSMIC’s EML4-ALK gene fusions by shared PubMed IDs yields:

circos_plot_pink

Here we see that fusion COSF474 shares many PubMed IDs with fusion COSF412, but few with fusion COSF493.

Method

The raw data fed to the chord chart looks like:

data

I used D3 (Data-Driven Documents) [3], a JavaScript library for producing data-heavy graphics, to create the image. Particularly, I modified the chord diagram shown at [4] to accommodate this data.

References

  1. Soda M, Choi YL, Enomoto M, et al. (August 2007). “Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer”. Nature 448 (7153): 561–6. doi:10.1038/nature05945. PMID 17625570
  2. Forbes et al. “COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer”. Nucl. Acids Res. (2011) 39 (suppl 1): D945-D950. doi: 10.1093/nar/gkq929
  3. http://d3js.org/
  4. http://bl.ocks.org/mbostock/1308257
Posted in data science, engineering, science | Tagged , , , , , , , , , , | Leave a comment

freshwater and Canadian sovereignty

Canada holds seven percent of the world’s renewable freshwater [1]. While it is unclear how much this percentage or volume will be impacted by climate change, let’s assume for the sake of argument that Canada will retain its significant freshwater resources for the next century.

Contrast that with the neighboring United States, where freshwater is becoming increasingly scarce. Climate change is expected to exacerbate the shortage. Desalination remains too expensive a cure.

We can then envision the construction of multiple aqueducts to carry freshwater from Canada to thirsty parts of the United States. We can reason that the Canadians don’t need their water surplus because there are so few Canadians compared to Americans. We can invoke Free Trade as the legal framework for the exchange.

Under this scenario, as America becomes dependent on Canadian freshwater, the water sources and aqueduct infrastructure become strategic assets to the United States. This would be bad news for Canada.

Canadian sovereignty in the face of its mighty southern neighbor has been at risk throughout the country’s entire existence (militarily at first, and now culturally and economically). If Canadian freshwater were to become vital to American daily life, the United States would likely meddle in Canada’s business to ensure the water never stopped flowing south. This would erode Canada’s shaky sovereignty.

References

1.  http://www.ec.gc.ca/eau-water/ accessed 11 April 2013

 

Posted in Uncategorized | Tagged , , , , , , , , , , | Leave a comment

comparing mRNA half-life survival curves

In my last post, I illustrated how the Kaplan-Meier estimator can be used to estimate the survival curve of mRNA half-lives. In this post I will expand on that analysis and show how to compare two mRNA half-life Kaplan-Meier curves, each corresponding to a measured gene outcome, to see if mRNA half-life differs between outcomes. (Sorry, I can’t reveal what the “outcomes” are!).

Example source data is shown in the following table.  The “half_life” variable indicates the measured half-life of each gene’s mRNA [1], while “event_observed” indicates whether the event (half-life reached) was observed during the measurement experiment. We can see that one of the cases in the example source data below is right censored, i.e., the half-life of the mRNA was not reached by the time the experiment ended. The “group” variable indicates which outcome each gene’s mRNA corresponds to:

sample_data

Plotting the Kaplan-Meier estimate curves for each outcome yields:

compare_survival_curves

We see that the median half-lives do indeed look different on the graph, and in R’s estimate of the median values:

fit

But is the difference statistically significant? We can answer that question with the Mantel-Cox log rank test, shown on the last line of the Cox proportional hazards regression screenshot below:

coxph

Here the p-value is 1.308e-09, so we reject the null that the curves are equal. We therefore conclude that mRNAs associated with outcome B generally have a longer half-life.

References

1.  Lioudmila V. Sharova, Alexei A. Sharov, Timur Nedorezov, Yulan Piao, Nabeebi Shaik, and Minoru S.H. Ko. Database for mRNA Half-Life of 19 977 Genes Obtained by DNA Microarray Analysis of Pluripotent and Differentiating Mouse Embryonic Stem Cells. DNA Res (2009) 16(1): 45-58 first published online November 11, 2008 doi:10.1093/dnares/dsn030

Posted in data science, science, statistics | Tagged , , , , , , , , , , , , , | Leave a comment

mRNA half-life survival curve estimation

In a recent post, I demonstrated the use of the Kaplan-Meier estimator for estimating survival curves of fictional characters undergoing treatment in a fictional drug trial. Here I illustrate the Kaplan-Meier estimator on real data, data that is unique from normal survival analysis data in that the event under consideration is neither time until death nor time until product failure.

Sharova et. al. [1] published the measured half-lives of roughly 20,000 mouse genes. If we treat the observation of an mRNA reaching half-life during their experiment as the event, we can estimate a survival curve for mRNA half-lives using the Kaplan-Meier estimator. Since the experiment ran for only 24 hours, those few mRNAs with longer half-lives than 24 hours show up as right-censored data. An example of the source data is:

 

sharova_example

 

Plotting this data using R’s Kaplan-Meier estimator yields:

overall_half_life

Looking at this plot of the estimated survival curve, we can see that the median mRNA half-life is about seven hours. R’s Kaplan-Meier estimator estimates that the median is 7.08 hours:

fit2

References

1.  Lioudmila V. Sharova, Alexei A. Sharov, Timur Nedorezov, Yulan Piao, Nabeebi Shaik, and Minoru S.H. Ko. Database for mRNA Half-Life of 19 977 Genes Obtained by DNA Microarray Analysis of Pluripotent and Differentiating Mouse Embryonic Stem Cells. DNA Res (2009) 16(1): 45-58 first published online November 11, 2008 doi:10.1093/dnares/dsn030

Posted in data science, science, statistics | Tagged , , , , , , , , , , , , , | 1 Comment

stalling an airplane mid-flight

Thanks to my brother, I recently had the opportunity to fly a small Cessna aircraft under supervision of a flight instructor. The instructor took off and landed, but gave me the controls during flight. During this time we went through a few instructive maneuvers, including stalling the plane mid-flight. Here I explain how stalling an airplane works and how to get out of a stall once it has occurred.

Consider the following sketch showing the forces acting on an airfoil during steady-state flight:

four_forces_sans_equation

If the weight of the airplane exceeds the lift being generated by the airfoil, the plane will start falling.

Lift is generated by a pressure difference between the bottom and top of the airfoil:

presure_gradient

v_high_and_low

Bernoulli’s principle tells us that when air flows more quickly over the top of the airfoil than the bottom, the pressure on the bottom of the airfoil exceeds the pressure at the top of the airfoil. The thrust of the airplane (from the propeller) forces the air across the airfoil. Because the shape of the airfoil mildly obstructs flow over the top of the airfoil versus the bottom, the air on top is forced to move faster. The resulting pressure differential generates lift.

Stalling an airplane means configuring it mid-flight in such a way that lift fails. This is done by first lowering the propeller speed, thereby reducing airflow across the airfoil, and then angling the plane upward so that the remaining airflow across the airfoil fails to produce the required pressure difference for lift:

no_pressure_diff

Once this configuration occurs, the forces of weight and drag in the vertical direction exceed the force of thrust in the vertical direction, and the airplane starts falling:

stall_position

To get out of the stall, one increases the propeller speed to force airflow across the airfoil; and then pushes the yoke forward to bring the elevators down, which reduces the angle between the airfoil and the earth’s surface, thereby placing the airfoil back into a configuration that allows it to generate lift.

Posted in engineering, science | Tagged , , | Leave a comment

the Kaplan-Meier estimator

In my last post, I wrote about censored data. This post continues the survival analysis theme by focusing on estimation of survival curves.

In survival and reliability analysis, it is useful to determine the survival curve for the population under study. This is the curve defined by the probability that a random variable indicating the study event (e.g., patient death, product failure) exceeds a given time, for all time values in the study. However, we generally don’t know the survival curve and have to rely on an estimation of it. The Kaplan-Meier estimator provides such an estimate. To demonstrate, suppose we have the following data from a clinical trial:

uncensored_case_data

We can construct the following Kaplan-Meier estimate of the survival curve from this data. (I’ll show how to do this in a later post):

uncensored_case_data_EXAMPLE_CURVE

From this graph, we can estimate that the median time until patient death is eight days.

The Kaplan-Meier estimator works in the presence of right censored data. Suppose a different clinical study has right censored data points at days five, seven, and twenty:

censored_case_data

With right censoring taken into account, the Kaplan-Meier estimate looks like:

censored_case_data_EXAMPLE_CURVE

Here the censored data points are shown with tick marks.

We can derive useful information by comparing curves. Consider the following clinical study comparing a placebo with drug A:

two_cases_study

Plotting the Kaplan-Meier estimate for each treatment group yields:

two_cases_study_CURVES

Here we see that patients getting the placebo treatment generally live longer than patients receiving drug A. The placebo group’s median time until death is 17 days, compared to an eight day median time until death for patients receiving drug A.

Posted in engineering, science, statistics | Tagged , , , , , | 1 Comment

“right censored” data

In clinical trials and reliability studies, researchers often measure the time until an event occurs for each patient or object in the study. That event may be patient death in the case of clinical trials for a new cancer drug, or bridge failure in the case of a reliability study of bridges.

Sometimes, however, the event being measured cannot be observed for each subject in the study. For example, a patient may outlive the study period for a study measuring time to patient death, or may quit the study before it ends. In these cases we know only that their measured time until the study event occurs is greater than some value, but we don’t know by how much.

We call such data points “censored” data points.

Consider the following plot showing time to death in a 20-day clinical trial of a new drug:

censored_data_plot

Here we see that Bart outlived the study period, and therefore has a censored data point. Lisa quit the study after seven days, so her time before death is also a censored data point.

To be more technical, these data points are called “right censored”, to indicate that the censoring takes place on the right side of the graph (meaning the true data point exceeds a certain value, but it is unknown by how much). To contrast, “left censored” data occurs when researchers do not know when a subject of theirs enters the study; when the data point lies below a certain value, but it is unknown by how much.

Censored data must be treated carefully in analyses. In future posts I’ll illustrate some of the ways this is done.

Posted in data science, engineering, science, statistics | Tagged , , , , | 1 Comment

Excel mangles NCBI gene symbols

Using Microsoft’s Excel for bioinformatics work sucks, but sometimes a spreadsheet is the best format for communicating results to other scientists.

The program’s default behavior mangles some NCBI gene symbols when you import them from a text file. Here is how to deal with it. Suppose you have the following list of gene symbols,

gene_list

and you import them into Excel. The middle three gene symbols get treated as dates:

wrong

This definitely ruins your day.

The trick to avoiding this problem is to go to the third step of the Text Import Wizard and set the Column data format to “Text”:

step_03_with_arrow

Then the gene symbols import correctly:

final

Posted in science | Tagged , , , , , | Leave a comment

the future orientation index

I’ve recently discovered Google Trends and have been looking for an opportunity to use it. Today I found such opportunity in a paper [1] published last April that computes countries’ “future orientation index” from Google Trends data and correlates it with national per-capita GDP. The authors report correlation for 2010; my experiment with Google Trends reported here tests their idea for 2011.

The paper’s authors propose that the future orientation index serves as a measurement of how much a country’s population looks forward into the future, and propose that stronger economies correlate with more future-oriented thinking. They define the future orientation index as the number of Google searches for year x+1 originating in that country during year x divided by the number of searches for year x-1 originating in that country during year x. The Google Trends query to extract such data has to include year x+1 and year x-1 at the same time so that the scaling is correct, since Google Trends reports relative search numbers rather than absolute counts. The image below shows graphical output for x = 2011. I downloaded a CSV of the output for this analysis.

google_trends_output

The paper then demonstrates that a positive correlation exists between national per-capita GDP and the future orientation index. Here I depart from their study by dividing the 141 countries returned by Google Trends into G-20 and non-G-20 countries, and then plotting their future orientation indices in box plots:

future_outlook_index_boxplot

The result shows that the economically strong G-20 countries generally have a higher future orientation index, which jives with the findings of the paper (Kruskal-Wallis rank sum test for equality:  p=0.0004361; t-test that the G-20 group has greater future orientation index than the non-G-20 group:  p=0.0003101).

References

  1. Preis, T., Moat, H.S., Stanley, H.E., & Bishop, S. R. Quantifying the Advantage of Looking Forward. Sci. Rep. 2, 350; DOI:10.1038/srep00350 (2012).
Posted in big data, data science, econometrics | Tagged , , , | Leave a comment