The Kepler scientific workflow system enables scientists and engineers to specify their software pipelines as chains of visual dependencies. Each node in a pipeline runs a specific task, and it does not matter what programming language the task is written in since Kepler only manages the inputs and outputs of each step. Here I describe my test of the software for such pipelines, comment on its utility, and finish by considering its scalability to the cloud. The source code and model file for the demonstrated pipeline is attached.
For my test case, I considered a pipeline that forecasts Amazon EC2 spot instance prices. For example, the key computation of the pipeline takes three days of historical data about an EC2 configuration’s spot price, and produces a seven day forecast of future spot prices for that EC2 configuration. An example forecast for an m2.2xlarge SUSE Linux instance running at us-east-1c looks like:
Here the purple and light-purple regions correspond to 80% and 95% percent confidence intervals of the forecast.
The Kepler model that produced this result looks like:
In the top left lies the “SDF Director”, which tells Kepler to run steps sequentially, after dependencies have completed. (Other directors implement time-domain simulations like that referenced here).
The green nodes (called “Actors”) implement the pipeline steps. The first three are “External Execution” actors, meaning they run shell commands based on data received at STDIN and deliver data as STDOUT. An example configuration showing the “Resample Time Series” external execution actor is:
This calls a Python script that resamples the data into hourly prices and produces a time-series of such prices for each type/location/AMI of EC2 instance. The type/location/AMI EC2 instances are then split by the “StringSplit” actor, causing each time-series to be sent to an “Iterator” actor which operates on the time-series.
Opening the iterator actor shows the pipeline components to iterate the list elements over:
This uses the “R” actor, which allows direct running of forecast R code:
Note that the filename of the time-series to forecast is given as a port, and that the output filename for the image is determined by the R actor.
Back in the main workflow (outside the iterator sub-workflow), we see another R actor which renames the images according to the original names of the time-series files:
Finally, the last actor, a “Generic File Copier”, sends the images to an external server using FTP.
I found debugging this pipeline very difficult; the error messages proved hard to decipher. I believe this is the cost of Kepler’s flexibility. If Kepler’s actors were more constrained then error messages could likely be made clearer. Furthermore, I found limited online help regarding many of Kepler’s tools, whether in official documentation or through message boards.
On the other hand, once I got the pipeline working, it performed well. The visual display of the pipeline provided intuitive documentation on how the pipeline works. Now that I have the pipeline, I can set it to run as a batch process using a cron job so that my EC2 spot instance price forecasts remain current.
Kepler in the Cloud
Kepler contains actors for interacting with EC2, and Kepler-based Amazon Machine Images are publically available . This enables the provisioning, starting, and stopping of EC2 instances from Kepler, some of which might contain Kepler pre-loaded. In this manner a Kepler pipeline running locally might initiate another Kepler pipeline on an EC2 instance, or initiate a suite of Kepler pipelines on a cluster consisting of EC2 instances. If I run into this use case I’ll report on it.
Source Code and Model File
The source code and Kepler model file used by this pipeline is attached below.
1. Jianwu Wang, Ilkay Altintas. Early Cloud Experiences with the Kepler Scientific Workflow System. Elsevier. 2012.