by , , &

Chronos is our replacement for cron. It is a distributed and fault-tolerant scheduler which runs on top of Mesos. It’s a framework and supports custom mesos executors as well as the default command executor. Thus by default, Chronos executes SH (on most systems BASH) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchroneous callbacks to notify Chronos of job completion or failures.

Chronos has a number of advantages over regular cron. It allows you to schedule your jobs using ISO8601 repeating interval notation, which enables more flexibility in job scheduling. Chronos also supports the definition of jobs triggered by the completion of other jobs, and it also supports arbitrarily long dependency chains.

Chronos is available on Github

The Backstory

At Airbnb, we heavily rely on data analysis to build great products. Our data-pipeline consists of many technologies such as Hadoop, MySQL, Amazon Redshift and S3. Our software engineers and analysts use a mix of Cascading, Cascalog, Hive and Pig for interfacing with Hadoop. We have scripts that export tables from a vast number of databases into S3 and we use various ETL (extract transform and load) processes to turn blobs of bytes into meaningful information. Many of these transformations consist of multiple steps and some tables are composed of a myriad of data-sources and joins.

We’re not in a private datacenter, and we aren’t running our own Hadoop cluster – we use a managed Hadoop product from Amazon, called Elastic Map/Reduce. High variance in network latency, virtualization and not having predictable I/O performance is an ongoing challenge in a cloud environment. There are many sources for errors. For example calls to web services are subject to timeouts.

In a complex processing pipeline every step increases the chance of failure. Until December last year, we were relying on a single instance with cron to kick off our hourly, daily and weekly ETL jobs. Cron is a really great tool but we wanted a system that allowed retries, was lightweight and provided an easy-to-use interface giving analysts quick insights into which jobs failed and which ones succeeded.

We also wanted a system that was highly available and could manage any workload, not just Hadoop jobs. Other requirements were that the system still could run BASH scripts and fan out the workload to many systems (as we are exporting many tables we didn’t want to just execute on one box albeit we wanted to have central management). At the same time we began looking at Mesos for data-infrastructure. Thus we made the decision to build a new lightweight, fault-tolerant scheduling tool which we named Chronos that would run on top of Mesos, using Mesos’ primitives for storing state and distributing work. Mesos also allowed us to dynamically add new workers to our pool without having to change the configuration of the existing cluster.

Chronos UI

Chronos comes with a UI which can be used to add, delete, list, modify and run jobs. It can also show a graph of job dependencies. These screenshots should give you a good idea of what Chronos can do.

Sample-chronos-ui

Check it out on Github

Over the past weeks, we have open-sourced Chronos, you can check it out on our github page: http://airbnb.github.com/chronos

Here’s the video from our Tech Talk on Chronos: https://www.youtube.com/watch?v=FLqURrtS8IA

Want to work with us? We're hiring!