docker machine workers

Bidding for Builds

running jobs
This post covers the how of using EC2 spot instances for continuous integration (CI), and the why you want to do this.

Really, a whole post on that?

For a CI system to usable, it must fulfill specific needs.

  • Builds must be (largely!) reproducible
  • Providing access control
  • Delivering logs
  • Allowing build generation via multiple means
  • Accurately allocating compute resources
  • Allowing artifact archival
  • Allow arbitrary command execution

If you think about these needs for a bit, a well-developed CI system begins to look a bit like a simplistic execution platform.

Such an execution platform was required for an internal Two Six Labs project. Retrofitting CI was a good way to meet that need, and this post covers the details.

The starting point – Gitlab CI

Gitlab tightly integrates the addressing of all CI system needs. Using it allows us to centralize user permissions across both source control and our “simplistic execution platform”. Gitlab CI also provides the ability to trigger jobs via webhooks, and passing environmental variables to jobs as arguments. When jobs are triggered, environmental variables are saved, checking the “permits jobs to be replicated” box.

Gitlab CI also supports, to varying degrees, a handful of “executors”. I say “executors” as the specific executor we use, Docker Machine, is more a poorly-supported provisioner than an executor.

Docker Machine

docker machine workers
One of Docker Machine’s many features is handling most of the details involved in spinning up and tearing down EC2 Spot instances, such as setting maximum bid price.

Provided the commands you need to run can be executed within a Docker container, Docker Machine can serve as a decent provisioner, providing compute only as you need it.

Making Gitlab work well with Docker Machine

Spot instances are, as you may know, fantastically low-cost. They are low cost, because their availability both before and after provisioning is not guaranteed. Spot instance pricing changes with regional and zone demand, and provisioned spot instances can be terminated after a two minute warning.

Gitlab CI does not address spot instance termination. If a spot instance running a job is terminated, Gitlab CI eventually marks that job as failed. This is problematic because regardless of what you’re using it for, knowing if a task has failed or completed is too useful and too basic a feature to lose. The workaround to this issue we use the script check_termination.py.

Handling Terminations

check_termination.py is wrapped in a user-data.sh file and passed to the instance as Docker Machine provisions it. user-data.sh configures a cron job to run check-termination.py every thirty seconds, allowing it to cancel all jobs if the instance is marked for termination. There’s a bit more to this script, so I’m going to go through it function-by-function.

Imports

The interesting thing here is the docker package. This Docker library is unique in how high quality it is. I doubt there is any other Docker library, in any language, of such high quality.

Main

This script checks if any jobs are in need of termination and performs the termination if so.

Check Termination

This function determines if the instance is to be terminated and if that termination needs to be addressed.
One thing to note is that the IP being checked is present on all instances on AWS and AWS clones.

Wall All

This function is mainly for debugging, using wall to broadcast messages to all ttys.

Terminate Jobs

This function has several roles. It first acquires the job ID of the Gitlab CI job on the runner, and then cancels that job so it is not marked as having failed. Lastly, it retries the job, allowing the job to complete without user intervention.
It also will run the script /exit_cleanly.sh if it exists, which is useful if your jobs are stateful in a way CI doesn’t quite support.

User Data

This user-data.sh file contains the check_termination.py. Docker Machine has AWS execute this script on instances once they are provisioned.

Zombies?!?!

A problematic bug I had to workaround was Docker Machine abandoning provisioned instances, seemingly when it is rate-limited by AWS. The percentage of machines abandoned increase as the number of machines provisioned at once does. Fortunately, when this bug occurs, the instance in question is never tagged. As we only use Docker Machine to provision instances for CI, this allowed us to find and terminate instances meeting the criterion. The script we use is spot_sniper.py.

Spot Sniper

This script terminates abandoned spot instances. While this script is straightforward, this breakdown serves a vital purpose: preventing runaway AWS bills. Abandoned instances are not counted against the configured resource limits, allowing them to accumulate.

Imports

Some standard imports.

Main

This script looks for and terminates abandoned spot instances.
There is a bug somewhere between Docker Machine and Gitlab Runner that causes instances to be abandoned.
Instances that are abandoned by this error are identifiable by the lack of a name tag while having the docker-machine security group.
This script also terminates instances abandoned by min_bid_price.py as the instances that script provisions are configured to look abandoned.

Kill With Fire

This function terminates abandoned instances. It’s main purpose is to allow zombie instances to be killed with fire, instead of simply being terminated.

Crontab

This cron runs every 3 minutes. It needs to be tuned to minimize waste without causing excessive rate-limiting.

Optimizing Instances Used

As spot instance prices vary across regions, zones in those regions, instance types and time; costs can be minimized by checking across those axes. We wrote the script min_bid_price.py to do this.

Min Bid Price

While min_bid_price.py was initially intended to be a script run by cron to select the cheapest combination of region, zone and instance type; we also needed to determine instance availability. We found that requesting a few spot instances, waiting a few seconds, and checking if those instances were available was an effective way to do this.
The following breakdown details the what and why of each component of min_bid_price.py.

Imports

There are a couple of interesting imports in this script:

The sh package is a package that wraps binaries on $PATH, allowing them to be used in as pythonic a way as is possible without having to use a dedicated library.

total_ordering is an annotation that, provided an equivalence and a comparison operator are defined, will generate the not-explicitly-defined equivalence and comparison operators.

Instance Profile Class

instance_profile obtains, stores and simplifies the sorting of pricing info from AWS.

Main

There’s a fair amount going on here, so a few interruptions for the following function:

This code block specifies instances, regions and zones to be considered for use:

Here I specify the AMI to use in each region, as AMIs are not available across regions:

The following code block specifies the criteria an instance_profile must meet to be usable. It specifies that an instance_profile must enable the provisioning of 3 instances via 3 separate spot instance requests in under 10 seconds when max bid price is 0.08 cents per hour:

The following snippet assures the system configuration can be updated, before firing off AWS requests to find a suitable instance_profile and updating system configuration.

Update Config

The following block of code uses sed, via sh, to edit Gitlab Runners /etc/gitlab-runner/config.toml configuration file.
If you anticipate needing to use multiple instance types, use the toml package instead of sh and sed here.

Get Price List

The following function creates a boto3 client for each region being considered, and uses those clients to create a price-sorted list of instance_profile objects.

Safe to Configure

This function determines if it is safe to update system configuration. It determines this by assuring that both:

  • No CI jobs are running and,
  • No non-zombie Docker Machine instances are running

Spot Test

This function tests instance_profile objects to determine their usability, by exploring if instances can be provisioned quickly enough for the instance_profile in question.
If you’re wondering why instance_profile is absent , it processes the components of instance_profile.

Spot Up

This function requests the specified number of spot instances and returns a list of the IDs of those requests.

Spot Stop

This function cancels outstanding spot instance requests. It can fail when the system is being rate limited by AWS. Requests not cancelled will be fulfilled and cleaned up by spot_sniper.py. This failure is permitted as it results in stderr being emailed via cron, letting us know to not slam the system with jobs for a couple minutes.

Spot Down

This function terminates provisioned spot instances. It can fail when the system is being rate limited by AWS, in which case spot_sniper.py will clean up the provisioned instances during its next pass.

Check Instance Type in AZ

This function checks the status of spot instance requests made every five seconds until either the specified number of retries are made, all requests are fulfilled, or one request will not be fulfilled.

Spot Request Status

This function returns a list of the statuses of spot requests made.

Request Status Check

This function reduces a list of spot requests to a boolean once their success can be determined, returning None if their success cannot be determined.

Crontab

This crontab runs min_bid_price.py every 10 minutes.
The period of this cron needs to be tuned for your use case.
If it is too wide, it is less likely that system configuration will be updated when users are active.
If it is too narrow, the cost of determining instance availability will increase as instances are billed by the minute for their first minute.

Config, config, config…

AMI

As bandwidth costs on AWS can add up and Spot Instance usage is billed by the second (after the first minute), we pre-load a handful of images we use often into the AMI used by Docker Machine to provision instances.

Gitlab Runner Config

This is the config.toml we use for Gitlab Runner. Key points to note are the volume mounts it configures, and the max builds limitation. The volume mounts are configured to allow CI jobs to use volume mounts of their own. MaxBuilds being set to 1 prevents port conflicts from occurring and ensures that all jobs are run in a clean environment.

Docker Daemon Config

The following is the contents of /etc/docker/daemon.json on all CI machines. It configures the Docker daemon to use Google’s mirror of Dockerhub when Dockerhub is down or having reliability issues. It also limits the size of Docker logs (a source of many filled disks).

Metrics

The following table contains some metrics on the cost of our configuration over the past six months:

Instance Type Instance Count Total Job Hours Cost Cost Relative to On Demand, Always On, 4 Turnaround Time Relative to On Demand, Always On, 4
On demand, Always On 4 1938.74 3423.84 100% 100%
On demand, As Needed 80 Max 1938.74 378.87 11.07% 0.05%
Spot 80 Max 1938.74 80.69 2.36% 0.05%

The following histogram shows the durations of jobs ran since we started using CI:

job durations

The following plot shows the maximum number of jobs we’ve had running at once, over time:

max jobs

One thing these metrics do not capture is the impact of checking the availability of instance_profile has on job durations.

Before running this check, job startup times would often go as high as 6 minutes and would occasionally end up stuck in a “Pending” state due to a lack of available compute.

Job startup times now rarely exceed 1 minute and they do not get stuck “Pending”.

Changes since this was started

Docker Machine Gitlab MR

This MR, which was added in Gitlab 11.1, raised the number of CI jobs we could have running at once to at least 80. Given the performance claims and the number of jobs we could run at once before being rate limited by AWS before this MR was merged, I would guess we could run somewhere between 200 and 250 jobs at once before being rate limited by AWS.
https://gitlab.com/gitlab-org/gitlab-runner/merge_requests/909

Meltano

While working on this project, Gitlab announced their Meltano project. While the goal of Meltano might not be enabling the use of CI to process versioned data, that will almost certainly be a component. As the purpose of this CI configuration was to allowing us to use CI to process versioned data, I expect that the performance and capabilities of this CI configuration will increase as bugs related to Gitlab Runner and Docker Machine are addressed for Meltano.
https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/

Spot Pricing Change

AWS recently changed how they calculate the price of spot instances, smoothing price fluctuations.
While this change reduces the benefit of the approach of finding the optimal instance_profile to use to run instances, the approach of finding the optimal instance_profile still allows us to use the cheapest instances meeting our compute, startup-time and compute capacity requirements.
https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

So…

That’s left us with more than we need to get our jobs ran in a timely manner.

This paragraph was initially going to be:

There are a few yet-to-be-a-problem cases these scripts have not addressed, such as ignoring sold out instance-region-zone combinations and automatically restarting jobs that are cancelled due to instance price increases and automating the generation (and use of) new pre-loaded AMIs periodically.

but, as our needs grew, all those problems had to be addressed.

See https://github.com/twosixlabs/gitlab-ci-scripts for a more copy-paste friendly version of the scripts on this page.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *