Self-hosted (On-premise or Cloud) Runners
GitHub Actions, GitLab CI/CD, and Bitbucket Pipelines workflows are executed on "native" runners (hosted by GitHub/GitLab/Bitbucket respectively) by default. However, there are many great reasons to use your own runners: to take advantage of GPUs, orchestrate your team's shared computing resources, or train in the cloud.
Allocating Cloud Compute Resources with CML
When a workflow requires computational resources (such as GPUs), CML can
automatically allocate cloud instances using cml runner
. You can spin up
instances on AWS, Azure, GCP, or Kubernetes
(see below). Alternatively, you can
connect
any other compute provider or on-premise (local) machine.
For example, the following workflow deploys a p2.xlarge
instance on AWS EC2
and trains a model on the instance. After the job runs, the instance
automatically shuts down.
You might notice that this workflow is quite similar to the
basic use case. The only addition is cml runner
and a few
environment variables for passing your cloud compute credentials to the
workflow.
Note that cml runner
will also automatically restart your jobs (whether from a
GitHub Actions 35 day workflow timeout
or an
AWS EC2 spot instance interruption).
name: CML
on: [push]
jobs:
launch-runner:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v3
- name: Deploy runner on EC2
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=p2.xlarge \
--labels=cml-gpu
train-and-report:
needs: launch-runner
runs-on: [self-hosted, cml-gpu]
timeout-minutes: 50400 # 35 days
container:
image: docker://iterativeai/cml:0-dvc2-base1-gpu
options: --gpus all
steps:
- uses: actions/checkout@v3
- name: Train model
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
run: |
pip install -r requirements.txt
python train.py # generate plot.png
# Create CML report
cat metrics.txt >> report.md
echo '' >> report.md
cml comment create report.md
launch-runner:
image: iterativeai/cml:0-dvc2-base1
script:
- |
cml runner launch \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=p2.xlarge \
--cloud-spot \
--labels=cml-gpu
train-and-report:
needs: [launch-runner]
tags: [cml-gpu]
image: iterativeai/cml:0-dvc2-base1-gpu
script:
- pip install -r requirements.txt
- python train.py # generate plot.png
# Create CML report
- cat metrics.txt >> report.md
- echo '' >> report.md
- cml comment create report.md
pipelines:
default:
- step:
image: iterativeai/cml:0-dvc2-base1
script:
- |
cml runner launch \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=m5.2xlarge \
--cloud-spot \
--labels=cml.runner
- step:
runs-on: [self.hosted, cml.runner]
image: iterativeai/cml:0-dvc2-base1
# GPU not yet supported, see https://github.com/iterative/cml/issues/1015
script:
- pip install -r requirements.txt
- python train.py # generate plot.png
# Create CML report
- cat metrics.txt >> report.md
- echo '' >> report.md
- cml comment create report.md
In the workflow above, the launch-runner
job launches an EC2 p2.xlarge
instance in the us-west
region. The train-and-report
job then runs on the
newly-launched instance. See Environment Variables
below for details on the secrets
required.
Docker Images
The CML Docker images (docker://iterativeai/cml
or
docker://ghcr.io/iterative/cml
) come loaded with Python, CUDA, git
, node
and other essentials for full-stack data science. Different versions of these
essentials are available from different iterativeai/cml
image tags. The tag
convention is {CML_VER}-dvc{DVC_VER}-base{BASE_VER}{-gpu}
:
{BASE_VER} | Software included (-gpu ) |
---|---|
0 | Ubuntu 18.04, Python 2.7 (CUDA 10.1, CuDNN 7) |
1 | Ubuntu 20.04, Python 3.8 (CUDA 11.0.3, CuDNN 8) |
For example, docker://iterativeai/cml:0-dvc2-base1-gpu
, or
docker://ghcr.io/iterative/cml:0-dvc2-base1
.
Using your own custom Docker images: To use commands such as
cml comment create
, make sure to
install CML in your Docker image.
Options
The cml runner
command supports many options (see the
command reference). Notable options are:
--labels=<...>
: One or more (comma-delimited) labels (e.g.cml,gpu
).--idle-timeout=<seconds>
: Seconds to wait for jobs before terminating.--single
: Terminate runner after one workflow run.--reuse
: Don't launch a new runner if an existing one has the same name or overlapping labels.--cloud={aws,azure,gcp,kubernetes}
: Cloud compute provider to host the runner.--cloud-type={m,l,xl,m+k80,m+v100,...}
: Instance type. Also accepts native types such ast2.micro
.--cloud-gpu={nogpu,k80,v100,tesla}
: GPU type.--cloud-hdd-size=<...>
: Disk storage in GB.--cloud-spot
: Request a preemptible spot instance.--cloud-spot-price=<...>
: Maximum spot instance USD bidding price.--cloud-region={us-west,us-east,eu-west,eu-north,...}
: Region where the instance is deployed. Also accepts native AWS/Azure region or GCP zone.--cloud-permission-set=<...>
: AWS instance profile or GCP instance service account.