GitHub Actions, GitLab CI/CD, and Bitbucket Pipelines workflows are executed on "native" runners (hosted by GitHub/GitLab/Bitbucket respectively) by default. However, there are many great reasons to use your own runners: to take advantage of GPUs, orchestrate your team's shared computing resources, or train in the cloud.
When a workflow requires computational resources (such as GPUs), CML can
automatically allocate cloud instances using cml runner
. You can spin up
instances on AWS, Azure, GCP, or Kubernetes
(see below). Alternatively, you can
connect
any other compute provider or on-premise (local) machine.
For example, the following workflow deploys a p2.xlarge
instance on AWS EC2
and trains a model on the instance. After the job runs, the instance
automatically shuts down.
You might notice that this workflow is quite similar to the
basic use case. The only addition is cml runner
and a few
environment variables for passing your cloud compute credentials to the
workflow.
Note that cml runner
will also automatically restart your jobs (whether from a
GitHub Actions 35 day workflow timeout
or a
AWS EC2 spot instance interruption).
name: CML
on: [push]
jobs:
deploy-runner:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.sha }}
- name: Deploy runner on EC2
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml runner \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=p2.xlarge \
--labels=cml-gpu
train-model:
needs: deploy-runner
runs-on: [self-hosted, cml-gpu]
timeout-minutes: 50400 # 35 days
container:
image: docker://iterativeai/cml:0-dvc2-base1-gpu
options: --gpus all
steps:
- uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.sha }}
- name: Train model
env:
REPO_TOKEN: ${{ secrets.REPO_TOKEN }}
run: |
pip install -r requirements.txt
python train.py
# Create CML report
cat metrics.txt >> report.md
cml publish plot.png --md --title="Confusion Matrix" >> report.md
cml send-comment report.md
deploy-runner:
image: iterativeai/cml:0-dvc2-base1
script:
- |
cml runner \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=p2.xlarge \
--cloud-spot \
--labels=cml-gpu
train-model:
needs: [deploy-runner]
tags:
- cml-gpu
image: iterativeai/cml:0-dvc2-base1-gpu
script:
- pip install -r requirements.txt
- python train.py
# Create CML report
- cat metrics.txt >> report.md
- cml publish plot.png --md --title="Confusion Matrix" >> report.md
- cml send-comment report.md
pipelines:
default:
- step:
image: iterativeai/cml:0-dvc2-base1
script:
- |
cml runner \
--cloud=aws \
--cloud-region=us-west \
--cloud-type=m5.2xlarge \
--cloud-spot \
--labels=cml
- step:
runs-on: [self.hosted, cml]
image: iterativeai/cml:0-dvc2-base1
# GPU not yet supported, see https://github.com/iterative/cml/issues/1015
script:
- pip install -r requirements.txt
- python train.py
# Create CML report
- cat metrics.txt >> report.md
- cml publish plot.png --md --title="Confusion Matrix" >> report.md
- cml send-comment report.md
In the workflow above, the deploy-runner
step launches an EC2 p2.xlarge
instance in the us-west
region. The train-model
job then runs on the
newly-launched instance. See Environment Variables
below for details on the secrets
required.
🎉 Note that jobs can use any Docker container! To use commands such as
cml send-comment
from a job, the only requirement is to
have CML installed.
The CML Docker images (docker://iterativeai/cml
or
docker://ghcr.io/iterative/cml
) come loaded with Python, CUDA, git
, node
and other essentials for full-stack data science. Different versions of these
essentials are available from different iterativeai/cml
image tags. The tag
convention is {CML_VER}-dvc{DVC_VER}-base{BASE_VER}{-gpu}
:
{BASE_VER} | Software included (-gpu ) |
---|---|
0 | Ubuntu 18.04, Python 2.7 (CUDA 10.1, CuDNN 7) |
1 | Ubuntu 20.04, Python 3.8 (CUDA 11.0.3, CuDNN 8) |
For example, docker://iterativeai/cml:0-dvc2-base1-gpu
, or
docker://ghcr.io/iterative/cml:0-dvc2-base1
.
The cml runner
command supports many options (see the
command reference). Notable options are:
--labels=<...>
: One or more (comma-delimited) labels (e.g. cml,gpu
).--idle-timeout=<seconds>
: Seconds to wait for jobs before terminating.--single
: Terminate runner after one workflow run.--reuse
: Don't launch a new runner if an existing one has the same name or
overlapping labels.--cloud={aws,azure,gcp,kubernetes}
: Cloud compute provider to host the
runner.--cloud-type={m,l,xl,m+k80,m+v100,...}
: Instance
type.
Also accepts native types such as t2.micro
.--cloud-gpu={nogpu,k80,v100,tesla}
: GPU type.--cloud-hdd-size=<...>
: Disk storage in GB.--cloud-spot
: Request a preemptible spot instance.--cloud-spot-price=<...>
: Maximum spot instance USD bidding price.--cloud-region={us-west,us-east,eu-west,eu-north,...}
:
Region
where the instance is deployed. Also accepts native AWS/Azure region or GCP
zone.--cloud-permission-set=<...>
:
AWS instance profile
or
GCP instance service account.☝️ Tip! Check out the full
cml runner
command reference.
Sensitive values like cloud and repository credentials can be provided through environment variables with the aid of GitHub secrets, GitLab masked variables (or external secrets for added security), or Bitbucket secured user-defined variables.
You will need to create a personal access token (PAT)
with enough permissions to register self-hosted runners. In the example workflow
above, this token is stored as REPO_TOKEN
.
🛈 If using the --cloud
option, you will also need to provide access
credentials for your cloud compute resources as secrets. In the above example,
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
(with privileges to create &
destroy EC2 instances) are required.
This token serves as a repository access credential, and is especially required
for cml runner
to function.
Use either:
repo
scope, orIdeally, you should not use personal access tokens from your own account, as they grant access to all your repositories. Instead, it's highly recommended to create a separate bot account that only has access to the repositories where you plan to deploy runners to. Bot accounts are the same as normal user accounts, with the only difference being the intended use case.
For instance, to use a personal access token:
PERSONAL_ACCESS_TOKEN
repo
scopePERSONAL_ACCESS_TOKEN
Step 2 can also be used for adding other secrets such as cloud access credentials.
Alternatively, a GitHub App ID (CML_GITHUB_APP_ID
) and private key
(CML_GITHUB_APP_PEM
) can be used to generate a token on-the-fly, as shown in
the example below:
steps:
- uses: navikt/github-app-token-generator@v1
id: get-token
with:
private-key: ${{ secrets.CML_GITHUB_APP_PEM }}
app-id: ${{ secrets.CML_GITHUB_APP_ID }}
- uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.sha }}
token: ${{ steps.get-token.outputs.token }}
- name: Train model
env:
REPO_TOKEN: ${{ steps.get-token.outputs.token }}
run: |
...
cml send-comment report.md
Note that the Apps require the following write permissions:
cml runner
)cml send-github-check
)cml {pr,send-comment}
)cml runner
)Use either:
api
, read_repository
and write_repository
scopes, orFor instance, to use a personal access token:
Navigate to User Settings → Access Tokens
REPO_TOKEN
api
, read_repository
and write_repository
In your GitLab project, navigate to Settings → CI/CD → Variables → Add Variable
REPO_TOKEN
Step 2 can also be used for adding other masked variables such as cloud access credentials.
Bitbucket Cloud does not use access tokens. Instead, create a REPO_TOKEN
variable with a Base64 encoded username and password.
Use either:
Read
permission for Account and Write
permission for Pull requests,
Pipelines, and Runners, orIn either case, the steps to create a REPO_TOKEN
are:
echo -n $USERNAME:$PASSWORD | base64
. The -n
ensures the base64 does
not contain the trailing newline that echo
adds by default.REPO_TOKEN
Secured
to hide credentials in all Bitbucket logsStep 2 can also be used for adding other secured variables such as cloud access credentials.
Note that you will also need to provide access credentials of your compute
resources. In the above example, AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
are required to deploy EC2 instances.
Click below to see credentials needed for supported compute providers.
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
(optional)See the AWS credentials docs for obtaining these keys.
☝️ Note The same credentials can also be used for configuring cloud storage.
AZURE_CLIENT_ID
AZURE_CLIENT_SECRET
AZURE_SUBSCRIPTION_ID
AZURE_TENANT_ID
Either one of:
GOOGLE_APPLICATION_CREDENTIALS_DATA
: the contents of a service account
JSON file, orGOOGLE_APPLICATION_CREDENTIALS
: the path to the JSON file.The former is more convenient for CI/CD scenarios, where secrets are (usually) provisioned through environment variables instead of files.
KUBERNETES_CONFIGURATION
: the contents of a
kubeconfig
file.The cml runner
command can also be used to manually set up a local machine,
on-premise GPU cluster, or any other cloud compute resource as a self-hosted
runner. Simply install CML and then run:
$ cml runner \
--repo="$REPOSITORY_URL" \
--token="$PERSONAL_ACCESS_TOKEN" \
--labels="local,runner" \
--idle-timeout=180
The machine will listen for jobs from your repository and execute them locally.
⚠️ Warning: anyone with access to your repository (everybody for public ones) may be able to execute arbitrary code on your machine. Refer to the corresponding GitHub and GitLab documentation for additional guidance.
If cml runner
fails with a Terraform error message, setting the environment
variable TF_LOG_PROVIDER=DEBUG
may yield more information.
In very rare cases, you may need to clean up CML cloud resources manually. An example of such a problem can be seen when an EC2 instance ran out of storage space.
The following is a list of all the resources you may need to manually clean up in the case of a failure:
cml-{random-id}
)cml-{random-id}
)If you encounter these edge cases create a GitHub Issue with as much detail as possible. If possible link your workflow in the issue or provide an example of your workflow's YAML.
Additionally, try to capture and include logs from the instance:
For easy local access and debugging on the cml runner
instance
check our example on using the --cloud-startup-script
option.
Then you can run the following:
$ ssh ubuntu@instance_public_ip
$ sudo journalctl -n all -u cml.service --no-pager > cml.log
$ sudo dmesg --ctime > system.log
☝️ Note Please give your cml.log
a visual scan, entries like IP addresses
and Git repository names may be present and considered sensitive in some cases.
You can then copy those logs to your local machine with:
$ scp ubuntu@instance_public_ip:~/cml.log .
$ scp ubuntu@instance_public_ip:~/system.log .
There is a chance that the instance could be severely broken if the SSH command hangs — if that happens reboot it from the web console and try the commands again.