by iterative.ai
Get started
Edit on GitHub

CML with DVC

In many ML projects, data isn't stored in a Git repository and needs to be downloaded from external sources. DVC is a common way to bring data to your CML runner. DVC also lets you run pipelines and plot changes in metrics for inclusion in CML reports.

dvc cml long report

The .github/workflows/cml.yaml file to create this report is:

name: CML & DVC
on: [push]
jobs:
  run:
    runs-on: ubuntu-latest
    container: docker://ghcr.io/iterative/cml:0-dvc2-base1
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: Train model
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          pip install -r requirements.txt  # Install dependencies
          dvc pull data --run-cache        # Pull data & run-cache from S3
          dvc repro                        # Reproduce pipeline
      - name: Create CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          echo "## Metrics" >> report.md
          dvc metrics diff master --show-md >> report.md

          # Publish confusion matrix diff
          echo "## Plots" >> report.md
          echo "### Class confusions" >> report.md
          dvc plots diff \
            --target classes.csv \
            --template confusion \
            -x actual \
            -y predicted \
            --show-vega master > vega.json
          vl2png vega.json -s 1.5 | cml publish --md >> report.md

          # Publish regularization function diff
          echo "### Effects of regularization" >> report.md
          dvc plots diff \
            --target estimators.csv \
            -x Regularization \
            --show-vega master > vega.json
          vl2png vega.json -s 1.5 | cml publish --md >> report.md

          cml send-comment report.md

See the example repository for more, or check out the use cases for machine learning.

Cloud Storage Provider Credentials

There are many supported could storage providers. Authentication credentials can be provided via environment variables. Here are a few examples for some of the most frequently used providers:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_SESSION_TOKEN (optional)
  • AZURE_STORAGE_CONNECTION_STRING
  • AZURE_STORAGE_CONTAINER_NAME
  • OSS_BUCKET
  • OSS_ACCESS_KEY_ID
  • OSS_ACCESS_KEY_SECRET
  • OSS_ENDPOINT
  • GOOGLE_APPLICATION_CREDENTIALS: the path to a service account JSON file

GitHub Actions: setup-dvc

The iterative/setup-dvc action installs DVC (similar to what setup-cml does for CML).

This action works on Ubuntu, macOS, and Windows runners. When running on Windows, Python 3 should be setup first.

steps:
  - uses: actions/checkout@v2
  - uses: iterative/setup-dvc@v1
runs-on: windows-latest
steps:
  - uses: actions/checkout@v2
  - uses: actions/setup-python@v2
    with:
      python-version: '3.x'
  - uses: iterative/setup-dvc@v1

A specific DVC version can installed using the version argument (defaults to the latest release).

- uses: iterative/setup-dvc@v1
  with:
    version: '1.0.1'
Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat