EvalGate

Why CI/CD for LLMs?

Just like traditional software, your LLM applications need automated testing in the development workflow:

Catch Regressions Early

Detect quality degradation before it reaches production

Faster Iteration

Get immediate feedback on prompt and model changes

Team Confidence

Deploy with confidence knowing tests have passed

Compliance & Audit

Maintain test history and quality standards

🚀 One-Command CI Setup (EvalGate 2.0.0)

With EvalGate 2.0.0, you get a complete CI pipeline in a single command:

name: EvalGate CI
on: [push, pull_request]
jobs:
  evalai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalai-results
          path: .evalai/

That's it! This single command does everything:

Discovers evaluation specs automatically

Runs only impacted specs (smart caching)

Compares results against base branch

Posts rich summary in PR with regressions

Exits with proper codes (0=clean, 1=regressions, 2=config)

Legacy Setup (Pre-2.0.0)

For existing workflows, you can use the traditional regression gate:

Option A: Zero-config — run npx @evalgate/sdk init to auto-generate this workflow.

name: EvalGate CI Gate

on:
  pull_request:
    branches: [main]

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci

      # Option A: Local gate (no API key needed)
      - name: EvalGate regression gate
        run: npx evalgate gate --format github

      # Option B: Platform gate (requires API key)
      # - name: EvalGate quality gate
      #   env:
      #     EVALAI_API_KEY: ${{ secrets.EVALAI_API_KEY }}
      #   run: npx evalgate check --format github --onFail import

GitLab CI Configuration

For GitLab users, add this to your .gitlab-ci.yml:

eval-gate:
  stage: test
  image: node:20
  script:
    - npm ci
    - npx evalgate gate --format json
  only:
    - merge_requests
    - main

Setting Quality Gates

Defining Thresholds

Configure minimum scores for different evaluation criteria:

{
  "thresholds": {
    "accuracy": 0.85,
    "relevance": 0.80,
    "safety": 1.0,
    "latency_p95": 2000
  },
  "failOnViolation": true
}

Blocking Deployments

When evaluations fail, the CI pipeline will block the merge/deployment until issues are resolved. This ensures only high-quality changes make it to production.

Best Practices

• Keep tests fast: Use a subset of test cases in CI, run full suite nightly
• Cache dependencies: Speed up builds by caching npm packages and models
• Parallel execution: Run independent test suites in parallel when possible
• Clear reporting: Generate easy-to-read reports showing what failed and why
• Version control: Store test cases and thresholds in version control
• Cost monitoring: Track API costs to avoid expensive CI runs

CLI Commands Reference

# Setup (run once)
npx @evalgate/sdk init     # scaffolds everything: baseline, CI workflow, config

# Gate commands
npx evalgate gate                     # run regression gate locally
npx evalgate gate --format github     # CI step summary + PR annotations
npx evalgate gate --format json       # machine-readable output

# Baseline management
npx evalgate baseline update           # re-run tests and update baseline

# Platform gate (requires API key)
npx evalgate check --format github --onFail import

# Diagnostics
npx evalgate doctor                    # verify CI/CD setup

Set Up CI/CD View All Guides

Why CI/CD for LLMs?

Just like traditional software, your LLM applications need automated testing in the development workflow:

Catch Regressions Early

Detect quality degradation before it reaches production

Faster Iteration

Get immediate feedback on prompt and model changes

Team Confidence

Deploy with confidence knowing tests have passed

Compliance & Audit

Maintain test history and quality standards

🚀 One-Command CI Setup (EvalGate 2.0.0)

With EvalGate 2.0.0, you get a complete CI pipeline in a single command:

name: EvalGate CI
on: [push, pull_request]
jobs:
  evalai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalai-results
          path: .evalai/

That's it! This single command does everything:

Discovers evaluation specs automatically

Runs only impacted specs (smart caching)

Compares results against base branch

Posts rich summary in PR with regressions

Exits with proper codes (0=clean, 1=regressions, 2=config)

Legacy Setup (Pre-2.0.0)

For existing workflows, you can use the traditional regression gate:

Option A: Zero-config — run npx @evalgate/sdk init to auto-generate this workflow.

name: EvalGate CI Gate

on:
  pull_request:
    branches: [main]

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci

      # Option A: Local gate (no API key needed)
      - name: EvalGate regression gate
        run: npx evalgate gate --format github

      # Option B: Platform gate (requires API key)
      # - name: EvalGate quality gate
      #   env:
      #     EVALAI_API_KEY: ${{ secrets.EVALAI_API_KEY }}
      #   run: npx evalgate check --format github --onFail import

Setting Quality Gates

Defining Thresholds

Configure minimum scores for different evaluation criteria:

{
  "thresholds": {
    "accuracy": 0.85,
    "relevance": 0.80,
    "safety": 1.0,
    "latency_p95": 2000
  },
  "failOnViolation": true
}

Blocking Deployments

When evaluations fail, the CI pipeline will block the merge/deployment until issues are resolved. This ensures only high-quality changes make it to production.

Best Practices

• Keep tests fast: Use a subset of test cases in CI, run full suite nightly

• Cache dependencies: Speed up builds by caching npm packages and models

• Parallel execution: Run independent test suites in parallel when possible

• Clear reporting: Generate easy-to-read reports showing what failed and why

• Version control: Store test cases and thresholds in version control

• Cost monitoring: Track API costs to avoid expensive CI runs

CLI Commands Reference

# Setup (run once)
npx @evalgate/sdk init     # scaffolds everything: baseline, CI workflow, config

# Gate commands
npx evalgate gate                     # run regression gate locally
npx evalgate gate --format github     # CI step summary + PR annotations
npx evalgate gate --format json       # machine-readable output

# Baseline management
npx evalgate baseline update           # re-run tests and update baseline

# Platform gate (requires API key)
npx evalgate check --format github --onFail import

# Diagnostics
npx evalgate doctor                    # verify CI/CD setup

EvalGate

CI/CD Integration

Why CI/CD for LLMs?

Catch Regressions Early

Faster Iteration

Team Confidence

Compliance & Audit

🚀 One-Command CI Setup (EvalGate 2.0.0)

Legacy Setup (Pre-2.0.0)

GitLab CI Configuration

Setting Quality Gates

Defining Thresholds

Blocking Deployments

Best Practices

CLI Commands Reference

EvalGate

CI/CD Integration

Why CI/CD for LLMs?

Catch Regressions Early

Faster Iteration

Team Confidence

Compliance & Audit

🚀 One-Command CI Setup (EvalGate 2.0.0)

Legacy Setup (Pre-2.0.0)

GitLab CI Configuration

Setting Quality Gates

Defining Thresholds

Blocking Deployments

Best Practices

CLI Commands Reference