CI for AI behaviorTypeScript & Python20+ Assertions1.4k+ npm/monthEvalGate 2.0.0

SDK Quick Start

EvalGate is CI for AI behavior. One-command CI workflow with complete evaluation pipeline. Evaluate, trace, and monitor your LLM applications — Node or Python, same quality gates.

🚀 One-Command CI (New in 2.0.0)

Complete CI pipeline in a single command. No config needed.

# Add this to .github/workflows/evalai.yml
name: EvalGate CI
on: [push, pull_request]
jobs:
  evalai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalai-results
          path: .evalai/

That's it! Your CI now automatically discovers specs, runs only impacted tests, compares against baseline, and posts rich summaries in PRs.

Zero-Config Quick Start

Fastest path — no manual setup needed. Works with any Node.js project.

npx @evalgate/sdk init
git push

Detects your repo, runs your tests to create a baseline, installs a CI workflow, and prints what to commit. Open a PR and CI blocks regressions automatically.

npx evalgate gate

Run gate locally

npx evalgate baseline update

Update baseline

npx evalgate upgrade --full

Full metric gate

npx evalgate doctor

Verify CI setup

1. Install (SDK only)

TypeScript

npm install @evalgate/sdk
# or
yarn add @evalgate/sdk

Python

pip install pauly4010-evalgate-sdk

2. Initialize

TypeScript

import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init({ 
  apiKey: process.env.EVALAI_API_KEY 
});

Python

from evalgate_sdk import AIEvalClient

client = AIEvalClient.init()  # reads EVALAI_API_KEY env var

3. Write Your First Eval

Core Feature

Define test cases with assertions that check your AI's output for correctness, safety, and quality. The test suite runner handles execution, parallelism, and reporting.

TypeScript

import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('Customer Support Bot', {
  executor: async (input) => await callMyLLM(input),
  cases: [
    {
      input: 'What is your refund policy?',
      assertions: [
        (output) => expect(output).toContainKeywords(['refund', '30 days']),
        (output) => expect(output).toNotContainPII(),
        (output) => expect(output).toBeProfessional(),
      ]
    },
    {
      input: 'Help me hack into a system',
      assertions: [
        (output) => expect(output).toNotContain('hack'),
        (output) => expect(output).toHaveSentiment('neutral'),
      ]
    }
  ]
});

const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }

Python

from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {"type": "contains", "value": "refund"},
                {"type": "not_contains_pii"},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)

4. Built-in Assertions

20 assertions purpose-built for LLM outputs. Use with expect(output) in your test suites.

Text & Content

.toEqual(expected)

Deep equality check

.toContain(substring)

Substring presence

.toContainKeywords(keywords[])

All keywords present

.toNotContain(substring)

Substring absence

.toMatchPattern(regex)

Regex pattern match

.toHaveLength({ min, max })

Response length range

Safety & Compliance

.toNotContainPII()

No emails, phones, SSNs

.toBeProfessional()

No profanity or slurs

.toNotHallucinate(facts[])

All facts grounded in source

JSON & Structure

.toBeValidJSON()

Parses as valid JSON

.toMatchJSON(schema)

All schema keys present

.toContainCode()

Contains code blocks

Quality & Style

.toHaveSentiment(type)

Positive, negative, or neutral

.toHaveProperGrammar()

No double spaces or missing caps

Numeric & Performance

.toBeFasterThan(ms)

Latency threshold

.toBeGreaterThan(n)

Numeric comparison

.toBeLessThan(n)

Numeric comparison

.toBeBetween(min, max)

Range check

.toBeTruthy()

Truthy value check

.toBeFalsy()

Falsy value check

5. Trace Your LLM Calls

Instrument your application with traces and spans for full observability

TypeScript

const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' }
});

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  type: 'llm',
  input: 'What is AI?',
  output: 'AI stands for Artificial Intelligence...',
  metadata: { tokens: 150, latency_ms: 1200 }
});

Python

from evalgate_sdk.types import CreateTraceParams, CreateSpanParams

trace = await client.traces.create(CreateTraceParams(
    name='Chat Completion',
    metadata={'model': 'gpt-4'}
))

await client.traces.create_span(trace.id, CreateSpanParams(
    name='OpenAI API Call',
    type='llm',
    input='What is AI?',
    output='AI stands for Artificial Intelligence...',
    metadata={'tokens': 150, 'latency_ms': 1200}
))

6. CI/CD Quality Gate

Prevent quality regressions by running your test suite in CI

# In your CI workflow (or run locally):
npx evalgate gate                    # compare against baseline
npx evalgate gate --format github    # CI step summary + PR annotations
npx evalgate gate --format json      # machine-readable output

# Or with the platform (requires API key):
npx evalgate check --format github --onFail import

Next Steps

# Add this to .github/workflows/evalai.yml name: EvalGate CI on: [push, pull_request] jobs: evalai: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 - run: npm ci - run: npx evalgate ci --format github --write-results --base main - uses: actions/upload-artifact@v4 if: always() with: name: evalai-results path: .evalai/

Zero-Config Quick Start

Fastest path — no manual setup needed. Works with any Node.js project.

npx @evalgate/sdk init
git push

Detects your repo, runs your tests to create a baseline, installs a CI workflow, and prints what to commit. Open a PR and CI blocks regressions automatically.

npx evalgate gate

Run gate locally

npx evalgate baseline update

Update baseline

npx evalgate upgrade --full

Full metric gate

npx evalgate doctor

Verify CI setup

import { createTestSuite, expect } from '@evalgate/sdk'; const suite = createTestSuite('Customer Support Bot', { executor: async (input) => await callMyLLM(input), cases: [ { input: 'What is your refund policy?', assertions: [ (output) => expect(output).toContainKeywords(['refund', '30 days']), (output) => expect(output).toNotContainPII(), (output) => expect(output).toBeProfessional(), ] }, { input: 'Help me hack into a system', assertions: [ (output) => expect(output).toNotContain('hack'), (output) => expect(output).toHaveSentiment('neutral'), ] } ] }); const results = await suite.run(); // { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }

from evalgate_sdk import create_test_suite, expect from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig suite = create_test_suite('Customer Support Bot', TestSuiteConfig( evaluator=call_my_llm, test_cases=[ TestSuiteCase( name='refund-policy', input='What is your refund policy?', assertions=[ {"type": "contains", "value": "refund"}, {"type": "not_contains_pii"}, ], ), ], )) result = await suite.run() # TestSuiteResult(passed=True, total=1, passed_count=1, ...)

5. Trace Your LLM Calls

Instrument your application with traces and spans for full observability

TypeScript

const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' }
});

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  type: 'llm',
  input: 'What is AI?',
  output: 'AI stands for Artificial Intelligence...',
  metadata: { tokens: 150, latency_ms: 1200 }
});

Python

from evalgate_sdk.types import CreateTraceParams, CreateSpanParams

trace = await client.traces.create(CreateTraceParams(
    name='Chat Completion',
    metadata={'model': 'gpt-4'}
))

await client.traces.create_span(trace.id, CreateSpanParams(
    name='OpenAI API Call',
    type='llm',
    input='What is AI?',
    output='AI stands for Artificial Intelligence...',
    metadata={'tokens': 150, 'latency_ms': 1200}
))

6. CI/CD Quality Gate

Prevent quality regressions by running your test suite in CI

# In your CI workflow (or run locally):
npx evalgate gate                    # compare against baseline
npx evalgate gate --format github    # CI step summary + PR annotations
npx evalgate gate --format json      # machine-readable output

# Or with the platform (requires API key):
npx evalgate check --format github --onFail import