CortexOps

Reliability infrastructure for LangGraph, CrewAI, and AutoGen agents.

Ship agents you can trust.Evaluate before you deploy. Gate every PR. Alert when production regresses.

What is CortexOps?

CortexOps is an evaluation and observability SDK for AI agents. It catches regressions before they reach production — not after a customer complaint at 2am.

Built by a Senior AI Engineer at PayPal after debugging payment agent failures in production. The pain is real. The fix is now open source.

Core capabilities

Golden dataset evals
Define expected tool calls, output keywords, and latency budgets in YAML. Run against any agent.
CI eval gate
Block PRs when task_completion drops below threshold. One line of config.
Trace observability
Every node, tool call, and LLM turn captured. Replay any failure in seconds.
LLM-as-judge
GPT-4o scores open-ended outputs against natural language criteria. Heuristic fallback included.

Who is it for?

  • AI engineers shipping LangGraph or CrewAI agents to production
  • Fintech teams running payment, fraud, or KYC agents
  • Any team that wants to catch regressions before users do

Installation

One command. No Docker required.

Requirements

  • Python 3.10, 3.11, or 3.12
  • pip 21+

Install the SDK

pip install cortexops
Verify the installpython -c "import cortexops; print(cortexops.__version__)"

Optional dependencies

For LLM-as-judge scoring, you need an OpenAI API key:

export OPENAI_API_KEY=sk-...

For trace shipping to the hosted API (Pro):

export CORTEXOPS_API_KEY=cxo-...
export CORTEXOPS_API_URL=https://api.getcortexops.com

Upgrade

pip install --upgrade cortexops

Quickstart

From zero to a passing eval in 2 minutes.

1

Install

pip install cortexops

2

Define your agent

Any callable that takes a dict and returns a dict.

3

Write a golden dataset

YAML file with expected inputs, outputs, and tool calls.

4

Run the eval

EvalSuite.run() scores every case and prints a summary.

Complete example

from cortexops import CortexTracer, EvalSuite

# 1. Define your agent
def my_agent(input_data):
    query = input_data.get("query", "")
    if "refund" in query.lower():
        return {"output": "refund approved for your order"}
    return {"output": "request received"}

# 2. Instrument with one line
tracer = CortexTracer(project="my-agent")
agent  = tracer.wrap(my_agent)

# 3. Run evals
results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=agent,
    fail_on="task_completion < 0.90",
)
print(results.summary())

golden_v1.yaml

project: my-agent
cases:
  - id: refund_test
    input:
      query: process refund for order ORD-123
    expected_output_contains:
      - refund
      - approved
    max_latency_ms: 2000

Expected output

  [1/1] refund_test ... pass (100)

CortexOps eval — my-agent
  Cases           : 1  (1 passed, 0 failed)
  Task completion : 100.0%
  Tool accuracy   : 100.0/100
  Latency p50/p95 : 1ms / 1ms

CortexTracer

One-line instrumentation for any agent. No refactoring required.

Basic usage

from cortexops import CortexTracer

tracer = CortexTracer(
    project="payments-agent",
    api_key="cxo-...",          # optional — Pro tier
    api_url="https://api.getcortexops.com",
    environment="production",
)

agent = tracer.wrap(your_agent)

Parameters

ParameterTypeDefaultDescription
projectstrrequiredProject name for grouping traces
api_keystrNoneCortexOps Pro API key (cxo-...)
api_urlstrapi.getcortexops.comHosted API endpoint
environmentstr"development"Tag traces by environment
sample_ratefloat1.0Fraction of traces to capture (0.0–1.0)

Auto-detected frameworks

tracer.wrap() automatically detects your framework:

  • LangGraph — wraps CompiledStateGraph.invoke()
  • CrewAI — wraps Crew.kickoff()
  • Any callable — wraps the function directly

Accessing traces

last  = tracer.last_trace()
all_  = tracer.traces()
tracer.clear()               # reset local store

Golden datasets

YAML files that define what your agent should do. The ground truth your evals run against.

Full schema

project: payments-agent
version: 1
description: Core payment flows

cases:
  - id: refund_approved          # unique case ID
    input:
      query: process refund for ORD-8821
      user_id: usr_123
    expected_output_contains:    # ALL must appear
      - refund
      - approved
    expected_output_not_contains: # NONE should appear
      - error
      - failed
    expected_tool_calls:         # tools that must be called
      - lookup_refund
      - send_confirmation
    max_latency_ms: 3000          # latency budget
    judge: llm                   # "rule" or "llm"
    judge_criteria: >            # natural language criteria
      Response should confirm the refund was approved,
      mention the order ID, and offer next steps.
    tags: [happy-path, refund]

Fields reference

FieldRequiredDescription
idYesUnique identifier for this case
inputYesDict passed to the agent
expected_output_containsNoKeywords ALL required in output
expected_output_not_containsNoKeywords NONE allowed in output
expected_tool_callsNoTool names that must be called
max_latency_msNoMax allowed latency in milliseconds
judgeNo"rule" (default) or "llm"
judge_criteriaNoNatural language criteria for LLM judge
tagsNoList of tags for filtering

Metrics

Five built-in metrics. All run on every eval case automatically.

task_completion
Did the agent produce a non-empty, non-error output containing all expected keywords?
tool_accuracy
Were all expected tool calls made? Partial credit for subset matches.
latency
Did the agent respond within the max_latency_ms budget?
hallucination
Detects date fabrication, capability disclaimers, and forbidden content patterns.

Scoring

Each metric returns a score from 0–100. A case passes if its combined score is ≥ 70.

ScoreStatusMeaning
90–100PassExcellent — all checks passed
70–89PassGood — minor issues detected
60–69WarningDegraded — review recommended
0–59FailFailed — CI gate will block

Custom metrics

from cortexops.metrics import Metric
from cortexops.models import FailureKind

class ComplianceMetric(Metric):
    name = "compliance"

    def score(self, case, trace):
        output = str(trace.output)
        if "account number" in output.lower():
            return 0.0, FailureKind.HALLUCINATION, "PII leaked"
        return 100.0, None, None

CI eval gate

Block PRs when your agent regresses. Zero configuration, one YAML file.

Why this mattersLangSmith tells you what failed. CortexOps stops it from shipping.

GitHub Actions setup

# .github/workflows/eval.yml
name: CortexOps eval gate
on: [push, pull_request]

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install cortexops
      - name: Run eval gate
        run: |
          cortexops eval run \
            --dataset golden_v1.yaml \
            --fail-on "task_completion < 0.90"

fail_on expressions

ExpressionMeaning
task_completion < 0.90Block if less than 90% of cases pass
tool_accuracy < 0.80Block if tool accuracy drops below 80%
pass_rate < 1.0Block if any single case fails

Python API

from cortexops import EvalSuite
from cortexops.eval import EvalThresholdError

try:
    EvalSuite.run(
        dataset="golden_v1.yaml",
        agent=my_agent,
        fail_on="task_completion < 0.90",
    )
except EvalThresholdError as e:
    print(f"Gate fired: {e}")
    sys.exit(1)   # blocks the PR

LangGraph

Instrument any CompiledStateGraph in one line.

from langgraph.graph import StateGraph
from cortexops import CortexTracer, EvalSuite

# Build your graph normally
builder = StateGraph(AgentState)
builder.add_node("lookup", lookup_node)
builder.add_node("respond", respond_node)
builder.set_entry_point("lookup")
builder.add_edge("lookup", "respond")
graph = builder.compile()

# Wrap with CortexTracer — zero refactoring
tracer = CortexTracer(
    project="payments-agent",
    api_key="cxo-...",
)
graph = tracer.wrap(graph)

# Use exactly as before
result = graph.invoke({"messages": [...]})

# Run evals
results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=graph,
    verbose=True,
)
How it worksCortexTracer detects CompiledStateGraph automatically and wraps the invoke() method. Your graph works identically — tracing adds zero overhead to your agent logic.

CrewAI

Wrap any Crew with one line. kickoff() is automatically instrumented.

from crewai import Agent, Task, Crew
from cortexops import CortexTracer

analyst = Agent(role="Analyst", goal="Analyse payment disputes", ...)
task    = Task(description="Review dispute {id}", agent=analyst, ...)
crew    = Crew(agents=[analyst], tasks=[task])

# Wrap the crew
tracer = CortexTracer(project="dispute-crew")
crew   = tracer.wrap(crew)

# Use normally
result = crew.kickoff(inputs={"id": "DIS-4421"})

Custom agents

Any callable works. If it takes a dict and returns a dict, CortexTracer can instrument it.

from cortexops import CortexTracer

tracer = CortexTracer(project="my-agent")

# Plain function
def my_agent(input_data: dict) -> dict:
    return {"output": "done"}

wrapped = tracer.wrap(my_agent)
result  = wrapped({"query": "hello"})

# Object with .invoke()
class MyAgent:
    def invoke(self, input_data):
        return {"output": "done"}

wrapped = tracer.wrap(MyAgent())
result  = wrapped.invoke({"query": "hello"})

Hosted API Pro

Ship traces to api.getcortexops.com. 90-day retention, live dashboard, Slack alerts.

Get a Pro API keySubscribe at getcortexops.com/#pricing — payment via PayPal. Your cxo- key arrives by email within 60 seconds.

Connect the SDK

tracer = CortexTracer(
    project="payments-agent",
    api_key="cxo-your-key-here",
    api_url="https://api.getcortexops.com",
)

Verify traces are shipping

curl "https://api.getcortexops.com/v1/traces?project=payments-agent" \
  -H "X-API-Key: cxo-your-key-here"

Alerts Pro

Get notified the moment your agent regresses. Slack, webhook, PagerDuty.

Slack alerts

# Set in Railway Variables or .env
CORTEXOPS_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
CORTEXOPS_ALERT_THRESHOLD=0.90

Webhook (PagerDuty, OpsGenie, custom)

CORTEXOPS_WEBHOOK_URL=https://your-webhook.com/alert
CORTEXOPS_WEBHOOK_SECRET=your-secret

Alert payload

{
  "project": "payments-agent",
  "run_id": "abc123",
  "task_completion_rate": 0.78,
  "regressions": 2,
  "failed_cases": [
    { "case_id": "refund_test", "failure_kind": "TIMEOUT", "score": 42 }
  ]
}

Prompt versioning Pro

Git-style version history for every prompt. Diff any two versions instantly.

Commit a prompt version

curl -X POST https://api.getcortexops.com/v1/prompts \
  -H "X-API-Key: cxo-..." \
  -H "Content-Type: application/json" \
  -d '{
    "project": "payments-agent",
    "prompt_name": "system_prompt",
    "content": "You are a payment assistant...",
    "message": "Add refund policy clarification"
  }'

View diff between versions

curl "https://api.getcortexops.com/v1/prompts/diff?\
project=payments-agent&prompt_name=system_prompt&version_a=1&version_b=2" \
  -H "X-API-Key: cxo-..."

LLM judge Pro

GPT-4o scores open-ended outputs against your criteria. Falls back to heuristics if unavailable.

Enable in your dataset

cases:
  - id: tone_check
    input: {query: I want to dispute this charge}
    judge: llm
    judge_criteria: >
      Response should be empathetic, acknowledge the frustration,
      and offer a clear next step for dispute resolution.

Set the API key

export OPENAI_API_KEY=sk-...
Heuristic fallbackIf OpenAI is unavailable, CortexOps automatically falls back to keyword-based scoring. Your eval never fails due to a third-party outage.

API reference — Traces

Base URL: https://api.getcortexops.com · Auth: X-API-Key: cxo-...

POST/v1/traces
Ingest a trace from the SDK or directly.
GET/v1/traces?project={name}
List traces for a project. Optional: status=failed, limit=50.
GET/v1/traces/{trace_id}
Get full trace detail including raw node waterfall.

Trace object

{
  "trace_id": "4398c8e8-b1e1-4012-ae12-59782725e792",
  "project": "payments-agent",
  "case_id": "refund_approved",
  "status": "completed",
  "total_latency_ms": 342.5,
  "failure_kind": null,
  "failure_detail": null,
  "environment": "production",
  "created_at": "2025-04-05T08:20:54Z"
}

API reference — Evals

POST/v1/evals
Trigger an async eval run. Returns run_id.
GET/v1/evals?project={name}
List eval runs for a project.
GET/v1/evals/{run_id}
Poll eval run status and results.
GET/v1/evals/diff?run_a={id}&run_b={id}
Compare two eval runs. Returns deltas and regression list.

API reference — API keys

POST/v1/keys
Create a new API key for a project. Returns raw cxo- key once.
GET/v1/keys?project={name}
List active keys for a project.
DELETE/v1/keys/{key_id}
Revoke a key immediately.
Store your key immediatelyThe raw cxo- key is only returned once at creation time. After that, only the hash is stored.

Changelog

v0.1.0 — April 2026

First public release.

  • CortexTracer — one-line instrumentation for LangGraph, CrewAI, any callable
  • EvalSuite — golden dataset runner with YAML format
  • 5 built-in metrics — task_completion, tool_accuracy, latency, hallucination
  • LLM-as-judge scoring with GPT-4o + heuristic fallback
  • GitHub Actions CI gate — fail_on threshold expressions
  • FastAPI backend — traces, evals, prompts, keys endpoints
  • Slack + webhook alerting on regression
  • PayPal billing integration for Pro tier
  • Hosted API at api.getcortexops.com
  • LangGraph payments example with 9 golden cases

FAQ

Does CortexTracer slow down my agent?

No. Tracing uses an async flush — your agent runs normally. If the hosted API is unreachable, traces are stored locally and the agent is unaffected.

Do I need to use LangGraph or LangChain?

No. CortexOps works with any Python agent — LangGraph, CrewAI, AutoGen, or a plain function. Unlike LangSmith, there is no framework lock-in.

What's the difference between Free and Pro?

Free gives you unlimited local evals and the GitHub Actions CI gate. Pro adds hosted trace storage (90 days), the live dashboard, Slack alerts, prompt versioning, and LLM-as-judge scoring.

How is pricing different from LangSmith?

LangSmith charges $39/seat plus $2.50–$5.00 per 1,000 traces. CortexOps is $49/seat flat — unlimited traces, no per-trace billing surprises.

Is my data used to train models?

No. Your traces, prompts, and outputs are private to your project. CortexOps does not use your data for training.

Can I self-host?

Yes. The full backend is open source on GitHub. Run it locally with Docker Compose or deploy to any cloud. See the README for instructions.

How do I cancel?

Cancel your PayPal subscription anytime from your PayPal account. No cancellation fees, no lock-in.