CortexOps
Reliability infrastructure for LangGraph, CrewAI, and AutoGen agents.
What is CortexOps?
CortexOps is an evaluation and observability SDK for AI agents. It catches regressions before they reach production — not after a customer complaint at 2am.
Built by a Senior AI Engineer at PayPal after debugging payment agent failures in production. The pain is real. The fix is now open source.
Core capabilities
Who is it for?
- AI engineers shipping LangGraph or CrewAI agents to production
- Fintech teams running payment, fraud, or KYC agents
- Any team that wants to catch regressions before users do
Installation
One command. No Docker required.
Requirements
- Python 3.10, 3.11, or 3.12
- pip 21+
Install the SDK
pip install cortexops
Optional dependencies
For LLM-as-judge scoring, you need an OpenAI API key:
export OPENAI_API_KEY=sk-...
For trace shipping to the hosted API (Pro):
export CORTEXOPS_API_KEY=cxo-...
export CORTEXOPS_API_URL=https://api.getcortexops.com
Upgrade
pip install --upgrade cortexops
Quickstart
From zero to a passing eval in 2 minutes.
Install
pip install cortexops
Define your agent
Any callable that takes a dict and returns a dict.
Write a golden dataset
YAML file with expected inputs, outputs, and tool calls.
Run the eval
EvalSuite.run() scores every case and prints a summary.
Complete example
from cortexops import CortexTracer, EvalSuite
# 1. Define your agent
def my_agent(input_data):
query = input_data.get("query", "")
if "refund" in query.lower():
return {"output": "refund approved for your order"}
return {"output": "request received"}
# 2. Instrument with one line
tracer = CortexTracer(project="my-agent")
agent = tracer.wrap(my_agent)
# 3. Run evals
results = EvalSuite.run(
dataset="golden_v1.yaml",
agent=agent,
fail_on="task_completion < 0.90",
)
print(results.summary())
golden_v1.yaml
project: my-agent
cases:
- id: refund_test
input:
query: process refund for order ORD-123
expected_output_contains:
- refund
- approved
max_latency_ms: 2000
Expected output
[1/1] refund_test ... pass (100)
CortexOps eval — my-agent
Cases : 1 (1 passed, 0 failed)
Task completion : 100.0%
Tool accuracy : 100.0/100
Latency p50/p95 : 1ms / 1ms
CortexTracer
One-line instrumentation for any agent. No refactoring required.
Basic usage
from cortexops import CortexTracer
tracer = CortexTracer(
project="payments-agent",
api_key="cxo-...", # optional — Pro tier
api_url="https://api.getcortexops.com",
environment="production",
)
agent = tracer.wrap(your_agent)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
project | str | required | Project name for grouping traces |
api_key | str | None | CortexOps Pro API key (cxo-...) |
api_url | str | api.getcortexops.com | Hosted API endpoint |
environment | str | "development" | Tag traces by environment |
sample_rate | float | 1.0 | Fraction of traces to capture (0.0–1.0) |
Auto-detected frameworks
tracer.wrap() automatically detects your framework:
- LangGraph — wraps
CompiledStateGraph.invoke() - CrewAI — wraps
Crew.kickoff() - Any callable — wraps the function directly
Accessing traces
last = tracer.last_trace()
all_ = tracer.traces()
tracer.clear() # reset local store
Golden datasets
YAML files that define what your agent should do. The ground truth your evals run against.
Full schema
project: payments-agent
version: 1
description: Core payment flows
cases:
- id: refund_approved # unique case ID
input:
query: process refund for ORD-8821
user_id: usr_123
expected_output_contains: # ALL must appear
- refund
- approved
expected_output_not_contains: # NONE should appear
- error
- failed
expected_tool_calls: # tools that must be called
- lookup_refund
- send_confirmation
max_latency_ms: 3000 # latency budget
judge: llm # "rule" or "llm"
judge_criteria: > # natural language criteria
Response should confirm the refund was approved,
mention the order ID, and offer next steps.
tags: [happy-path, refund]
Fields reference
| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for this case |
input | Yes | Dict passed to the agent |
expected_output_contains | No | Keywords ALL required in output |
expected_output_not_contains | No | Keywords NONE allowed in output |
expected_tool_calls | No | Tool names that must be called |
max_latency_ms | No | Max allowed latency in milliseconds |
judge | No | "rule" (default) or "llm" |
judge_criteria | No | Natural language criteria for LLM judge |
tags | No | List of tags for filtering |
Metrics
Five built-in metrics. All run on every eval case automatically.
Scoring
Each metric returns a score from 0–100. A case passes if its combined score is ≥ 70.
| Score | Status | Meaning |
|---|---|---|
| 90–100 | Pass | Excellent — all checks passed |
| 70–89 | Pass | Good — minor issues detected |
| 60–69 | Warning | Degraded — review recommended |
| 0–59 | Fail | Failed — CI gate will block |
Custom metrics
from cortexops.metrics import Metric
from cortexops.models import FailureKind
class ComplianceMetric(Metric):
name = "compliance"
def score(self, case, trace):
output = str(trace.output)
if "account number" in output.lower():
return 0.0, FailureKind.HALLUCINATION, "PII leaked"
return 100.0, None, None
CI eval gate
Block PRs when your agent regresses. Zero configuration, one YAML file.
GitHub Actions setup
# .github/workflows/eval.yml
name: CortexOps eval gate
on: [push, pull_request]
jobs:
eval-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install cortexops
- name: Run eval gate
run: |
cortexops eval run \
--dataset golden_v1.yaml \
--fail-on "task_completion < 0.90"
fail_on expressions
| Expression | Meaning |
|---|---|
task_completion < 0.90 | Block if less than 90% of cases pass |
tool_accuracy < 0.80 | Block if tool accuracy drops below 80% |
pass_rate < 1.0 | Block if any single case fails |
Python API
from cortexops import EvalSuite
from cortexops.eval import EvalThresholdError
try:
EvalSuite.run(
dataset="golden_v1.yaml",
agent=my_agent,
fail_on="task_completion < 0.90",
)
except EvalThresholdError as e:
print(f"Gate fired: {e}")
sys.exit(1) # blocks the PR
LangGraph
Instrument any CompiledStateGraph in one line.
from langgraph.graph import StateGraph
from cortexops import CortexTracer, EvalSuite
# Build your graph normally
builder = StateGraph(AgentState)
builder.add_node("lookup", lookup_node)
builder.add_node("respond", respond_node)
builder.set_entry_point("lookup")
builder.add_edge("lookup", "respond")
graph = builder.compile()
# Wrap with CortexTracer — zero refactoring
tracer = CortexTracer(
project="payments-agent",
api_key="cxo-...",
)
graph = tracer.wrap(graph)
# Use exactly as before
result = graph.invoke({"messages": [...]})
# Run evals
results = EvalSuite.run(
dataset="golden_v1.yaml",
agent=graph,
verbose=True,
)
CompiledStateGraph automatically and wraps the invoke() method. Your graph works identically — tracing adds zero overhead to your agent logic.CrewAI
Wrap any Crew with one line. kickoff() is automatically instrumented.
from crewai import Agent, Task, Crew
from cortexops import CortexTracer
analyst = Agent(role="Analyst", goal="Analyse payment disputes", ...)
task = Task(description="Review dispute {id}", agent=analyst, ...)
crew = Crew(agents=[analyst], tasks=[task])
# Wrap the crew
tracer = CortexTracer(project="dispute-crew")
crew = tracer.wrap(crew)
# Use normally
result = crew.kickoff(inputs={"id": "DIS-4421"})
Custom agents
Any callable works. If it takes a dict and returns a dict, CortexTracer can instrument it.
from cortexops import CortexTracer
tracer = CortexTracer(project="my-agent")
# Plain function
def my_agent(input_data: dict) -> dict:
return {"output": "done"}
wrapped = tracer.wrap(my_agent)
result = wrapped({"query": "hello"})
# Object with .invoke()
class MyAgent:
def invoke(self, input_data):
return {"output": "done"}
wrapped = tracer.wrap(MyAgent())
result = wrapped.invoke({"query": "hello"})
Hosted API Pro
Ship traces to api.getcortexops.com. 90-day retention, live dashboard, Slack alerts.
cxo- key arrives by email within 60 seconds.Connect the SDK
tracer = CortexTracer(
project="payments-agent",
api_key="cxo-your-key-here",
api_url="https://api.getcortexops.com",
)
Verify traces are shipping
curl "https://api.getcortexops.com/v1/traces?project=payments-agent" \
-H "X-API-Key: cxo-your-key-here"
Alerts Pro
Get notified the moment your agent regresses. Slack, webhook, PagerDuty.
Slack alerts
# Set in Railway Variables or .env
CORTEXOPS_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
CORTEXOPS_ALERT_THRESHOLD=0.90
Webhook (PagerDuty, OpsGenie, custom)
CORTEXOPS_WEBHOOK_URL=https://your-webhook.com/alert
CORTEXOPS_WEBHOOK_SECRET=your-secret
Alert payload
{
"project": "payments-agent",
"run_id": "abc123",
"task_completion_rate": 0.78,
"regressions": 2,
"failed_cases": [
{ "case_id": "refund_test", "failure_kind": "TIMEOUT", "score": 42 }
]
}
Prompt versioning Pro
Git-style version history for every prompt. Diff any two versions instantly.
Commit a prompt version
curl -X POST https://api.getcortexops.com/v1/prompts \
-H "X-API-Key: cxo-..." \
-H "Content-Type: application/json" \
-d '{
"project": "payments-agent",
"prompt_name": "system_prompt",
"content": "You are a payment assistant...",
"message": "Add refund policy clarification"
}'
View diff between versions
curl "https://api.getcortexops.com/v1/prompts/diff?\
project=payments-agent&prompt_name=system_prompt&version_a=1&version_b=2" \
-H "X-API-Key: cxo-..."
LLM judge Pro
GPT-4o scores open-ended outputs against your criteria. Falls back to heuristics if unavailable.
Enable in your dataset
cases:
- id: tone_check
input: {query: I want to dispute this charge}
judge: llm
judge_criteria: >
Response should be empathetic, acknowledge the frustration,
and offer a clear next step for dispute resolution.
Set the API key
export OPENAI_API_KEY=sk-...
API reference — Traces
Base URL: https://api.getcortexops.com · Auth: X-API-Key: cxo-...
status=failed, limit=50.Trace object
{
"trace_id": "4398c8e8-b1e1-4012-ae12-59782725e792",
"project": "payments-agent",
"case_id": "refund_approved",
"status": "completed",
"total_latency_ms": 342.5,
"failure_kind": null,
"failure_detail": null,
"environment": "production",
"created_at": "2025-04-05T08:20:54Z"
}
API reference — Evals
run_id.API reference — API keys
cxo- key once.cxo- key is only returned once at creation time. After that, only the hash is stored.Changelog
v0.1.0 — April 2026
First public release.
- CortexTracer — one-line instrumentation for LangGraph, CrewAI, any callable
- EvalSuite — golden dataset runner with YAML format
- 5 built-in metrics — task_completion, tool_accuracy, latency, hallucination
- LLM-as-judge scoring with GPT-4o + heuristic fallback
- GitHub Actions CI gate —
fail_onthreshold expressions - FastAPI backend — traces, evals, prompts, keys endpoints
- Slack + webhook alerting on regression
- PayPal billing integration for Pro tier
- Hosted API at api.getcortexops.com
- LangGraph payments example with 9 golden cases
FAQ
Does CortexTracer slow down my agent?
No. Tracing uses an async flush — your agent runs normally. If the hosted API is unreachable, traces are stored locally and the agent is unaffected.
Do I need to use LangGraph or LangChain?
No. CortexOps works with any Python agent — LangGraph, CrewAI, AutoGen, or a plain function. Unlike LangSmith, there is no framework lock-in.
What's the difference between Free and Pro?
Free gives you unlimited local evals and the GitHub Actions CI gate. Pro adds hosted trace storage (90 days), the live dashboard, Slack alerts, prompt versioning, and LLM-as-judge scoring.
How is pricing different from LangSmith?
LangSmith charges $39/seat plus $2.50–$5.00 per 1,000 traces. CortexOps is $49/seat flat — unlimited traces, no per-trace billing surprises.
Is my data used to train models?
No. Your traces, prompts, and outputs are private to your project. CortexOps does not use your data for training.
Can I self-host?
Yes. The full backend is open source on GitHub. Run it locally with Docker Compose or deploy to any cloud. See the README for instructions.
How do I cancel?
Cancel your PayPal subscription anytime from your PayPal account. No cancellation fees, no lock-in.