Framework Evaluations

How the leading agent frameworks actually score

We evaluated five agent frameworks against the Raknor Governance Scorecard using code-level analysis of their open-source repositories. No marketing claims. No self-assessments. Just what the code does.

NeMo Guardrails
C
Bronze
~72
LangGraph
D
Not certified
~48
CrewAI
D
Not certified
~45
OpenAI Swarm
F
Not certified
~12
Harvey (internal)
B
Silver
~81

Governance vs. adoption

The entire industry clusters in the “widely used, ungoverned” quadrant.

Low governance High governance High adoption Low adoption Popular but ungoverned Governed at scale Niche & ungoverned Governed but emerging
LangGraph
D (~48)
CrewAI
D (~45)
NeMo Guardrails
C (~72)
Swarm
F (~12)
Harvey
B (~81)
No framework here yet

Criterion-level comparison

Governance Capability LangGraph CrewAI NeMo Swarm Harvey
Authority Governance (30%)
Consequence-tier gatingNoNoNoNoYes
Authority boundary enforcementNoPartialNoNoYes
Escalation protocolYesBehavioralNoNoYes
Earned authority lifecycleNoNoNoYesYes
Resource governanceNoYesNomax_turnsYes
Observability (20%)
Decision logging w/ reasoningCheckpointVerboseOTelNoYes
Provenance chainNoNoTracesNoAppend-only
Decision reconstructionState replayNoTrace replayNoYes
Safety & Reliability (15%)
Graceful failureGraph retrymax_retryRail fallbackNoYes
Human-in-the-loopArchitecturalBehavioralNoMinimalYes
Input validationNoNoYARA + railsNoYes
Backpressure responseNoNoNoNoYes
Adversarial Resilience (20%)
Prompt injection resistanceNoNoYARA + LLMNoRegex
Authority spoofing detectionNoNoNoNoYes
Scope drift detectionNoNoTopical railsNoYes
Governance rollback preventionNoNoNoNoAppend-only
Interoperability (15%)
Standard protocol supportLangServeRESTAPINoMCP
Multi-agent coordinationGraphCrewsNoHandoffEventBus

Methodology: These evaluations are based on code-level analysis of each framework's open-source repository as of March 2026. We evaluate what the code does, not what documentation claims. Scores are estimates based on the Raknor Governance Scorecard v1.0.1 (26 criteria, 5 domains). Full behavioral certification via the Arena would produce exact scores. Harvey is included as an internal reference implementation — it has not undergone independent Arena certification.

Disagree with a score? The scorecard is published under CC BY 4.0. The criteria are open. Submit your framework for formal certification and we'll test it under adversarial conditions.

Want to know where your agent framework stands?

Request a Certification