Documentation
Methodology Overview
PSAE evaluates AI systems using the STAR-R framework (Situation, Task, Action, Result, Risk). Each test case follows peer-reviewed methodologies from NIST AI RMF, METR autonomy evaluation guidelines, and ACM AIware safety taxonomy.
Statistical Requirements
- Confidence level: 95%
- Minimum runs per scenario: 5
- Inter-rater reliability: Cohen's κ ≥ 0.8
- Effect size: Cohen's d for model comparison
Documentation Index
Paper Criteria Alignment
Implementation status against PipelineAIEvalPaper2026 recommendations
Category-Based Scoring
Per-category grades and combined final score methodology
Benchmark Grading Methodology
How the benchmark scores AI responses: metrics, risk multipliers, penalties, pass/fail
Benchmark Immutability Controls
Signed manifest workflow for dataset integrity
Industry Standards Compliance
Alignment with NIST AI RMF, METR, and safety-critical standards
Implementation Progress
Feature development status and roadmap