MAST (Medical AI Superintelligence Test) is a suite of clinically realistic benchmarks to evaluate real-world medical capabilities of artificial intelligence models. The system provides a leaderboard where AI models submit API endpoints that are automatically tested against standardized medical scenarios.
The live leaderboard is available at benchmarks.arise-ai.org.
This repository provides instructions and test files to validate your custom model API endpoint. After passing validation, view the Submission Agreement and submit the Registration Form for review by the MAST team. The API and token are used only for benchmark execution and are not stored after evaluation.
- Submitters provide a single API endpoint with authentication token
- Leaderboard runs automated tests against all benchmarks using that endpoint
- API calls are made with standardized prompts and test cases for each benchmark
- Responses are validated for format compliance
- Results are manually reviewed prior to publication on the leaderboard
mast/
├── benchmarks/
│ ├── donoharm/ # Do No Harm benchmark
│ │ ├── prompt.md # System prompt sent with each case
│ │ ├── schema.json # Response validation schema
│ │ ├── validator.py # API testing logic
│ │ ├── inputs/ # Test input files (.txt)
│ │ └── outputs/ # Reference responses
│ ├── sct/ # Script Concordance Test benchmark
│ └── template/ # Template for new benchmarks
├── results/ # API response storage (per-benchmark)
├── scripts/
│ ├── validate_all.py # Master API tester
│ ├── utils.py # Shared utilities
│ ├── config.json # API endpoint config (gitignored)
│ └── config.example.json # Template for submitters
├── docs/
│ ├── contributing.md # Contribution guidelines
│ ├── submission_agreement.md # Terms for submitters
│ └── benchmark_descriptions.md # Detailed benchmark info
└── README.md
- Clone the repository:
git clone https://github.com/ARISENetwork/mast.git
cd mast-
Set up your API endpoint — provide a hosted endpoint for accessing and benchmarking your model.
-
Configure your endpoint by copying and editing the config:
cp scripts/config.example.json scripts/config.json
# Edit scripts/config.json with your API details- Test your endpoint:
python scripts/validate_all.pyEach benchmark makes HTTPS POST requests with:
- Method:
POST - Headers:
Authorization: Bearer {token}Content-Type: text/plain
- Body:
prompt.md + "\n" + test_input.txt - Timeout: Up to 300 seconds
The body contains the full system prompt followed by the clinical case. See benchmarks/donoharm/prompt.md for the exact prompt and benchmarks/donoharm/inputs/test_001.txt for an example case.
APIs must return a JSON object containing a free-text clinical management plan:
{
"response": "Assessment: Grade 3 infusion reaction to nivolumab...\n\n1. Refer to Allergy/Immunology for urgent evaluation...\n2. Hold next nivolumab dose until allergy clearance...\n3. ..."
}The response field must contain at least 50 characters of clinical text. There is no required structure within the text itself — the model should write a management plan as described in the prompt. See benchmarks/donoharm/outputs/test_001.txt for an example of a valid response.
OpenAI-compatible endpoints are also accepted. If your API returns the standard OpenAI chat completions format (choices[0].message.content), the validator will automatically extract the content. This includes endpoints served via OpenRouter or any OpenAI-compatible provider.
- Study: https://arxiv.org/abs/2512.01241
- Task: Provide clinical recommendations for a medical case
- Input: Reconstructed from real clinical cases, where a generalist physician electronically consulted a specialist/subspecialist
- Output: Free-text management plan (assessment + recommendations)
- Scoring: Evaluated by multiple LLM judges against specialist-authored rubrics
- Validation: Format compliance (schema validation only)
Coming soon. See benchmarks/sct/README.md for preliminary details.
All API responses are saved for auditability:
test_XXX_response.json: Complete API response with metadatatest_XXX_validation.json: Validation results and error details
Install required packages:
pip install jsonschema requests- Stable endpoint: API must remain accessible for at least 72 hours during benchmarking
- Concurrent requests: Must support 5-10 simultaneous connections
- Authentication: Bearer token authentication required
- Response time: Under 300 seconds per request
- Response format: Valid JSON — either
{"response": "..."}or OpenAI-compatible chat completions format
- Input tokens: ~1,500,000
- Output tokens: ~3,000,000-18,000,000 (varies with reasoning depth)
Approximate inference costs for large frontier-scale models: $125-$400 for a full benchmark run (but can be higher depending on reasoning effort).
Costs are approximate and depend on your provider's current pricing. The benchmark is run at multiple response lengths for sensitivity analysis. Reasoning models produce significantly more output tokens due to chain-of-thought, increasing costs substantially.
- Plain text clinical cases
- UTF-8 encoding
- One case per file
- JSON object with a
responsestring field, or OpenAI-compatible chat completions format - Must conform to
benchmarks/donoharm/schema.json(after extraction) - Minimum 50 characters in the response field