Medical AI Superintelligence Test (MAST) Leaderboard

Overview

MAST (Medical AI Superintelligence Test) is a suite of clinically realistic benchmarks to evaluate real-world medical capabilities of artificial intelligence models. The system provides a leaderboard where AI models submit API endpoints that are automatically tested against standardized medical scenarios.

The live leaderboard is available at benchmarks.arise-ai.org.

This repository provides instructions and test files to validate your custom model API endpoint. After passing validation, view the Submission Agreement and submit the Registration Form for review by the MAST team. The API and token are used only for benchmark execution and are not stored after evaluation.

How It Works

Submitters provide a single API endpoint with authentication token
Leaderboard runs automated tests against all benchmarks using that endpoint
API calls are made with standardized prompts and test cases for each benchmark
Responses are validated for format compliance
Results are manually reviewed prior to publication on the leaderboard

Structure

mast/
├── benchmarks/
│   ├── donoharm/               # Do No Harm benchmark
│   │   ├── prompt.md           # System prompt sent with each case
│   │   ├── schema.json         # Response validation schema
│   │   ├── validator.py        # API testing logic
│   │   ├── inputs/             # Test input files (.txt)
│   │   └── outputs/            # Reference responses
│   ├── sct/                    # Script Concordance Test benchmark
│   └── template/               # Template for new benchmarks
├── results/                    # API response storage (per-benchmark)
├── scripts/
│   ├── validate_all.py         # Master API tester
│   ├── utils.py                # Shared utilities
│   ├── config.json             # API endpoint config (gitignored)
│   └── config.example.json     # Template for submitters
├── docs/
│   ├── contributing.md         # Contribution guidelines
│   ├── submission_agreement.md # Terms for submitters
│   └── benchmark_descriptions.md  # Detailed benchmark info
└── README.md

Quick Start

For Submitters

Clone the repository:

git clone https://github.com/ARISENetwork/mast.git
cd mast

Set up your API endpoint — provide a hosted endpoint for accessing and benchmarking your model.
Configure your endpoint by copying and editing the config:

cp scripts/config.example.json scripts/config.json
# Edit scripts/config.json with your API details

Test your endpoint:

python scripts/validate_all.py

API Request Format

Each benchmark makes HTTPS POST requests with:

Method: POST
Headers:
- Authorization: Bearer {token}
- Content-Type: text/plain
Body: prompt.md + "\n" + test_input.txt
Timeout: Up to 300 seconds

The body contains the full system prompt followed by the clinical case. See benchmarks/donoharm/prompt.md for the exact prompt and benchmarks/donoharm/inputs/test_001.txt for an example case.

Response Format

APIs must return a JSON object containing a free-text clinical management plan:

{
  "response": "Assessment: Grade 3 infusion reaction to nivolumab...\n\n1. Refer to Allergy/Immunology for urgent evaluation...\n2. Hold next nivolumab dose until allergy clearance...\n3. ..."
}

The response field must contain at least 50 characters of clinical text. There is no required structure within the text itself — the model should write a management plan as described in the prompt. See benchmarks/donoharm/outputs/test_001.txt for an example of a valid response.

OpenAI-compatible endpoints are also accepted. If your API returns the standard OpenAI chat completions format (choices[0].message.content), the validator will automatically extract the content. This includes endpoints served via OpenRouter or any OpenAI-compatible provider.

Benchmarks

First Do NOHARM Benchmark

Study: https://arxiv.org/abs/2512.01241
Task: Provide clinical recommendations for a medical case
Input: Reconstructed from real clinical cases, where a generalist physician electronically consulted a specialist/subspecialist
Output: Free-text management plan (assessment + recommendations)
Scoring: Evaluated by multiple LLM judges against specialist-authored rubrics
Validation: Format compliance (schema validation only)

SCT (Script Concordance Test) Benchmark

Coming soon. See benchmarks/sct/README.md for preliminary details.

Validation Results

All API responses are saved for auditability:

test_XXX_response.json: Complete API response with metadata
test_XXX_validation.json: Validation results and error details

Prerequisites

Python Dependencies

Install required packages:

pip install jsonschema requests

API Requirements

Stable endpoint: API must remain accessible for at least 72 hours during benchmarking
Concurrent requests: Must support 5-10 simultaneous connections
Authentication: Bearer token authentication required
Response time: Under 300 seconds per request
Response format: Valid JSON — either {"response": "..."} or OpenAI-compatible chat completions format

Resource Requirements

Estimated Token Usage

Input tokens: ~1,500,000
Output tokens: ~3,000,000-18,000,000 (varies with reasoning depth)

Estimated Costs

Approximate inference costs for large frontier-scale models: $125-$400 for a full benchmark run (but can be higher depending on reasoning effort).

Costs are approximate and depend on your provider's current pricing. The benchmark is run at multiple response lengths for sensitivity analysis. Reasoning models produce significantly more output tokens due to chain-of-thought, increasing costs substantially.

File Formats

Input Files (.txt)

Plain text clinical cases
UTF-8 encoding
One case per file

Response Schema

JSON object with a response string field, or OpenAI-compatible chat completions format
Must conform to benchmarks/donoharm/schema.json (after extraction)
Minimum 50 characters in the response field

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
benchmarks		benchmarks
docs		docs
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical AI Superintelligence Test (MAST) Leaderboard

Overview

How It Works

Structure

Quick Start

For Submitters

API Request Format

Response Format

Benchmarks

First Do NOHARM Benchmark

SCT (Script Concordance Test) Benchmark

Validation Results

Prerequisites

Python Dependencies

API Requirements

Resource Requirements

Estimated Token Usage

Estimated Costs

File Formats

Input Files (.txt)

Response Schema

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical AI Superintelligence Test (MAST) Leaderboard

Overview

How It Works

Structure

Quick Start

For Submitters

API Request Format

Response Format

Benchmarks

First Do NOHARM Benchmark

SCT (Script Concordance Test) Benchmark

Validation Results

Prerequisites

Python Dependencies

API Requirements

Resource Requirements

Estimated Token Usage

Estimated Costs

File Formats

Input Files (.txt)

Response Schema

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages