Arena Leaderboard Rules
Rules
Transparent, Fair, Independent — Every Vote Counts
01 Voting Rules
  • Blind Comparison: Each question gets 4 anonymous model responses (A/B/C/D), identities hidden until voting
  • Dual Extreme Voting: Pick the best (winner) and worst (loser), winners gain points, losers get penalized
  • Optional Tie: Choose "About the same" or "Both bad" if hard to distinguish
  • Instant Reveal: Model identities revealed immediately after voting for transparency
  • Dual Mode: Speed mode (fast response) and Expert mode (deep reasoning) ranked independently
02 Ranking Algorithm

We use Plackett-Luce probabilistic model + UCB-E exposure control. Key points:

  • Dimensionality Note: 4-way choice treated as "1 clear winner vs 3 undifferentiated losers" — an engineering tradeoff, not a flaw
  • Winner Bonus:
    Winner gain = K × (1 - P_win) × weight
    More underestimated models gain more from a win (similar to ELO)
  • Loser Penalty: The "worst" model takes directed penalty (2/3 of penalty pool), remaining losers share the rest
  • Arena Score
    Score = 1200 + 400 × log₁₀(γ)
    γ is the model intrinsic strength parameter, estimated by MLE iteration
  • Sincerity Weight: Dwell time <2s = weight 0, 2-10s linear interpolation, 10s+ = 1.0; scroll depth included to block instant voters
  • Model Lifecycle (4-state machine):
    Active Observing Eliminated Probation

    Eliminated models enter Probation period, LCB (Lower Confidence Bound) determines revival, not "revive with one win"

  • Monthly Reset: Rankings archived on the 1st of each month, new month starts fresh, history available
03 Anti-Cheat
  • Server-side User ID: user_id issued via HMAC, preventing client-side forgery
  • IP Rate Limiting: Max 20 requests per 60s per IP, 429 on excess
  • Vote Cooldown: Minimum 10s between votes per user
  • Daily Vote Cap: Max 50 votes per user per day
  • Browser Fingerprint: Canvas/WebGL/Audio fingerprinting to detect multi-account same-device
  • Anomaly Detection: Repeated votes on same question, highly repetitive patterns flagged
  • Sincerity Filter: Dwell time + scroll depth dual check, instant voter weight = 0
  • Audit Log: Complete audit trail for all voting behavior, fully traceable
04 Data Transparency
  • Vote & Reveal: Model identities disclosed after each vote, no black-box operations
  • Open Source Ranking: Core ranking engine (lobster.py) algorithm fully public and auditable
  • Transparent Scoring: Arena Score formula, weight factors, state transition rules all public
  • Monthly Archives: Historical ranking data archived monthly, any period queryable
  • Auditable: Complete voting audit logs retained, community oversight welcome
05 Independence Declaration

This platform has no corporate backing

No AI model provider intervention allowed

Driving better AI development through real user evaluation data

  • No Corporate Ties: Independently operated, no financial ties to any AI provider
  • Tamper-proof Algorithm: Rankings computed automatically by Plackett-Luce, no manual override
  • Data = Truth: All rankings independently reproducible from raw vote data
  • Open Oversight: Community welcome to independently audit algorithms, data, and results