Each question selects 4 models from different providers for blind testing. Selection uses intent classification + four-slot forced exploration:
Lightweight regex + keyword classifier categorizes questions into:
| Domain | Trigger |
|---|---|
| code | code, programming, bug, API, function, implementation |
| math | calculation, equation, proof, integral, probability |
| creative | story, poetry, creative, continuation, novel |
| factual | what is, explain, history, science, principle |
| reasoning | analysis, comparison, logic, reasoning, argument |
4 slots each serve a purpose, balancing exploration vs exploitation:
| Slot | Strategy | Role |
|---|---|---|
| Slot 1 | Strongest Baseline | Current P-L #1, ensures baseline quality |
| Slot 2 | UCB Dynamic Challenge | UCB selects potential models to challenge the leader |
| Slot 3 | Observing Gray-zone | Randomly pick from models with impressions < threshold, accumulate data |
| Slot 4 | 30% Upset | 30% chance from bottom tier, preventing filter bubbles |
Slot 2 UCB-E score formula:
Where γm is the Plackett-Luce inferred model strength, N is total battles, nm is model m's battle count, c is the exploration constant.
After voting, users can trigger cross-validation. Totoro refines the 4 model responses through factual distillation, producing an ultimate truth-seeking answer. Core algorithm: Four-Dimensional Weighted Consensus:
Each model's information is weighted by its P-L rank via Sigmoid mapping:
Mapped to [0.5, 2.0]. Rank #1 ≈ 1.5, bottom ≈ 0.8.
If a core data point / logic block is independently mentioned by 3+ models:
Weight modifiers based on user votes and DPO labels:
| Tag | Effect | Modifier |
|---|---|---|
| Factually rigorous / no hallucination | Winner unique info boosted | × 2.0 |
| Code / format zero errors | Winner code blocks boosted | × 2.0 |
| Extremely strong logic | Winner logic chain boosted | × 1.5 |
| Excellent instruction following | Winner structure boosted | × 1.5 |
| Severe factual hallucination | Loser info circuit-breaker | × 0.0 |
| Over-aligned / verbose | Loser downweighted | × 0.3 |
| Logic break / infinite loop | Loser downweighted | × 0.2 |
| Format crash / broken code | Loser downweighted | × 0.3 |
Information with exact values, API params, or perfectly aligned with external search context gets highest priority; vague claims and uncited theories are forcibly removed.
Each cross-validation output includes an immutable verification trace log:
VERIFICATION PROOF STRUCTURE
proof_hash = SHA256(
battle_id +
question +
M1..M4 responses +
weights_applied +
user_signal +
refined_answer +
timestamp
)
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Trace log includes:
Hash generated in real-time by backend, displayed as collapsible log on frontend, ensuring every answer is traceable.
Arena ranking based on Plackett-Luce model + UCB-E anti-cheat strategy:
EM algorithm iteratively estimates γ parameter, combined with UCB-E (Upper Confidence Bound - Explorer) strategy balancing exploration vs exploitation.
Axiom V is LLMVECT's full-stack anti-cheat and vote credibility engine, composed of four modules forming an unbypassable security pipeline:
Each vote must pass 7 layers of protection before counting toward rankings:
| # | Layer | Rejection Code | Threshold |
|---|---|---|---|
| 1 | IP Rate Limiting | 429 | 10 req / 60s |
| 2 | Duplicate Vote Detection | 409 | Same user_id + battle_id |
| 3 | Cooldown Period | 425 | 5s cooldown |
| 4 | Daily Cap | 429 | Blind 5 votes / Think 1 vote |
| 5 | Device Fingerprint Check | 403 | device_hash blacklist |
| 6 | Anomaly Pattern Detection | Downweight | High-frequency / extreme bias |
| 7 | Dual-Track Quota | 429 | Blind + Think counted independently |
Each voting device generates a unique device_hash via triple hardware fingerprinting:
device_hash uses daily salt (same device produces stable hash within 24h), ensuring tracking continuity. Constraints:
| Constraint | Value | Violation |
|---|---|---|
| Max user_ids per device | 3 | Registration rejected (403) |
| Daily vote cap per device | 30 | Device banned (403) |
| Banned device blacklist | Permanent | All linked user_ids circuit-broken |
Each battle's ELO K-value is dynamically adjusted by vote consensus rate — contested battles get higher weight, landslide battles get minimal weight:
D-Factor mapping function:
| Consensus Rate | D-Factor | Semantics |
|---|---|---|
| ≥ 1.00 (unanimous) | 0.2 | / Landslide, minimal ranking impact |
| 0.75 | 1.67 | Majority agree, moderate impact |
| 0.50 (split) | 2.33 | Even split, high-weight decisive |
| ≤ 0.25 (contentious) | 3.0 | Fierce contest, maximum weight impact |
Consensus rate = leader votes / total votes, updated in real-time after each round, ensuring subsequent voters face ELO params adapted to current contention.
Blind and Think modes are counted independently, no cross-consumption:
| Mode | Daily Quota | Reveal | ELO Eligible |
|---|---|---|---|
| Blind | 5 / day | Revealed after voting | ✅ Yes |
| Think | 1 / day | Real-time visible reasoning | ✅ Yes |
When Think Mode is on, users see the full reasoning chain, but limited to 1 per day to prevent over-reliance on a single model's reasoning.
AXIOM V INTEGRITY GUARANTEE
Each vote record carries: device_hash | is_think_enabled | task_difficulty_factor
Each battle carries: total_votes | consensus_rate | difficulty_factor
All fields auditable, SHA-256 attested as V-Verification Hash Chain