6 sports · calibrated ML · conformal prediction
Ensemble ML with calibration guarantees, conformal uncertainty quantification, and real-time drift monitoring across six major sports markets.
The edge
Sports betting markets are structurally inefficient. Retail platforms aggregate noisy consensus signals into lines that reflect sentiment, not probability. The gap between implied probability and true probability is where alpha lives.
Most prediction services chase accuracy. We chase calibration. A model that says 62% and hits 62% of the time is worth more than one that says 90% and hits 70%. Calibrated probabilities enable proportional sizing, expected value optimization, and principled risk management — the same framework institutional desks use for any other asset class.
Our pipeline treats every prediction as a probability estimate with quantified uncertainty — not a pick. Conformal prediction intervals provide coverage guarantees. Drift detection flags regime changes before they erode edge. The system knows when it doesn't know.
Market coverage
Prediction coverage by sport across the calendar year. Overlapping seasons ensure continuous signal generation.
Model architecture
Multiple gradient-boosted learners — including , , and — combined via a proprietary architecture. Production models span 6 sports and multiple prediction types.
Multiple calibration methods — among them , , and — compete on held-out data. The best-performing calibrator is selected automatically. Safety checks prevent calibration from degrading raw model quality.
provides coverage-guaranteed prediction sets at a configurable confidence level. When the model cannot confidently distinguish between outcomes, it says so — and the system acts accordingly.
Statistical tests — including and — run on every inference batch. High-severity drift suppresses predictions until the model is retrained on current-regime data.
Rolling-window tracks whether the calibration surface has shifted since training. When drift exceeds threshold, the system flags the model for retraining — before edge erodes.
End-to-end data flow from ingestion to execution. Each stage is independently monitored.
Validation rigor
Every methodology decision is designed to prevent the most common failure mode in quantitative modeling: overfitting to historical data that doesn't generalize.
Temporal CV with and embargo periods between train and test boundaries. Eliminates lookahead bias that inflates backtest results. Season-aware splits for multi-season datasets.
The stacking architecture is designed to prevent information leakage at every layer — base models, meta-learner, and calibration are trained on strictly separated data.
Automated checks ensure calibration never degrades raw model quality. If post-calibration metrics are worse, calibration is rejected entirely.
Coverage guarantees are provable — not claimed from backtest results. constructs prediction sets that achieve the target coverage rate by design, with set size as a direct uncertainty signal.
Bayesian optimization via with early pruning. More sample-efficient than grid or random search. Cross-validation during optimization prevents overfitting to a single train/test split.
Fraction of predictions receiving set size 1 (confident) vs set size 2 (uncertain skip) by model confidence. At low confidence, conformal correctly flags nearly all predictions as uncertain. At 70%+, the model confirms with single-class prediction sets. 17,894 total predictions.
Risk intelligence
Knowing when not to bet is the real edge. Every prediction passes through multiple independent risk filters before it reaches the execution layer.
When the model cannot confidently distinguish between outcomes, the system skips automatically. No override, no manual judgment.
Edge requirements scale dynamically with market price. Extreme favorites and longshots face higher thresholds to compensate for asymmetric risk.
inspired sizing proportional to model conviction. Position size scales with edge — high conviction gets more, marginal signals get less.
When feature or calibration drift exceeds severity thresholds, predictions are suppressed until the model is retrained. The system does not bet on stale data.
A proprietary signal quality score filters low-quality predictions before they reach the execution layer. Multiple independent dimensions are evaluated — the specifics are not disclosed.
Adaptive pipeline
Most quantitative systems retrain weekly or monthly. Ours retrains in minutes and deploys in seconds.
Full model retraining — including optimization, validation, calibration, and drift baseline computation — completes in minutes, not hours. No manual intervention. When the market shifts, the models shift with it.
Automated feature engineering discovers and evaluates signal candidates across multiple domains. Features are scored and selected dynamically — not hand-tuned by a human staring at spreadsheets.
Retrained models deploy to production with zero downtime. The previous model serves predictions until the new one is validated and promoted.
The pipeline monitors for and in real time. When severity exceeds threshold, retraining triggers automatically. The system adapts to regime changes on a continuous basis.
Rolling Expected Calibration Error over time. When ECE approaches threshold, automated retraining fires. Green dot marks a retrain event.
Performance
ROC-AUC from purged cross-validation during training. All other metrics computed on live production predictions scored against real outcomes — not backtests.
Every metric on this page is derived from real production data and independently verifiable. We maintain full prediction logs with timestamps, model versions, and scored outcomes. Ask us to prove any number.
| Metric | Aggregate |
|---|---|
ROC-AUC Area under the receiver operating characteristic curve (CV) | 0.713 |
Win Rate Accuracy on recommended predictions (production) | 69.4% |
Brier Score Mean squared error of probability estimates (lower is better) | 0.207 |
ECE Expected calibration error (lower is better) | 0.022 |
Avg Confidence Mean model probability on scored predictions | 70.5% |
High-Conf WR Win rate on predictions with ≥70% model confidence | 74.2% |
Total Scored Total predictions evaluated (wins + losses only) | 5,509 |
Aggregate Win Rate
break-even at -110 vig
+17pp above the house edge across 5,509 predictions. NFL leads at 78.6%.
Expected Calibration Error
typical ML models
Near-perfect calibration. NCAAF achieves 0.009 — the model says 70%, it hits 70%.
Sharpe Ratio
top quantitative funds
Risk-adjusted return ratio. Renaissance Medallion targets ~6. Flat-bet methodology inflates Sharpe vs leveraged strategies, but the signal quality is real.
NCAAF ROC-AUC
published academic models
Discriminative power well above peer-reviewed sports prediction literature. NCAAB follows at 0.791.
Predicted probability vs observed outcome frequency. Points near the diagonal indicate well-calibrated estimates. Bubble size proportional to sample count.
Hypothetical cumulative P&L assuming 1-unit flat bet at -110 standard vig across 5,509 scored predictions. Not investment advice.
Full prediction log available for audit — every bet timestamped and scored against final outcomes.
Distribution of predicted edge with win rate overlay. Higher edge correlates with higher win rate and ROI.
ROC-AUC sourced from purged temporal cross-validation. Production metrics computed from 5,509 live predictions across 13 months (Mar 2025 — Mar 2026). Small-sample sports (NFL n=126, NCAAF n=376) carry wider confidence intervals.
Signal output
Every prediction is a structured signal — not a pick. Calibrated probability, quantified uncertainty, edge magnitude, position sizing, and a binary proceed/skip decision. The output is designed for systematic execution, not gut-feel betting.
Illustrative example. Full prediction log available for partner-level audit — every bet timestamped and scored against final outcomes.
The quant desk for sports markets.
Access is capped. Too many users on identical signals erodes the edge for everyone — so we limit membership to protect prediction value.