P300 TELEBASEBALL Loading...
ABOUT TELEBASEBALL
DATA INFRASTRUCTURE
HISTORICAL DATA
Retrosheet data spanning 1960-present—over 60 years of play-by-play, box score, and game state records.
FEATURES
Advanced sabermetrics (wRC+, wRAA, BABIP, FIP, xFIP), rolling averages (7/14/30/60 game windows), park factors, weather normalization, rest days, pitcher-batter matchups.
INGESTION
Daily team and player performance metrics, automated validation, temporal feature engineering with one-game lag to prevent lookahead bias.
MODEL ARCHITECTURE
MONEYLINE MODEL
LightGBM gradient boosting. Bayesian hyperparameter optimization. Temporal cross-validation. Outputs calibrated probability distributions. Primary drivers: pitcher quality, recent form, situational context.
TOTALS MODEL
NGBoost probabilistic regression. Generates full probability distributions over run totals. Uncertainty quantification critical for over/under. Features: park factors, weather, umpire tendencies, pitcher quality, offensive form.
EDGE IDENTIFICATION
MARKET ANALYSIS
Odds aggregated from multiple sportsbooks. Consensus closing lines constructed. Model implied probability vs. market implied probability comparison.
BACKTESTING
Historical validation against 2010-present data. Walk-forward analysis for threshold optimization. Edge requirements calibrated to Sharpe ratios and maximum drawdowns.
BET SIZING
Kelly Criterion based on estimated edge magnitude. Bankroll management constraints enforced. Minimum edge thresholds prevent over-betting.
PRODUCTION PIPELINE
ETL PROCESS
Automated data ingestion, validation, and storage. Previous day's performance metrics processed. Schedule parsing extracts lineups and starting pitchers.
FEATURE ENGINEERING
Rolling averages recalculated with temporal shifts. Fresh odds data scraped and normalized. Feature vectors constructed for model inference.
PUBLICATION
Predictions published ~60 minutes before first pitch. Full audit trail maintained for reproducibility. Performance attribution tracked end-to-end.
QUALITY ASSURANCE
• Temporal data integrity: rolling averages shifted back one game
• Cross-validation uses time-series splits, not random splits
• Performance monitoring via out-of-sample metrics (log loss, Brier score, AUC-ROC)
• Drift detection alerts trigger model retraining when performance degrades
The information used to train the models was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".