Complete Technical Guide: From Data Collection to a Working Model
This article provides a step-by-step walkthrough of the architecture and implementation of an analytical system for predicting football match outcomes. The system uses Claude API from Anthropic as its "brain" - for data interpretation, feature engineering, and generating final predictions. The key innovation is combining three probability layers: bookmaker odds (Bet365), Polymarket prediction market data (blockchain-based crowd intelligence), and a custom ML model. The entire pipeline is written in Python using pandas, scikit-learn, XGBoost, and matplotlib.
System Architecture
*The system consists of several layers, each serving a specific role:*
Required Dependencies
Installation:
*The Polymarket Gamma API does not require a dedicated SDK - all requests are made via `requests` to public REST endpoints without authentication.*
Data Collection and Preparation
*The primary data source is football-data.co.uk, which provides CSV files with match results and statistics for all major European leagues. The data includes goals, shots, corners, fouls, cards, and bookmaker odds.*
*Data Loading*
*Cleaning and Transformation*
Feature Engineering with Claude
This is the key stage where we create features that enable the model to "understand" the match context. Here, Claude serves as an intelligent assistant - helping generate feature ideas and evaluate contextual factors.
*Statistical Features (Rolling Averages)*
*Claude for Contextual Feature Generation*
This is where things get interesting: we use Claude to analyze context that is unavailable in numerical data.
*Adding Bookmaker Odds as Features*
Bookmaker odds are one of the strongest predictors because they already contain aggregated market expertise.
*Advanced Feature Engineering: ELO, xG Proxy, and Fatigue*
Rolling averages over 5 matches are just the starting point. The literature shows that pi-ratings, ELO ratings, and xG significantly improve accuracy. Razali et al. (2022) demonstrated this on 216k matches: CatBoost + pi-ratings = 55.82% accuracy, the best Soccer Prediction Challenge result.
*ELO Ratings with Margin of Victory*
ELO is a ranking system adopted by FIFA since 2018. Its key property: it accounts for opponent strength, not just W/D/L.
*xG Proxy from Basic Statistics*
True xG requires StatsBomb/Opta data (paid access). But we can build an xG proxy - an approximation of expected goals from available statistics:
*Fatigue Factor and Fixture Congestion*
Draper et al. (2024) showed that fatigue affects results. A simple proxy: the number of rest days between matches.
*xG Proxy from Basic Statistics*
True xG requires StatsBomb/Opta data (paid access). But we can build an xG proxy - an approximation of expected goals from available statistics:
*Fatigue Factor and Fixture Congestion*
Draper et al. (2024) showed that fatigue affects results. A simple proxy: the number of rest days between matches.
*Head-to-Head History*
Polymarket Integration: Prediction Market as a Signal Source
*Why Polymarket Is Not Just Another Bookmaker*
Polymarket is a decentralized prediction market on the Polygon blockchain, where contract prices are formed by real money from traders (USDC). Key differences from bookmaker odds:
When Polymarket and the bookmaker diverge in their estimates - that's a potential edge. The divergence indicates that one source knows something the other doesn't (injuries, inside information, recent form).
*Connecting to the Polymarket Gamma API*
The Gamma API is fully open - no API key or authentication required. This allows free access to probabilities for any market.
*Fetching Historical Prices (for Backtesting)*
Training the model requires historical Polymarket probabilities - not just current prices.
*Combining Three Probability Layers*
This is the core of the system - merging three independent probability sources into a unified feature set.
*Visualizing Divergences: Bookmaker vs Polymarket*
*Claude Analyzes Divergences*
Building the ML Model
*Preparing Data for Training*
*Training Multiple Models*
*Ensemble: Combining Models*
Claude API Integration for Interpretation
One of Claude's key strengths is the ability to transform dry numbers into clear analytical conclusions.
Generating Detailed Predictions
*Batch Matchday Analysis*
Visualizing Results
*Model Comparison*
*Confusion Matrix*
*Feature Importance*
*Predicted Probability Distributions*
Backtesting and Model Evaluation
*Walk-Forward Backtest*
This is the only correct way to test a predictive model on sports data - simulating real-time trading over time.
*Probability Calibration*
Advanced Architecture: Hybrid System
*Hybrid: ML + Claude + Polymarket*
The most powerful architecture is a triple hybrid: the ML model provides quantitative probabilities, Polymarket delivers crowd intelligence, and Claude synthesizes everything into a final conclusion accounting for divergences.





