Building a Football Match Prediction System with Claude AI cover

Building a Football Match Prediction System with Claude AI

zostaff avatar

zostaff · @zostaff · Apr 14

View original post

Complete Technical Guide: From Data Collection to a Working Model

This article provides a step-by-step walkthrough of the architecture and implementation of an analytical system for predicting football match outcomes. The system uses Claude API from Anthropic as its "brain" - for data interpretation, feature engineering, and generating final predictions. The key innovation is combining three probability layers: bookmaker odds (Bet365), Polymarket prediction market data (blockchain-based crowd intelligence), and a custom ML model. The entire pipeline is written in Python using pandas, scikit-learn, XGBoost, and matplotlib.

System Architecture

*The system consists of several layers, each serving a specific role:*

Required Dependencies

Installation:

*The Polymarket Gamma API does not require a dedicated SDK - all requests are made via `requests` to public REST endpoints without authentication.*

Data Collection and Preparation

*The primary data source is football-data.co.uk, which provides CSV files with match results and statistics for all major European leagues. The data includes goals, shots, corners, fouls, cards, and bookmaker odds.*

*Data Loading*

*Cleaning and Transformation*

Feature Engineering with Claude

This is the key stage where we create features that enable the model to "understand" the match context. Here, Claude serves as an intelligent assistant - helping generate feature ideas and evaluate contextual factors.

*Statistical Features (Rolling Averages)*

*Claude for Contextual Feature Generation*

This is where things get interesting: we use Claude to analyze context that is unavailable in numerical data.

*Adding Bookmaker Odds as Features*

Bookmaker odds are one of the strongest predictors because they already contain aggregated market expertise.

*Advanced Feature Engineering: ELO, xG Proxy, and Fatigue*

Rolling averages over 5 matches are just the starting point. The literature shows that pi-ratings, ELO ratings, and xG significantly improve accuracy. Razali et al. (2022) demonstrated this on 216k matches: CatBoost + pi-ratings = 55.82% accuracy, the best Soccer Prediction Challenge result.

*ELO Ratings with Margin of Victory*

ELO is a ranking system adopted by FIFA since 2018. Its key property: it accounts for opponent strength, not just W/D/L.

*xG Proxy from Basic Statistics*

True xG requires StatsBomb/Opta data (paid access). But we can build an xG proxy - an approximation of expected goals from available statistics:

*Fatigue Factor and Fixture Congestion*

Draper et al. (2024) showed that fatigue affects results. A simple proxy: the number of rest days between matches.

*xG Proxy from Basic Statistics*

True xG requires StatsBomb/Opta data (paid access). But we can build an xG proxy - an approximation of expected goals from available statistics:

*Fatigue Factor and Fixture Congestion*

Draper et al. (2024) showed that fatigue affects results. A simple proxy: the number of rest days between matches.

*Head-to-Head History*

Polymarket Integration: Prediction Market as a Signal Source

*Why Polymarket Is Not Just Another Bookmaker*

Polymarket is a decentralized prediction market on the Polygon blockchain, where contract prices are formed by real money from traders (USDC). Key differences from bookmaker odds:

When Polymarket and the bookmaker diverge in their estimates - that's a potential edge. The divergence indicates that one source knows something the other doesn't (injuries, inside information, recent form).

*Connecting to the Polymarket Gamma API*

The Gamma API is fully open - no API key or authentication required. This allows free access to probabilities for any market.

*Fetching Historical Prices (for Backtesting)*

Training the model requires historical Polymarket probabilities - not just current prices.

*Combining Three Probability Layers*

This is the core of the system - merging three independent probability sources into a unified feature set.

*Visualizing Divergences: Bookmaker vs Polymarket*

*Claude Analyzes Divergences*

Building the ML Model

*Preparing Data for Training*

*Training Multiple Models*

*Ensemble: Combining Models*

Claude API Integration for Interpretation

One of Claude's key strengths is the ability to transform dry numbers into clear analytical conclusions.

Generating Detailed Predictions

*Batch Matchday Analysis*

Visualizing Results

*Model Comparison*

*Confusion Matrix*

*Feature Importance*

*Predicted Probability Distributions*

Backtesting and Model Evaluation

*Walk-Forward Backtest*

This is the only correct way to test a predictive model on sports data - simulating real-time trading over time.

*Probability Calibration*

Advanced Architecture: Hybrid System

*Hybrid: ML + Claude + Polymarket*

The most powerful architecture is a triple hybrid: the ML model provides quantitative probabilities, Polymarket delivers crowd intelligence, and Claude synthesizes everything into a final conclusion accounting for divergences.

Recent discoveries