Checkmate

Devin Mantz
3 min readApr 1, 2021

I was first introduced to chess when I was around 8 or 9 years old and have been playing it ever since. Chess is one of the most complex games out there with so many possible games that the number is Impossible to even calculate! With all of that being said, it can be quite difficult to predict the outcome of a chess game. But I figured I’d give it my best shot and see how I could do.

The Data

To do this I used a dataset found on kaggle.com containing data from 6.2 million chess games played on lichess(Here’s the link:https://www.kaggle.com/arevel/chess-games). This dataset contained everything I would need to make a predictive model but contained a little more data than I needed so I only used 35,000 games for my models. I used the Result column as my target since it contained the result of the game. Then I removed the columns WhiteRatingDiff and BlackRatingDiff to avoid data leakage since those values were determined by the result. I also needed to change the AN column. That column contained all of the moves in the game and in many cases who the winner was so I kept only the first two moves from each player and got rid of the rest so I could get some information from the column without data leakage. Lastly, I removed the player's names as this should have no effect on the game. After all of this, I was ready to start making models.

Making Models

I decided to give two different models a try to see which of the two would perform better. The two I decided on were a LogisticalRegression model and a RandomForestClassifier model. Before I could get to building these I first needed to find a baseline accuracy to compare my models against. Using the data I found that 49.30% of games ended with white winning 47.14% ended with black winning and 3.56% ended in draws. So I will use 49.3% as my baseline.

The first model I made was the LogisticalRegression model. I did a grid search to find the best hyperparameters for solver and max_iters and found that the best option for solver was saga and the best value for max_iters was 125. Using these hyperparameters I got a training accuracy of 63.5% and a validation accuracy of 62.6%.

Next, I made the RandomForestClassifier model. Similar to the LogisticalRegression model I used a grid search to find the best values for max_depth and n_estimators and found that the best value for max_depth was 9 and the best value for n_estimators was 200. Using these hyperparameters I got a training accuracy of 69.35% and a validation accuracy of 61.86%.

Since I still had over 5million games my model had not seen I figured it wouldn’t hurt to test my LogisticalRegression model a little more. Using 20,000 more games to test I found that the model was 62.59% accurate. Nearly the exact same as the validation accuracy.

Conclusion

62% accuracy may not sound fantastic but, this model did beat the baseline accuracy by 13%. So yes 62% may not be perfect but it’s better than nothing. Chess, like almost any other game or sport out there, is very hard to predict the outcome of a game. Even people who have studied chess for decades struggle to predict the outcome of games so I think this model did a pretty good job. Plus if we were ever able to fully predict the outcome of a game it wouldn’t be fun to play or watch anymore and that goes for more than just chess.

--

--