This deep transformer architecture observes strings of Portable Game Notation (PGN) - a common string representation of chess games designed for maximum human understanding - and outputs strong predicted moves alongside an English commentary of what the model is trying to achieve.
Dataset Used: 3.5 Million Chess Games dataset and the KingBase Chess Dataset
To access the cleaned chess dataset, download it from here.
The pre_processing.py file outputs the sequence of all the moves taken in the chess games and removes all the unnecessary characters and words from the chess game sequences present in the 3.5-million-chess-games dataset.
This is how the output of the pre_processing on 3.5 Million Chess Games looks like:
Note
Each game occupying a single line.
This is how the output of the pre_processing on KingBase Games looks like:
Note
A single game in multiple lines.
Causal self-attention ensures that the outputs for a certain position in a sequence is based only on the known outputs at previous positions and not on future positions. In simpler terms, it ensures that the prediction for each next word should only depend on the preceding words. To achieve this in GPT-like LLMs, for each token processed, we mask out the future tokens, which come after the current token in the input text.
To understand about the model architecture and how to build GPT from scratch, check out my other repository here.
Below is a picture explaining of how GPT model is being used for text prediction and also explains the Block class defined by us in the code.
These classes inherit from the torch.utils.data.Dataset class, making them compatible with PyTorch's DataLoader.
Each class represents a different dataset for specific stages of finetuning or pretraining.
The __init__
method initializes the dataset with relevant parameters and processes the input data.
The __len__
method returns the length of the dataset.
The __getitem__
method retrieves an item from the dataset at a given index.
This class acts as a directory or factory for creating different datasets based on the version specified.
The __init__
method stores information about the dataset, version, configuration arguments, and pretraining vocabulary.
The __call__
method dynamically selects the appropriate dataset class based on the specified version and returns an instance of that class.
This class represents the dataset used for the initial pretraining stage.
It prepares the data by encoding it into tensors of integers based on a character-to-index mapping.
The __getitem__
method returns input-output pairs, where each input and output are subsequences of the original text shifted by one position.
This dictionary maps version numbers to the corresponding finetuning dataset classes.
In the main section, a sample of game data is loaded from a file.
An instance of the PretrainDataset class is created with the sample data.
The __main__
block demonstrates how to use the code, creating and printing an instance of the PretrainDataset class.