GPT chess - a guardrailed example

Applications

Playing chess with GPT:
a guardrailed example

Language models are fundamentally stochastic next-token-predictors: autoregressive models that generate text based on existing text one token at a time. Unsurprisingly, they are not great chess players.

And yet... they are not terrible chess players either. Despite sensationalist media reports to the contrary, an advanced model like GPT-5 can play chess reasonably well, at the level of an unsophisticated amateur: It knows the rules, it can even offer a coherent narrative, but it will make stupid mistakes, even sacrificing its queen or getting itself into an unavoidable checkmate.

So why the terrible reputation? Why the sensationalist coverage? The reason is simple: the ability to play chess is not the same as the ability to reconstruct the state of a chessboard based on a long list of prior moves. Yet when we naively "play chess" with a language model, offering moves and expecting moves in return, at every conversational turn we expect the language model to reconstruct the entire state of the board from scratch, from an ever longer conversation containing an ever increasing number of moves and countermoves. It should come as no surprise that ultimately, the model fails miserably.

Yet it is quite possible to get a language model to play well. But this requires external scaffolding: software code that keeps track of the chessboard state, prompts the model for the next move, rejects invalid moves, and updates the board.

Building a scaffolded environment for GPT chess

How do we keep track of a chessboard? Fortunately, there is no need to reinvent the wheel here. There exists a notation, FEN (Forsyth-Edwards Notation), that is designed to capture not just the present state of the chessboard, but also any relevant information about the state of the game, including, e.g., castling and en passant.

The basic idea behind our GPT chess implementation is to keep track of the board, track of the game, really, using a continuously updated FEN string. At every turn, the language model is provided this string, a list of the most recent moves and nothing else. The system's prompt simply requests the model to reason and offer the best move that it can.

The model is then requested to respond using a specific format, which may include verbal reasoning but must conclude with a valid move. The model's move is matched against a strict validity checker. If the move is invalid, the model is re-prompted, with any invalid moves tried so far listed as such. If the model fails to respond with a valid move after three such turns, control is returned to the user.

Putting aside the specifics of chess, what this implementation demonstrates is how proper prompting and proper, scaffolded processing the model's response can ensure that the conversation stays strictly within the intended guardrails. This is a prerequisite in any application where the model is used in the role of an "agent", entrusted with specific tasks.

Inadvertently, this chess application also reveals something about the strengths and limitations of large language models in "reasoning" tasks. The typical chess game begins with "openings" that are well-documented in the chess literature. This is literature with which the model is familiar through its training. Therefore, the model confidently responds with moves that are well-known, widely studied by chess masters. Later in the game, however, mid-game, the language model falters. It still offers erudite analysis, but its moves are often mediocre or worse. This is true even for the latest, "frontier class" reasoning models. The reason for this has been best summarized by GPT itself: "Language models have rhetorical competence but no tactical competence." In other words, language models are great at discovering even distant associations between elements of their input or between their input and their training corpus. However, they lack the ability to explore and analyze a combinatorically expanding series of potential decisions, model the likely outcomes, prune the `decision tree" and find optimal solutions. They can talk about the world; they cannot envision, simulate, or model the world internally.

Maxima... »