Student of Games: A unified learning algorithm for both perfect and imperfect information games

Games have a long history as benchmarks for progress in artificial intelligence. Approaches using search and learning produced strong performance across many perfect information games, and approaches using game-theoretic reasoning and learning demonstrated strong performance for specific imperfect information poker variants. We introduce Student of Games, a general-purpose algorithm that unifies previous approaches, combining guided search, self-play learning, and game-theoretic reasoning. Student of Games achieves strong empirical performance in large perfect and imperfect information games—an important step toward truly general algorithms for arbitrary environments. We prove that Student of Games is sound, converging to perfect play as available computation and approximation capacity increases. Student of Games reaches strong performance in chess and Go, beats the strongest openly available agent in heads-up no-limit Texas hold’em poker, and defeats the state-of-the-art agent in Scotland Yard, an imperfect information game that illustrates the value of guided search, learning, and game-theoretic reasoning.


Student of Games Algorithm Details
Network Architecture and Optimization Table S1 lists neural network architectures and input features used for each game.For chess and Go we use exactly the same architecture and inputs as used by AlphaZero (6).In poker and Scotland Yard we process concatenated belief and public state features by a MLP with ReLU activations.
The counterfactual value head is optimized by Huber loss (77), while policy for each information state i is optimized by KL divergence: where each head is weighted with the corresponding weight w v and w p .During training we smoothly decay the learning rate by a factor of d every T decay steps.Formally learning rate ↵ at training step t is defined as: When using the policy head's prediction as prior in PUCT formula the logits are processed with softmax with temperature T prior .This can decrease weight of the prior in some games and encourage more exploration in the search phase.

Pseudocode
Here we provide pseudocode for the most important parts of the SOG algorithm.Algorithm 1 specifies GT-CFR, the core of SOG's sound game-theoretic search that scales to large perfect information games introduced in Search via Growing-Tree CFR.Algorithm 2 presents how GT-CFR is used during self-play that generates training examples for the neural network, previously covered in Training Process.Hyperparameters used in self-play are specified in Table S2.
When SOG plays against an opponent, the search tree is rebuilt also for the opponent's actions (as discussed in Modified Continual Re-solving).This way, SOG reasons about the opponent's behavior since it directly influences the belief distribution for the current state where SOG is to act.
Note that unlike AlphaZero, SOG currently starts its search procedure from scratch.That is, the previous computation only provides invariants for the next resolving step.AlphaZero rather warm-starts the MCTS process by initializing values and visit counts from the previous search.For SOG, this would also require warm-starting CFR.While possible (78), there is no warm-starting in the current implementation of SOG.

Implementation
SOG is implemented as a distributed system with decoupled actor and trainer jobs.Each actor runs several parallel games and the neural network evaluations are batched for better accelerator utilization.The networks were implemented using TensorFlow.

Poker Betting Abstraction
There are up to 20000 possible actions in no-limit Texas hold'em.To make the problem easier, AI agents are typically allowed to use only a small subset of these (8,(37)(38)(39).This process of selecting a set of allowed actions for a given poker state is called betting abstraction.Even using betting abstraction the players are able to maintain strong performance in the full game (8,37,38).Moreover, the local best response evaluation (75) suggests that there is not an easy exploit for such simplification as long as the agent is able to see full opponent actions (8).
Algorithm 1 Growing Tree CFR.Note that GT-CFR is logging all neural net queries it does since they might be used later in training.
procedure GT-CFR(L 0 , , s, c) .L 0 -a tree including built as described in Modified Continual Re-solving. .-a public belief state under which the new nodes will be added. .s, c -total number of expansion simulations and number of simulations per CFR update.
. Store average policy and counterfactual values in the tree.
. Return counterfactual values and average policy from CFR and all NN calls.return v, p, nn queries end procedure procedure GROW(L, ) .Create recursive queries.queries Pick on average q recursive neural net queries from nn queries.queries to solve.extend(queries)end for end procedure We use a betting abstraction in the Student of Games to speed up the training and simplify the learning task.Our agent's action set was limited to just 3 actions: fold (give up), check/call (match the current wager) and bet/raise (add chips to the pot).To improve generalization we used stochastic betting size similarly to ReBeL (37).The single allowed bet/raise size is randomly uniformly selected at the start of each poker hand from the interval h0.5, 1.0i ⇤ pot size.
This amount is anecdotally similar to one used by human players and had good performance in our experiments.The same random selection was used in both training and evaluation.
As in (37), we have also randomly varied the stack size (number of chips available to the players) at the start of the each round during the training.This number stays fixed during evaluation.

Description of Leduc poker
Leduc is a simplified poker game with two rounds and a 6-card deck in two suits.Each player initially antes a single chip to play and obtains a single private card and there are three actions: fold, call and raise.There is a fixed bet amount of 2 chips in the first round and 4 chips in the second round, and a limit of two raises per round.After the first round, a single public card is revealed.A pair is the best hand, otherwise hands are ordered by their high card (suit is irrelevant).A player's reward is their gain or loss in chips after the game.

Reinforcement Learning and Search in Imperfect Information Games
In this section, we provide some experimental results showing that common RL and widelyused search algorithms can produce highly exploitable strategies, even in small imperfect information games where exploitability is computable exactly.In particular, we show how exploitable Information Set Monte Carlo Tree Search is in Leduc poker, as well as three standard RL algorithms (DQN, A2C and tabular Q-learning) in both Kuhn poker and Leduc poker using OpenSpiel (79).Results are presented in milli big blinds per hand (mbb/h), which corresponds to one thousandth of a chip for both games.

Information Set Monte Carlo Tree Search
Information Set Monte Carlo Tree Search (IS-MCTS) is a search method that, at the start of each simulation, first samples a world state-consistent with the player's information state-and uses it for the simulation (40).Reward and visit count statistics are aggregated over information states so that players base their decisions only on their information states rather than on private information inaccessible to them.
Table S5 shows the exploitability of a policy obtained by running separate independent IS-MCTS searches from each information state in the game, over various parameter values.The lowest exploitability of IS-MCTS we found among this sweep was 465 mbb/h.

Standard RL algorithms in Imperfect Information Games
As imperfect information games generally need stochastic policies to achieve an optimal strategy, one might wonder how exploitable standard RL algorithms are in this class of games.To test this, we trained three standard RL agents: DQN, policy gradient (A2C) and tabular Qlearning.We used MLP neural networks in DQN and A2C agents.Table S6 shows the hyper parameters we swept over to train these RL agents.
In Kuhn poker, the best performing A2C agent converges to exploitability of 52 mbb/h, and tabular Q-learning and DQN agents converge to around 250 mbb/h.Similarly, in Leduc poker, the best performing A2C agent converges to exploitability of 78 mbb/h, tabular Q-learning and DQN agents converge to about 1300 mbb/h and 900 mbb/h respectively.Fig. S4 shows the exploitability of RL agents in Kuhn poker and Leduc poker.

Proofs of Theorems
There are three substantive differences between the SOG algorithm and DeepStack.First, SOG uses a growing search tree, rather than using a fixed limited-lookahead tree.Second, the SOG search tree may depend on the observed chance events.Finally, SOG uses a continuous selfplay training loop operating throughout the entire game, rather than the stratified bottom-up training process used by DeepStack.We address each of these differences below, in turn, after considering how to describe an approximate value function for search in imperfect information games.

Value Functions for Subgames
Like DeepStack, the SOG algorithm uses a value function, so the quality of its play depends on the quality of the value function.We will describe a value function in terms of its distance to a strategy with low regret.We start with some value and regret definitions that are better suited to subgames.Consider some policy profile ⇡ which is a tuple containing a strategy for each player, public tree subgame S rooted at public state s pub with player ranges B i [s i 2 S i (s pub )] := P i (s i |⇡).First, note that we can re-write counterfactual value v so that it depends only on B and ⇡ restricted to S, with no further dependence on ⇡.Let s i be a Player i information state in S i (s pub ), and q be the opponent of Player i, then: We can write several quantities in terms of the best-response value at information state s i : where ⇡ ⇡ 0 is the policy profile constructed by replacing action probabilities in ⇡ with those in ⇡ 0 .The value function is a substitute for an entire subgame policy profile, so the regret we are interested in is player i's full counterfactual regret (31) at s i , which considers all possible strategies within subgame S: With these definitions in hand, we can now consider the quality of a value function f in terms of a regret bound ✏ and value error ⇠.Recall that f maps ranges B and public state s pub to approximate counterfactual values ṽ(s i ) for each player i.
First, we consider versions of the regret bound and value error which are parameterised by a strategy ⇡.There is some associated bound ✏(⇡) on the sum of regrets across all information states at any subgame, valid for both players.
There is also some bound ⇠ f (⇡) on the distance between f (s pub , B) and the best-response values to ⇡.
We then say that f has ✏, ⇠ quality bounds if there exists some strategy ⇡ such that ✏(⇡)  ✏ and ⇠ f (⇡)  ⇠.As desired, if both ✏ and ⇠ are low then f(s pub ,B) is a good approximation of the best-response values to a low-regret strategy, for a subgame rooted at s pub with initial beliefs The DeepStack algorithm (8) used a similar error metric for value functions, but only considered zero-regret strategies.We introduce a more complicated error measure because the space of values corresponding to low-regret strategies may be much larger than the space of values corresponding to no-regret strategies.For example, consider the public subgame of a matching pennies game after the first player acts with the policy 0.501 heads, 0.499 tails.There are two first-player information states, from playing either heads or tails, with an empty firstplayer strategy as there are no further first player actions.Let us assume a value function f is returning the values [0 0] for these two information states.How good is f , assuming we restrict our attention to this one subgame?
The unique zero-regret strategy for the second player is to play tails 100% of the time, resulting in first player counterfactual values of -1 for playing heads and 1 for playing tails.The error metric based on zero-regret strategies is therefore measuring |f so that the DeepStack metric states that f has an error of 2. However, [0 0] seems like a very reasonable choice: these are exactly the first player counterfactual values when the second player has a strategy of 0.5 heads, 0.5 tails, which has a regret of only 0.002 in this subgame.
Rather than saying f is a poor quality value function with an error of 2 in a game with utilities in [ 1,1], we can now say f is a great 0.002, 0 value function which exactly describes a low-regret strategy.
The new quality metric also addresses an issue the old DeepStack metric had with discontinuities in the underlying 0-regret value functions.This means that the space of functions with a low DeepStack error metric may not be well suited for learning from data.Continuing with the previous example, if we shifted B slightly to be 0.499 heads and 0.501 tails for the first player, the unique 0-regret strategy in the subgame flips to playing tails 0% of the time, while the uniform random strategy is still a low-regret strategy for this subgame.In this example, a function can only have a low error with the DeepStack metric if it accurately predicts the values everywhere around the discontinuity at 0.5 heads and 0.5 tails, whereas the new metric can avoid this discontinuity by picking an ✏ > 0.More generally, for any constant c, the objective Like DeepStack, SOG has two steps which involve solving subgames of the original game.
One of the steps is the re-solving step used to play through a game, where we solve a modified subgame based on constraints on opponent values and beliefs about our possible private information, in order to get our policy and new opponent values.The other step is only in the training loop, where we are solving a subgame with fixed beliefs for both players, in order to get values for both players.While the (sub)games for these two cases are slightly different, they are both well-formed games and we can find an approximate Nash equilibrium using GT-CFR.
When running GT-CFR, even though a policy is explicitly defined only at information states in the lookahead tree L, at each iteration t there is implicitly some complete policy profile ⇡ t .For any information state s in L which is not a leaf, ⇡ t (s) is explicitly defined by the regret-matching policy.For all other s -either a leaf of L or outside of the lookahead tree -⇡ t (s) is defined by the ✏-regret subgame policy profile ⇡ ⇤,S associated with the value function's ✏, ⇠ quality bounds.Note that this ⇡ t only exists as a concept which is useful for theoretical analysis: GT-CFR does not have access to the probabilities outside of its lookahead tree, only a noisy estimate of the associated counterfactual values provided by the value function.
Lemma 1.Let p and q be vectors in [0, 1] n , and v and w be vectors in Let p and q be vectors in [0, 1] n , and v and w be vectors in R n such that In GT-CFR, the depth-limited public tree used for search may change at each iteration.Let L t be the public tree at time t.For any given tree L, let N (L) be the interior of the tree: all non-leaf, non-terminal public states.The interior of the tree is where regret matching is used to generate a policy, with regrets stored for all information states in interior public states.Let F(L) be the frontier of L, containing non-terminal leaves, and Z(L) be the terminal public states.GT-CFR uses the value function at all public states in the frontier, receiving noisy estimates ṽ(s) of the true counterfactual values v(s).We will distinguish between the true regrets R T s computed from the entire policy, and the regret RT s computed using the estimated values ṽ(s).Given a sequence of trees across T iterations, let T n (s pub ) be the set of maximal Let U be the maximum difference in counterfactual value between any two strategies, at any information state, and A be the maximum number of actions at any information state.
Lemma 3.After running GT-CFR for T iterations starting at some initial public state s 0 , using a value function with quality ✏, ⇠, regret for the strategies satisfies the bound Starting with the definition of regret, and noting that regrets are independently maximised in a perfect recall game, we can rearrange terms to get We can rewrite the counterfactual values of information state s i in terms of the counterfactual value of leaves and terminals of the tree.
= max Examining part of the first term inside the sum, we can independently maximise the counterfactual values at each information state s i .As above, this is equivalent to maximising at public state s pub .
Given that we individually maximised over each minuend, we satisfy the requirements of Lemma 1.We can then use the value function quality bounds.
 max Up to this point, we have used the true counterfactual values for the current policy profile.At leaves, however, GT-CFR only has access to the value function's noisy estimates of the true values.Applying Lemma 2, we get Placing this back into Equation 1 and collecting ✏ and ⇠ terms, we have We can rearrange the sums to consider the regret contribution for each public state As before we can use Lemma 1 to separate out regrets at the interior states in N := F(N ( S T t=1 L t )), which always depend only on leaves and terminals.Let L 0 ,t be L t minus all public states in N and any successor states.
Note that the states which were separated out are now effectively terminals in smaller trees.
We can repeat this process until regrets for all public states have been separated out.
Finally, from bounds on regret-matching (32), Note that the form of Lemma 3 implies that regret might not be sub-linear if public states are repeatedly added and removed from the lookahead tree.If we only add states and never remove them, however, we get a standard CFR regret bound plus error terms for the value function.
Theorem 3. Assume the conditions of Lemma 3 hold, and public states are never removed from the lookahead tree.Then Proof.This follows from Lemma 3, noting that the interior of L t monotonically grows over time.

Self-play Values as Re-solving Constraints
By using a value network in solving, we lose the ability to compute our opponent's counterfactual best response values to our average strategy (80).It is easy to track the opponent's average self-play value across iterations of a CFR variant, but using these values as re-solving constraints does not trivially lead to a bound on exploitability for the re-solved strategy.We show here that average CFR self-play values lead to reasonable, controllable error bounds in the context of continual re-solving.We will use (x) + to mean max{x, 0}.For simplicity, we will also assume that the subgame that is being re-solved is in the GT-CFR lookahead tree for all iterations.
Theorem 4. Assume we have some average strategy ⇡ generated by T iterations of GT-CFR solver using a value function with quality ✏, ⇠, with final lookahead tree L T where public states were never removed from the lookahead tree, and a final average regret R T i for the player of interest.Further assume that we have re-solved some public subgame S rooted public state s pub , using the average counterfactual values v(s o ) := 1 T P T t=1 v ⇡ t (s o ) as the opt-out values in the re-solving gadget.Let ⇡ S be the strategy generated from the re-solving game, with some player and opponent average regrets RS i and RS o , respectively.Then The general outline of the proof has two parts, both asking the question "how much can the opponent best response value increase?"As in Lemma 4 of ( 8), we can consider breaking the error in re-solving opt-out values into separate underestimation and overestimation terms.The first part of this proof is a bound that takes into account the re-solving solution quality, and how much the average values underestimate the best response to the average.This underestimation is bounded by the opponent regret at a subgame, which requires the solving algorithm to have low regret everywhere in the game: low regret for the opponent does not directly imply that the opponent has low regret in portions of the game that they do not play.The second part of the proof is placing a bound on the overestimation, using the player's regret rather than the opponent's regret.
We start by noting that from the opponent player o's point of view, we can replace an information set s o with a terminal that has utility BV ⇡(s o ), and the best response utility BV ⇡,so BV ⇡ (so) o in this modified game will be equal to BV ⇡ o .We can extend this to the entire subgame S, replacing each s o with a terminal giving the opponent the best response value: BV ⇡,S BV ⇡ (S) Using this notation, we can rewrite BV ⇡ ⇡ S o : Next, note that BV ⇡ S (s o ), the opponent's counterfactual best response to the re-solved subgame strategy ⇡ S at any s o at the root of S, is no greater than the value of max{BV ⇡ S (s o ), v(s o )}, the value of s o within the re-solving game before the gadget where the opponent has decision to opt-out for a fixed value v(s o ).That is, adding an extra opponent action which terminates the game never decreases the opponent's best response utility.Extending this to the entire subgame S again, we get From Lemma 1 of (8), the game value of a re-solving game with opt-out values v(s o ) is , for some underestimation error on the opt-out values that is given by Given the re-solving regrets, we have BV That is, there is some per-information-set values ✏ such that Looking at U S v,⇡ , we note that this minimum is no greater than the case when is the average full counterfactual regret R so of strategy ⇡ at s o .Restricting our attention to L t s pub , the portion of the lookahead tree restricted to s pub and its descendants, Theorem 3 gives us a bound on U S v,⇡ and we can update Equation 3BV Looking at just the difference in opponent counterfactual best response values, we can again get an upper bound by giving the opponent the choice at all information sets at the root of subgame S of playing a best response against the unmodified strategy ⇡ to get value BV ⇡(S), or opting out to get value v.
The difference of the first two terms is the regret in the opt-out game game described above, where we have lifted each iteration strategy ⇡ t into this game by never selecting the opt-out choice.Consider the immediate counterfactual regret RT (s o ) in this situation for any information state s o in this augmented game.Writing this in terms of the original immediate counterfactual regret R T (s o ) and the opt-out value, we get Because the positive immediate regret in the opt-out game is the same as the positive regret in the original game, we can use the Theorem 3 bound, which is composed from immediate regrets.Putting this together with Equation 4and Equation 5, we get Theorem 5. Assume we have played a game using continual re-solving, with one initial solve and D re-solving steps.Each solving or re-solving step finds an approximate Nash equilibrium through T iterations of GT-CFR using a value function with quality ✏, ⇠, public states are never removed from the lookahead tree, the maximum interior size P s pub 2N (L T ) |S i (s pub )| of all lookahead trees is bounded by N , the sum of frontier sizes across all lookahead trees is bounded by F , the maximum number of actions at any information sets is A, and the maximum difference in values between any two strategies is U .The exploitability of the final strategy is then bounded by (5D + 2) Proof.The exploitability EXP 0 of the player's initial strategy from the original solve is bounded by the sum of the regrets for both players.Theorem 3 provides regret bounds for GT-CFR, so

!
Each subsequent re-solve is operating on the strategy of the previous step, using the average values for the opt-out values.That is, the first re-solve will be updating the strategy from the initial solve, the second re-solve will be updating the subgame strategy from the first re-solve, and so on.Theorem 4 provides a bound on how much the exploitability increases after each re-solving step, with Theorem 3 providing the necessary regret bounds  In this example the game starts in w init which is the complete state of the environment containing private information for both players.After playing action a i the state moves to w 0 where there are two possible actions.Each action emits private and public observations.In this example, actions a j and a l emit the same private observation o 1 priv(1) for player 1, therefore they cannot distinguish which action happened.On the other hand, player 2 has different observations o 1 priv(2) and o 01 priv(2) for each of the actions, therefore they have more information about the state of the environment than player 1.The sequence of public observations shared by both players information is denoted as s pub .Both sequences of actions and factored observations meet in the final 'Histories and information states' view.The two possible action sequences are represented by histories h 0 and h 1 , where h 0 =(a i ,a l ), h 1 =(a i ,a j ).Since both actions a l and a j result in the same observation for player 1, they cannot tell which one of the histories happened and his information state s ¯0 contains them both.This is not the case for player 2, who can separate the histories, and each of his information states s 0 and s 1 contains just one history.In this example actions a 0 and a 1 emit the same public observation and therefore they lead to the same public tree node s 0 pub .On the other hand, action a 2 can lead to multiple possible states: for instance when a detective in Scotland Yard moves to a location the game can either 1) end because Mr. X was there and he was caught or 2) it continues because he was in a different station.
Figure S3: Initial situation on the glasses map for Scotland Yard.Mr. X starts at station 6 while the two detectives start at stations 1 and 11.All of them have 5 taxi cards (all edges in this map are of the same type) and the game is played for 5 rounds.
Continual re-solving puts GT-CFR together with re-solving the previously solved subgame.A bound on final solution quality follows directly from applications of Theorem 3 and Theorem 4.

!
Unrolling for D re-solving steps leads to the final bound.

Figure S1 :
Figure S1: An example of a Factored-Observation Stochastic Game (FOSG).This figure presents the visual view of notation from Background and Terminology.In this example the game starts in w init which is the complete state of the environment containing private information for both players.After playing action a i the state moves to w 0 where there are two possible actions.Each action emits private and public observations.In this example, actions a j and a l emit the same private observation o 1 priv(1) for player 1, therefore they cannot distinguish which action happened.On the other hand, player 2 has different observations o 1 priv(2) and o 01 priv(2) for each of the actions, therefore they have more information about the state of the environment than player 1.The sequence of public observations shared by both players information is denoted as s pub .Both sequences of actions and factored observations meet in the final 'Histories and information states' view.The two possible action sequences are represented by histories h 0 and h 1 , where h 0 =(a i ,a l ), h 1 =(a i ,a j ).Since both actions a l and a j result in the same observation for player 1, they cannot tell which one of the histories happened and his information state s ¯0 contains them both.This is not the case for player 2, who can separate the histories, and each of his information states s 0 and s 1 contains just one history.

3 Figure S2 :
Figure S2: An example of a public tree.The public tree provides different view of the FOSG.In this example actions a 0 and a 1 emit the same public observation and therefore they lead to the same public tree node s 0 pub .On the other hand, action a 2 can lead to multiple possible states: for instance when a detective in Scotland Yard moves to a location the game can either 1) end because Mr. X was there and he was caught or 2) it continues because he was in a different station.

1 (
Figure S4: Comparing performance of DQN, A2C, tabular Q-learning and uniform random policy in (A) Kuhn poker and (B) Leduc poker.
Decide whether the game has to be played till the end.do not resign coin flip with probability p no resign while w is not terminal AND played less than moves max do if chance acts in w then 1, • • • , dce 1} do w (1 ✏) • ⇡ controller w + ✏ • ⇡ unif ormif moves played < moves greedy af ter then .Send the example to the trainer.replay buffer.append(h, (v, p)i) making it a potentially more attractive learning target than the discontinuous function defined by exact Nash equilibrium values, and which matches a learning procedure based on approximately solving example subgames.
component to solve the problems that the SOG algorithm sets up.At every non-terminal public leaf state s pub of the lookahead tree, GT-CFR uses estimated counterfactual values ṽ, generated from a value function f (s pub , B) with player ranges B induced by Bayes' rule at s pub for the current policy profile ⇡.

Table S1 :
A neural network architecture and features used for each game.

Table S2 :
Hyperparameters for each game.

Table S3 :
Full Go results (Non Recursive Queries).Elo of GnuGo with a single thread and 100ms thinking time was set to be 0. AlphaZero(s=16k, t=800k) refers to 16000 search simulations after 800000 training steps.

Table S4 :
Full Go results (Recursive Queries).Elo of GnuGo with a single thread and 100ms thinking time was set to be 0. AlphaZero(s=16k, t=800k) refers to 16000 search simulations after 800000 training steps.