Adaptive tuning of human learning and choice variability to unexpected uncertainty

Human value–based decisions are notably variable under uncertainty. This variability is known to arise from two distinct sources: variable choices aimed at exploring available options and imprecise learning of option values due to limited cognitive resources. However, whether these two sources of decision variability are tuned to their specific costs and benefits remains unclear. To address this question, we compared the effects of expected and unexpected uncertainty on decision-making in the same reinforcement learning task. Across two large behavioral datasets, we found that humans choose more variably between options but simultaneously learn less imprecisely their values in response to unexpected uncertainty. Using simulations of learning agents, we demonstrate that these opposite adjustments reflect adaptive tuning of exploration and learning precision to the structure of uncertainty. Together, these findings indicate that humans regulate not only how much they explore uncertain options but also how precisely they learn the values of these options.


INTRODUCTION
Human decisions exhibit a pervasive variability under uncertainty (1). In the context of value-based decisions, the source of this variability has classically been assigned to exploration (2)(3)(4)(5), i.e., purposeful bias and variance in choice policy aimed at reducing uncertainty about the values of choice options. Beside this well-described source of decision variability, recent work has shown that the computations used to learn option values from obtained rewards suffer from imprecisions because of limited cognitive resources (6)(7)(8)(9). Learning imprecisions result in decision variability, which can be mistaken for exploration, but these two sources of decision variability are dissimilar in nature. Exploration drives decision variability through the probabilistic selection of options that do not maximize the expected value, whereas learning imprecisions reflect random noise in the reinforcement learning process that updates option values. In practice, these differences allow decomposing human decision variability into two separate sources through detailed computational modeling of human behavior (6).
Both exploration and imprecise computations entail substantial reward costs. First, by selecting an option that does not maximize the expected value, exploration temporarily foregoes the exploitation of the best available option (10). Second, relying on imprecise computations means that the option with the highest subjective value is less likely to be the objectively best option available. Despite these similar reward costs, the two sources of decision variability have very different cognitive benefits. Exploration reduces uncertainty about option values, whereas imprecise learning reduces demands in terms of cognitive and neural resources (11)(12)(13). It is well known that humans arbitrate the "explore-exploit" trade-off under uncertainty in terms of its costs and benefits (3,4). However, these findings have been obtained using reinforcement learning models, which assign all decision variabilities to exploration. Whether humans simultaneously regulate learning imprecisions in terms of their specific costs and benefits remains unknown.
The costs and benefits of exploration and learning imprecisions depend on the dominant form of uncertainty in the environment: expected versus unexpected uncertainty (14,15). Expected uncertainty refers to random stochasticity of rewards associated with a choice option around a constant mean value. By contrast, unexpected uncertainty refers to changes in the mean value of rewards associated with a choice option. Under expected uncertainty (i.e., reward stochasticity), individual rewards become less informative about their mean value as learning progresses, and agents can therefore tolerate low learning rates and little exploration. By contrast, under unexpected uncertainty (i.e., reward volatility), individual rewards are highly informative about changes in their mean value, and agents should therefore maintain high learning rates and frequent exploration of unobserved choice alternatives. There is ample experimental evidence that humans adjust their learning rates and exploration at short time scales depending on the dominant form of uncertainty in their environment (16)(17)(18)(19). However, whether this regulation of learning rates and exploration is accompanied by a modulation of learning imprecisions remains unknown.
Here, we developed an experimental framework in which the effects of expected and unexpected uncertainty on decision variability can be compared in the context of the same reinforcement learning task (Fig. 1, A and B; see Materials and Methods). We decomposed the decision variability of two large samples of human participants (a "discovery" dataset and a "replication" dataset; see Materials and Methods) into exploration and learning imprecisions by fitting a noisy reinforcement learning model to their behavior. We found that participants choose more variably between options but learn more precisely their values in response to unexpected uncertainty. By studying individual differences in these opposite adjustments, we show that humans regulate the variability of their decisions based on not one but two cost-benefit trade-offs.

Two-armed bandit task design and performance
Adult participants (n = 200 per dataset; see Materials and Methods) played a two-armed bandit task in three conditions (Fig. 1B). The "reference" (Ref ) condition consists of short rounds of trials using two choice options associated with reward distributions of fixed means and variances. The "stochastic" (S+) condition differs from the Ref condition in that reward distributions have larger variances, i.e., increased expected uncertainty. The "volatile" (V+) condition differs from the Ref condition in that it consists of long rounds of trials with reward distributions of changing means, i.e., unexpected uncertainty. Participants were instructed of the structure of uncertainty and relative difficulty of each condition (see Materials and Methods). We describe below the analysis of the first "discovery" dataset of adult participants (n = 154 after exclusion; see Materials and Methods) and point toward the replication of observed effects in the second replication dataset (n= 142 after exclusion; see Materials and Methods). Once a choice is made, the reward associated with the option on that trial is displayed. (B) Reward schedule of an option during a block. Colored shapes indicate the reward given for an option. The set of shape options are shown for each block. Darker colored shapes indicate the option chosen on a trial. Vertical lines demarcate the beginning and the end of a block in the reference (Ref ) and the stochastic (S+) conditions and a reversal in the volatile (V+) condition. Top: Darker red horizontal lines show the generative mean for the option corresponding to the correct shape in the Ref and S+ conditions. Bottom: Darker lines indicate the generative reward mean for the best option drifting throughout the course of a block in the V+ condition. Shaded areas around lines represent the values of the probability density function from which a reward (shape) was drawn. (C) Left: Average accuracy within each condition. Colored dots represent individual participants' mean accuracy. White dots indicate the median accuracy. Error bars represent the first and third quartiles. Shaded areas indicate first and third quartiles of the optimal model's accuracy. Right: Accuracy over time within each condition. The vertical line represents the start of a new block or a reversal. Solid lines indicate the mean accuracy across participants. Dotted lines indicate the mean accuracy of the optimal model. Shaded areas correspond to the SEM. (D) Left: Average switching rate within each condition. The vertical line represents the start of a new block or a reversal. Colored dots represent individual participants' overall switch rate. White dots indicate the median switch rate. Error bars represent the first and third quartiles. Shaded areas indicate first and third quartiles of the optimal model's switch rate. Right: Proportion of switches over time within each condition. Solid lines indicate the mean switch rate across participants. Dotted lines indicate the mean accuracy of the optimal model. Shaded areas correspond to the SEM. ***P < 0.001.  Fig. 1D). Comparing the two conditions with increased uncertainty showed that participants switched more often between choice options in the V+ condition (V+ minus S+: z = 5.3, P < 0.001). These differences between conditions were replicated in the second dataset ( fig. S1).

Computational model specification
We first sought to compare human decisions to those of an optimal learning agent in the same task conditions. For this purpose, we derived a Kalman filter whose parameters were set to the generative values of each task condition (see Materials and Methods). This optimal learning agent selected on each trial of each round the option associated with the highest estimated value (reward mean). Simulating the behavior of the optimal learning agent confirmed that human decisions were substantially less accurate and more variable than those of the optimal learning agent in all three conditions ( Fig. 1, C and D). These differences between human and optimal decisions were replicated in the second dataset ( fig. S1).
To capture the suboptimal variability of human decisions, we altered the optimal learning agent with four free parameters (Fig. 2, A and B; see Materials and Methods). First, a learning rate α controls how much estimated option values are updated following each obtained reward. Unlike reinforcement learning models, the learning rate α of a Kalman filter reflects the dominant form of uncertainty assumed by the learning agent, and the effective magnitude of updates varies from trial to trial as a function of uncertainty regarding the current value of the chosen option (see Supplementary Text). Second, a decay rate δ controls the rate at which the value of the unchosen option regresses toward its previous value, reflecting a decay of unchosen option values in working memory. Third, a learning noise term triggers imprecise updates of estimated option values, controlled by a scaling factor ζ. As in recent work (6), we hypothesized that learning imprecisions follow Weber's law and scale with the magnitude of associated reward prediction errors. Fourth, a choice temperature τ generates exploration through a "softmax" choice policy (3,4). Before analyzing fits of this suboptimal learning agent to human decisions, we performed model and parameter recovery analyses (20,21) to validate that our fitting procedure was capable of identifying the source(s) of decision variability from choice behavior ( fig. S2; see Materials and Methods). We also verified that the main findings are robust to the use of a reinforcement learning model instead of a Kalman filter to fit human decisions (see Supplementary Text).

Model parameter fits in S+ and V+ conditions
We fitted the parameters of the suboptimal learning agent to each participant's behavior in each condition using approximate Bayesian inference (see Materials and Methods). We performed Bayesian model selection to compare the suboptimal learning agent including the two sources of decision variability (learning imprecisions and exploration) with variants including a single source ( fig. S3). We found that both sources of decision variability are needed to fit participants' behavior (exceedance of P > 0.999 in each condition). In line with recent work, we validated that learning imprecisions scale with the magnitude of updates following each outcome (comparison between flat and update-scaled imprecisions; exceedance of P > 0.999 in each condition). We also found evidence for a significant memory decay of unchosen option values in all three conditions (comparison between δ > 0 and δ = 0; exceedance of P > 0.999 in each condition).
As expected, participants had higher learning rates in the V+ condition than in the other two conditions (V+ against Ref: z = 10.2, P < 0.001; V+ against S+: z = 9.8, P < 0.001; Fig. 3A). Regarding decision variability, participants had lower learning noise in the V+ condition (V+ against Ref: z = −2.8, P = 0.006; V+ against S+: z = −4.4, P < 0.001), as well as higher choice temperature (V+ against Ref: z = 4.4, P < 0.001; V+ against S+: z = 4.4, P < 0.001). We found no statistically significant difference between the Ref and S+ conditions in terms of any model parameter (learning rate: P = 0.437; decay rate: P = 0.479; learning noise: P = 0.100; choice temperature: P = 0.995). Unlike its optimal counterpart, this suboptimal learning agent was able to capture participants' behavior in terms of accuracy and switch rate, as well as the trajectories of these quantities over the course of each round (Fig. 3, B and C). These effects were replicated in the second dataset (fig. S1).
Learning noise and choice temperature showed large individual differences in each condition. We took advantage of these differences to study how each parameter affects decision variability. We correlated each parameter with decision accuracy and switch rate across participants (Fig. 4A). Participants' accuracy showed significant negative correlations with learning imprecisions and exploration (rank correlation, learning noise: ρ = −0.18, P < 0.001; choice temperature: ρ = −0.38, P < 0.001). By contrast, participants' switch rates showed a negative relation with learning noise (ρ = −0.14, P = 0.002) but showed a strong positive relation with choice temperature (ρ = 0.71, P < 0.001). This means that, unlike exploration, learning imprecision did not make participants switch more between the two choice options. Median splits of participants' choice behavior as a function of learning noise (

Model parameter covariations across participants
We studied how the large individual differences in decision variability are shared between model parameters and across conditions. For this purpose, we computed the correlation matrix of model parameters across participants (12 parameters = 4 parameters per condition; see Materials and Methods). This matrix revealed significant covariations between model parameters and conditions (Fig. 5A). First, learning rate, learning noise, and choice temperature all showed significant within-parameter correlations between the Ref and the S+ or V+ conditions (Fig. 5C). Furthermore, learning rate correlated negatively with learning noise (rank correlation, ρ = −0.36, P < 0.001; Fig. 5D) but correlated positively with choice temperature (ρ = 0.43, P < 0.001). We verified that these between-parameter correlations remained significant within each condition and performed additional recovery analyses to validate that neither reflects spurious correlations arising from the fitting procedure ( Fig. 5B; see Materials and Methods). These correlations were also replicated in the second dataset (figs. S4 and S5). The positive association of learning noise and learning rate and the negative association of choice temperature and learning rate across participants mirror their within-participant adjustments across task conditions. The decrease in learning noise and the increase in choice temperature are associated with higher learning rate in the V+ condition. However, are these simultaneous adjustments of decision variability distinct or tied to each other? To answer this question, we first measured the partial correlation between learning noise and choice temperature once the relation of these two quantities with learning rate are partialled out. This analysis indicated no direct relation between the two quantities (partial rank correlation, ρ = 0.03, P = 0.584). We then measured the relation between adjustments of each source of variability with adjustments of learning rate between conditions. We found that the decrease in learning noise in the V+ condition (V+ minus Ref ) correlates with the increase in learning rate (ρ = −0.22, P = 0.007) but not with the increase in choice temperature (ρ = −0.03, P = 0.739). Together, these results indicate that the

Principal components analysis of model parameters
To provide additional support for independent adjustments of learning imprecision and exploration to volatility, we performed a principal components analysis (PCA) of model parameters (see Materials and Methods). By construction, principal components (PCs) are orthogonal to each other and reflect uncorrelated sources of variability in model parameters. We therefore asked whether learning noise and choice temperature project onto distinct PCs. This analysis revealed two PCs (PC1 and PC2) that capture covariance in model parameters better than chance (Fig. 6A).  to accuracy and no significant relation to switch rate (see Supplementary Text). These different effects were replicated in the second dataset ( fig. S6). The covariance structure of model parameters extracted using PCA confirms that the simultaneous adjustments of learning noise and choice temperature to volatility are largely independent of each other.

Adaptive regulation of learning and choice variability to uncertainty
Participants choose more variably but learn less imprecisely in the V+ condition as compared to the other two conditions. However, do these opposite adjustments of decision variability reflect the changing costs of learning imprecisions and exploration across conditions? To address this question, we performed theoretical simulations of noisy reinforcement learning agents to measure the reward costs of learning imprecisions and exploration in each condition.
We computed the marginal reward costs associated with each source of decision variability through simulations of the suboptimal learning agent. We selectively varied the associated model parameter (either the learning noise ζ or the choice temperature τ) while holding all other model parameters constant and set to their bestfitting values (see Materials and Methods). We measured marginal reward cost C x as the fraction loss of reward excess (i.e., the difference between obtained and foregone rewards) compared to a learning agent without the corresponding source of variability x (x = ζ or τ) tested in the same condition (Ref, S+, or V+). This "relative" definition of reward cost C was chosen to measure how much each source of variability affects the reward that could have been obtained in each condition (from C x = 0 for a cost-free source of variability to C x = 100% for a purely random agent).
Marginal reward costs C ζ and C τ increased monotonically with learning noise and choice temperature in all conditions (Fig. 7A). However, the same amount of learning noise ζ was associated with larger reward costs C ζ in the V+ condition than in the other two conditions. By contrast, the same choice temperature τ was associated with smaller reward costs C τ in the V+ condition. This means that learning noise is effectively more costly in the V+ condition, whereas choice temperature is effectively less costly in the V+ condition.
On the basis of these theoretical simulations, we estimated the marginal reward costs incurred by learning noise (C ζ ) and choice temperature (C τ ) for each participant and each condition (n = 296 across the two datasets; Fig. 7B). We found that C ζ did not differ between the S+ and V+ conditions (S+: 11.8% [10.5 13.9]; V+: 12.5% [10.9 14.3], median [bootstrapped 95% confidence interval]; z = 1.0, P = 0.307). Similar to C ζ , C τ did not differ between these two conditions (S+: 4.1% [3. In other words, the decrease in learning noise and increase in choice temperature in the V+ condition compensate for the larger reward costs of learning noise and the smaller reward costs of choice temperature in this condition. Expressing reward costs in terms of "joint" reward costs of the two sources of variability C ζ,τ (compared to a learning agent without any variability) also results in similar costs across conditions ( fig. S7C). Conversely, the amounts of learning noise ζ and choice temperature τ associated with fixed marginal reward costs across conditions show a decrease in ζ and an increase in τ in the V+ condition, as in participants (Fig. 7B). In additional theoretical analyses (see Supplementary Text), we considered alternative definitions of reward costs and the possibility that participants optimize learning imprecisions and exploration in terms of a quantitative comparison between the costs (in terms of reward loss) and benefits (in terms of reduced cognitive resources for learning imprecisions and of lower uncertainty for exploration). Together, these theoretical considerations suggest that the opposite adjustments of learning imprecisions and exploration to unexpected uncertainty follow the opposite changes in their reward costs between conditions.

DISCUSSION
Making variable decisions under uncertainty has not only clear benefits but also significant costs (9,22). Previous research has provided compelling evidence that part of this variability reflects active and adaptive exploration aimed at reducing uncertainty about choice options (3)(4)(5). However, seeking information about uncertain options often means foregoing a rewarding option (10). Furthermore, recent work has shown that a substantial fraction of decision variability arises not only from exploration but also from learning imprecisions (6). Here, we studied how humans adapt these two sources of decision variability to different forms of uncertainty. We obtained converging evidence for an independent tuning of learning imprecisions and exploration to their specific costs and benefits. We discuss below the implications of these findings for existing theories of the explore-exploit trade-off that ignore learning imprecisions (23).
In agreement with existing theories (4,10,14,22), we found that humans adjust the explore-exploit trade-off depending on the dominant form of uncertainty in their environment. Participants made more variable choices in the V+ condition where option values change over the course of a single game. Such "random" exploration is adaptive in this context to monitor possible changes in the values of recently unchosen options (3). Similar adaptations of the exploreexploit trade-off have been described across other task conditions. Humans make little to no exploratory choices when the outcomes of unchosen options are known (6), engage in "directed" exploration in conditions with imbalanced uncertainty across options (5,22,24), and use structured exploration in environments with spatially correlated option values (25).
Here, by measuring imprecisions in the reinforcement learning process that updates option values, we found that humans increase learning precision in the V+ condition. This within-participant adjustment of learning precision is resource-efficient: While outcomes from S+ options are weakly informative about their values, outcomes from V+ options are informative about changes in their values. Participants therefore require not only higher learning rates but also more precise updates. In the noisy Kalman filter that we fitted to participants' behavior, the learning rate α reflects the perceived volatility of choice options. We leveraged the large individual differences in learning rate in the V+ condition to validate a second prediction of our resource-efficient account: that individuals with high learning rates (i.e., who perceive options that are more volatile) should (on average) learn option values more precisely than individuals with low learning rates. This pervasive relation between learning rate and learning precision reveals a second trade-off that shapes human learning and decision-making under uncertainty.
Adjustments of exploration and learning imprecisions are not only distinct in terms of their respective costs and benefits but they also correspond to fundamentally different types of adjustments in the decision process. Optimizing the explore-exploit trade-off consists in tuning the choice policy between available options (4), downstream from the reinforcement learning process which estimates option values based on obtained rewards. By contrast, optimizing the cost-benefit trade-off associated with imprecise computations consists in tempering with the estimation of option values themselves, upstream from the choice policy (6,9). The observation of opposite adjustments of exploration and learning imprecisions in response to unexpected uncertainty provides empirical evidence that humans can simultaneously and independently regulate how they choose between options and how precisely they learn from choice outcomes.
The tuning of learning imprecisions to uncertainty cannot be described in terms of "efficient coding" theories (7,(26)(27)(28). These influential theories explain how noise-free computations can be tuned at long time scales to minimize the impact of external noise on performance. Instead, we show that humans rely on imprecise computations to learn option values and that they regulate this internal noise (6,9) at short time scales as a function of task demands. This finding is in agreement with the idea that humans optimize the use of their limited cognitive resources in a flexible, context-dependent fashion as a function of task demands (29,30). The fact that humans regulate learning imprecisions confirm theoretical considerations that describe precise computations as extremely costly in terms of neural resources (13). This finding is also consistent with the observation that reinforcement learning in multidimensional environments relies on selective attention mechanisms that focus learning resources on a subset of learnable option features (31)(32)(33).
The fact that humans adapt their exploration and learning imprecisions across conditions does not imply that participants are aware and in control of these adaptations. In our computational model, learning imprecisions corrupt value representations in an implicit fashion, i.e., trial-to-trial learning errors are not explicitly accounted for by increasing the uncertainty about option values. It is well known that humans exhibit partial blindness to internal sources of error (34,35). In additional control analyses (see Supplementary Text), we have considered variants of our learning agent where the uncertainty triggered by learning imprecisions alters learning rates. The fact that these variants provide a poorer fit to human decisions suggests that learning imprecisions may be regulated in an implicit, nonintentional fashion. By contrast, exploration is thought to reflect a source of decision variability that can be intentionally regulated, particularly in conditions where sources of uncertainty are known (as it is the case in our study). In addition, in our model, exploration corresponds to explicit "non-greedy" decisions that do not maximize the expected value. Nevertheless, further work will be required to determine the extent to which exploration and learning imprecisions are regulated intentionally as a function of uncertainty.
We simulated the marginal reward costs of exploration and learning imprecisions in each condition, expressed in terms of relative loss compared to a learning agent without this source of variability. We found that unexpected uncertainty is associated with lower costs of exploration but higher costs of learning imprecisions than expected uncertainty. These opposite effects of unexpected uncertainty on the costs of exploration and learning imprecisions provide an adaptive account of their opposite adjustments. In agreement with this view, the marginal reward costs of participants' exploration and learning imprecisions were found to be constant across conditions. However, it remains unclear whether participants actively maintain fixed reward costs across conditions, or whether they optimize other cost-benefit trade-offs associated with these two sources of behavioral variability (see Supplementary Text).
If humans regulate (intentionally or not) the variability of their decisions in terms of two distinct cost-benefit trade-offs, then their neurophysiological substrates should be dissociable. Trial-to-trial decision variability has been associated with brain activity in frontopolar (3,36) and anterior cingulate (37,38) cortices and has been related to dopaminergic and noradrenergic pathways (39)(40)(41)(42). These multiple effects have been linked to adjustments of the explore-exploit trade-off, which is the sole source of decision variability in state-of-the-art models relying on perfectly precise computations. Accounting for learning imprecisions has revealed that the variability of value updates correlates with blood oxygen level dependent (BOLD) activity in the anterior cingulate cortex and with pupil dilation (6). Pupil dilation correlates with the activity of the locus coeruleus (43,44), a brainstem nucleus with norepinephrine-containing neurons that has bidirectional projections with the anterior cingulate cortex in the primate brain (45). Adjustments of exploration and learning imprecisions could be achieved either by dissociable brain regions and neuromodulatory pathways or by different signals in the same brain region (46). In particular, anterior cingulate activity has been related not only to exploration but also to the cost-benefit trade-off associated with cognitive control (12). Future research should study how exploration and learning imprecisions are simultaneously regulated in the human brain (47).
The delineation of opposite adjustments of exploration and learning imprecisions in response to volatility suggests that models lacking either source may draw incorrect interpretations. Our findings show that humans adjust not only how much they explore uncertain choice options, by arbitrating the exploreexploit trade-off, but also how precisely they learn their values through reinforcement learning. These adjustments may generalize to different tasks and computational models of behavior. Decisions in perceptual categorization tasks suffer from a similar source of imprecisions in the underlying inference problem (48). These categorization tasks can be embedded in V+ conditions (49) where statistical inference models of behavior include a parameter that controls learning rate and another parameter that controls inferential imprecisions (50,51). In this context that differs widely from the two-armed bandit used in this study, the same negative relation between learning rate and inferential imprecisions may emerge. Together, these results indicate that the precision of cognitive computations acts as an important and flexible constraint on human cognition (9). Existing theories of cognitive control should therefore be revised to account for imprecise computations and their associated costs and benefits.

Experimental design
Three samples of n = 200 English-speaking adults were recruited on Prolific (prolific.co) for inclusion in the two datasets collected for this study (a discovery dataset and a replication dataset; total n = 400). We fixed this sample size of n = 200 per dataset a priori and calculated that it is sufficiently powered to detect a minimal between-subject correlation of 0.20 with 80% power at a significance threshold α = 0.05. We excluded from analyses any participant whose accuracy did not significantly exceed chance-level performance at a one-tailed threshold α = 0.05 in any of the three conditions (binomial test against 50% accuracy). This exclusion procedure left n = 154 participants (mean age = 29.6 ± 10.0 years; 53 females) in the first dataset and n = 142 participants (mean age = 25.4 ± 6.4 years; 72 females) in the second dataset. All participants provided informed consent regarding their participation in the study, which followed guidelines approved by the ethical review committee of the Institut National de la Santé et de la Recherche Médicale (IRB #00003888).
Upon gaining access to the task, participants were prompted to set the task to full-screen before being able to continue. Next, participants were introduced to the mechanics of their responses (e.g., keys for choosing options) and the presentation of the task as a slot machine game. They were instructed on the dynamics of the Ref, S+, and V+ conditions and the difference between these conditions via written prompts on the screen. Participants always played the Ref condition first, and the order of the other two conditions was counterbalanced across participants. Before every condition, participants played a "practice" block to familiarize them with the condition. At the end of each practice block, participants were presented with a debriefing of their performance and the condition with a visual illustration of the trajectory of their decisions and the rewards they obtained. The task was self-paced but had to be completed within 60 min to avoid disengagement. Participants were offered to take breaks between blocks.

Task generation
The Ref and S+ conditions corresponded to two-armed bandits with fixed reward distributions, drawn between 1 and 99 points. The V+ condition corresponded to a restless two-armed bandit task where the means of the reward distributions associated with each option follow a random walk over time. In both conditions, the mean rewards r o associated with the two options o ∈ {1,2} were symmetrical (i.e., r 2 ¼ 100 À r 1 ). Reversals were defined as time points where the mean rewards associated with the two options cross each other.
Task difficulty was matched between the S+ and V+ conditions based on the simulated performance of an optimal learning agent. For each participant, we first generated 2000 random walks of mean rewards following a beta distribution of an initial mean of 50 points and a drifting SD of 5 points. All random walks that traveled below 10 points or above 90 points, or had steps larger than 10 points, were discarded. Excursions below or above 50 points of 8, 12, 16, 20, and 24 steps were chosen to populate the V+ condition. The average reward of each excursion was used as the static mean reward of a round of trials for the Ref condition. We set the variance of sampled rewards such that 75% of sampled rewards from the better option in the Ref condition were higher than 50 points (and 25% lower than 50 points). We then computed the effective sampling variance (ṽ s ) and the effective drifting variance (ṽ d ) of rewards and simulated a standard Kalman filter with greedy choices in the V+ condition to obtain its accuracy (fraction of choices toward the option associated with the largest mean reward). To generate the last S+ condition, we incrementally regressed the static mean reward from the Ref condition toward 50 until the choice accuracy of a Kalman filter matched that of the V+ condition ( fig. S8). Participants played two instances of each condition (two blocks of 80 trials for each condition).
Options were depicted as black shapes for the Ref and S+ conditions, and as colored discs in the V+ condition, to help participants distinguishing between stable (Ref and S+) and volatile (V+) conditions. To make sure that participants treated each new round of the Ref and V+ conditions as independent from the previous ones, we presented novel shapes for each new round. After each choice, the chosen option briefly appeared at the center of the screen and was then followed by the number of points obtained from the chosen option (Fig. 1A). Participants earned £3.30 upon successful completion of the task. Participants gained an extra £1.00 as a bonus if they performed significantly better than chance across the entire experiment (α = 0.05, meaning a choice accuracy above 53.75%). Participants who answered a battery of mental health questionnaires performed after the task (not analyzed for this study) earned another £3.00 upon completion.
The visuals and dynamics of the task (i.e., the frontend code) were implemented using jsPsych (version 6.3.0) (52). The backend server was handled using Node.js. Participants' responses to the task and questionnaires were stored in a MySQL database.

Computational modeling
Following each outcome, the posterior value of the chosen option x c t is updated according to a standard Kalman filter corrupted by additive random noise ε t drawn from a normal distribution of zero mean and SD η t x c t ¼ x c tÀ 1 þ k t ðr tÀ 1 À x c tÀ 1 Þ þ ɛ t where k t is the Kalman gain, which depends on the posterior variance of the chosen option v c tÀ 1 and the sampling variance v s of rewards, which was set to its true effective value (0.0163 for rewards rescaled between 0 and 1) and used as the "scaling parameter" in the model where v d is the drifting variance assumed by the Kalman filter. We fitted v d in terms of its associated asymptotic Kalman gain α for an option that would be chosen on n → ∞ trials. Following recent findings (6), the SD of the random noise corrupting the update of the posterior value of the chosen option scales as a constant fraction ζ of the update j k t ðr tÀ 1 À x c tÀ 1 Þ j. Posterior variances of the two options were initialized to the true variance of mean reward values across all conditions (0.0214 for rewards rescaled between 0 and 1). The posterior value of the unchosen option x u t decays toward the baseline value (50 points, i.e., 0.5 for values rescaled between 0 and 1). The rate of this decay is controlled by an exponential decay factor δ x u t ¼ x u t þ δ ð0:5 À x u tÀ 1 Þ The probability of choosing option 1 p 1 t based on the posterior values of the two options follows the standard softmax policy with choice temperature τ where τ → 0 corresponds to a purely greedy argmax policy. We used sequential Monte Carlo sampling methods to estimate the conditional likelihoods of the responses of each participant given the set of parameter values. Using Bayesian adaptive direct search with 10 random starting points for each parameter α, ζ, δ, and τ (53), we first obtained point estimates of best-fitting parameter values. We then used these estimates as starting points for the estimation of their joint posterior distribution using Variational Bayes Monte Carlo (54,55). Both model fitting algorithms accounted for noise in log-likelihood estimates because of the presence of random learning noise in the Kalman filter updates.

Computational model validation
To ensure that the inclusion of learning noise (controlled by its Weber fraction ζ) and soft choices (controlled by their temperature τ) were both necessary to fit participants' choices, we performed random-effects Bayesian model selection based on its standard Dirichlet parameterization described in the literature (56). In practice, as in recent work (6), we compared the model including both sources of decision variability with two model variants: one without learning noise where ζ = 0, and one with a purely greedy choice policy where τ = 0. The full model outperformed the other two reduced models (model prevalence of 70% for the full model; exceedance of P > 0.999).
Critically, we performed standard model recovery to validate our model simulation and fitting procedure (20,21). In particular, by simulating the different variances of our Kalman filter model using individual best-fitting parameters, we established that our three candidate models were distinguishable between each other ( fig. S2B). Our fitting procedure was not biased in that we were able to successfully recover all three model variances by simulating decisions from each model and then fitted all three models again to this simulated behavior for which the ground-truth model structure is known.

Statistical analyses
We used Wilcoxon signed-rank tests for the comparison of modelfree behavioral variables across conditions (choice accuracy and switch rate). Wilcoxon signed-rank tests were also used for comparisons of best-fitting model parameters across conditions. Repeatedmeasures analysis of variance (ANOVA) was performed for comparing choice accuracy curves and switch rate curves across conditions. Spearman (rank-based) correlations were used unless noted otherwise. Split-plot tests were used to calculate the effects of median splits with respect to each model parameter on choice accuracy and switch rate curves. We performed a PCA on standardized (zero mean and unit variance) model parameter values such that, by construction, (i) the variance explained by a PC in a given condition captures shared variance between model parameters and (ii) different PCs are orthogonal to each other and therefore reflect uncorrelated sources of variability in model parameters.
SEs on the Spearman's rank correlation coefficients were generated through bootstrapping by randomly sampled datasets from the original dataset (with identical sample size and replacement) 1000 times and calculating the correlation on the new datasets. The same procedure was done to obtain the SEs of PC coefficients and associated fractions of variance explained. Statistical significance of the fraction of variance explained was obtained by a one-sided comparison of the variance explained by bootstrapped PCs to the variance explained by bootstrapped PCs for shuffled data. A similar procedure was used to compare the fractions of variance of each PC explained by each model parameter.