Every animal is represented by a blue circle. Correlation was measured by Spearman s rank correlation coefficient (ρ).

Supplementary Figure 1 Correlations between tone and context freezing by animal in each of the four groups in experiment 1. Every animal is represented by a blue circle. Correlation was measured by Spearman s rank correlation coefficient (ρ).

Supplementary Figure 2 Comparisons of context memory between groups are stable over time and invariant to the salience of the conditioning context.

(a) Minute-by-minute analysis of freezing during the context test for the groups in experiment 1 (and Pairings Last, included for illustration but not in the statistical analysis). A repeated measures ANOVA showed no Time*Contingency*Spacing interaction (n = 16,17,18,20,21,F 1,288 = 1.96, P = 0.17). A comparison restricted to the massed condition (between CTL II and Pairings First) also showed no Time*Contingency interaction (F 1, 144 ) = 0.74, P =0.40). (b) Comparison of CTL II and Pairings First groups with conditioning and context test performed in a more salient context (lit by a visible light and with citrus odor). Reduction in Tone memory matched previous result (ratio between CTLII and Pairings First 0.63 vs. 0.66 originally), whereas Context memory was similar between the groups, as before. Error bars indicate s.e.m.

Supplementary Figure 3 Defecation gives similar results to freezing for experiments described in Figure 2. (a) Contingency degradation in the Intermixed condition with APV infusion in dorsal hippocampus (DH) prior to conditioning, as measured by defecation during tone test. (n = 9,9, Mann-Whitney U test, U = 17.5, P = 0.04). (b) Impaired contextual aversive memory, as measured by defecation, following APV infusion in DH prior to conditioning (n = 9,7, Mann-Whitney U test, U = 9.5, P = 0.018). Error bars indicate s.e.m.

Supplementary Figure 4 Effect of repeated USs depends on contingencies. (a) Behavioral data (left) and model simulation (right) for conditioning with 15, and 21 CS-US pairings (n=9, 7). Adding further shocks paired with the same tone CS (21 pairings in total) did not reduce tone memory strength. (b) Behavioral data (left) and model simulation (right) for the cover stimulus effect. Signaling shocks with a second CS (in this case a flashing light), instead of giving unsignaled shocks attenuates contingency degradation (comparison between Pairings First and Cover Stimulus groups, n = 18, 12, unpaired sample t-test, t 28 = 2.42, * P = 0.022) Error bars indicate s.e.m.

Supplementary Figure 5 Location of the optical fiber tips for optogenetic experiments.

Supplementary Figure 6 Location of the electrode tips for electrophysiological experiments.

Supplementary Figure 7 Effects of contingency on amygdala LFP potentiation. (a) Example traces before, and after conditioning for a representative animal each in the Control II group (left) and Pairings First group (right). Red arrows indicate the peak depolarization. (b) Averaged peak depolarizations in the CTL II and Pairings First groups before (Habituation), and after (LTM) conditioning. There was a marginally significant interaction between time and contingency (n=10, 8, repeated measures ANOVA, F = 1,16 4.30, P=0.055), and a simple effects analysis showed significant potentiation of the LFP response in the CTL II, but not the Pairings First condition (F = 18.0, P = 0.001 and F 1,16 1,16 = 1.022, P = 0.33), further indicating that conditioning differentially effects synaptic processing depending on contingency. Error bars indicate s.e.m.

I II mixed First Supplementary Figure 8 Comparison of SLM to behavioral results for experiments described in Figure 1. Direct comparison of behavioral data (top panel) and SLM (bottom panel) for experiment 1. Error bars indicate s.e.m.

Supplementary Figure 9 The graph structures used for SLM, extended to include a second discrete variable.

Supplementary Figure 10 SLM s predictions for further conditioning phenomena. (a) Cover stimulus effect: Replacing unsignaled USs by USs signaled by a second discrete cue (e.g. a light) reverses the effects of contingency degradation. (b) Overshadowing: Conditioning to a single cue (Tone) is reduced if it is trained in compound with a second cue (Light). (c) Recovery from overshadowing: Unreinforced presentations of the overshadowing second cue (Light) restores the level of responding to the first cue. (d) Blocking: Initial conditioning to a Light reduces subsequent conditioning to the Tone when the Tone is conditioned in compound with the Light.

Supplementary Figure 11 Graphical illustration of the fits of some of the different models compared. (a) Behavioral Data. (b) Bayesian model that learned both structure and parameters (SPLM). (c) Bayesian model that learned parameters using Graph 6 (from Fig. 5a) and the best Beta priors for edge parameters. (d) Van Hamme and Wasserman s extension of the Rescorla-Wagner model.

Model MSE s.e. of MSE for hippoampus the MSE APV injection (% freezing squared) Parameters Structure Learning (SLM) 13.44 0.0054 34.03 2 Parameter Learning (PLM) 20.80 0.024 245.64 8 SLM no Background 17.28 0.057 51.31 3 PLM no Background 34.69 0.023 201.87 6 SLM Linear 21.66 0.086 192.29 2 SOCR with Background 30.89-440.66 7 HW-RW with Background 47.66-234.61 11 SOCR 32.45-292.53 6 HW-RW 67.06-848.88 8 Supplementary Table 1: Comparison of model fits. Mean squared error (MSE) for the best fit of each model, followed by the standard error of the MSE, and the error of the model in predicting the results of hippocampal interventions, with each value representing percentage freezing squared. The final column lists the number of free parameters (excluding the two scaling parameters). Model BIC Adjusted R 2 Adjusted R 2 Context Tone SLM 88.82 0.87 0.79 SLM no Background 99.47 0.74 0.73 SLM linear 102.66 0.68 0.93 SPLM 110.54 0.79 0.71 PLM 121.69 0.53 0.65 SOCR 127.85 0.63 0.78 HW-RW 155.64-0.32 0.17 Supplementary Table 2: Bayesian Information Criterion (BIC) and adjusted R-squared values for the different models. For BIC the number of data points was 29, and the number of parameters was the number of model parameters plus the two scaling parameters. R-squared values were adjusted by the number of model parameters plus 1 scaling parameter.

Training protocol 2*U 3*U 6*U 10*U 15*U 2*U +1*E 3*U+1*E n= 7 7 7 5 10 6 7 Training protocol 2*P 3*P 6*P 10*P 15*P 21*P n= 6 25 8 8 9 7 including with CS test only - 3 - - - Training protocol 3*P+3*U 3*P+6*U 3*P+9*U 3*P+12*U Intermixed 12*U+3*P n= 12 10 12 18 17 16 including with CS test only 3 2 3 - - - Supplementary Table 3: Group sizes for the different conditioning protocols. Experiment 1 Tone Context F statistic p value F statistic p value Interaction F 1,71 = 0.4 0.75 F 1,71 = 0.5 0.68 Main effect for test order F 1,71 = 0.53 0.47 F 1,71 = 0.16 0.60 All experiments with varied test order Tone Context F statistic p value F statistic p value Interaction F 1,116 = 0.35 0.93 F 1,116 = 0.76 0.62 Main effect for test order F 1,116 = 0.24 0.62 F 1,116 = 2.83 0.096 Supplementary Table 4: The order of the CS and Context tests didn t significantly affect freezing scores either for the groups in Experiment 1, or across all the different conditioning protocols where the order of the testing was varied. Testing order was varied for all protocols except when only CS-US pairings were given. For those, the context test was always given first to get the best possible measure of the overshadowing effect. Analysis using Two-way ANOVA, with the levels of the first factor being the different conditioning protocols.

Group Latency HAB Latency LTM Amplitude HAB Amplitude LTM as % from HAB (ms) (µv) (µv) Control II 14.06 ± 1.23 14.44 ± 1.41 7.57 ± 1.40 224.34 ± 37.94 Pairings First 13.83 ± 1.09 13.86 ± 1.05 9.35 ± 1.83 122.46. ± 13.25 Supplementary Table 5: There was no statistically significant effect of group or conditioning on A-LFP latencies (p values for main effects and interaction > 0.7), or between average amplitudes during habituation between groups (p > 0.4). However, the increase (as percentage baseline) in A-LFP amplitude amplitude following conditioning was significantly higher in the 100% contingency group Control II (p = 0.028).

Experiment 1-no manipulation Tone Context F statistic p value F statistic p value Interaction F 1,75 = 1.63 0.21 F 1,75 = 6.44 0.013 Main effect for contingency F 1,75 = 18.02 0.00006 F 1,75 = 3.26 0.074 Main effect for spacing of CS-US pairings F 1,75 = 1.35 0.25 F 1,75 = 8.22 0.005 Simple main effect for contingency with pairings spaced F 1,75 = 15.00 0.0002 F 1,75 = 14.37 0.0003 Simple main effect for contingency with pairings massed F 1,75 = 4.48 0.038 F 1,75 = 0.00008 0.82 Simple main effect for spacing with 100% contingency F 1,75 = 3.35 0.071 F 1,75 = 10.65 0.001 Simple main effect for spacing with 20% contingency F 1,75 = 0.006 0.94 F 1,75 = 0.0004 0.63 Experiment 2-APV infusions (Fig. 2b,c) Tone Context F statistic p value F statistic p value Interaction F 1,32 = 0.07 0.79 F 1,32 = 0.38 0.54 Main effect for contingency F 1,32 = 11.98 0.0015 F 1,32 = 0.55 0.46 Main effect for drug F 1,32 = 0.59 0.44 F 1,32 = 9.47 0.0043 Experiment 2-APV infusions (Fig. 2d,e) Tone Context t statistic p value t statistic p value t 16 = 2.14 0.048 t 14 = 2.31 0.037 Experiment 3-optogenetic inactivation Tone Context F statistic p value F statistic p value Interaction F 1,29 = 0.28 0.6 F 1,29 = 4.3903 0.045 Main effect for laser treatment F 1,29 = 11.41 0.0021 F 1,29 = 0.0297 0.86 Main effect for spacing of CS-US pairings F 1,29 = 0.092 0.76 F 1,29 = 1.79 0.19 Simple main effect for laser with pairings spaced F 1,29 = 4.4537 0.044 F 1,29 = 2.03 0.17 Simple main effect for laser with pairings massed F 1,29 = 7.0165 0.013 F 1,29 = 2.37 0.13 Experiment 4 LFP potentiation Tone t statistic p value t statistic p value t 11.1 = 2.54 0.028 t 11.0 = 3.07 0.011 Supplementary Table 6: F and p values from Two-way ANOVA comparisons for Experiments 1-3 and t statistics for Experiments 2 and 4.

Experiment 1-no manipulation Tone Context % freezing CTL I 52.22 ± 4.69 3.22 ± 0.91 CTLII 40.79 ± 4.14 24.53 ± 5.50 Intermixed 26.33 ± 4.41 29.73 ± 6.06 Pairings First 26.86 ± 5.34 26.14 ± 5.94 Experiment 2-APV infusions Tone Context % freezing Vehicle CTLII 40.09 ± 8.88 12.12 ± 5.87 Vehicle Pairings First 18.2 ± 5.55 8.04 ± 2.31 APV CTL II 47.17 ± 8.64 1.00 ± 0.422 APV Pairings First 21.66 ± 5.56 0.63 ± 0.29 Vehicle Intermixed 19.92 ± 9.57 (Defecation 5.423 ± 1.23) APV CTL I 57.33 ± 9.62 (Defecation 6.88 ± 0.73) APV Intermixed 26.87 ± 10.46 0.62 ± 0.22 (Defecation 3.89 ± 1.10 2.11 ± 0.70) Experiment 3-optogenetic inactivation Tone Context % freezing Pairings First/Laser Offset 16.13 ± 3.64 45.46 ± 9.38 Pairings Firs/Laser Overlap 40.23 ± 8.04 26.34 ± 8.05 Intermixed/LaserOffset 17.51 ± 5.06 16.51 ± 6.99 Intermixed/Laser Overlap 35.1 ± 6.10 32.73 ± 8.61 Experiment 4-Electrophysiology Tone % freezing Control II 35.45 ± 7.894 Pairings Firs 9.87 ± 2.66 Cover Stimulus 48.04 ± 7.15 Supplementary Table 7: Mean ± standard error freezing (and defecation) scores.

Structure learning model Structure & parameter learning model Parameter learning model Extended Comparator Hypothesis Extended Rescorla- Wagner model α = 0.809 a 1 = 28.620 a 1 = 5.968 s 1 = 0.865 T a 1 = 0.980 ρ = 0.646 b 1 = 0.239 b 1 = 0.766 s 2 = 0.862 T a 2 = 0.000 a 2 = 5.407 a 2 = 1.058 s 3 = 0.679 T b 1 = 0.468 b 2 = 0.173 b 2 = 0.372 k 1 = 0.05 T b 2 = 0.005 a 3 = 28.034 a 3 = 0.214 k 2 = 0.551 Ca 1 = 0.972 b 3 = 2.416 b 3 = 0.140 k 3 = 1.00 Ca 2 = 0.158 a 4 = 23.422 a 4 = 0.018 Cb 1 = 0.857 b 4 = 3.108 b 4 = 3.267 Cb 2 = 0.004 ρ = 0.782 Supplementary Table 8: Best fit parameters for Bayesian and Associative models.

Supplementary Modeling Stationarity Our Bayesian models assume that trials come from a stationary process and therefore ignore trial order effects. This is a good match for our data where we only see weak ordering effects, and is a suitable strategy for learning the structure of the environment, since the presence or absence of a predictive relationship between two variables would tend to be a stable property over time. On the other hand the exact strength of a relationship might change over time, and learning the parameters that represent this strength by assuming that the underlying state of the world is dynamic could give rise to strong trial ordering effects (52). An interplay between structure and parameter learning at different time points could yield an optimal strategy for exploring the environment, and give rise to weaker or stronger trial order effects, depending on the particulars of the learning environment. Temporal representation The most straight-forward temporal discretization of the experiments assigns one time bin to each trial, such that the time length for each temporal unit is 2 mins. This corresponds to counting each US as 1 event where the CS is either present or absent. The approximately two minutes each animal spent outside the conditioning chamber, but before being returned to the home cage was also counted as 1 event. when only the Background, but not the other stimuli were present. Since the 30s tone CS was only present for a fourth of the two minute duration of tone-shock pairing trials, the remaining fractional time intervals, as well as the one minute half trial after the last US, were added up, and the integer value of this sum counted towards the number of trials with context present and CS and US absent (i.e. added to the count of the [1 1 0 0] vectors).including these fractional counts improved all Bayesian model fits slightly, but didn t affect the final order of the fits. Since USs always arrived at the end of the CS in our experiments, we didn t consider potential effects of ambiguity arising by the timing of the US during the CS, which is a characteristic of the totally random control procedure, and has been suggested to slow conditioning to a CS (53) Alternative formulations of the Bayesian models An alternative for the Bayesian models without a Background variable can be implemented with the three variables X 2, X 3, and X 4, s.t. P (X 4 = 1 P a(x 4 )) = 1 X i P a(x 4 ) (1 ω i,4 ) x i (1)

or if X 4 has no parents, or when all its parents are absent, then P (X 4 = 1) = ω px4 (2) where ω px4 has a prior Beta distribution. This formulation gave a higher MSE then SLM and SPLM, but a better fit than the other models. Removing the Background variable from PLM turns it into a model that is similar to simple cue competition models, and substantially decreases the model s ability to fit the data (Supplementary Table 1). A further alternative is to replace the noisy-or generating function with a thresholded linear function (see SLM linear in Supplementary Table 1), such that P (X 4 = 1 P a(x 4 )) = max(1, i X i P a(x 4 ) ω i,4 X i ) (3) Extending SLM to model other behavioral phenomena We tested if SLM was compatible with previously documented conditioning phenomena involving ambiguous cue-outcome associations, by extending the model to include a further variable representing an additional environmental stimulus (such as a light). We extended the number of graph structures to include the relevant configurations (Supplementary Fig. 9), but otherwise left the model unchanged, including the hyperparameter P (G 1 ), and the scaling factors previously fitted to our behavioral data. SLM made accurate predictions, qualitatively matching previous data for the cover stimulus effect, overshadowing, recovery from overshadowing and blocking (Supplementary Fig. 10). The original formulation of SLM can also explain partial reinforcement and latent inhibition effects for CS-US pairings, as well as for context-us pairings. Associative models Representing experiments To allow maximum flexibility for the associative models, we used two temporal discretization parameters, one for the CS duration, t CS and one for when the CS was not present, t C. We restricted the relationship between these two parameters such that the discretized representation remained faithful to the original temporal structure. In particular we imposed 3 t C t CS 3 (t C + 2) These discretization parameters were not included in the parameter count in the model comparison, Models were implemented as described in references (7) and (8), and parameters fitted

by minimizing MSE with an interior-point algorithm using MATLAB s fmincon function. Associative strengths given by the model were converted to freezing scores in the same way as for the Bayesian models (multiplication by a scalar through simple linear regression, separately for the Context and the CS). Van Hamme and Wasserman s extension of the Rescorla-Wagner model required fitting 4 learning rate parameters each for the context and the tone (8 in total) that were constrained to lie in the interval [0, 1] or [ 1, 0], as specified by the model. λ was taken to be one, since the multiplicative scaling when converting to freezing scores ensured that the value of λ didn t influence the model s fit. We fit 6 parameters for the SOCR model: 3 learning rate parameters (s 1, s 2 and s 3 ), the extinction parameter k 1, and the comparator parameters k 2 and k 3. The learning rates and comparator parameters were constrained to lie in [0, 1], while k 1 was required to lie in [0.05, 1] to ensure a realistic model that can account for extinction/partial reinforcement effects. We ran the optimization process at least 10 times from different random starting points, separately for each permitted combination of the integer-valued temporal discretization parameters with C t 30. We found that minimums for each such combination were consistent across most of the the runs, with the optimization terminating at one of the few different observed values. We also extended these models by adding a Background cue, to enable factorial model comparison and a better understanding of the importance of such a variable. For HW-RW this meant adding further 4 learning rate parameters, and for SOCR a single extra learning rate parameter. Modeling neural interventions To model hippocampal inactivations, the learning rates for the context were set to zero. Amygdala inactivation during the US were modeled by excluding trials with inactivation from the trial counts (such that they counted neither towards the reinforced, nor the unreinforced trials). Alternative formulations In the extended RW model, we tried replacing the linear sum in the prediction error terms and by the or function and λ (V Context + V CS ) 0 (V Context + V CS ) λ (V context + V CS V Context V CS ) 0 (V Context + V CS V Context V CS )

to see if the linear vs or formulation were important in the differences we found between models. However, this change didn t significantly improve the fit of the model. References 52. P. Dayan, S. Kakade, P. R. Montague, Learning and selective attention. Nat. Neurosci. 3 Suppl, 1218-1223 (2000). 53. P. D. Balsam, S. Fairhurst, C. R. Gallistel, Pavlovian contingencies and temporal information. J. Exp. Psychol. Anim. Behav. Process. 32, 284-294 (2006).