Martingale posteriors in modern AI

Trial Lecture, University of Oslo

Simen Eide

2025-11-28

Agenda

Target audience: master students with some knowledge about Bayesian statistics

Motivational LLM In Context Learning
Traditional Bayesian review
Predictive Bayesian introduction
Predictive resampling algorithm
Some theory
Applications
Conclusion

LLM In-Context Learning

Assume we ask an LLM to solve the following problem

Where should this person go on holiday based on some information of that person?

x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:

LLM In-Context Learning

Assume we ask an LLM to solve the following problem
Collect some few shot examples

Where should this person go on holiday based on some information of that person?

x: Anders is a physicist and likes to discuss philosophy
y: destination=Rome

x: Kamilla enjoys skiing and works at the local university
y: destination=Alps

x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:

LLM In-Context Learning

Assume we ask an LLM to solve the following problem
Collect some few shot examples
Let the LLM generate additional examples

Repeat and record answer

Where should this person go on holiday based on some information of that person?

x: Anders is a physicist and likes to discuss philosophy
y: destination=Rome

x: Kamilla enjoys skiing and works at the local university
y: destination=Alps

x: Maria is retired and spends her time gardening and traveling
y: destination=Madeira

x: Sven just quit his job and is now playing in a band with his friends.
y: destination=Nashville

x: Elias is in military conscription and is considering studying engineering afterward
y: destination=Berlin

x: Ingrid is a medical resident finishing her fourth year of specialty training
y: destination=The Well

x: Thomas just became a partner at a consulting firm and enjoys sailing
y: destination=Maldives

x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:

Valid? Posterior predictive?

Traditional Bayesian approach

We have collected \(y_{1:n}\) data points from an unknown distribution \(F_0\).
Assume \(F_0\) has a true density \(y_i \sim f_{\theta_0}\) given by a true parameter \(\theta_0\).
Goal: Find probable values of \(\theta_0\) given the data \(y_{1:n}\).

Prior \(\pi(\theta)\)
Sampling density \(f_{\theta}(y)\)
Likelihood \(f_\theta(y) = p(y_{1:n} | \theta) = \prod_{i=1}^n f_\theta(y_i)\)

Posterior

\[ P(\theta | y_{1:n}) = \frac {\pi(\theta) p(y_{1:n} | \theta )} {\int \pi(\theta) p(y_{1:n} | \theta ) d\theta} \tag{1}\]

Posterior predictive

We can compute the distribution of a new data point \(y\) given the observed data \(y_{1:n}\) by using the posterior predictive

\[ P(y | y_{1:n}) = \int f_{\theta}(y) \pi(\theta | y_{1:n}) d\theta \tag{2}\]

Prior beliefs on Neural Network parameters?!

Neural networks are black box models
We have no intuition here(!)

Standard answer:

We dont care, we just want to use it for variability

Today: Start at a different point

Instead, define the predictive distribution:

\[P(y_{n+1} | y_{1:n})\]

More natural to define the predictive distribution instead of prior+likelihood?

Predictive Bayes Motivation

Compute statistics from an infinite population

Holmes and Walker (2023)

Think of bayesian ucertainty to originate from missing data:

The assumption behind most, if not all, statistical problems is that there is an amount of data, \(x_{comp}\), which if observed, would yield the problem solved.

For example:

Consider i.i.d. observations from an infinite population.
We have collected \(y_\text{obs} := y_{1:n}\)
The missing data are then \(y_\text{mis} := y_{n+1:{\infty}}\)
If we had access to the full data \(y_{comp} := \{y_\text{obs}, y_\text{mis} \}\), we could compute any statistics of interest with near zero uncertainty.

Simulate the missing data

Holmes and Walker (2023)

The Bayesian posterior can be written as

\[ \begin{aligned} \pi(\theta | y_\text{obs}) =& \int \pi(\theta, y_\text{mis} | y_\text{obs}) dy_\text{mis} \\ =& \int \pi(\theta | y_\text{comp}) P(y_\text{mis} | y_\text{obs}) dy_\text{mis} \end{aligned} \]

We can make \(y_\text{comp}\) arbitrarily large
Replace the conditional posterior with a point estimate \[\pi(\theta | y_\text{comp}) = \delta_{\hat{\theta}(y_\text{comp})}(\theta)\]
Integrate over the missing data

Just need to define \(P(y_\text{mis} | y_\text{obs})\) …

Predictive resampling

Fong, Holmes, and Walker (2022)

Way to sample the missing data given a one step ahead predictive distribution \(P(y_{i+1} | y_{1:i})\)

\(y_{comp} = y_{1:\infty} \approx y_{1:N}\) for some large N.

\[ P(y_{n+1:N} | y_{1:n}) = \prod_{i=n}^N P(y_{i+1} | y_{1:i}) \]

Step 1: Simulate \(y_{n+1:\infty}\) by N one step ahead predictions \(P(y_{i+1} | y_{1:i})\)

Step 2: Compute the quantity of interest \(\theta(y_{1:N})\) on the full dataset

Repeat B times to get a posterior distribution of the quantity of interest

theta_samples = np.zeros(B)
for b in range(0,B):
    y = np.zeros(N)
    y[:n] = y_obs
    for i in range(n+1,N):
        y[i] = p_i(y[1:(i-1)])

    theta_samples[b] = theta(y)

NB: Need to define a valid \(P(y_{i+1} | y_{1:i})\)

Example 1 (problem)

Adapted example from Fong, Holmes, and Walker (2022)

Assume we have the model \[ P(\theta):= \pi(\theta) = N(\theta | 0,1) \\ P(y|\theta):=f_\theta(y) = N(y | \theta, 1) \]

Conjugate prior gives closed form posterior

\[ P(\theta | y_{1:n}) = N(\theta | \bar{\theta_n}, \bar{\sigma_n}^2 ) \] where \[ \bar{\theta_n} := \frac{\sum_{i=1}^n y_i}{n+1}, \bar{\sigma}_n^2 := \frac{1}{n+1} \]

and posterior predictive \[ P(y | y_{1:n}) = N(y | \bar{\theta_n}, \bar{\sigma_n^2} + 1)\]

Set the true parameter \(\theta =2.0\), and then collect \(n=50\) values:

Example (run predictive resampling)

Let the predictive distribution be the true posterior predictive: \[ P(y_{i+1} | y_{1:i}) = N(y | \bar{\theta_i}, \bar{\sigma_i^2} + 1)\]

Step 1: For \(i=n,...,N-1\): Sample \(y_{i+1} | y_{1:i}\) from the posterior predictive

Step 2: Compute the point estimate of \(\theta\) given the full data \(y_{1:N}\): \[ \hat{\theta}(y_{1:N}) = \frac{\sum_{i=1}^N y_i}{N+1} \]

Repeat B times to get posterior samples of \(\theta\)

Why is this useful?

Recap

We can find a posterior statistics \(\theta(y_{1:N})\) if we have a valid \(P(y_{i+1} | y_{1:i})\).
Do not need to define prior or likelihood

Specification

Priors on BNN parameters lack a clear interpretation
Maybe its more intuitive to define a predictive distribution?
\(P(y_{i+1} | y_{1:i})\) feels similar to many black box models we have today

Computation

Predictive resampling instead of MCMC
Predictive resampling can be parallelized
- 10-100x speedups

Theory on the predictive distributions

What predictive distributions does this work for?

Fortini and Petrone (2025)

Have a \(P(y_{i+1} | y_{1:i})\) that will generate a sequence \((Y_i)_{i\ge n}\)

When \((Y_i)_{i\ge n}\) give us a data distribution \(F_0\)? i.e. when does

\[ (Y_n)_{n\ge1} | F_0 \] exists.

In this section we will:

Introduce a requirement on the sequence \((Y_i)_{i\ge1}\) that makes it possible to define a prior
Then relax this requirement
End up with a requirement that \(P(y_{i+1} | y_{1:i})\) is a martingale.

de Finetti’s Theorem

Definition 1 \((Y_i)_{i\ge1}\) is exchangeable if

\[ (Y_{\sigma(1)}, Y_{\sigma(2)},...) \sim (Y_1, Y_2, ..) \]

for every finite permutation \(\sigma\) of \(\mathbb{N}\) .

Theorem 1 (de Finetti’s Theorem)

There exist \(F_0\) such that \((Y_i) | F_0\)
if and only if
\((Y_i)\) is exchangeable.

If we can construct a \(P(y_{i+1} | y_{1:i})\) so that the resulting sequence \((Y_i)_{i\ge n}\) is exchangeable, then we have our familiar traditional Bayes framework

In reality, hard to use exchangeability

Relaxing exchangeability

Theorem 2 (Kallenberg 1988)
“Exchangeability = Stationarity + conditionally identically distributed”

So let us remove stationarity…

Definition 2 (Conditionally identically distributed (c.i.d.)) A sequence \((Y_n)_{n\ge 1}\) is c.i.d. if it satisfies

\[ P(Y_{n+k} | y_1, ..., y_{n}) = P(Y_{n+1} | y_1, .., y_{n}) \] for all \(k\ge1\).

All future observations share the same conditional distribution given the past

Martingale

Definition 3 (Conditionally identically distributed (c.i.d.)) A sequence \((Y_n)_{n\ge 1}\) is c.i.d. if it satisfies

\[ P(Y_{n+k} | y_1, ..., y_{n}) = P(Y_{n+1} | y_1, .., y_{n}) \] for all \(k\ge1\).

All future observations share the same conditional distribution given the past

This is equivalent to saying:

Definition 4 (Martingale of the predictive distribution) \[ E[P(Y_{i+2} \in A | y_{1:{i+1}}) | y_{1:i}] = P(Y_{i+1} \in A | y_{1:i}) \] for all \(A\) and all \(i\).

Still have many desirable properties

\((Y_i)_{i\ge n}\) converge to a distribution \(F_0\),
\((Y_i)_{i\ge n}\) are identically distributed,

\((Y_i)_{i\ge n}\) are asymptotically exchangeable
The empirical distribution of \((Y_i)_{i\ge n}\) converges to \(F_0\)

Applications

Example 1: Empirical predictive

Remember example:
- \(Y_i \sim N(\theta, 1)\)
- \(\theta=2\)

If we instead of the true posterior predictive assume an empirical predictive on the collected data \(y_{1:n}\):

Works but “worse model”

Are LLMs martingales?

Check whether \(P(y_{i+1}|y_{1:i})\) is a martingale

Note: \(y_{i+1}\) is not the next token!

Where should this person go on holiday based on some information of that person?

x: Anders is a physicist and likes to discuss philosophy
y: destination=Rome

x: Kamilla enjoys skiing and works at the local university
y: destination=Alps

x: Maria is retired and spends her time gardening and traveling
y: destination=Madeira

x: Sven just quit his job and is now playing in a band with his friends.
y: destination=Nashville

x: Elias is in military conscription and is considering studying engineering afterward
y: destination=Berlin

x: Ingrid is a medical resident finishing her fourth year of specialty training
y: destination=The Well

x: Thomas just became a partner at a consulting firm and enjoys sailing
y: destination=Maldives

x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:

Are LLMs martingales?

Falck, Wang, and Holmes (2024)

Falck, Wang, and Holmes (2024) studies martingale property of LLMs in-context learning empirically

Tests three LLMs:
- gpt-3.5,
- llama-2-7b
- mistral-7b

Tests three datasets:
- Bernoulli,
- Gaussian,
- Synthetic natural language dataset

Result: No.

The expected probability drifts as \(i\) increases

Suggests tools or fine tuning to make LLMs more like martingales

Tabular Prior Fitted Networks (TabPFN)

Hollmann et al. (2023), Nagler and Rügamer (2025), Ng et al. (2025)

TabPFN: pretrained transformer on tabular data (Hollmann et al. (2023))

Observe the same non-martingale property in TabPFN (Nagler and Rügamer (2025))

Two different approaches:

Nagler and Rügamer (2025)

Only use tabPFN on the first step \(P(y_{n+1} | x_{n+1}, z_{1:n})\), then a Dirichlet process mixture for the rest.

Ng et al. (2025)

“Ignores” theoretical requirements, checks convergence empirically and says its “good enough”.

Conclusion

Introduced a new, alternative Bayesian framework
Specify predictive distributions instead of priors and likelihoods
Hints at being able to insert our favourite black-box auto-regressive model
Highly parallizeable
Limitation: Difficult/unsure how to specify valid predictive distributions

Future (/current) work

Show that different models (predictive distributions) are martingales
Relaxation of the martingale posterior (e.g. Battiston and Cappello (2025))
Empirical robustness (e.g. Ng et al. (2025))

Half baked (LLM)-ideas

“LLMs are not row-invariant due to positional embeddings” Ng et al. (2025)

Can we design new, “local” positional embeddings architectures are row-invariant?

Update parametric models inside predictive resampling. LLM posteriors?

:::

References

Battiston, Marco, and Lorenzo Cappello. 2025. “Bayesian Predictive Inference Beyond Martingales.” arXiv. https://doi.org/10.48550/arXiv.2507.21874.

Falck, Fabian, Ziyu Wang, and Chris Holmes. 2024. “Are Large Language Models Bayesian? A Martingale Perspective on in-Context Learning.”

Fong, Edwin, Chris Holmes, and Stephen G Walker. 2022. “Martingale Posterior Distributions.” Journal of the Royal Statistical Society Series B: Statistical Methodology 85 (5): 1357–91. https://doi.org/10.1093/jrsssb/qkad005.

Fortini, Sandra, and Sonia Petrone. 2025. “Exchangeability, Prediction and Predictive Modeling in Bayesian Statistics.” Statistical Science 40 (1): 40–67. https://doi.org/10.1214/24-STS965.

Hollmann, Noah, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2023. “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second.” arXiv. https://arxiv.org/abs/2207.01848.

Holmes, Chris C., and Stephen G. Walker. 2023. “Statistical Inference with Exchangeability and Martingales.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 381 (2247): 20220143. https://doi.org/10.1098/rsta.2022.0143.

Nagler, Thomas, and David Rügamer. 2025. “Uncertainty Quantification for Prior-Data Fitted Networks Using Martingale Posteriors.” arXiv. https://doi.org/10.48550/arXiv.2505.11325.

Ng, Kenyon, Edwin Fong, David T. Frazier, Jeremias Knoblauch, and Susan Wei. 2025. “TabMGP: Martingale Posterior with TabPFN.” arXiv. https://doi.org/10.48550/arXiv.2510.25154.

BONUS

Parametric models

Holmes and Walker (2023)

Consider a data model \(f(y | \theta_0)\)
Let \(\hat{\theta}_n = \theta(y_{1:n})\) be an unbiased MLE estimate of \(\theta_0\) based on \(y_{1:n}\)
Then we can sample a new point and estimate a new parameter

\[ y_{n+1} \sim f(y | \hat{\theta}_n) \\ \hat{\theta}_{n+1} = \theta(y_{1:n+1}) \]

Idea: If uncertainty in mle estimate decreases with more data, then we can iteratively update
\[ \theta_{m+1} = \theta_m + \epsilon_m \frac{\partial log f(x_{m+1}|\theta_m)}{\partial \theta_m} \] where \(\epsilon_m\) functions as our learning rate scheduler.
“Gradient descent style” updates
Gives us the “frequentist posterior”