
Trial Lecture, University of Oslo
2025-11-28
Target audience: master students with some knowledge about Bayesian statistics
Where should this person go on holiday based on some information of that person?
x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:
Assume we ask an LLM to solve the following problem
Collect some few shot examples
Where should this person go on holiday based on some information of that person?
x: Anders is a physicist and likes to discuss philosophy
y: destination=Rome
x: Kamilla enjoys skiing and works at the local university
y: destination=Alps
x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:

Where should this person go on holiday based on some information of that person?
x: Anders is a physicist and likes to discuss philosophy
y: destination=Rome
x: Kamilla enjoys skiing and works at the local university
y: destination=Alps
x: Maria is retired and spends her time gardening and traveling
y: destination=Madeira
x: Sven just quit his job and is now playing in a band with his friends.
y: destination=Nashville
x: Elias is in military conscription and is considering studying engineering afterward
y: destination=Berlin
x: Ingrid is a medical resident finishing her fourth year of specialty training
y: destination=The Well
x: Thomas just became a partner at a consulting firm and enjoys sailing
y: destination=Maldives
x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:
Valid? Posterior predictive?
\[ P(\theta | y_{1:n}) = \frac {\pi(\theta) p(y_{1:n} | \theta )} {\int \pi(\theta) p(y_{1:n} | \theta ) d\theta} \tag{1}\]
We can compute the distribution of a new data point \(y\) given the observed data \(y_{1:n}\) by using the posterior predictive
\[ P(y | y_{1:n}) = \int f_{\theta}(y) \pi(\theta | y_{1:n}) d\theta \tag{2}\]
We dont care, we just want to use it for variability
\[P(y_{n+1} | y_{1:n})\]
Holmes and Walker (2023)
The assumption behind most, if not all, statistical problems is that there is an amount of data, \(x_{comp}\), which if observed, would yield the problem solved.
For example:
Holmes and Walker (2023)
The Bayesian posterior can be written as
\[ \begin{aligned} \pi(\theta | y_\text{obs}) =& \int \pi(\theta, y_\text{mis} | y_\text{obs}) dy_\text{mis} \\ =& \int \pi(\theta | y_\text{comp}) P(y_\text{mis} | y_\text{obs}) dy_\text{mis} \end{aligned} \]
Just need to define \(P(y_\text{mis} | y_\text{obs})\) …
Fong, Holmes, and Walker (2022)
Way to sample the missing data given a one step ahead predictive distribution \(P(y_{i+1} | y_{1:i})\)
\(y_{comp} = y_{1:\infty} \approx y_{1:N}\) for some large N.
\[ P(y_{n+1:N} | y_{1:n}) = \prod_{i=n}^N P(y_{i+1} | y_{1:i}) \]
Step 1: Simulate \(y_{n+1:\infty}\) by N one step ahead predictions \(P(y_{i+1} | y_{1:i})\)
Step 2: Compute the quantity of interest \(\theta(y_{1:N})\) on the full dataset
Repeat B times to get a posterior distribution of the quantity of interest
NB: Need to define a valid \(P(y_{i+1} | y_{1:i})\)
Adapted example from Fong, Holmes, and Walker (2022)
Assume we have the model \[ P(\theta):= \pi(\theta) = N(\theta | 0,1) \\ P(y|\theta):=f_\theta(y) = N(y | \theta, 1) \]
Conjugate prior gives closed form posterior
\[ P(\theta | y_{1:n}) = N(\theta | \bar{\theta_n}, \bar{\sigma_n}^2 ) \] where \[ \bar{\theta_n} := \frac{\sum_{i=1}^n y_i}{n+1}, \bar{\sigma}_n^2 := \frac{1}{n+1} \]
and posterior predictive \[ P(y | y_{1:n}) = N(y | \bar{\theta_n}, \bar{\sigma_n^2} + 1)\]
Set the true parameter \(\theta =2.0\), and then collect \(n=50\) values:









Fortini and Petrone (2025)
\[ (Y_n)_{n\ge1} | F_0 \] exists.
In this section we will:
Definition 1 \((Y_i)_{i\ge1}\) is exchangeable if
\[ (Y_{\sigma(1)}, Y_{\sigma(2)},...) \sim (Y_1, Y_2, ..) \]
for every finite permutation \(\sigma\) of \(\mathbb{N}\) .
Theorem 1 (de Finetti’s Theorem)
There exist \(F_0\) such that \((Y_i) | F_0\)
if and only if
\((Y_i)\) is exchangeable.
If we can construct a \(P(y_{i+1} | y_{1:i})\) so that the resulting sequence \((Y_i)_{i\ge n}\) is exchangeable, then we have our familiar traditional Bayes framework
In reality, hard to use exchangeability
Theorem 2 (Kallenberg 1988)
“Exchangeability = Stationarity + conditionally identically distributed”
So let us remove stationarity…
Definition 2 (Conditionally identically distributed (c.i.d.)) A sequence \((Y_n)_{n\ge 1}\) is c.i.d. if it satisfies
\[ P(Y_{n+k} | y_1, ..., y_{n}) = P(Y_{n+1} | y_1, .., y_{n}) \] for all \(k\ge1\).
All future observations share the same conditional distribution given the past
Definition 3 (Conditionally identically distributed (c.i.d.)) A sequence \((Y_n)_{n\ge 1}\) is c.i.d. if it satisfies
\[ P(Y_{n+k} | y_1, ..., y_{n}) = P(Y_{n+1} | y_1, .., y_{n}) \] for all \(k\ge1\).
All future observations share the same conditional distribution given the past
This is equivalent to saying:
Definition 4 (Martingale of the predictive distribution) \[ E[P(Y_{i+2} \in A | y_{1:{i+1}}) | y_{1:i}] = P(Y_{i+1} \in A | y_{1:i}) \] for all \(A\) and all \(i\).



Where should this person go on holiday based on some information of that person?
x: Anders is a physicist and likes to discuss philosophy
y: destination=Rome
x: Kamilla enjoys skiing and works at the local university
y: destination=Alps
x: Maria is retired and spends her time gardening and traveling
y: destination=Madeira
x: Sven just quit his job and is now playing in a band with his friends.
y: destination=Nashville
x: Elias is in military conscription and is considering studying engineering afterward
y: destination=Berlin
x: Ingrid is a medical resident finishing her fourth year of specialty training
y: destination=The Well
x: Thomas just became a partner at a consulting firm and enjoys sailing
y: destination=Maldives
x: Simen just defended his PhD in Machine Learning and enjoys paragliding
y:
Falck, Wang, and Holmes (2024)
gpt-3.5,llama-2-7bmistral-7b
:::
Holmes and Walker (2023)
\[ y_{n+1} \sim f(y | \hat{\theta}_n) \\ \hat{\theta}_{n+1} = \theta(y_{1:n+1}) \]
Idea: If uncertainty in mle estimate decreases with more data, then we can iteratively update
\[ \theta_{m+1} = \theta_m + \epsilon_m \frac{\partial log f(x_{m+1}|\theta_m)}{\partial \theta_m} \] where \(\epsilon_m\) functions as our learning rate scheduler.
“Gradient descent style” updates
Gives us the “frequentist posterior”