New York Data Science & AI Conference
Posit PBC
topepo/2025_NYR
How do I add a calibration model on top of an XGBoost classifier for the
mtcars
data using tidymodels?
If I get results for this using different LLMs or prompts, how can I know
Which setup works best?
Do the results improve over time?
How stochastic are the results?
See “I was wrong about tidymodels and LLMs” by Simon (yesterday) for an interesting read.
inspect
is “a framework for large language model evaluations created by the UK AI Security Institute.”
vitals
is an R package that treats LLM evaluations similarly to unit testing.
We’ll use example data from vitals
to illustrate some of the statistical issues.
For our data, there are three ordinal levels: incorrect, partially correct, and correct.
Some notation:
This talk focuses on ordinal outcomes. Binary or numeric outcomes have very close analogies (if not simpler models).
“Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations was released in 2024 and does a good job of describing the spirit of how to analyze these results.
It shows equations for computing large-sample standard errors for proportions (which are assumed to be normal).
The manuscript is not wrong statistically, but very myopic and not especially good statistical methodology.
Generalized linear models have existed for 51 years and solve nearly all the variations of this problem.
This experimental design falls squarely into the realm of analysis of variance (ANOVA).
“Modern” statistical tools for basic or more advanced models have evolved into frameworks that can seamlessly handle different outcome types of designs.
To model the score probabilities, the cumulative logit model estimates \(C-1\) outcomes of
\[ \log\left(\frac{Pr[Y_{i} \ge c]}{1 - Pr[Y_{i} \ge c]}\right) = (\theta_c - \alpha_{k})- (\beta_2x_{i2} + \ldots + \beta_{p}x_{ip}) \]
Estimate parameters using maximum likelihood. Straightforward-ish.
Enables inference via p-values (booo) and confidence intervals (better).
Parameter estimates are the difference from the GPT 4.1 results:
LLM |
Parameters
|
Odds Ratios
|
|||||
---|---|---|---|---|---|---|---|
Estimate | Lower 95% | Lower 95% | p-Value | Estimate | Lower 95% | Lower 95% | |
Gemini 2.5 Pro | 0.0114 | −0.623 | 0.646 | 0.976 | 1.01 | 0.536 | 1.91 |
Claude 4 Sonnet | 0.548 | −0.0878 | 1.18 | 0.156 | 1.73 | 0.916 | 3.26 |
IMO inference is torturous:
If we were to repeat this experiment a large number of times, the true parameter value of the difference between GPT 4.1 and Gemini 2.5 Pro would fall between -0.088 and 1.183 95% of the time.
Requires priors for parameters and estimation via MCMC.
Inference is rational via probability statements.
Code is not as simple, but brms
has a high-level interface
LLM |
Parameters
|
Odds Ratios
|
||||
---|---|---|---|---|---|---|
Mean | 5% Percentile | 95% Percentile | Mean | 5% Percentile | 95% Percentile | |
Gemini 2.5 Pro | 0.00997 | −0.628 | 0.645 | 1.09 | 0.534 | 1.91 |
Claude 4 Sonnet | 0.552 | −0.0926 | 1.20 | 1.88 | 0.912 | 3.32 |
Inference is incredibly simple:
There is a 92.1% probability that Claude 4 Sonnet is better than GPT 4.1.
It might be a good idea to consult a statistician for advice on conducting an inferential analysis of your experimental design.
Existing statistical tools perform very well for any type of evaluation outcome and can accommodate nearly every experimental design.
R has very good existing tools for executing the analyses.