Measuring LLM Effectiveness

New York Data Science & AI Conference

Max Kuhn and Simon Couch

Posit PBC

On github: topepo/2025_NYR

How do we compare LLMs, prompts, etc?

How do I add a calibration model on top of an XGBoost classifier for the mtcars data using tidymodels?


If I get results for this using different LLMs or prompts, how can I know

  • Which setup works best?

  • Do the results improve over time?

  • How stochastic are the results?

See “I was wrong about tidymodels and LLMs” by Simon (yesterday) for an interesting read.

An example endpoint

inspect and vitals

inspect is “a framework for large language model evaluations created by the UK AI Security Institute.”


vitals is an R package that treats LLM evaluations similarly to unit testing.


We’ll use example data from vitals to illustrate some of the statistical issues.

  • 25 R-related tasks/questions
  • Three LLMS: GPT 4.1, Gemini 2.5 Pro, and Claude 4 Sonnet
  • Each question is run three times for each LLM

Terminology/notation

For our data, there are three ordinal levels: incorrect, partially correct, and correct.

Some notation:

  • \(C\) outcome values (\(C = 3\) here)
  • \(p\) LLMs (\(p=3\) for our data)
  • \(m\) epochs (a.k.a. replicates, \(m = 3\))


This talk focuses on ordinal outcomes. Binary or numeric outcomes have very close analogies (if not simpler models).

The data

Anthropic manuscript

“Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations was released in 2024 and does a good job of describing the spirit of how to analyze these results.


It shows equations for computing large-sample standard errors for proportions (which are assumed to be normal).


The manuscript is not wrong statistically, but very myopic and not especially good statistical methodology.

We need statistical inference

Unsexy, existing statistical models to the rescue

Generalized linear models have existed for 51 years and solve nearly all the variations of this problem.


This experimental design falls squarely into the realm of analysis of variance (ANOVA).


“Modern” statistical tools for basic or more advanced models have evolved into frameworks that can seamlessly handle different outcome types of designs.

The proportional odds model

To model the score probabilities, the cumulative logit model estimates \(C-1\) outcomes of

\[ \log\left(\frac{Pr[Y_{i} \ge c]}{1 - Pr[Y_{i} \ge c]}\right) = (\theta_c - \alpha_{k})- (\beta_2x_{i2} + \ldots + \beta_{p}x_{ip}) \]

  • \(c = 1\ldots C\) outcomes (correct, …)
  • \(i=1\ldots n\) results
  • \(j=2\ldots p\) LLMs
  • \(k=1\ldots m\) epochs/replicates
  • \(\theta_c\): differences in outcome levels
  • \(\alpha_{k}\): within-question correlation
  • \(\beta_{j}\): contrasts LLMs
    • \(\beta_{2}\): GPT - Gemini
    • \(\beta_{3}\): GPT - Claude

Hierarchical (mixed) model

Estimate parameters using maximum likelihood. Straightforward-ish.


Enables inference via p-values (booo) and confidence intervals (better).


# `id` is the question number
llm_mod  <- ordinal::clmm(score ~ LLM + (1|id), data = res, Hess = TRUE)

null_mod <- ordinal::clmm(score ~ 1   + (1|id), data = res, Hess = TRUE)
anova(llm_mod, null_mod)

tidy(llm_mod, conf.int = TRUE)

Results

Parameter estimates are the difference from the GPT 4.1 results:

LLM
Parameters
Odds Ratios
Estimate Lower 95% Lower 95% p-Value Estimate Lower 95% Lower 95%
Gemini 2.5 Pro 0.0114 −0.623 0.646 0.976 1.01 0.536 1.91
Claude 4 Sonnet 0.548 −0.0878 1.18 0.156 1.73 0.916 3.26

IMO inference is torturous:

If we were to repeat this experiment a large number of times, the true parameter value of the difference between GPT 4.1 and Gemini 2.5 Pro would fall between -0.088 and 1.183 95% of the time.

Difficulty Estimates a.k.a. BLUPs

Bayesian hierarchical model

Requires priors for parameters and estimation via MCMC.


Inference is rational via probability statements.


Code is not as simple, but brms has a high-level interface

brm(
    score ~ LLM + (1 | id),
    data = are_eval,
    family = cumulative(link = "logit", threshold = "flexible"),
    prior = c(set_prior(prior = "student_t(1, 0, 1)", class = "Intercept")),
    # sampling options such as `chains`, `iter`, etc.
)

Bayesian Results


LLM
Parameters
Odds Ratios
Mean 5% Percentile 95% Percentile Mean 5% Percentile 95% Percentile
Gemini 2.5 Pro 0.00997 −0.628 0.645 1.09 0.534 1.91
Claude 4 Sonnet 0.552 −0.0926 1.20 1.88 0.912 3.32


Inference is incredibly simple:

There is a 92.1% probability that Claude 4 Sonnet is better than GPT 4.1.

Bayesian difficulty Estimates

Summary

It might be a good idea to consult a statistician for advice on conducting an inferential analysis of your experimental design.


Existing statistical tools perform very well for any type of evaluation outcome and can accommodate nearly every experimental design.


R has very good existing tools for executing the analyses.

Thanks for listening!