R/Pharma
Posit PBC
2025-11-05
topepo/2025-r-pharmaTabPFN is an unconventional deep learning (DL) framework for regression, classification, and density estimation. It stands for “TABular Prior-data Fitted Network”.
Two main references:
In Muller et al (2022), they determined that they can approximate Bayesian inference with a very complex deep learning (DL) model.
The prior for their model is on data sets. They developed a system to generate synthetic data based on a causal graph that mimics common mechanisms that generate real-world data.
In other words, their prior (\(Pr[D]\) where \(D\) is a data set) generates different types of data sets.
For example, elements of the prior could simulate
A sample from the prior generates a data set and it’s predictive task.
Their Prior Fitted Network is a DL model that emulates Bayes’ Rule
\[ \underbrace{p(y|x, D)}_{\text{Approximate Posterior Distribution}} = \underbrace{\int_T p(y|x, T) p(T|D)}_{\text{TabPFN DL Model}} \]
where \(T\) is the task, \(D\) is the random data set variable, \(x\) is a new input, and \(y\) is its corresponding dependent variable.
TabPFN V2 uses 130 Million realizations of \([D, T]\).
The resulting model requires no additional estimation on our data \((y, x)\) to produce predictions.
The really important part of their model is that they include attention mechanisms to model relationships across predictors (e.g. interactions) and across rows of data.
Image from Hollmann et al (2025)
The attention mechanisms function as a type of similarity search, generating weights for the network’s parameters.
The parts of the DL model that are most relevant to our training set are upweighted, resulting in a kind of local prediction.
In essence, the training set “primes the pump” so that the new instances use the most pertinent parts of the DL model.
Version 2 of the model was trained on 130 Million data sets.
It’s intended use is for “small” data sets (as defined from the DL perspective). They suggest that it performs well for training sets with up to 10,000 samples, 500 features, and 10 classes.
We’ve found that it ranks in the top five for every data set that I’ve applied it to.
Given the size/complexity of the DL model, the computations are very heavy.
Unfortunately, without a (CUDA) GPU, the time to prediction is exorbitantly slow.
If you are coming from the DL community, this is to be expected.
However, from a “tabular” background, it can seem extreme to need expensive hardware to predict a few samples.
That said, the GPU provides 100-fold speedups.
Training set is 4,832 samples.
The holdout is 1,371 samples.
Important features that we didn’t have time to discuss:
The inventors have a Python library at github.com/PriorLabs/TabPFN and I’ve made an initial port to R (via reticulate) at github.com/topepo/TabPFN:
Thanks for the invitation to speak today!
Thanks to others at Posit that helped: Simon Couch, Daniel Falbel, and Tomasz Kalinowski.