TabPFN: A Deep-Learning Solution for Tabular Data

R/Pharma

Max Kuhn

Posit PBC

2025-11-05

GitHub: topepo/2025-r-pharma

TabPFN

TabPFN is an unconventional deep learning (DL) framework for regression, classification, and density estimation. It stands for “TABular Prior-data Fitted Network”.


Two main references:

Data Priors

In Muller et al (2022), they determined that they can approximate Bayesian inference with a very complex deep learning (DL) model.



The prior for their model is on data sets. They developed a system to generate synthetic data based on a causal graph that mimics common mechanisms that generate real-world data.


In other words, their prior (\(Pr[D]\) where \(D\) is a data set) generates different types of data sets.

Possible Elements of Data Priors

For example, elements of the prior could simulate

  • Distributional effects: skewness, correlated multivariate data, serial correlation, imbalanced frequency distributions, outliers, etc.
  • Missing data mechanisms.
  • Discretization.
  • Latent/Hidden variables: PCA, PLS, etc.
  • Functional relationships between predictors and outcomes (the “task” \(T\)).
  • Random error

A sample from the prior generates a data set and it’s predictive task.

Deep Learning Model

Their Prior Fitted Network is a DL model that emulates Bayes’ Rule

\[ \underbrace{p(y|x, D)}_{\text{Approximate Posterior Distribution}} = \underbrace{\int_T p(y|x, T) p(T|D)}_{\text{TabPFN DL Model}} \]

where \(T\) is the task, \(D\) is the random data set variable, \(x\) is a new input, and \(y\) is its corresponding dependent variable.

TabPFN V2 uses 130 Million realizations of \([D, T]\).

The resulting model requires no additional estimation on our data \((y, x)\) to produce predictions.

Attention

The really important part of their model is that they include attention mechanisms to model relationships across predictors (e.g. interactions) and across rows of data.



Attention

The attention mechanisms function as a type of similarity search, generating weights for the network’s parameters.


The parts of the DL model that are most relevant to our training set are upweighted, resulting in a kind of local prediction.


In essence, the training set “primes the pump” so that the new instances use the most pertinent parts of the DL model.

Predictive Performance

Version 2 of the model was trained on 130 Million data sets.


It’s intended use is for “small” data sets (as defined from the DL perspective). They suggest that it performs well for training sets with up to 10,000 samples, 500 features, and 10 classes.


We’ve found that it ranks in the top five for every data set that I’ve applied it to.

GPU Required

Given the size/complexity of the DL model, the computations are very heavy.


Unfortunately, without a (CUDA) GPU, the time to prediction is exorbitantly slow.

  • If you are coming from the DL community, this is to be expected.

  • However, from a “tabular” background, it can seem extreme to need expensive hardware to predict a few samples.


That said, the GPU provides 100-fold speedups.

Feature Selection Stability


Stability by “Training Set” Size



Training set is 4,832 samples.


The holdout is 1,371 samples.

Stability by “Training Set” Size

Other Aspects of the Model

Important features that we didn’t have time to discuss:

  • (Actual) Inference
  • Ensembling
  • Variable Importance
  • Explainability via Shapley values

Software

The inventors have a Python library at github.com/PriorLabs/TabPFN and I’ve made an initial port to R (via reticulate) at github.com/topepo/TabPFN:


Python:

import tabpfn

# Initialize a classifier
clf = tabpfn.TabPFNClassifier()
clf.fit(X_train, y_train)

# Predict probabilities
clf.predict_proba(X_test)

R:

library(TabPFN)

# Initialize a classifier
cls_mod <- tab_pfn(class ~ ., data = dat)

# Predict probabilities
predict(cls_mod, X_test)

Thanks

Thanks for the invitation to speak today!


Thanks to others at Posit that helped: Simon Couch, Daniel Falbel, and Tomasz Kalinowski.