Terminology

We define a resample as the result of a two-way split of a data set. For example, when bootstrapping, one part of the resample is a sample with replacement of the original data. The other part of the split contains the instances that were not contained in the bootstrap sample. Cross-validation is another type of resampling.

rset Objects Contain Many Resamples

The main class in the package (rset) is for a set or collection of resamples. In 10-fold cross-validation, the set would consist of the 10 different resamples of the original data.

Like modelr, the resamples are stored in data-frame-like tibble object. As a simple example, here is a small set of bootstraps of the mtcars data:

library(rsample)
set.seed(8584)
bt_resamples <- bootstraps(mtcars, times = 3)
bt_resamples
#> # Bootstrap sampling 
#> # A tibble: 3 x 2
#>         splits         id
#>         <list>      <chr>
#> 1 <S3: rsplit> Bootstrap1
#> 2 <S3: rsplit> Bootstrap2
#> 3 <S3: rsplit> Bootstrap3

Individual Resamples are rsplit Objects

The resamples are stored in the splits column in an object that has class rsplit.

In this package we use the following terminology for the two partitions that comprise a resample:

  • The analysis data are those that we selected in the resample. For a bootstrap, this is the sample with replacement. For 10-fold cross-validation, this is the 90% of the data. These data are often used to fit a model or calculate a statistic in traditional bootstrapping.
  • The assessment data are usually the section of the original data not covered by the analysis set. Again, in 10-fold CV, this is the 10% held out. These data are often used to evaluate the performance of a model that was fit to the analysis data.

(Aside: While some might use the term “training” and “testing” for these data sets, we avoid them since those labels often conflict with the data that result from an initial partition of the data that is typically done before resampling. The training/test split can be conducted using the initial_split function in this package.)

Let’s look at one of the rsplit objects

first_resample <- bt_resamples$splits[[1]]
first_resample
#> <32/12/32>

This indicates that there were 32 data points in the analysis set, 12 instances were in the assessment set, and that the original data contained 32 data points. These results can also be determined using the dim function on an rsplit object.

To obtain either of these data sets from an rsplit, the as.data.frame function can be used. By default, the analysis set is returned but the data option can be used to return the assessment data:

head(as.data.frame(first_resample))
#>               mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Merc 240D    24.4   4  147  62 3.69 3.19 20.0  1  0    4    2
#> Camaro Z28   13.3   8  350 245 3.73 3.84 15.4  0  0    3    4
#> Valiant      18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
#> Merc 230     22.8   4  141  95 3.92 3.15 22.9  1  0    4    2
#> Merc 450SLC  15.2   8  276 180 3.07 3.78 18.0  0  0    3    3
#> Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
as.data.frame(first_resample, data = "assessment")
#>                      mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> Datsun 710          22.8   4 108.0  93 3.85 2.32 18.6  1  1    4    1
#> Duster 360          14.3   8 360.0 245 3.21 3.57 15.8  0  0    3    4
#> Merc 450SL          17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.25 18.0  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.42 17.8  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.20 19.5  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.46 20.0  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.52 16.9  0  0    3    2
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.85 17.1  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.94 18.9  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4

Alternatively, you can use the shortcuts analysis(first_resample) and assessment(first_resample).