Generates data from K multivariate normal data populations, where each population (class) has an intraclass covariance matrix.

This function generates K multivariate normal data sets, where each class is generated with a constant mean vector and an intraclass covariance matrix. The data are returned as a single matrix x along with a vector of class labels y that indicates class membership.

generate_intraclass(n, p, rho, mu, sigma2 = rep(1, K))

Arguments

n	vector of the sample sizes of each class. The length of `n` determines the number of classes `K`.
p	the number of features (variables) in the data
rho	vector of the values of the off-diagonal elements for each intraclass covariance matrix. Must equal the length of `n`.
mu	vector containing the mean for each class. Must equal the length of `n` (i.e., equal to `K`).
sigma2	vector of variances for each class. Must equal the length of `n`. Default is 1 for each class.

Value

named list with elements:

x: matrix of observations with n rows and p columns
y: vector of class labels that indicates class membership for each observation (row) in x.

Details

For simplicity, we assume that a class mean vector is constant for each feature. That is, we assume that the mean vector of the $k$th class is $c_k * j_p$, where $j_p$ is a $p \times 1$ vector of ones and $c_k$ is a real scalar.

The intraclass covariance matrix for the $k$th class is defined as: $$\sigma_k^2 * (\rho_k * J_p + (1 - \rho_k) * I_p),$$ where $J_p$ is the $p \times p$ matrix of ones and $I_p$ is the $p \times p$ identity matrix.

By default, with $\sigma_k^2 = 1$, the diagonal elements of the intraclass covariance matrix are all 1, while the off-diagonal elements of the matrix are all rho.

The values of rho must be between $1 / (1 - p)$ and 1, exclusively, to ensure that the covariance matrix is positive definite.

The number of classes K is determined with lazy evaluation as the length of n.

Examples

# Generates data from K = 3 classes.
data <- generate_intraclass(n = 3:5, p = 5, rho = seq(.1, .9, length = 3),
                            mu = c(0, 3, -2))
data$x
#>             [,1]        [,2]       [,3]       [,4]        [,5]
#>  [1,] -0.4952236 -0.09231841  0.5820481  0.4069353 -0.08789382
#>  [2,] -0.7586901  1.12339313  0.9231776 -0.6332903  0.70545237
#>  [3,]  0.4882661  1.05286150  0.2032278 -1.0212341  0.82259500
#>  [4,]  2.8157877  3.70386131  4.5143962  2.3278051  4.43550577
#>  [5,]  3.8168171  2.48614267  2.6823111  2.3750202  2.45667692
#>  [6,]  3.2560923  2.21057101  2.7116475  2.7444760  2.77794385
#>  [7,]  3.3997265  4.40037276  2.9522846  2.7093591  3.83165394
#>  [8,] -1.5656832 -1.61143639 -1.4991122 -2.1324906 -1.34975544
#>  [9,] -2.4122656 -2.25497541 -2.7450749 -3.1382796 -2.70904766
#> [10,] -1.6468736 -0.65720421 -1.4908530 -1.4283055 -1.02396009
#> [11,]  0.3898303  0.40650182  1.0265115  0.2461894  0.79612886
#> [12,] -3.3029130 -3.84452904 -4.0874635 -3.1776978 -3.69913633
data$y
#>  [1] 1 1 1 2 2 2 2 3 3 3 3 3
#> Levels: 1 2 3

# Generates data from K = 4 classes. Notice that we use specify a variance.
data <- generate_intraclass(n = 3:6, p = 4, rho = seq(0, .9, length = 4),
                            mu = c(0, 3, -2, 6), sigma2 = 1:4)
data$x
#>              [,1]        [,2]        [,3]        [,4]
#>  [1,]  0.04197238  1.77704370  0.34718268 -0.28709745
#>  [2,] -1.92457665 -0.80064911 -0.57561797  0.68420573
#>  [3,]  0.48345735  0.01396341  0.18024373 -0.28812325
#>  [4,]  3.52967610  3.81535991  4.02536219  5.65638898
#>  [5,]  1.58635485  4.64713473  3.69666330  1.07366165
#>  [6,]  2.02532406  1.60848209  1.24918974  0.53347893
#>  [7,]  1.78059811  1.85010263  1.98240432 -0.07351498
#>  [8,] -0.27770691  0.05623810  0.10366170  0.08295519
#>  [9,] -1.97605528 -0.58720208 -3.15580783 -3.53937707
#> [10,] -1.59893020 -1.48611451 -2.78678090 -1.92891310
#> [11,] -0.51323544 -0.45148914  0.09281614  0.15507192
#> [12,] -5.02802916 -0.47448121 -1.86787795 -2.99220228
#> [13,]  6.25672435  7.53408762  7.16918086  7.64351868
#> [14,]  5.88276961  5.76690173  7.33851662  4.63324150
#> [15,]  7.67176675  7.66790807  7.70198620  7.77753004
#> [16,]  3.92055285  4.83299294  2.54314638  3.58588276
#> [17,]  6.80054322  6.97912386  7.13297706  7.28025097
#> [18,]  5.32693327  6.28585491  6.61251656  7.11985206
data$y
#>  [1] 1 1 1 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4
#> Levels: 1 2 3 4