This function generates K multivariate normal data sets, where each class is generated with a constant mean vector and an intraclass covariance matrix. The data are returned as a single matrix x along with a vector of class labels y that indicates class membership.

generate_intraclass(n, p, rho, mu, sigma2 = rep(1, K))

Arguments

n

vector of the sample sizes of each class. The length of n determines the number of classes K.

p

the number of features (variables) in the data

rho

vector of the values of the off-diagonal elements for each intraclass covariance matrix. Must equal the length of n.

mu

vector containing the mean for each class. Must equal the length of n (i.e., equal to K).

sigma2

vector of variances for each class. Must equal the length of n. Default is 1 for each class.

Value

named list with elements:

  • x: matrix of observations with n rows and p columns

  • y: vector of class labels that indicates class membership for each observation (row) in x.

Details

For simplicity, we assume that a class mean vector is constant for each feature. That is, we assume that the mean vector of the \(k\)th class is \(c_k * j_p\), where \(j_p\) is a \(p \times 1\) vector of ones and \(c_k\) is a real scalar.

The intraclass covariance matrix for the \(k\)th class is defined as: $$\sigma_k^2 * (\rho_k * J_p + (1 - \rho_k) * I_p),$$ where \(J_p\) is the \(p \times p\) matrix of ones and \(I_p\) is the \(p \times p\) identity matrix.

By default, with \(\sigma_k^2 = 1\), the diagonal elements of the intraclass covariance matrix are all 1, while the off-diagonal elements of the matrix are all rho.

The values of rho must be between \(1 / (1 - p)\) and 1, exclusively, to ensure that the covariance matrix is positive definite.

The number of classes K is determined with lazy evaluation as the length of n.

Examples

# Generates data from K = 3 classes. data <- generate_intraclass(n = 3:5, p = 5, rho = seq(.1, .9, length = 3), mu = c(0, 3, -2)) data$x
#> [,1] [,2] [,3] [,4] [,5] #> [1,] -0.4952236 -0.09231841 0.5820481 0.4069353 -0.08789382 #> [2,] -0.7586901 1.12339313 0.9231776 -0.6332903 0.70545237 #> [3,] 0.4882661 1.05286150 0.2032278 -1.0212341 0.82259500 #> [4,] 2.8157877 3.70386131 4.5143962 2.3278051 4.43550577 #> [5,] 3.8168171 2.48614267 2.6823111 2.3750202 2.45667692 #> [6,] 3.2560923 2.21057101 2.7116475 2.7444760 2.77794385 #> [7,] 3.3997265 4.40037276 2.9522846 2.7093591 3.83165394 #> [8,] -1.5656832 -1.61143639 -1.4991122 -2.1324906 -1.34975544 #> [9,] -2.4122656 -2.25497541 -2.7450749 -3.1382796 -2.70904766 #> [10,] -1.6468736 -0.65720421 -1.4908530 -1.4283055 -1.02396009 #> [11,] 0.3898303 0.40650182 1.0265115 0.2461894 0.79612886 #> [12,] -3.3029130 -3.84452904 -4.0874635 -3.1776978 -3.69913633
data$y
#> [1] 1 1 1 2 2 2 2 3 3 3 3 3 #> Levels: 1 2 3
# Generates data from K = 4 classes. Notice that we use specify a variance. data <- generate_intraclass(n = 3:6, p = 4, rho = seq(0, .9, length = 4), mu = c(0, 3, -2, 6), sigma2 = 1:4) data$x
#> [,1] [,2] [,3] [,4] #> [1,] 0.04197238 1.77704370 0.34718268 -0.28709745 #> [2,] -1.92457665 -0.80064911 -0.57561797 0.68420573 #> [3,] 0.48345735 0.01396341 0.18024373 -0.28812325 #> [4,] 3.52967610 3.81535991 4.02536219 5.65638898 #> [5,] 1.58635485 4.64713473 3.69666330 1.07366165 #> [6,] 2.02532406 1.60848209 1.24918974 0.53347893 #> [7,] 1.78059811 1.85010263 1.98240432 -0.07351498 #> [8,] -0.27770691 0.05623810 0.10366170 0.08295519 #> [9,] -1.97605528 -0.58720208 -3.15580783 -3.53937707 #> [10,] -1.59893020 -1.48611451 -2.78678090 -1.92891310 #> [11,] -0.51323544 -0.45148914 0.09281614 0.15507192 #> [12,] -5.02802916 -0.47448121 -1.86787795 -2.99220228 #> [13,] 6.25672435 7.53408762 7.16918086 7.64351868 #> [14,] 5.88276961 5.76690173 7.33851662 4.63324150 #> [15,] 7.67176675 7.66790807 7.70198620 7.77753004 #> [16,] 3.92055285 4.83299294 2.54314638 3.58588276 #> [17,] 6.80054322 6.97912386 7.13297706 7.28025097 #> [18,] 5.32693327 6.28585491 6.61251656 7.11985206
data$y
#> [1] 1 1 1 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 #> Levels: 1 2 3 4