Helpers for specifying nodes in simulations

Mix two variables together. The output will have the specified R-squared with var1 and variance one.

Evaluate an expression separately for each case

Usage

categorical(n = 5, ..., exact = TRUE)

cat2value(variable, ...)

bernoulli(n = 0, logodds = NULL, prob = 0.5, labels = NULL)

mix_with(signal, noise = NULL, R2 = 0.5, var = 1, exact = FALSE)

each(ex)

block_by(block_var, levels = c("treatment", "control"), show_block = FALSE)

random_levels(n, k = NULL, replace = FALSE)

Arguments

n: The symbol standing for the number of rows in the data frame to be generated by datasim_run(). Just use n as a symbol; don't assign it a value. (That will be done by datasim_run().)
exact: if TRUE, make R-squared or the target variance exactly as specified.
variable: a categorical variable
logodds: Numerical vector used to generate bernouilli trials. Can be any real number.
prob: An alternative to logodds. Values must be in [0,1].
labels: Character vector: names for categorical levels, also used to replace 0 and 1 in bernouilli()
signal: The part of the mixture that will be correlated with the output.
noise: The rest of the mixture. This will be uncorrelated with the output only if you specify it as pure noise.
R2: The target R-squared.
var: The target variance.
ex: an expression potentially involving other variables.
block_var: Which variable to use for blocking
levels: Character vector giving names to the blocking levels
show_block: Logical. If TRUE, put the block number in the output.
k: Number of distinct levels
replace: if TRUE, use resampling on the set of k levels
...: assignments of values to the names in variable

Value

A numerical or categorical vector which will be assembled into a data frame by datasim_run()

Details

datasim_make() constructs a simulation which can then be run with datasim_run(). Each argument to datasim_make() specifies one node of the simulation using an assignment-like syntax such as y <- 3*x + 2 + rnorm(n). The datasim helpers documented here are for use on the right-hand side of the specification of a node. They simplify potentially complex operations such as blocking, creation of random categorical methods, translation from categorical to numerical values, etc.

The target R-squared and variance will be achieved only if exact=TRUE or the sample size goes to infinity.

Examples

Demo <- datasim_make(
  g <- categorical(n, a=2, b=1, c=0.5),
  x <- cat2value(g, a=-1.7, b=0.1, c=1.2),
  y <- bernoulli(logodds = x, labels=c("no", "yes")),
  z <- random_levels(n, k=4),
  w <- mix_with(x, noise=rnorm(n), R2=0.75, var=1),
  treatment <- block_by(w),
  dice <- each(rnorm(1, sd = abs(w)))
)