Show code
library(paramtest)
library(lattice)
library(knitr)
library(kableExtra)
library(plyr)
library(dplyr)
library(ggplot2)
library(MASS)
library(plotly)
We use cookies
We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.
Yang Wu
November 4, 2020
The following packages will be used in this post:
The central limit theorem, crudely speaking, states that— if there is a distribution with expected value given by the mean
The mean of the distribution of sample means,
The standard deviation of the distribution of sample means, also known as the standard error of the mean,
Here, we examine a poisson distribution
# Lambda is the number of geese that arrive per hour
lambda <- 220 / 24
# Number of random samples to be drawn
numsim <- 10000
# Initialize variables
mean5 <- rep(0, numsim)
mean15 <- rep(0, numsim)
mean30 <- rep(0, numsim)
mean100 <- rep(0, numsim)
mean200 <- rep(0, numsim)
# Loop for simulating
for (i in 1:numsim) {
# sample means
mean5[i] <- mean(rpois(5, lambda))
mean15[i] <- mean(rpois(15, lambda))
mean30[i] <- mean(rpois(30, lambda))
mean100[i] <- mean(rpois(100, lambda))
mean200[i] <- mean(rpois(200, lambda))
}
# Create five data frames for these sampling distributions with varying sample sizes
# Create a variable "sample_size" used as an identifier
n_5 <- data.frame(sample_size = as.factor(rep(5, 10000)), sample_means = mean5)
n_15 <- data.frame(sample_size = as.factor(rep(15, 10000)), sample_means = mean15)
n_30 <- data.frame(sample_size = as.factor(rep(30, 10000)), sample_means = mean30)
n_100 <- data.frame(sample_size = as.factor(rep(100, 10000)), sample_means = mean100)
n_200 <- data.frame(sample_size = as.factor(rep(200, 10000)), sample_means = mean200)
# Combine into a single data frame
# The function rbind() combines data frames by rows
sampling_distributions <- rbind(n_5, n_15, n_30, n_100, n_200)
The first characteristic of the CLT states that
5 15 30 100 200
9.16998 9.15055 9.16974 9.16441 9.17096
As can be seen,
# Using ggplot()
ggplot(
data = sampling_distributions,
mapping = aes(x = sample_means)
) +
geom_density(aes(fill = sample_size, color = sample_size), alpha = 0.2) +
geom_vline(xintercept = lambda, linetype = "dashed") +
ggtitle("Density Plot of Sampling Distributions") +
xlab("Distributions of Sample Means") +
ylab("Density") +
theme(
panel.background = element_rect(fill = "azure2"),
panel.grid = element_blank()
)
The dashed line in the figure above indicates
5 15 30 100 200
1.35926 0.77799 0.55530 0.30251 0.21138
Now, we have a numeric proof of this characteristic as well. Additionally, we may also compare these simulated values to our analytical solutions. We know that the standard deviation of a poisson distribution is simply the square root of the mean,
# sd of the distribution of a random sample n=5
sd_5 <- sqrt(lambda) / sqrt(5)
# sd of the distribution of a random sample n=15
sd_15 <- sqrt(lambda) / sqrt(15)
# sd of the distribution of a random sample n=30
sd_30 <- sqrt(lambda) / sqrt(30)
# sd of the distribution of a random sample n=100
sd_100 <- sqrt(lambda) / sqrt(100)
# sd of the distribution of a random sample n=200
sd_200 <- sqrt(lambda) / sqrt(200)
# store in one vector
se <- c(sd_5, sd_15, sd_30, sd_100, sd_200)
names(se) <- c("5", "15", "30", "100", "200")
se
5 15 30 100 200
1.3540064 0.7817360 0.5527708 0.3027650 0.2140872
Here, we see that our simulated values match pretty well with our analytical solutions.
When two random variables
A random
A
Then, we say
# Define a function for generating random multivariate normals
rmultnorm <- function(n, mu, vmat, tol = 1e-07) {
p <- ncol(vmat)
if (length(mu) != p) {
stop("alright, alright, alright, mu vector is the wrong length")
}
if (max(abs(vmat - t(vmat))) > tol) {
stop("vmat not symmetric")
}
vs <- svd(vmat)
vsqrt <- t(vs$v %*% (t(vs$u) * sqrt(vs$d)))
ans <- matrix(rnorm(n * p), nrow = n) %*% vsqrt
ans <- sweep(ans, 2, mu, "+")
dimnames(ans) <- list(NULL, dimnames(vmat)[[2]])
return(ans)
}
We now have a function for genearting random multivariate normals. If we would like simulate observations from a bivariate normal distribution with mean 100, variance 14. And if we wish to specify a moderately negative association based on Cohen (1988) and its convention for interpreting the strength of the correlation coefficient, we need to find
We modify the inputs of the function defined above:
Here is a scatter plot on the xy-plane:
# Using ggplot()
ggplot(
data = data.frame(moderate_negative),
mapping = aes(
x = moderate_negative[, 1],
y = moderate_negative[, 2]
)
) +
geom_point(color = "orange") +
xlim(c(86, 115)) +
ylim(c(88, 112)) +
ggtitle(
paste(
"Scatter Plot of Bivariate Normal Random Variables
with Moderate Negative Association (rho = ",
round(cor(moderate_negative[, 1], moderate_negative[, 2]), digits = 2),
")"
)
) +
xlab("X1") +
ylab("X2") +
theme(
panel.background = element_rect(fill = "azure2"),
panel.grid = element_blank()
)
Here’s a 3D plot of the joint probability density function of the two random variables:
# Bandwidth
# Using a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator
bw_vector <- c(bandwidth.nrd(moderate_negative[, 1]), bandwidth.nrd(moderate_negative[, 2]))
bw_moderate_negative <- mean(bw_vector)
# Using the kde2d() function from the MASS package
# kde2d() is used for 2d kernel density estimation
joint_pdf <- kde2d(x = moderate_negative[, 1], y = moderate_negative[, 2], h = bw_moderate_negative)
# Using plot_ly() from the plotly graphics library
# the pipe operator %>% will forward the result of an expression into the next function call
joint_pdf2d <- plot_ly(x = joint_pdf$x, y = joint_pdf$y, z = joint_pdf$z)
joint_pdf2d <- joint_pdf2d %>% layout(title = "Bivariate Density")
joint_pdf2d <- joint_pdf2d %>% add_surface()
joint_pdf2d
As can be seen, when
To simulate observations from a bivariate normal distribution with a strong negative association, say,
# Using ggplot()
ggplot(
data = data.frame(strong_negative),
mapping = aes(
x = strong_negative[, 1],
y = strong_negative[, 2]
)
) +
geom_point(color = "orange") +
xlim(c(89, 112)) +
ylim(c(86, 111)) +
ggtitle(
paste(
"Scatter Plot of Bivariate Normal Random Variables
with Strong Negative Association (rho = ",
round(cor(strong_negative[, 1], strong_negative[, 2]), digits = 2),
")"
)
) +
xlab("X1") +
ylab("X2") +
theme(
panel.background = element_rect(fill = "azure2"),
panel.grid = element_blank()
)
# Bandwidth
bw_vector <- c(bandwidth.nrd(strong_negative[, 1]), bandwidth.nrd(strong_negative[, 2]))
bw_strong_negative <- mean(bw_vector)
# 2d kernel density estimation
joint_pdf <- kde2d(x = strong_negative[, 1], y = strong_negative[, 2], h = bw_strong_negative)
# Using plot_ly()
joint_pdf2d <- plot_ly(x = joint_pdf$x, y = joint_pdf$y, z = joint_pdf$z)
joint_pdf2d <- joint_pdf2d %>% layout(title = "Bivariate Density")
joint_pdf2d <- joint_pdf2d %>% add_surface()
joint_pdf2d
There is a variety of perspectives on the definition of power. Simply put, power is the probability of avoiding a Type II error, according to Neil Weiss in Introductory Statistics. In the section below we explore the concept of power by examining a two-way ANOVA with interaction. Note that for simplicity, I used arbitrarily picked values for the model instead of real empirical data. In the context of a two-way ANOVA with interaction, we could interpret power as the probability that a test of significance will pick up on an effect that is present.
# Define a function for a two-way anova model with interaction
two_way_anova_interaction_regression <- function(n, b0, b1, b2, b3,
x1_mean = 0, x1_sd = 1, err_mean = 0, err_sd = 1) {
# x1 draws n values from a normal distribution with a mean of 0 & sd of 1
# x2 draws integers btw 0 to 1, n times (i.e., n numbers of either 0 or 1)
x1 <- rnorm(n, mean = x1_mean, sd = x1_sd)
x2 <- sample(0:1, n, replace = TRUE)
# y is a linear combination of x1 & x2 multiplied by coefficients/effect sizes
# the last term is the error term-- i.e. the unexplained portion of y
y <- b0 + (b1 * x1) + (b2 * x2) + (b3 * x1 * x2) + rnorm(n, mean = err_mean, sd = err_sd)
# regression model
anova_model <- lm(y ~ x1 * x2)
summary(anova_model)
# store model outputs
output <- summary(anova_model)$coefficients
coeffs <- output[, 1]
p_values <- output[, 4]
r_sqr <- summary(anova_model)$r.squared
# output
results <- c(coeffs, p_values, r_sqr)
names(results) <- c(
"$\\beta_{0}$", "$\\beta_{1}$", "$\\beta_{2}$",
"$\\beta_{3}$", "$\\beta_{0}$_pvalue",
"$\\beta_{1}$_pvalue", "$\\beta_{2}$_pvalue",
"$\\beta_{3}$_pvalue", "$r^2$"
)
return(results)
}
Let’s try using this function:
# Using arbitrarily picked values
anova_model <- two_way_anova_interaction_regression(n = 100, b0 = 0, b1 = 0.2, b2 = 0.4, b3 = 0.5)
# Generate table using kable () function from the knitr package
kable(anova_model, caption = "Two-way ANOVA Regression Model with Interaction") %>%
kable_styling(position = "center")
x | |
---|---|
$\beta_{0}$ | 0.0310109 |
$\beta_{1}$ | 0.2611402 |
$\beta_{2}$ | 0.5979107 |
$\beta_{3}$ | 0.3640547 |
$\beta_{0}$_pvalue | 0.8446025 |
$\beta_{1}$_pvalue | 0.1368724 |
$\beta_{2}$_pvalue | 0.0042647 |
$\beta_{3}$_pvalue | 0.1080374 |
$r^2$ | 0.2291234 |
Now, we run 1000 simulations of the model, finding the coefficients and their associating p-values for each iteration. The output would be a data frame, with one row per simulation:
# Number of simulations
num_sims <- 1000
# Using a function called ldply() from the plyr package by Hadley Wickham
# ldply () applies function for each element of a list then combine results into a data frame.
power_simulations <- ldply(1:num_sims, two_way_anova_interaction_regression,
n = 100, b0 = 0, b1 = 0.2, b2 = 0.4, b3 = 0.5
)
# First 5 rows
kable(power_simulations[1:5, ], caption = "Two-way ANOVA Regression Model with Interaction") %>%
kable_styling(position = "center")
$\beta_{0}$ | $\beta_{1}$ | $\beta_{2}$ | $\beta_{3}$ | $\beta_{0}$_pvalue | $\beta_{1}$_pvalue | $\beta_{2}$_pvalue | $\beta_{3}$_pvalue | $r^2$ |
---|---|---|---|---|---|---|---|---|
-0.1657905 | 0.1852724 | 0.7968441 | 0.5459245 | 0.4342815 | 0.2469565 | 0.0103175 | 0.0136621 | 0.4588430 |
-0.1945950 | 0.2394697 | 0.7110830 | 0.3147623 | 0.5081150 | 0.0582744 | 0.0991120 | 0.1033851 | 0.3969188 |
-0.7726047 | 0.5142571 | 0.8882506 | 0.2458852 | 0.0712582 | 0.0000984 | 0.1898985 | 0.2411440 | 0.5116927 |
1.3787538 | -0.1250316 | -1.2926884 | 0.8758351 | 0.0884618 | 0.5130762 | 0.1906276 | 0.0003018 | 0.6237839 |
0.2492963 | 0.1122680 | -0.2806965 | 0.7228013 | 0.7372446 | 0.4271540 | 0.7963389 | 0.0010111 | 0.7472213 |
We can estimate power by finding the proportion of p-values that are significant. We are primarily interested in the p-value for the interaction effect,
0.67
We could tweak the parameters to see how it affects our power estimate. For instance, we could investigate what happens when we increase sample sizes for x1 and x2:
# Create a vector of sample sizes ranging from 50 to 500
sample_sizes <- c(50, 100, 200, 300, 500)
# Initualize variable
results <- list()
# For Loop
for (val in sample_sizes) {
# cycle through each value in "sample_sizes" and sets val to be that value
# then pass that value to the simulation function as the sample size n
power_simulations <- ldply(1:1000, two_way_anova_interaction_regression,
n = val, b0 = 0, b1 = 0.3, b2 = 0.2, b3 = 0.3
)
# create new variable called n in the output data frame "results"
# this variable functions as an indentifier
power_simulations$n <- as.factor(val)
# rbind() combines data frames by rows
# notice that the first argument "results" in rbind() is an empty list when val=50
# after each cycle, this dataframe gets 1000 more rows added to it
# in the end we should have a single data frame with 5000 rows as there are five values in "sample_sizes"
results <- rbind(results, power_simulations)
}
We could examine the results using a table and a plot:
# Split results into five individual data frames using the variable "n" as the identifier
list_of_results <- split(results, results[[10]])
# Find the power estimates associated with each sample size
power_n_50 <- sum(list_of_results[["50"]][[8]] < .05) / nrow(list_of_results[["50"]])
power_n_100 <- sum(list_of_results[["100"]][[8]] < .05) / nrow(list_of_results[["100"]])
power_n_200 <- sum(list_of_results[["200"]][[8]] < .05) / nrow(list_of_results[["200"]])
power_n_300 <- sum(list_of_results[["300"]][[8]] < .05) / nrow(list_of_results[["300"]])
power_n_500 <- sum(list_of_results[["500"]][[8]] < .05) / nrow(list_of_results[["500"]])
# Store all power estimates in one single vector
power_estimates <- c(power_n_50, power_n_100, power_n_200, power_n_300, power_n_500)
# Create data frame
power_table <- data.frame(
"Sample Size" = sample_sizes,
"Power Estimate" = power_estimates
)
# Generate table using kable () function from the knitr package
kable(power_table, caption = "Power Estimates by Sample Size") %>%
kable_styling(position = "center")
Sample.Size | Power.Estimate |
---|---|
50 | 0.175 |
100 | 0.284 |
200 | 0.542 |
300 | 0.728 |
500 | 0.910 |
# Using ggplot()
ggplot(
data = power_table,
mapping = aes(x = Sample.Size, y = Power.Estimate)
) +
geom_point(color = "orange") +
geom_line(color = "orange") +
geom_hline(yintercept = 0.8, linetype = "dashed") +
ylim(c(0, 1)) +
ggtitle("Power Estimates by Sample Size") +
xlab("Sample Size") +
ylab("Power Estimates") +
theme(
panel.background = element_rect(fill = "azure2"),
panel.grid = element_blank()
)
As can be seen, our power estimates increase as sample size increases. If we take 0.8 as the rough rule of thumb of desired level of power, then a sample size of