Show code
library(tidyverse)
library(bbmle)
library(rlang)We use cookies
We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.
Yang Wu
September 4, 2021
In my post on function factories, we discussed the R data structure that powers function factories— environments. Now, we will focus on some applications of function factories in statistics. Again, the content of this post is inspired by Advance R.
The following packages will be used in this post:
The Box-Cox procedure is used in statistical analysis to identify a transformation from the family of power transformations on a univariate variable,
where
Here is the function factory:
Let us see it in action:
# Create a transformation function where lambda = 5
box_cox_5 <- box_cox_factory(lambda = 5)
# Create a vector of non-normal data
y <- rbeta(n = 500, shape1 = 6, shape2 = 1)
# Visualize
ggplot(data = tibble::tibble(y), mapping = aes(x = y)) +
geom_histogram(
binwidth = function(y) (max(y) - min(y)) / nclass.FD(y),
color = "black", fill = "orange"
) +
labs(title = "Pre-transformation") +
theme(
panel.background = element_rect(fill = "white"),
panel.grid = element_blank()
)
# After transformation
ggplot(data = tibble("new_y" = box_cox_5(y)), mapping = aes(x = new_y)) +
geom_histogram(
binwidth = function(y) (max(y) - min(y)) / nclass.FD(y),
color = "black", fill = "orange"
) +
labs(title = "Post-transformation") +
theme(
panel.background = element_rect(fill = "white"),
panel.grid = element_blank()
)
This is by no means the optimal transformation, but it allows us to see the significant change to the distribution of our data. An added benefit of using the function factory is that we can visually explore how different values of ggplot2::stat_function:
Let us see how different values of
ggplot(data = tibble::tibble(y), mapping = aes(x = y)) +
purrr::map(.x = c(0, 1, 2, 3, 4, 5), .f = box_cox_plot_function) +
scale_color_viridis_c(limits = c(0, 5)) +
labs(
y = "Transformed Y",
x = "Original Y"
) +
theme(
panel.background = element_rect(fill = "white"),
panel.grid = element_blank()
)
The transformation can be applied in linear regression analysis as a remedial measure when model conditions— linearity, constancy of error variance— are not met. I will also cover this topic in my future posts as well.
The bootstrap is a powerful statistical technique that allows us to quantify the uncertainty associated with a given estimator or statistic. Ideally, to quantify the uncertainty of an estimator or statistic, we would obtain new samples of data from the population, obtaining estimators or statistics for each individual sample. In reality, however, obtaining repeated samples from a given population can be costly. The bootstrap method emulates the process of obtaining new samples, allowing us to estimate the variability of estimators or statistics without actually generating additional samples. Rather than repeatedly obtaining independent samples from the population, the method instead obtains distinct samples by repeatedly sampling observations from the original sample. The following diagram illustrates the essence of bootstrapping:

Let us examine an application. Suppose we have a data set with two variables x and y; we wish to fit the simple linear regression model (
However, we are not satisfied with just one point estimator. We would like to generate bootstrap samples of the data, fit the simple regression model to each sample, and obtain a point estimator
To solve this problem in R, we can combine function factories and functionals:
# Generate a random data frame
data <- tibble(
y = sample(x = 1:200, size = 100, replace = FALSE),
x = sample(x = 1:200, size = 100, replace = FALSE)
)
# Find the point estimator of the variance of the sampling distribution of beta hat
lm(y ~ x, data = data) %>%
summary() %>%
purrr::pluck(.x = ., "coefficients") %>%
`[`(4)[1] 0.09776657
Next, we create a function factory, and it takes as input a given data frame and a variable from which the bootstrap sample is to be generated. Ultimately, the function factory will return a manufactured function:
bootstrap_factory <- function(df, var) {
# Initialize n
n <- nrow(df)
# Force evaluation of var
force(var)
# Manufactured function
function() {
# Select the variables and modify them
# Use "drop" to prevent data frame from being simplified to a matrix
df[var] <- df[var][sample(x = 1:n, replace = TRUE), , drop = FALSE]
df
}
}The benefit of creating a function factory is that this bootstrap generator is data frame and variable-agnostic. If we wish to create a bootstrapping function for another data frame or other variables, we could simply create another manufactured function for that task. In the simple linear regression case with one independent variable, it may not be obvious why a function factory is beneficial. However, imagine now we have a 20-variable data frame that we need to bootstrap. This function factory will save us a lot of time in terms of copying-and-pasting, minimizing code length and the amount of typing. Next, let us generate 1000 bootstrapped data frames based on the original data frame:
# Create a list of 1000 data frames
# Each list element is a bootstrapped data frame
list_of_data_frames <- map(
.x = 1:1000,
# Create a manufactured function
.f = ~ bootstrap_factory(df = data, var = "y")()
)
# Next fit the model to each of the bootstrapped data frames and extract the coefficient
vector_of_betas <- list_of_data_frames %>%
# Fit the model to each of the 1000 data frames
# This is a list of "lm" model objects
map(.x = ., .f = ~ lm(y ~ x, data = .x)) %>%
# Now extract the estimator of the coefficient on x from each of the model summary output
map_dbl(.x = ., .f = ~ coef(.x)[[2]])As can be seen, the whole process only takes about 6 lines of code. Here’s the sampling distribution of the estimator

And the standard deviation of this sampling distribution is 0.093936. Does this confirm our standard error calculation from earlier or refute it?
Another application of function factories is the estimation of the parameters of the normal error regression model:
For this model, each
In R, this density can be computed using the dnorm function. We will specify the following variables:
expected y
The maximum likelihood function uses the product of the densities of all the
We may work with the natural log of
This will simplify our implementation in R.
For R implementation, we first need to generate some random data:
For comparison, let us also obtain the estimators of the parameters using the least squares method:
[1] 43.62319048 0.01186464
[1] 7.250221
Next, create the function factory:
likelihood_factory <- function(x, y) {
# Initialize y and x
y <- force(y)
x <- force(x)
# Manufactured function
function(beta0, beta1, sigma) {
# Linear regression model
# The value of "x" is scoped from the enclosing environment of this manufactured function
expected_y <- beta0 + beta1 * x
# Negative log-likelihood function
# The value of "y" is scoped from the enclosing environment of this manufactured function
# Negatively scale the log-likelihood values since we want to maximize
-sum(dnorm(x = y, mean = expected_y, sd = sigma, log = TRUE))
}
}The advantage of creating a function factory is that we initialize “x” and “y” in the execution environment of likelihood_factory, which is the enclosing environment of the manufactured function and where it scopes for the values of “x” and “y”. Without the captured and encapsulated environment of a factory function, “x” and “y” will have to be stored in the global environment. Here they can be overwritten or deleted as well as interfere with other bindings. So, in a sense, the function factory here provides an extra layer of safeguarding. To find the maximum likelihood estimators, we supply a manufactured function to the bbmle::mle2 function from the bbmle package:
Call:
mle2(minuslogl = likelihood_factory(data$x, data$y), start = list(beta0 = 0,
beta1 = 0, sigma = 50))
Coefficients:
beta0 beta1 sigma
43.62313572 0.01186424 7.20174955
Log-likelihood: -508.99
How do the maximum likelihood estimators compare to those of the least squares method? (Hint: For normal data, they should be consistent with each other.)
And that is all for this post. We have covered three interesting applications of function factories in statistics. Among all of R’s functional programming tool-kits, function factories are perhaps the least utilized features as far as data science is concerned. This is likely because functional factories tend to have more of a mathematical or statistical flavor to them. However, as we have seen in this post, there is a class of problems, particular in mathematical statistics, to which function factories could offer elegant solutions in R. Therefore, having them in our R toolkit may certainly reap some long term benefits, allowing us to solve a variety of problems beyond those that are typically encountered in data analysis.