A comprehensive reference for probability functions in R and Python, covering both continuous and discrete cases
R Programming
Python Programming
Probability & Statistics
Mathematics
Author
Yang Wu
Published
October 13, 2022
Motivation
The intent of this post is to serve as a quick reference for working with basic probability functions in R and Python, covering both continuous and discrete cases. This guide provides some theoretical background, properties, and practical examples in both R and Python.
To demonstrate the equivalence between R functions and Python computations, we will use the same simulated samples in both languages throughout this guide. That is, the reticulate package in R will be used to pass data between R and Python.
Each code block has a slider to toggle between R and Python code snippets.
Packages and Libraries
We will utilize several packages in R and Python to perform our computations.
Show code
# Base R packageslibrary(stats)library(moments)library(EnvStats)library(purrr)library(dplyr)
Show code
from scipy.stats import gamma, chi2, binom, poisson, moment, describeimport numpy as npimport matplotlib.pyplot as plt
Simulate Samples
We begin by simulating samples from both continuous and discrete distributions, which will be used throughout this guide.
Continuous Distribution: Chi-squared Distribution
Show code
# Simulate samples in R from chi-squared distributionr_sample_continuous <-rgamma(n =1000, shape =5/2, rate =1/2)hist(r_sample_continuous, main ="Histogram of Simulated Samples (R, Continuous)", xlab ="Value")
Show code
# Simulate samples in Python from chi-squared distributionpy_sample_continuous = gamma.rvs(a=5/2, scale=2, size=1000)plt.hist(py_sample_continuous, bins=30)plt.title("Histogram of Simulated Samples (Python, Continuous)")plt.xlabel("Value")plt.show()
Discrete Distribution: Binomial Distribution
Show code
# Simulate samples in R from binomial distributionr_sample_discrete <-rbinom(n =1000, size =10, prob =0.5)hist(r_sample_discrete, breaks =seq(-0.5, 10.5, 1), main ="Histogram of Simulated Samples (R, Discrete)", xlab ="Value")
Show code
# Simulate samples in Python from binomial distributionpy_sample_discrete = binom.rvs(n=10, p=0.5, size=1000)plt.hist(py_sample_discrete, bins=np.arange(-0.5,11.5,1))plt.title("Histogram of Simulated Samples (Python, Discrete)")plt.xlabel("Value")plt.show()
Probability Density Function (Continuous) and Probability Mass Function (Discrete)
Probability Density Function (PDF) for Continuous Variables
The Probability Density Function (PDF) of a continuous random variable \(X\) is a function \(f_X(x)\) such that:
Non-negativity: \(p_X(k) \geq 0\) for all integers \(k\).
Normalization: \(\sum_{k} p_X(k) = 1\).
Comments:
The PMF gives the exact probability that the random variable takes on a specific value.
It is useful when the random variable can only take on discrete values.
Examples
Continuous Case: PDF of Chi-squared Distribution
Show code
# Compute PDF values in Rr_densities_continuous <-dchisq(r_sample_continuous, df =10)plot(r_sample_continuous, r_densities_continuous,xlab ="Quantiles", ylab ="Density",main ="PDF of Chi-squared Distribution (R)")
Show code
# Compute PDF values in Pythonpy_densities_continuous = chi2.pdf(r.r_sample_continuous, df=10)plt.scatter(r.r_sample_continuous, py_densities_continuous);plt.xlabel("Quantiles")plt.ylabel("Density")plt.title("PDF of Chi-squared Distribution (Python)")plt.show()
Discrete Case: PMF of Binomial Distribution
Show code
# Compute PMF values in Rr_pmf_discrete <-dbinom(r_sample_discrete, size =10, prob =0.5)plot(unique(r_sample_discrete), r_pmf_discrete[!duplicated(r_sample_discrete)],xlab ="Outcomes", ylab ="Probability",main ="PMF of Binomial Distribution (R)", type ="h")
Show code
# Compute PMF values in Pythonpy_pmf_discrete = binom.pmf(r.r_sample_discrete, n=10, p=0.5)plt.stem(r.r_sample_discrete, py_pmf_discrete)plt.xlabel("Outcomes")plt.ylabel("Probability")plt.title("PMF of Binomial Distribution (Python)")plt.show()
Interpretation
Continuous Case: The PDF indicates the relative likelihood of the random variable near a point. Since the probability at an exact point is zero for continuous variables, we consider intervals.
Discrete Case: The PMF gives the exact probability of the random variable being equal to a specific value.
Cumulative Distribution Function (CDF)
The Cumulative Distribution Function (CDF) of a random variable \(X\) is a fundamental concept in probability theory that provides a complete description of the distribution of \(X\). It is defined as:
This definition holds for both continuous and discrete random variables. However, the way the CDF is calculated differs between the two cases due to the nature of their probability distributions.
Continuous Random Variables
For a continuous random variable, the CDF is calculated using an integral of its Probability Density Function (PDF)\(f_X(x)\):
\[\begin{align*}
F_X(x) = \int_{-\infty}^{x} f_X(u) \, du
\end{align*}\]
Key Points:
The PDF \(f_X(x)\) represents the density of probability at each point \(x\).
The CDF \(F_X(x)\) accumulates the probabilities from the lower bound \(-\infty\) up to \(x\).
Since \(\operatorname{Pr}(X = x) = 0\) for continuous variables, the CDF is a continuous and smooth function.
Right-Continuity: The CDF is always right-continuous, regardless of whether the random variable is discrete, continuous, or mixed. This means that at any point \(x\):
For Discrete Variables: The CDF has jumps at points where the random variable has positive probability mass, but it does not have discontinuities from the right.
For Continuous Variables: The CDF is continuous everywhere, so it is trivially right-continuous.
Comments
Accumulated Probability: The CDF represents the total probability accumulated up to a certain value \(x\). It answers the question: “What is the probability that the random variable \(X\) is less than or equal to \(x\)?”
Calculating Probabilities Over Intervals:
Continuous Case:
\[\begin{align*}
\operatorname{Pr}(a < X \leq b) = F_X(b) - F_X(a)
\end{align*}\]
Discrete Case:
\[\begin{align*}
\operatorname{Pr}(a < X \leq b) = \sum_{k = a+1}^{b} p_X(k)
\end{align*}\]
Continuous Variables: The CDF is a smooth, continuous curve that increases from 0 to 1.
Discrete Variables: The CDF is a step function with jumps at the points where the random variable has non-zero probability.
Total Probability: The CDF approaches 1 as \(x\) approaches infinity, reflecting that the total probability over the entire space is 1.
Differentiation and Integration: For continuous random variables, the PDF and CDF are related through differentiation and integration, as previously noted.
Discontinuities: In the discrete case, the size of the jump at a point \(k\) in the CDF is equal to \(p_X(k)\), the probability mass at that point.
Examples
Continuous Case: CDF of Chi-squared Distribution
Show code
# Compute CDF values in Rr_cdf_continuous <-pchisq(r_sample_continuous, df =10)plot(r_sample_continuous, r_cdf_continuous,xlab ="Quantiles", ylab ="Cumulative Probability",main ="CDF of Chi-squared Distribution (R)")
Show code
# Compute CDF values in Pythonpy_cdf_continuous = chi2.cdf(r.r_sample_continuous, df=10)plt.scatter(r.r_sample_continuous, py_cdf_continuous)plt.xlabel("Quantiles")plt.ylabel("Cumulative Probability");plt.title("CDF of Chi-squared Distribution (Python)")plt.show()
Discrete Case: CDF of Binomial Distribution
Show code
# Compute CDF values in Rr_cdf_discrete <-pbinom(r_sample_discrete, size =10, prob =0.5)plot(r_sample_discrete, r_cdf_discrete,xlab ="Outcomes", ylab ="Cumulative Probability",main ="CDF of Binomial Distribution (R)", type ="s")
Show code
# Compute CDF values in Pythonpy_cdf_discrete = binom.cdf(r.r_sample_discrete, n=10, p=0.5)plt.step(r.r_sample_discrete, py_cdf_discrete, where='post')plt.xlabel("Outcomes")plt.ylabel("Cumulative Probability")plt.title("CDF of Binomial Distribution (Python)")plt.show()
Comments
Continuous Case: The CDF smoothly increases from 0 to 1 as the quantile increases.
Discrete Case: The CDF increases in steps, reflecting the discrete nature of the variable.
Quantile Function
The Quantile Function is the inverse of the CDF and maps probabilities to quantiles.
Definitions
For a given probability \(p \in [0,1]\), the quantile function \(Q_X(p)\) is defined as:
Continuous Case:
\[\begin{align*}
Q_X(p) = F_X^{-1}(p) = \inf\{ x \in \mathbb{R} : F_X(x) \geq p \}
\end{align*}\]
Discrete Case:
\[\begin{align*}
Q_X(p) = \min\{ k \in \mathbb{Z} : F_X(k) \geq p \}
\end{align*}\]
Comments:
The quantile function tells us the value below which a certain percentage of data falls.
In the discrete case, it provides the smallest integer where the cumulative probability meets or exceeds the given probability.
Examples
Continuous Case
Show code
r_cdf_continuous <-pchisq(r_sample_continuous, df =10)# Quantile function in Rr_quantiles_continuous <-qchisq(p = r_cdf_continuous, df =10)# Verifyall.equal(r_quantiles_continuous, r_sample_continuous)
[1] TRUE
Show code
py_cdf_continuous = chi2.cdf(py_sample_continuous, df=10)# Quantile function in Pythonpy_quantiles_continuous = chi2.ppf(py_cdf_continuous, df=10)# Verifynp.allclose(py_quantiles_continuous, py_sample_continuous)
True
Discrete Case
Show code
r_cdf_discrete <-pbinom(r_sample_discrete, size =10, prob =0.5)# Quantile function in Rr_quantiles_discrete <-qbinom(p = r_cdf_discrete, size =10, prob =0.5)# Verifyall.equal(r_quantiles_discrete, r_sample_discrete)
[1] TRUE
Show code
py_cdf_discrete = binom.cdf(py_sample_discrete, n=10, p=0.5)# Quantile function in Pythonpy_quantiles_discrete = binom.ppf(py_cdf_discrete, n=10, p=0.5)# Verifynp.allclose(py_quantiles_discrete, py_sample_discrete)
True
Comments
The quantile function allows us to find thresholds corresponding to specific probabilities.
This is useful in statistical analyses, such as determining critical values.
Moments
Moments provide important characteristics of a probability distribution, such as its shape and spread.
Raw Moments: When \(c = 0\), \(\mu_k' = \mu_k(0)\).
Central Moments: When \(c = \mu\), the mean of \(X\), \(\mu_k = \mu_k(\mu)\).
Standardized Moments: \(\mu_k^* = \dfrac{\mu_k}{\sigma^k}\), where \(\sigma\) is the standard deviation.
Comments:
The first moment (mean) measures central tendency.
The second moment (variance) measures dispersion.
Higher moments (skewness and kurtosis) describe the shape of the distribution.
Computing Moments
Continuous Case
Show code
# First four central moments in Rcentral_moments_continuous_r <-map_dbl(1:4, ~moment(r_sample_continuous, order = .x, central =TRUE))names(central_moments_continuous_r) <-paste0("Moment_", 1:4)central_moments_continuous_r
# First four central moments in Pythoncentral_moments_continuous_py = [moment(r.r_sample_continuous, moment=order) for order inrange(1, 5)]central_moments_continuous_py_dict = {f"Moment_{i+1}": m for i, m inenumerate(central_moments_continuous_py)}central_moments_continuous_py_dict
# First four central moments in Rcentral_moments_discrete_r <-map_dbl(1:4, ~moment(r_sample_discrete, order = .x, central =TRUE))names(central_moments_discrete_r) <-paste0("Moment_", 1:4)central_moments_discrete_r
# First four central moments in Pythoncentral_moments_discrete_py = [moment(r.r_sample_discrete, moment=order) for order inrange(1, 5)]central_moments_discrete_py_dict = {f"Moment_{i+1}": m for i, m inenumerate(central_moments_discrete_py)}central_moments_discrete_py_dict
Comments
Accumulated Probability: The CDF represents the total probability accumulated up to a certain value \(x\). It answers the question: “What is the probability that the random variable \(X\) is less than or equal to \(x\)?”
Calculating Probabilities Over Intervals:
Continuous Case:
\[\begin{align*} \operatorname{Pr}(a < X \leq b) = F_X(b) - F_X(a) \end{align*}\]
Discrete Case:
\[\begin{align*} \operatorname{Pr}(a < X \leq b) = \sum_{k = a+1}^{b} p_X(k) \end{align*}\]
Continuous Variables: The CDF is a smooth, continuous curve that increases from 0 to 1.
Discrete Variables: The CDF is a step function with jumps at the points where the random variable has non-zero probability.
Total Probability: The CDF approaches 1 as \(x\) approaches infinity, reflecting that the total probability over the entire space is 1.
Differentiation and Integration: For continuous random variables, the PDF and CDF are related through differentiation and integration, as previously noted.
Discontinuities: In the discrete case, the size of the jump at a point \(k\) in the CDF is equal to \(p_X(k)\), the probability mass at that point.