import warningsfrom datetime import datetime, timedeltafrom typing import List, Optional, Tuple, Unionimport cvxpy as cpimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.figure_factory as ffimport plotly.graph_objects as goimport pypfopt as ppoimport yfinance as yffrom plotly.subplots import make_subplotsfrom scipy.cluster.hierarchy import dendrogram, linkagefrom scipy.spatial.distance import squareformfrom sklearn.covariance import ledoit_wolf
All-ETF Portfolio
The underlying investment vehicle for this portfolio optimization is Exchange-Traded Funds (ETFs). ETFs are a popular investment vehicle offering non-institutional investors diversification, liquidity, and cost-effectiveness. For this analysis, we will focus on the following equity ETFs with an emphasis on technology, financial services, real estate, utilities, and consumer staples.
ETF Selection
Technology ETFs
iShares U.S. Technology ETF (IYW): This ETF provides exposure to U.S. companies involved in technology, including hardware, software, and IT services.
iShares U.S. Financials ETF (IYF): This ETF provides exposure to U.S. companies in the financial services sector, including banks, investment firms, and insurance companies.
Real Estate ETFs
iShares U.S. Real Estate ETF (IYR): This ETF tracks the Dow Jones U.S. Real Estate Capped Index, which measures the performance of the real estate sector of the U.S. equity market, as defined by the index provider.
iShares U.S. Consumer Staples ETF (IYK): This ETF provides exposure to U.S. companies in the consumer staples sector, including food, beverage, and household products.
Other sectors and categories can be explored by utilizing resources like the ETF Database Categories on Yahoo Finance. One of the most crucial aspects of building a portfolio is aligning it with our personal convictions about the sectors and industries we are investing in; it’s about having a deep, long-term belief in the growth potential and sustainability of those industries.
For more insights into the pros and cons of an all-ETF portfolio, consider reading this article from Charles Schwab: 3 Ways to Build an All-ETF Portfolio.
Daily Price Time Series
The daily price time series can be downloaded from Yahoo Finance using the yfinance package:
\(r_{i,t}\) is the daily return for asset \(i\) on day \(t\)
\(T_i\) is the total number of trading days observed for asset \(i\) over the sampling period
\(\text{frequency}\) is the number of time periods in a year (typically \(252\) for daily returns)
This formula computes the geometric mean of daily returns for each asset \(i\), and annualizes the returns by raising the product of returns to the power of the ratio \(\frac{\text{frequency}}{T_i}\), where \(T_i\) accounts for the (potentially) varying number of observed trading days for each asset.
conditional_colors = ["green"if value >0else"red"for value in annualized_mean_returns]fig, ax = plt.subplots();bars = ax.bar( annualized_mean_returns.index, annualized_mean_returns *100, # Convert to percentage color=conditional_colors);for bar in bars: height = bar.get_height() ax.text( bar.get_x() + bar.get_width() /2, height + np.sign(height) *0.1, # Add spacing to the valuef'{height:.2f}%', ha='center', va='bottom'if height >0else'top', # Adjust placement for negative values fontsize=10 );title_text =f"Annualized Mean Returns (%)\nData from {start_date} to {daily_prices.index[-1].strftime('%Y-%m-%d')}"ax.set_title(title_text, fontsize=14);plt.xticks(rotation=45, ha='right');plt.tight_layout();plt.show()
Shrinkage Covariance Matrix
An essential component of portfolio optimization is the estimator of the covariance matrix, which quantifies the relationship between the daily returns series. While the sample covariance matrix is a standard, unbiased estimator, it often contains significant estimation errors in practice.
To address this issue, Ledoit and Wolf (2004) propose a shrinkage approach, which combines the sample covariance matrix with a more structured and stable estimator, such as the constant correlation model. The shrinkage process pulls extreme covariance values in noisy data towards a central target, reducing estimation error and enhancing the stability of the covariance matrix.
Shrinkage Target
Let the returns for asset \(i\) on day \(t\) be denoted by \(X_{i, t}\), where \(1 \leq i \leq N\) represents the index of assets (i.e., number of assets) and \(1 \leq t \leq T\) represents the time index (i.e., number of trading days over the sample period). Let the following matrices be defined:
\(\Sigma\) represents the true population covariance matrix, where the element \(\sigma_{i j}\) represents the covariance between asset \(i\) and asset \(j\).
\(S\) represents the sample covariance matrix, where the element \(s_{i j}\) represents the sample covariance between asset \(i\) and asset \(j\).
Population and Sample Correlations
The population correlation\(\varrho_{i j}\) between asset \(i\) and asset \(j\) is given by:
From \(\varrho_{i j}\) and \(r_{i j}\), the average population correlation\(\bar{\varrho}\) and the average sample correlation\(\bar{r}\) are computed as follows:
The sample constant correlation matrix \(F\) serves as a shrinkage target for the covariance matrix estimation. It represents a balance between structured stability (constant correlations) and data-driven flexibility (sample covariances), reducing estimation error and improving the robustness of any downstream portfolio optimization procedures.
Note: The reference paper highlights that the constant correlation model may be unsuitable when assets belong to different classes, such as stocks and bonds. However, in our case, since all the ETFs selected have same underlying asset class, i.e., equities, the constant correlation model is a reasonable and appropriate choice for the shrinkage target.
Formally, let \(\hat{\Sigma}\) represent the shrinkage covariance matrix, where the element \(\hat{\sigma}_{ij}\) is the covariance between assets \(i\) and \(j\). The matrix \(\hat{\Sigma}\) is of size \(N \times N\):
To convert this covariance matrix to the correlation matrix \(\hat{R}\), we normalize each element by dividing the covariance between assets \(i\) and \(j\) by the square root of the product of the variances of assets \(i\) and \(j\):
To compute the values (_{ij}) efficiently, define \(D\) as a diagonal matrix that contains the inverse of the square roots of the diagonal elements of \(\hat{\Sigma}\) (i.e., the variances of the assets):
fig, ax = plt.subplots();cax = ax.matshow(shrinkage_correlation_matrix, cmap="coolwarm");ax.xaxis.set_ticks_position('bottom');ax.set_xticks(np.arange(len(shrinkage_correlation_matrix.columns)));ax.set_yticks(np.arange(len(shrinkage_correlation_matrix.columns)));ax.set_xticklabels(shrinkage_correlation_matrix.columns);ax.set_yticklabels(shrinkage_correlation_matrix.columns);for i inrange(shrinkage_correlation_matrix.shape[0]):for j inrange(shrinkage_correlation_matrix.shape[1]):# Ensure text can be easily seen value = shrinkage_correlation_matrix.iloc[i, j] color ="white"if value <0.5else"black" ax.text(j, i, f"{value:.2f}", ha="center", va="center", color=color);plt.title("Shrinkage Correlation Matrix");plt.tight_layout();plt.show()
Hierarchical Risk Parity (HRP)
The HRP algorithm consists of three key stages:
Clustering: Assets are grouped based on their correlations using hierarchical clustering methods like single, average, or complete linkage.
Quasi-Diagonalization: The covariance matrix is reordered based on the clustering structure, concentrating covariances around the diagonal.
Recursive Bisection: Weights are assigned recursively across clusters in a top-down manner.
The following sections will walk through a numerical example of the HRP algorithm using daily returns data from the previous section.
Clustering
Instead of treating assets independently, HRP groups similar assets based on their correlations. The goal is to cluster assets that are more closely related to each other.
Distance Matrix
The distance between assets is computed as a function of their correlations, with lower correlations indicating greater distances. First, convert the correlation matrix into a distance matrix:
Transform the distance matrix into a condensed form. The condensed form contains only the upper triangular part of the distance matrix (excluding the diagonal), which is what hierarchical clustering requires:
A hierarchical clustering algorithm is applied to the condensed distance matrix to group pairs of assets into clusters. The linkage function from scipy is used to perform the clustering.
Note: An important hyperparameter in hierarchical clustering is the linkage method, which determines how the distances between newly formed clusters are calculated. The “appropriate” linkage method depends on the data characteristics, and evaluation mechanisms can help determine the choice. This will be done in the next section.
The linkage_matrix contains the steps of hierarchical clustering and is structured as follows:
Column
Description
1st
Index of the first cluster being merged. If the value is less than the number of original items, it refers to an original item (i.e., a ticker). Otherwise, it refers to a cluster formed in a previous step.
2nd
Index of the second cluster being merged. Follows the same logic as the first column.
3rd
Distance (or dissimilarity) between the two clusters being merged.
4th
Number of original items in the newly formed cluster after this merge.
The hierarchical clustering merges the original items step by step, forming clusters that are indexed starting from 6 (the first new cluster). The cluster IDs start after the number of original items (6th in zero-based indexing).
Step 1: Tickers 3 and 4 are merged to form Cluster 6 (2 items).
Step 2: Tickers 0 and 2 are merged to form Cluster 7 (2 items).
Step 3: Tickers 1 and 5 are merged to form Cluster 8 (2 items).
Step 4: Clusters 7 (containing tickers 0, 2) and 8 (containing tickers 1, 5) are merged to form Cluster 9 (4 items).
Step 5: Cluster 6 (containing tickers 3, 4) and Cluster 9 are merged to form Cluster 10 (6 items).
In this process, the distance column shows the dissimilarity between the merged clusters. The number of original items reflects how many tickers are included in the newly formed cluster. The output can be visualized via a dendrogram:
After clustering, the next step in the HRP algorithm is quasi-diagonalization. The goal of this step is to reorder the rows and columns of the correlation matrix based on the hierarchical clustering results obtained in the previous step. This seriation step allows similar investments to be grouped together, and dissimilar ones to be placed farther apart.
The following implementation is taken from the original paper with annotated print statements for better understanding:
Show code
def get_quasi_diag(linkage_matrix: np.ndarray) -> List[int]:""" Sort clustered items by distance based on the linkage matrix. Parameters ---------- linkage_matrix : np.ndarray The linkage matrix from hierarchical clustering. Returns ------- List[int] The sorted indices of the clustered items. """ linkage_matrix = linkage_matrix.astype(int) sorted_index = pd.Series([linkage_matrix[-1, 0], linkage_matrix[-1, 1]])print(f"Initial sorted index (last two merged clusters): {sorted_index.tolist()}") num_items = linkage_matrix[-1, 3]print(f"Number of original items: {num_items}")# Recursively sort clusters by distance until no clusters remainwhile sorted_index.max() >= num_items:print("\n--- New iteration ---\n") sorted_index.index =range(0, sorted_index.shape[0] *2, 2)print(f"Expanded sorted index to make space: {sorted_index.tolist()}") cluster_indices = sorted_index[sorted_index >= num_items] i = cluster_indices.index # Indices in sorted_index where clusters need to be expanded j = cluster_indices.values - num_items # Row numbers in linkage_matrix to retrieve cluster membersprint(f"Identified clusters to expand:")print(f" Indices in sorted_index (i): {i.tolist()}")print(f" Corresponding linkage matrix rows (j): {j.tolist()}")# Replace clusters with their first constituentfor idx, row_num inzip(i, j): row_dict = {'merged_1': linkage_matrix[row_num, 0],'merged_2': linkage_matrix[row_num, 1],'dist': linkage_matrix[row_num, 2],'number_of_original': linkage_matrix[row_num, 3] } original_cluster = sorted_index[idx] first_constituent = linkage_matrix[row_num, 0]print(f" Row {row_num} of the linkage matrix: {row_dict}")print(f" Cluster {original_cluster} was formed by merging {linkage_matrix[row_num, 0]} and {linkage_matrix[row_num, 1]}")print(f" Replacing cluster {original_cluster} with its first constituent: {first_constituent}") sorted_index[idx] = first_constituentprint(f"After replacing clusters with their first constituents: {sorted_index.tolist()}")# Append the second constituent of the clusters second_constituents = pd.Series(linkage_matrix[j, 1], index=i +1) sorted_index = pd.concat([sorted_index, second_constituents])for idx, second_constituent in second_constituents.items():print(f" Appending second constituent {second_constituent}")print(f"After appending the second constituents: {sorted_index.tolist()}")# Re-sort the index and reset indexing to keep proper order sorted_index = sorted_index.sort_index() sorted_index.index =range(sorted_index.shape[0])print(f"Final sorted index after re-sorting: {sorted_index.tolist()}")return sorted_index.tolist()
The quasi-diagonalization process is applied as follows:
Show code
sorted_index = get_quasi_diag(linkage_matrix)
Initial sorted index (last two merged clusters): [6, 9]
Number of original items: 6
--- New iteration ---
Expanded sorted index to make space: [6, 9]
Identified clusters to expand:
Indices in sorted_index (i): [0, 2]
Corresponding linkage matrix rows (j): [0, 3]
Row 0 of the linkage matrix: {'merged_1': 3, 'merged_2': 4, 'dist': 0, 'number_of_original': 2}
Cluster 6 was formed by merging 3 and 4
Replacing cluster 6 with its first constituent: 3
Row 3 of the linkage matrix: {'merged_1': 7, 'merged_2': 8, 'dist': 0, 'number_of_original': 4}
Cluster 9 was formed by merging 7 and 8
Replacing cluster 9 with its first constituent: 7
After replacing clusters with their first constituents: [3, 7]
Appending second constituent 4
Appending second constituent 8
After appending the second constituents: [3, 7, 4, 8]
Final sorted index after re-sorting: [3, 4, 7, 8]
--- New iteration ---
Expanded sorted index to make space: [3, 4, 7, 8]
Identified clusters to expand:
Indices in sorted_index (i): [4, 6]
Corresponding linkage matrix rows (j): [1, 2]
Row 1 of the linkage matrix: {'merged_1': 0, 'merged_2': 2, 'dist': 0, 'number_of_original': 2}
Cluster 7 was formed by merging 0 and 2
Replacing cluster 7 with its first constituent: 0
Row 2 of the linkage matrix: {'merged_1': 1, 'merged_2': 5, 'dist': 0, 'number_of_original': 2}
Cluster 8 was formed by merging 1 and 5
Replacing cluster 8 with its first constituent: 1
After replacing clusters with their first constituents: [3, 4, 0, 1]
Appending second constituent 2
Appending second constituent 5
After appending the second constituents: [3, 4, 0, 1, 2, 5]
Final sorted index after re-sorting: [3, 4, 0, 2, 1, 5]
Show code
sorted_index
[3, 4, 0, 2, 1, 5]
The sorted correlation matrix is obtained by rearranging the rows and columns based on the sorted index:
The quasi-diagonalized correlation matrix shows a more structured pattern, with higher correlations concentrated around the main diagonal. Store the ordered tickers for future reference:
In the final step, recursive bisection, the hierarchical tree (built from hierarchical clustering) is split into two branches repeatedly until each branch contains only one asset. It can be shown that splitting weights inversely proportional to the variance of assets in each branch achieves optimal allocation when the correlation matrix is diagonal (see Appendix A.2. of the original paper).
Initialization:
Set the list of items \(L = \{L_0\}\), where \(L_0 = \{n\}_{n=1, \ldots, N}\). This list represents the starting set that includes every asset to be allocated.
Assign a unit weight to all items: \(w_n = 1\), for all \(n = 1, \ldots, N\).
Stopping Condition: If \(\left| L_i \right| = 1\) for all \(L_i \in L\), stop the algorithm.
Bisection (for each \(L_i \in L\) such that \(\left| L_i \right| > 1\)):
Bisect the set: Divide \(L_i\) into two subsets \(L_i^{(1)} \cup L_i^{(2)} = L_i\), where \[\begin{align*}
\left| L_i^{(1)} \right| = \operatorname{int}\left(\frac{1}{2} \left|L_i\right| \right)
\end{align*}\] The order is preserved in the bisection.
Variance of subsets: For each subset \(L_i^{(j)}\) (where \(j = 1, 2\)), calculate the variance \(\widetilde{V}_i^{(j)}\) using the quadratic form: \[\begin{align*}
\widetilde{V}_i^{(j)} \equiv \widetilde{w}_i^{(j)} V_i^{(j)} \widetilde{w}_i^{(j)}
\end{align*}\] where \(V_i^{(j)}\) is the covariance matrix of the assets in \(L_i^{(j)}\), and \(\widetilde{w}_i^{(j)}\) is defined as: \[\begin{align*}
\widetilde{w}_i^{(j)} = \operatorname{diag}\left( V_i^{(j)} \right)^{-1} \frac{1}{\operatorname{tr}\left( \operatorname{diag}\left( V_i^{(j)} \right)^{-1} \right)}
\end{align*}\] Here, \(\operatorname{diag}[\cdot]\) denotes the diagonal matrix and \(\operatorname{tr}[\cdot]\) is the trace operator.
Rescale allocations for all \(n \in L_i^{(1)}\) by a factor of \(\alpha_i\).
Rescale allocations for all \(n \in L_i^{(2)}\) by a factor of \(1 - \alpha_i\).
Repeat: Loop back to step 2 until the stopping condition is met.
Inverse Variance Portflio
Show code
def get_ivp(cov: pd.DataFrame) -> np.ndarray:""" Compute the inverse variance portfolio (IVP) for a given covariance matrix. Parameters ---------- cov : pd.DataFrame Covariance matrix of the cluster. Returns ------- np.ndarray Inverse variance portfolio weights for the cluster. """ ivp =1.0/ np.diag(cov) ivp /= ivp.sum() # Normalize weightsprint(f"\n IVP weights for cluster: {ivp}")return ivp
Cluster Variance
Show code
def get_cluster_var(cov: pd.DataFrame, cluster_items: List[str]) ->float:""" Compute the variance for a given cluster. Parameters ---------- cov : pd.DataFrame Covariance matrix for the assets. cluster_items : List[str] List of tickers in the cluster. Returns ------- float The variance of the cluster. """# Subset of the covariance matrix containing only the cluster assets cov_slice = cov.loc[cluster_items, cluster_items]# Define the left space (4 spaces in this case) hspace =" " formatted_cov_slice =str(cov_slice).replace("\n", "\n"+ hspace)print(f"\n Covariance matrix slice for cluster {cluster_items}:\n{hspace}{formatted_cov_slice}")# Calculate inverse variance portfolio (IVP) weights for the cluster and reshape to column vector (n x 1) weights = get_ivp(cov_slice).reshape(-1, 1)# Compute cluster variance as w.T * cov_slice * w, which is [1 x n] * [n x n] * [n x 1] = [1 x 1] cluster_var = np.dot(np.dot(weights.T, cov_slice), weights)[0, 0]print(f"\n Cluster variance for {cluster_items}: {cluster_var}\n")return cluster_var
Recursive Bisection
Show code
def recursive_bisection(cov: pd.DataFrame, ordered_tickers: Union[pd.Index, List[str]]) -> pd.Series:""" Compute the Hierarchical Risk Parity (HRP) portfolio allocation using recursive bisection. Parameters ---------- cov : pd.DataFrame Covariance matrix for the assets. ordered_tickers : Union[pd.Index, List[str]] List of tickers ordered by hierarchical clustering. Returns ------- pd.Series Portfolio weights for each asset. """# Initialize equal weights for all assets weights = pd.Series(1.0, index=ordered_tickers) hspace =" "# Define the left space (4 spaces in this case) formatted_weights =str(weights).replace("\n", "\n"+ hspace)print(f"Initial weights:\n{hspace}{formatted_weights}\n")# Initialize the cluster as the full set of ordered tickers cluster_items = [ordered_tickers] step =0whilelen(cluster_items) >0: step +=1print(f"\n{'='*20} Step {step}: Recursive Bisection {'='*20}\n")print(f"Cluster items at Step {step}: {[list(cluster) for cluster in cluster_items]}")# Split each cluster into two halves only if the cluster contains more than one asset cluster_items = [ i[j:k]for i in cluster_itemsfor j, k in ((0, len(i) //2), (len(i) //2, len(i)))iflen(i) >1 ]# For each pair of clusters, calculate variance and adjust weightsfor i inrange(0, len(cluster_items), 2): first_cluster = cluster_items[i] second_cluster = cluster_items[i +1]print(f"\n ---- Processing clusters ----")print(f" First cluster: {first_cluster}")print(f" Second cluster: {second_cluster}")# Compute variance for each cluster first_var = get_cluster_var(cov, list(first_cluster)) second_var = get_cluster_var(cov, list(second_cluster))# Compute the weighting factor alpha alpha =1- first_var / (first_var + second_var)print(f" Weighting factor alpha for {list(first_cluster)}: {alpha}")print(f" Weighting factor 1 - alpha for {list(second_cluster)}: {1- alpha}")# Adjust weights for each cluster weights[first_cluster] *= alpha weights[second_cluster] *=1- alpha formatted_weights =str(weights).replace("\n", "\n"+ hspace)print(f" Updated weights after adjustment:\n{hspace}{formatted_weights}\n")return weights
Portfolio Allocation
Applying the recursive bisection algorithm to the covariance matrix:
The PyportfolioOpt package provides a convenient implementation of the HRP algorithm. We can compare the results obtained from our manual implementation with the package’s implementation to verify the correctness of our approach.
To assess the out-of-sample performance of the HRP optimal allocation strategy, a backtest can be employed. Backtesting is a key technique for evaluating an investment strategy’s performance using historical data. It allows for an assessment of how reliably the strategy could be replicated in real-world scenarios, while also providing valuable insights into its associated risk-return profile. To this end, Joubert et al., 2024 outlines three primary backtesting methods:
Backtesting Method
Description
Advantages
Limitations
Walk-Forward Backtest
Historical data is split into multiple segments. The strategy is trained on one segment and tested on the next.
Easy to implement.
Tests only a single path, risking overfitting. Past performance may not predict future results (e.g., Nvidia’s growth since COVID).
Resampling
A statistical technique that creates multiple samples from historical data to assess performance. This approach yields a distribution of performance metrics, including: - Bootstrap: Draws samples with replacement from historical data, enabling performance estimation over many iterations. - Cross-Validation: Splits the data into multiple training and testing sets, providing multiple assessments of the strategy’s performance.
Provides more robust analysis by generating multiple scenarios.
Does not create new data, but resamples existing data, which may miss unique future events.
Monte Carlo Simulation
Generates synthetic data with properties similar to historical data using an understanding of the data generation process.
Yields performance estimates indicative of future outcomes (assuming correct and stable data processes).
Requires accurate modeling of the data generation process, which is difficult for volatile financial markets.
For evaluating the HRP portfolio allocation strategy and optimizing the hierarchical clustering linkage method, we will use time-series bootstraps tools from the arch package.
Time-Series Bootstrapping
The arch.bootstrap.StationaryBootstrap class implements the method developed by Politis and Romano (1994). This method addresses the limitations of traditional bootstrapping techniques when applied to time-series data, aiming to preserve crucial time-dependent structures of the data while generating resampled datasets.
Let \(X_1, X_2, \dots, X_T\) represent the original time series. To generate a resampled time series using block bootstrapping, define a block \(B_{i,b}\) as consisting of \(b\) consecutive observations starting at \(X_i\):
If \(i + b - 1 > T\) (i.e., the block extends beyond the end of the time series), periodic boundary conditions are applied. This means that the series wraps around, treating the first observations in the series as if they come after the last. In other words, for any index \(j > T\), the observation is taken as:
The expression \(j \mod T\) calculates the remainder when \(j\) is divided by \(T\), effectively cycling the indices back to the beginning of the series. For example, if \(T = 10\), and \(X_{12}\) is requested, it would be equivalent to sampling \(X_2\).
Block Selection
The starting indices of the blocks are determined by a sequence of independent and identically distributed (iid) random variables \(I_1, I_2, \dots\), drawn uniformly from \(\{1, \dots, T\}\). The lengths of the blocks are governed by another sequence of iid random variables \(L_1, L_2, \dots\), where each \(L_i\) follows a geometric distribution with parameter \(p \in [0,1]\), i.e.,
\[\begin{align*}
P(L_i = m) = (1 - p)^{m-1} p, \quad m = 1, 2, \dots.
\end{align*}\]
Resampling
To construct the resampled series \(X_1^*, X_2^*, \dots, X_T^*\):
Select the first block \(B_{I_1, L_1}\) and copy \(L_1\) consecutive observations starting at \(X_{I_1}\) into the resampled series.
Select the next block \(B_{I_2, L_2}\) and append \(L_2\) consecutive observations starting at \(X_{I_2}\), continuing this process.
Repeat the block selection and copying processes until a total of \(T\) observations have been generated.
It is possible for this procedure to result in overlapping blocks when consecutive starting indices \(I_1, I_2, \dots\) are close together, leading to the reuse of some observations. The process terminates once \(T\) observations are resampled, but it can be extended to generate longer series if desired.
Implementation
The following script implements the time-series bootstrapping:
Show code
import warningsfrom argparse import ArgumentParserfrom datetime import datetime, timedeltafrom multiprocessing import Pool, cpu_countfrom pathlib import Pathfrom typing import Listimport numpy as npimport pandas as pdimport yfinance as yffrom arch.bootstrap import StationaryBootstrap, optimal_block_lengthfrom pypfopt.hierarchical_portfolio import HRPOptfrom sklearn.covariance import ledoit_wolf# Define the tickers for the assets in the portfoliotech = np.array(["IYW", "VGT"])real_estate = np.array(["IYR"])bank_finance = np.array(["IYF"])utilities = np.array(["XLU"])consumer_staples = np.array(["IYK"])tickers = np.concatenate( [tech, real_estate, bank_finance, utilities, consumer_staples], axis=0)# Download the daily adjusted closing prices of the assetsstart_date = (datetime.now() - timedelta(days=365*5)).strftime("%Y-%m-%d")daily_prices = yf.download(", ".join(tickers), start=start_date)["Adj Close"]# Calculate the daily returns of the assetsdaily_returns = daily_prices.pct_change().dropna()frequency =252# Typical savings accountannual_risk_free_rate =0.02daily_risk_free_rate = (1+ annual_risk_free_rate) ** (1/ frequency) -1pd.set_option("mode.copy_on_write", True)def compute_optimal_block_length(returns: pd.DataFrame) ->float:""" Compute the optimal block length for the stationary bootstrap. The optimal block length is computed for the squared returns since the autocorrelation in the squares tends to be stronger than in the returns themselves. This approach helps capture more of the dependence in the returns for the bootstrap. Parameters ---------- returns : pd.DataFrame Daily returns of the assets. Returns ------- float The average optimal block length for the stationary bootstrap. """ block_lengths = optimal_block_length(returns**2) avg_block_length = block_lengths.loc[:, "stationary"].mean()return avg_block_lengthdef compute_shrinkage_covariance(returns: pd.DataFrame) -> pd.DataFrame:""" Calculate the Ledoit-Wolf shrunk covariance matrix of the daily returns. Parameters ---------- returns : pd.DataFrame Daily returns of the assets. Returns ------- pd.DataFrame The shrunk covariance matrix of the daily returns. """ shrinkage_cov = pd.DataFrame( ledoit_wolf(returns, assume_centered=False)[0], index=returns.columns, columns=returns.columns, )return shrinkage_covdef compute_performance_metrics( returns: pd.DataFrame, linkage_method: str) -> np.ndarray:""" Calculate the expected return, volatility, and Sharpe ratio of the portfolio. Parameters ---------- returns : pd.DataFrame Daily returns of the assets. linkage_method : str The method used for hierarchical clustering. Possible values: 'centroid', 'weighted', 'complete', 'average', 'ward'. Returns ------- np.ndarray An array containing the annualized expected return, volatility, and Sharpe ratio. """ shrinkage_cov = compute_shrinkage_covariance(returns) hrp = HRPOpt(returns=returns, cov_matrix=shrinkage_cov)with warnings.catch_warnings(): warnings.simplefilter("ignore", category=FutureWarning) hrp.optimize(linkage_method=linkage_method) expected_return, volatility, sharpe_ratio = hrp.portfolio_performance( risk_free_rate=daily_risk_free_rate, frequency=frequency, )return np.array([expected_return, volatility, sharpe_ratio])def bootstrap(seed_value: int, linkage_method: str) -> pd.DataFrame:""" Perform stationary bootstrapping to compute portfolio performance metrics. Parameters ---------- seed_value : int Seed for the random number generator to ensure reproducibility. linkage_method : str The method used for hierarchical clustering. Possible values: 'centroid', 'weighted', 'complete', 'average', 'ward'. Returns ------- pd.DataFrame A DataFrame containing the annualized expected return, volatility, Sharpe ratio, and the linkage method used for clustering. """ rs = np.random.default_rng(seed_value) avg_block_length = compute_optimal_block_length(daily_returns) bs = StationaryBootstrap( block_size=avg_block_length, returns=daily_returns, seed=rs, ) results_array = bs.apply( func=compute_performance_metrics, reps=5000, extra_kwargs={"linkage_method": linkage_method}, ) performance_metrics = pd.DataFrame( results_array, columns=["annualized_expected_return","annualized_volatility","annualized_sharpe_ratio", ], ) performance_metrics["linkage_method"] = linkage_methodreturn performance_metricsdef main() ->int:""" Main function to run stationary bootstrapping on portfolio returns using hierarchical clustering with various linkage methods. This function parses command line arguments to specify the output directory, performs bootstrapping using parallel processing, and saves the results. Returns ------- int Returns 0 upon successful completion. """ parser = ArgumentParser() parser.add_argument("--output_dir", type=str, default=Path.cwd()) args, _ = parser.parse_known_args() linkage_methods = ["centroid", "weighted", "complete", "average", "ward"] n_cores =min(cpu_count() -1, len(linkage_methods))# Generate unique seeds for each process seeds = np.random.randint(0, 1e6, size=len(linkage_methods))with Pool(n_cores) as pool:# Run bootstrap for each linkage method with a unique seed simulated_performance: List[pd.DataFrame] = pool.starmap( bootstrap, zip(seeds, linkage_methods) ) final_results = pd.concat(simulated_performance, ignore_index=True) final_results.to_csv( Path(args.output_dir) /"bootstrapped_performance_metrics.csv", index=False )return0if__name__=="__main__": main()
Data Collection:
Downloads daily adjusted closing prices for the specified ETFs
Computes the daily returns time series.
Risk-Free Rate: Converts a \(2\%\) annual risk-free rate to a daily rate to match daily returns frequency, which is typically \(252\) for trading days.
Block Length Calculation: Computes optimal block length for StationaryBootstrap using squared returns to capture dependencies. See docstring for details.
Ledoit-Wolf Covariance: Estimates the covariance matrix using the Ledoit-Wolf shrinkage method.
Bootstrapping: Runs StationaryBootstrap with 5000 replications for each linkage method to assess portfolio performance.
Performance Metrics: Computes annualized expected return, volatility, and Sharpe ratio from bootstrapped samples.
The script uses parallel processing to run bootstrapping for each linkage method with a unique seed in an separate process. The results are collected and saved to a CSV file for further analysis.
The bootstrapped performance metrics are stored in long format, with the linkage_method column indicating the hierarchical clustering method used for each set of bootstrapped samples. Each bootstrapped sample contains the annualized expected return, volatility, and Sharpe ratio, computed over a resampled set of all \(T\) observations from the original time series.
def plot_performance_metrics(performance_metrics: pd.DataFrame, metric: str) ->None:""" Plot the performance metrics of the Hierarchical Risk Parity (HRP) portfolio. Parameters ---------- performance_metrics : pd.DataFrame Performance metrics of the HRP portfolio. metric : str The performance metric to plot. """ metric_title = metric.replace("_", " ").title() median_values = performance_metrics.groupby("linkage_method")[metric].median().reset_index() fig = px.violin( performance_metrics, x="linkage_method", y=metric, color="linkage_method", title=f"Bootstrapped {metric_title}s of Hierarchical Risk Parity Portfolios", labels={"metric": "Performance Metric", metric: "Value"}, );for i, row in median_values.iterrows(): value = row[metric]if metric in ["annualized_expected_return", "annualized_volatility"]: value = value *100 formated_value =f"{value:.4f}%"else: formated_value =f"{value:.4f}" fig.add_annotation( x=row['linkage_method'], y=row[metric], text=f"Median:<br>{formated_value}", showarrow=True, arrowhead=2, ax=0, ay=-40# Offset to place the label above the plot ); fig.update_layout( xaxis_title="Linkage Method", yaxis_title=metric_title, legend_title="Linkage Method", legend=dict( orientation="h", # Horizontal legend yanchor="bottom", # Align the legend to the bottom of its position y=1.02, # Place it slightly above the plot xanchor="center", # Center the legend x=0.5# Place it at the middle horizontally ), autosize=True, ); fig.show()
From the bootstrapped distributions, we can calculate the confidence intervals for the performance metrics using the quantiles directly from the bootstrapped samples.
Show code
def bootstrapped_confidence_intervals(performance_metrics: pd.DataFrame, metric: str, alpha: float=0.05) -> pd.DataFrame:""" Calculate the (1 - alpha) confidence intervals for the performance metrics. Parameters ---------- performance_metrics : pd.DataFrame Performance metrics of the HRP portfolio. metric : str The performance metric for which to calculate the confidence intervals. alpha : float, default 0.05 The significance level for the confidence interval. Returns ------- pd.DataFrame A DataFrame containing the (1 - alpha) confidence intervals for each performance metric. """ lower_percentile =100* (alpha /2) upper_percentile =100* (1- (alpha /2))def compute_ci(series: np.ndarray, is_lower: bool) ->float: percentile = lower_percentile if is_lower else upper_percentile bound = np.percentile(series, percentile)if metric in ["annualized_expected_return", "annualized_volatility"]:return bound *100return bound confidence_intervals = performance_metrics.groupby("linkage_method")[metric].agg( lower_bound=lambda x: compute_ci(x, True), upper_bound=lambda x: compute_ci(x, False), )return confidence_intervals
def plot_investment_growth_with_compounding(performance_metrics: pd.DataFrame, initial_investment: float, years: int) ->None:""" Plot the projected investment growth over a holding period by linkage method, using compounding with median annualized expected returns. Parameters ---------- performance_metrics : pd.DataFrame Performance metrics of the HRP portfolio. initial_investment : float The initial amount of the investment. years : int The holding period in years. """ median_values = performance_metrics.groupby("linkage_method")["annualized_expected_return"].median().reset_index() growth_data = pd.DataFrame({'year': range(1, years +1)})# Calculate the investment growth for each linkage method for each year, incorporating compoundingfor _, row in median_values.iterrows(): growth_data[row['linkage_method']] = initial_investment * ((1+ row['annualized_expected_return']) ** growth_data['year'])# Reshape from wide to long format for plotting growth_data_melted = growth_data.melt(id_vars=['year'], var_name='linkage_method', value_name='investment_value') fig = px.line( growth_data_melted, x='year', y='investment_value', color='linkage_method', title=f'Projected Growth Over {years} Years (Initial Investment: ${initial_investment:,.2f})', labels={'investment_value': 'Investment Value ($)', 'year': 'Year'} ); fig.update_layout( xaxis_title="Year", yaxis_title="Investment Value ($)", legend_title="Linkage Method", autosize=True, legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5 ) ); fig.show()
Plot the projected investment growth over a \(5\)-year period with an initial investment of \(\$10,000\):
def plot_risk_return_profile(performance_metrics: pd.DataFrame) ->None:""" Plot the risk-return profile (annualized expected return vs. annualized volatility) for each linkage method. Parameters ---------- performance_metrics : pd.DataFrame Performance metrics of the HRP portfolio. """ linkage_methods = performance_metrics['linkage_method'].unique() n_cols =2# Number of columns in the grid n_rows = (len(linkage_methods) +1) // n_cols fig = make_subplots( rows=n_rows, cols=n_cols, subplot_titles=linkage_methods, shared_xaxes=False, shared_yaxes=True, x_title="Annualized Volatility (Risk)", y_title="Annualized Expected Return" )# Find the min and max values of the Sharpe ratio for the color scale sharpe_min = performance_metrics['annualized_sharpe_ratio'].min() sharpe_max = performance_metrics['annualized_sharpe_ratio'].max()for i, method inenumerate(linkage_methods): row = i // n_cols +1 col = i % n_cols +1 filtered_data = performance_metrics[performance_metrics['linkage_method'] == method]# Add scatter plot trace with shared color scale fig.add_trace( go.Scatter( x=filtered_data['annualized_volatility'], y=filtered_data['annualized_expected_return'], mode='markers', name=method.capitalize(), hovertext=filtered_data['annualized_sharpe_ratio'], marker=dict( size=8, color=filtered_data['annualized_sharpe_ratio'], # Color based on Sharpe ratio cmin=sharpe_min, # Set the same min for all subplots cmax=sharpe_max, # Set the same max for all subplots colorscale="Viridis", showscale=(i ==0), # Only show the scale on the first subplot colorbar=dict(title="Sharpe Ratio") if i ==0elseNone, # Add color bar on the first subplot ), ), row=row, col=col ) fig.update_layout( title="Risk-Return Profile by Linkage Method", showlegend=False, autosize=True, height=1000, ) fig.show()
Show code
plot_risk_return_profile(performance_metrics)
Optimal Linkage Method
We will select the linkage method that yields the highest median annualized Sharpe ratio from the bootstrapped distribution.
The optimal portfolio weights derived from the HRP algorithm can now be used to construct the final asset allocation. The pypfopt.discrete_allocation.DiscreteAllocation class provides several methods to convert continuous weights into discrete share quantities. One such method treats this conversion as an integer programming problem formulated as follows:
\[\begin{align*}
\underset{x \in \mathbb{Z}^n}{\operatorname{minimise}} & \quad r + \| w T - x \odot p \|_1 \\
\text{subject to} & \quad r + x \cdot p = T
\end{align*}\]
Where:
\(T \in \mathbb{R}\) is the total dollar amount to be allocated.
\(p \in \mathbb{R}^n\) is the vector of the latest prices for each asset.
\(w \in \mathbb{R}^n\) is the vector of target portfolio weights for each asset.
\(x \in \mathbb{Z}^n\) is the integer allocation, representing the number of units of each asset to be purchased.
\(r \in \mathbb{R}\) is the remaining unallocated dollar amount, defined as \(r = T - x \cdot p\).
\(\| \cdot \|_1\) represents the \(L_1\) norm, i.e., the sum of the absolute values of the elements in the vector.
The goal is to minimize the remaining unallocated value \(r\) while also minimizing the difference between the target dollar allocation, \(wT\), and the actual allocation, \(x \odot p\).
De Prado, M. L. (2016). Building diversified portfolios that outperform out-of-sample. SSRN Electronic Journal. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2708678
Ledoit, O., & Wolf, M. (2003). Honey, I shrunk the sample covariance matrix. UPF Economics and Business Working Paper No. 691. https://doi.org/10.2139/ssrn.4897573
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303-1313. https://doi.org/10.1080/01621459.1994.10476870
Joubert, J., Sestovic, D., Barziy, I., Distaso, W., & Lopez de Prado, M. (2024). The three types of backtests. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4897573
De Prado, M. L. (2018). Advances in Financial Machine Learning. John Wiley & Sons. https://www.wiley.com/en-fr/Advances+in+Financial+Machine+Learning-p-9781119482086