K-Nearest Neighbors (KNN) Classifier: Step-by-Step Python Implementation from Scratch
In this post, we will implement the K-Nearest Neighbors (KNN) algorithm from scratch in Python. KNN is a simple, yet powerful non-parametric algorithm commonly used for both classification and regression tasks. Unlike many machine learning algorithms, KNN doesn’t require an explicit training phase, as it falls under the category of instance-based learning. The core assumption behind KNN is that similar data points are typically located near one another in the feature space.
For classification tasks, KNN operates by measuring the distance between the new data point and all the points in the training set. It then identifies the K closest points, and the majority class among these neighbors is used to predict the class of the new point. The proximity of each neighbor can be weighted by its distance to ensure closer neighbors have a greater influence on the prediction.
In this post, we’ll break down how to implement this intuitive and flexible algorithm from scratch in Python.
Model Formulation
In K-Nearest Neighbors (KNN) classification, a query point (i.e., the one requiring a prediction) is classified based on the majority class of its nearest neighbors in the training data. Given a query point \(\mathbf{x}_q\), the classifier finds the \(k\) nearest points in the training set based on a chosen distance metric.
Distance Metrics
The choice of distance metric is an important hyperparameter in KNN, as it determines how the algorithm measures the similarity between data points. The most common distance metrics include Euclidean distance and cosine similarity.
For a query point $ _q $ and a training point $ _i $, the Euclidean distance is defined as:
\[\begin{align*} d(\mathbf{x}_q, \mathbf{x}_i) = \sqrt{\sum_{j=1}^{n} (x_{q,j} - x_{i,j})^2} \end{align*}\]
where \(n\) is the dimensionality of the data points. Cosine similarity measures the angle between the vectors of two points, calculated as:
\[\begin{align*} \text{sim}(\mathbf{x}_q, \mathbf{x}_i) = \frac{\mathbf{x}_q \cdot \mathbf{x}_i}{\|\mathbf{x}_q\|_{2} \|\mathbf{x}_i\|_{2}} \end{align*}\]
where \(\|\cdot\|_{2}\) denotes the \(L_2\) norm of a vector.
The predicted label for a query point is determined by the majority vote from its \(k\)-nearest neighbors. The vote can be either uniform or weighted based on the distance between the query point and its neighbors.
Note: the concept of distance and similarity is related but not identical. In KNN, we typically use distance metrics to measure the similarity between data points. The cosine similarity, which is between 0 and 1, can be converted to a distance metric by subtracting it from 1.
Decision Rule
Let \(N_k(\mathbf{x}_q)\) denote the set of \(k\)-nearest neighbors of \(\mathbf{x}_q\). The predicted class \(\hat{y}_q\) is:
\[\begin{align*} \hat{y}_q = \arg \max_{y} \sum_{\mathbf{x}_i \in N_k(\mathbf{x}_q)} w_i \cdot \mathbb{I}[y_i = y] \end{align*}\]
where $ w_i $ is a weight applied to neighbor $ _i $ and $ [y_i = y] $ is an indicator function that equals 1 if the neighbor’s class is $ y $, and 0 otherwise.
Python Implementation
Cosine Similarity
Let’s start by implementing the cosine similarity function in Python. We’ll use the numpy
library for vectorized operations.
def custom_cosine_similarity(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
"""
Compute the cosine similarity between a sample and all samples in a dataset.
Parameters
----------
sample : np.ndarray
A single sample as a row vector with dimensions (1, n_features).
X : np.ndarray
The training data matrix with dimensions (n_samples, n_features).
Returns
-------
np.ndarray
The cosine similarities between the query sample and all samples in the training data matrix.
"""
# Compute the dot products between the sample and all samples in the dataset
# The result is a row vector with n_samples elements
dot_products = np.dot(X, sample.T)
# Compute the L2 norm of the sample and all samples in the dataset
# This is a scalar value
sample_norm = np.linalg.norm(sample)
# The axis=1 argument ensures that the norm is computed along the feature axis
# X_norm is a row vector with n_samples elements
X_norm = np.linalg.norm(X, axis=1)
# The broadcasting of sample_norm and X_norm is handled automatically
cosine_similarities = dot_products / (sample_norm * X_norm + 1e-8)
return cosine_similarities
Note the use of \(axis=1\) in the np.linalg.norm()
function to compute the norm along the feature axis. This ensures that the norm is calculated for each sample in the training set. By default, np.linalg.norm()
computes the Frobenius norm of the entire matrix (a 2D array).
Eucledian Distance
Next, we’ll implement the Euclidean distance function in Python. Again, we compute the distances between a single sample and all samples in the training set.
def custom_euclidean_distances(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
"""
Compute the Euclidean distance between a sample and all samples in a dataset.
Parameters
----------
sample : np.ndarray
A single sample as a row vector with dimensions (1, n_features).
X : np.ndarray
The training data matrix with dimensions (n_samples, n_features).
Returns
-------
np.ndarray
The Euclidean distances between the query sample and all samples in the training data matrix.
"""
# X is (n_samples, n_features) and sample is (1, n_features), and so the subtraction is broadcasted along the first dimension (i.e., each row of X is subtracted by sample)
# The result is a matrix with dimensions (n_samples, n_features)
differences = X - sample
squared_differences = differences**2
# Sum across the feature axis to obtain the sum of squared differences for each sample
# The result is a row vector with n_samples elements
squared_distances = np.sum(squared_differences, axis=1)
euclidean_distances = np.sqrt(squared_distances)
return euclidean_distances
Once again, we utilize broadcasting and element-wise calculations:
Subtracting the sample from all samples in the training set is broadcasted along the first dimension (i.e., the rows of \(X\)). This results in a matrix with the same dimensions as \(X\), whose rows are the element-wise differences between the sample and each training sample.
Squaring the differences is performed element-wise, resulting in a matrix with the same dimensions as \(X\).
Summing across the feature axis (axis=1) calculates the sum of squared differences for each sample in the training set, resulting in a row vector with \(n_{\text{training samples}}\) elements.
Finally, taking the square root of each sum of squared differences yields the Euclidean distances between the query sample and all training samples.
KNN Classifier
Initialization
The initializer logic is straightforward as we only need to store the hyperparameters.
class KNNClassifier(BaseEstimator, ClassifierMixin):
"""
K-Nearest Neighbors classifier.
"""
def __init__(
self,
n_neighbors: int = 5,
weights: str = "distance",
metric: str = "cosine",
metric_params: Dict[str, Any] = None,
) -> None:
"""
Initialize the KNN classifier instance.
Parameters
----------
n_neighbors : int
The number of nearest neighbors to consider.
weights : str
The weight function to use in prediction. Possible values are 'uniform' and 'distance'.
- 'uniform' : All points in each neighborhood are weighted equally.
- 'distance' : Weight points by the inverse of their distance.
metric : str
The similarity function to use. Possible values are 'cosine' and 'euclidean'.
metric_params : Dict[str, Any]
Additional keyword arguments for the metric function. See the documentation of scipy.spatial.distance and the metrics
listed in distance_metrics for valid metric values.
Returns
-------
None
"""
self.n_neighbors = n_neighbors
self.weights = weights
self.metric = metric
self.metric_params = metric_params
Fit Method
As discussed earlier, KNN is a lazy learner and doesn’t learn any model during the training phase. Instead, it stores the training data for later use. The fit()
method in the KNN classifier simply stores the training data and the target labels. Some housekeeping tasks are also performed, such as validation logic (only needed if we want sklearn compatibility), encoding the target labels, and storing the unique classes seen during training.
def fit(
self,
X: Union[csr_matrix, np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series, MutableSequence[Union[str, float]]],
) -> Self:
"""
The KNN classifier does not learn a model. Instead, it simply stores the training data. It is a 'lazy' learner.
Parameters
----------
X : Union[csr_matrix, np.ndarray, pd.DataFrame]
The input data.
y : Union[np.ndarray, pd.Series, MutableSequence[Union[str, float]]]
The target labels.
Returns
-------
Self
The instance itself.
"""
X, y = check_X_y(
X=X,
y=y,
accept_sparse="csr",
ensure_2d=True,
allow_nd=False,
y_numeric=False,
)
check_classification_targets(y)
# Attributes that have been estimated from the data must end with a trailing underscore
self.X_train_ = X
self.label_encoder_ = LabelEncoder()
self.y_train_ = self.label_encoder_.fit_transform(y)
self.classes_ = self.label_encoder_.classes_ # Unique classes seen during fit
self.n_features_in_ = X.shape[1] # Number of features seen during fit
return self
Predict Method
The predict()
method is where the actual prediction interface is implemented. For each query sample, the method calculates the distances to all training samples and identifies the \(k\) nearest neighbors. The class label of the query sample is then determined by the majority class among the neighbors.
def predict(self, X: Union[csr_matrix, np.ndarray, pd.DataFrame]) -> np.ndarray:
"""
Predict the labels for the test set using the KNN algorithm.
Parameters
----------
X : Union[csr_matrix, np.ndarray, pd.DataFrame]
The input data.
Returns
-------
np.ndarray
The predicted labels.
"""
check_is_fitted(
self,
["X_train_", "y_train_", "label_encoder_", "classes_", "n_features_in_"],
)
X = check_array(X, accept_sparse="csr")
# Handle edge case: only one class in training data
if len(self.classes_) == 1:
# The classes_ attribute was already in the format before encoding, so we can return it directly
return np.full(shape=X.shape[0], fill_value=self.classes_[0])
predictions = np.zeros(X.shape[0], dtype=int)
for i, test_sample in enumerate(tqdm(X)):
# Calculate distances to all training samples
test_sample_reshaped = test_sample.reshape(1, -1)
distances = self._compute_distances(test_sample_reshaped)
# Identify the indices of the k nearest neighbors
k_indices = np.argsort(distances, axis=-1)[: self.n_neighbors]
# Predict class based on nearest neighbors
prediction = self._predict_from_neighbors(k_indices, distances)
predictions[i] = prediction
return self.label_encoder_.inverse_transform(predictions)
The internal methods _compute_distances()
and _predict_from_neighbors()
are used to calculate the distances between the query sample and all training samples, and to predict the class label based on the nearest neighbors, respectively.
Note: In this implementation, we substituded the educlidean distance and cosine similarity functions with implementations from the from sklearn.metrics.pairwise module. On the one-hand, it is useful to implement these functions from scratch to understand the underlying concepts. On the other hand, the scikit-learn implementation is optimized and more efficient, so using them directly here speeds up the runtime quite a bit.
def _compute_distances(
self, test_sample: Union[np.ndarray, csr_matrix]
) -> np.ndarray:
"""
Compute the distances of the test sample to all training samples.
Parameters
----------
test_sample : Union[np.ndarray, csr_matrix]
A single test sample.
Returns
-------
distances : np.ndarray
Distances from the test sample to each training sample.
"""
metric_params = {} if self.metric_params is None else self.metric_params
if self.metric == "cosine":
return (
1
- cosine_similarity(
X=test_sample, Y=self.X_train_, **metric_params
).flatten()
)
elif self.metric == "euclidean":
return euclidean_distances(
X=test_sample, Y=self.X_train_, **metric_params
).flatten()
else:
raise NotImplementedError(
f"Unsupported similarity function '{self.metric}'"
)
Similarly, the _predict_from_neighbors()
method calculates the class label based on the majority class among the \(k\) nearest neighbors.
def _predict_from_neighbors(
self, k_indices: np.ndarray, distances: np.ndarray
) -> int:
"""
Predict the class of a sample based on its nearest neighbors.
Parameters
----------
k_indices : np.ndarray
Indices of the k nearest neighbors.
distances : np.ndarray
Distances from the sample to each neighbor.
Returns
-------
prediction : int
Predicted class label.
"""
k_nearest_labels = self.y_train_[k_indices]
if self.weights == "distance":
# Inverse distance weighting, considering a small epsilon to avoid division by zero
inv_distances = 1 / (distances[k_indices] + 1e-8)
weighted_votes = np.bincount(
k_nearest_labels, weights=inv_distances, minlength=len(self.classes_)
)
elif self.weights == "uniform":
# Uniform weights (each neighbor contributes equally)
weighted_votes = np.bincount(k_nearest_labels, minlength=len(self.classes_))
else:
raise ValueError(f"Unsupported weight function '{self.weights}'")
return np.argmax(weighted_votes)
Scikit-Learn Compatibility (Optional)
An additional feature of the implementation above is that it is compatible with the scikit-learn API. This means that the KNN classifier can be used in conjunction with scikit-learn’s utilities, such as GridSearchCV and cross_val_score.
To this end, we needed to take care of the following at the very least:
- Inherit from scikit-learn’s
BaseEstimator
andClassifierMixin
classes.
class KNNClassifier(BaseEstimator, ClassifierMixin):
- Implement the
fit()
andpredict()
methods with the correct validation logic and interface.
The
X
andy
inputs should be validated usingcheck_X_y()
andcheck_classification_targets()
during thefit()
method.The
X
input should be validated usingcheck_array()
during thepredict()
method, which must also check if the model has been fitted usingcheck_is_fitted()
.
For more details on scikit-learn’s API conventions, refer to the documentation. Needless to say, while rolling our implemenations can be a great learning experience, it is often more efficient to use the optimized implementations provided by scikit-learn for production-level code.
Putting It All Together
The following script implements the entire KNN classifier, the cosine similarity and Euclidean distance functions, and runs a test for its compatibility with scikit-learn using the check_estimator()
function.
from typing import Any, Dict, MutableSequence, Self, Union
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils.validation import check_array, check_is_fitted, check_X_y
from tqdm import tqdm
def custom_cosine_similarity(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
# Compute the dot products between the sample and all samples in the dataset
# The result is a row vector with n_samples elements
dot_products = np.dot(X, sample.T)
# Compute the L2 norm of the sample and all samples in the dataset
# This is a scalar value
sample_norm = np.linalg.norm(sample)
# The axis=1 argument ensures that the norm is computed along the feature axis
# X_norm is a row vector with n_samples elements
X_norm = np.linalg.norm(X, axis=1)
# The broadcasting of sample_norm and X_norm is handled automatically
cosine_similarities = dot_products / (sample_norm * X_norm + 1e-8)
return cosine_similarities
def custom_euclidean_distances(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
# X is (n_samples, n_features) and sample is (1, n_features), and so the subtraction is broadcasted along the first dimension (i.e., each row of X is subtracted by sample)
# The result is a matrix with dimensions (n_samples, n_features)
differences = X - sample
squared_differences = differences**2
# Sum across the feature axis to obtain the sum of squared differences for each sample
# The result is a row vector with n_samples elements
squared_distances = np.sum(squared_differences, axis=1)
euclidean_distances = np.sqrt(squared_distances)
return euclidean_distances
class KNNClassifier(BaseEstimator, ClassifierMixin):
"""
K-Nearest Neighbors classifier.
"""
def __init__(
self,
n_neighbors: int = 5,
weights: str = "distance",
metric: str = "cosine",
metric_params: Dict[str, Any] = None,
) -> None:
self.n_neighbors = n_neighbors
self.weights = weights
self.metric = metric
self.metric_params = metric_params
def fit(
self,
X: Union[csr_matrix, np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series, MutableSequence[Union[str, float]]],
) -> Self:
X, y = check_X_y(
X=X,
y=y,
accept_sparse="csr",
ensure_2d=True,
allow_nd=False,
y_numeric=False,
)
check_classification_targets(y)
# Attributes that have been estimated from the data must end with a trailing underscore
self.X_train_ = X
self.label_encoder_ = LabelEncoder()
self.y_train_ = self.label_encoder_.fit_transform(y)
self.classes_ = self.label_encoder_.classes_ # Unique classes seen during fit
self.n_features_in_ = X.shape[1] # Number of features seen during fit
return self
def predict(self, X: Union[csr_matrix, np.ndarray, pd.DataFrame]) -> np.ndarray:
check_is_fitted(
self,
["X_train_", "y_train_", "label_encoder_", "classes_", "n_features_in_"],
)
X = check_array(X, accept_sparse="csr")
# Handle edge case: only one class in training data
if len(self.classes_) == 1:
# The classes_ attribute was already in the format before encoding, so we can return it directly
return np.full(shape=X.shape[0], fill_value=self.classes_[0])
predictions = np.zeros(X.shape[0], dtype=int)
for i, test_sample in enumerate(tqdm(X)):
# Calculate distances to all training samples
test_sample_reshaped = test_sample.reshape(1, -1)
distances = self._compute_distances(test_sample_reshaped)
# Identify the indices of the k nearest neighbors
k_indices = np.argsort(distances, axis=-1)[: self.n_neighbors]
# Predict class based on nearest neighbors
prediction = self._predict_from_neighbors(k_indices, distances)
predictions[i] = prediction
return self.label_encoder_.inverse_transform(predictions)
def _compute_distances(
self, test_sample: Union[np.ndarray, csr_matrix]
) -> np.ndarray:
metric_params = {} if self.metric_params is None else self.metric_params
if self.metric == "cosine":
return (
1
- cosine_similarity(
X=test_sample, Y=self.X_train_, **metric_params
).flatten()
)
elif self.metric == "euclidean":
return euclidean_distances(
X=test_sample, Y=self.X_train_, **metric_params
).flatten()
else:
raise NotImplementedError(
f"Unsupported similarity function '{self.metric}'"
)
def _predict_from_neighbors(
self, k_indices: np.ndarray, distances: np.ndarray
) -> int:
k_nearest_labels = self.y_train_[k_indices]
if self.weights == "distance":
# Inverse distance weighting, considering a small epsilon to avoid division by zero
inv_distances = 1 / (distances[k_indices] + 1e-8)
weighted_votes = np.bincount(
k_nearest_labels, weights=inv_distances, minlength=len(self.classes_)
)
elif self.weights == "uniform":
# Uniform weights (each neighbor contributes equally)
weighted_votes = np.bincount(k_nearest_labels, minlength=len(self.classes_))
else:
raise ValueError(f"Unsupported weight function '{self.weights}'")
return np.argmax(weighted_votes)
if __name__ == "__main__":
from sklearn.utils.estimator_checks import check_estimator
knn_classifier = KNNClassifier()
check_estimator(knn_classifier)
print(
"\nKNNClassifier class is a valid estimator compatible with scikit-learn API!"
)
News Group Multiclass Classification
To demonstrate the KNN classifier in action, we’ll use the 20 Newsgroups dataset available in scikit-learn. This dataset consists of approximately \(18,846\) newsgroup documents across 20 different newsgroups. We will use a subset of the dataset containing three categories: soc.religion.christian
, comp.graphics
, and sci.med
.
The goal is to train a KNN classifier to predict the category of a newsgroup post based on its content, which is vectorized using the TfidfVectorizer.
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
Global Settings
rs = np.random.RandomState(12)
Data
# Subset of 20 newsgroups dataset
categories = ["soc.religion.christian", "comp.graphics", "sci.med"]
twenty_train = fetch_20newsgroups(
subset="train", categories=categories, shuffle=True, random_state=rs
)
len(twenty_train.data)
1777
The features are the raw text data, which we will vectorize using the TfidfVectorizer
. We will also reduce the dimensionality of the feature space using the TruncatedSVD
transformer.
for i in range(3):
print("\n".join(twenty_train.data[i].split("\n")[:3]), "\n")
From: kene@acs.bu.edu (Kenneth Engel)
Subject: Re: Atheists and Hell
Organization: Boston University, Boston, MA, USA
From: ab@nova.cc.purdue.edu (Allen B)
Subject: Re: TIFF: philosophical significance of 42
Organization: Purdue University
From: grante@aquarius.rosemount.com (Grant Edwards)
Subject: Re: Krillean Photography
Reply-To: grante@aquarius.rosemount.com (Grant Edwards)
The target distribution for this toy dataset is as balanced as it can be compared to any real-world dataset.
target_names_map = {i: name for i, name in enumerate(twenty_train.target_names)}
fig, ax = plt.subplots(figsize=(10, 7))
ax.hist(twenty_train.target)
ax.set_xticks(range(len(target_names_map)))
ax.set_xticklabels(target_names_map.values(), rotation=45)
# Add the count at the top of the bar
for i, count in enumerate(np.bincount(twenty_train.target)):
ax.text(i, count, str(count), ha="center", va="bottom");
ax.set_xlabel("Category")
ax.set_ylabel("Count")
plt.title("Target Distribution")
plt.show()
Custom KNN Classifier
Pipeline
For the pipeline, we carry out the following preprocessing steps:
Vectorize the text data using the
TfidfVectorizer
.Reduce the dimensionality of the feature space using the
TruncatedSVD
. This is also known as Latent Semantic Analysis (LSA). This step can be a hyperparameter in the pipeline.Train the KNN classifier.
pipeline_knn = Pipeline(
[
(
"tfidf",
TfidfVectorizer(stop_words="english", max_features=None, analyzer="word"),
),
("lsa", TruncatedSVD()),
("knn", KNNClassifier()),
]
)
pipeline_knn
Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')), ('lsa', TruncatedSVD()), ('knn', KNNClassifier())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')), ('lsa', TruncatedSVD()), ('knn', KNNClassifier())])
TfidfVectorizer(stop_words='english')
TruncatedSVD()
KNNClassifier()
Grid Search
We will perform a grid search to find the optimal hyperparameters for the KNN classifier. The hyperparameters we will tune are:
max_features
: The maximum number of features to consider in theTfidfVectorizer
.lsa
: The number of components to keep in theTruncatedSVD
transformer. We will also include the option to use the original feature space without LSA using thepassthrough
option. See the documentation for more details.n_neighbors
: The number of neighbors to consider in the KNN classifier.weights
: The weight function to use in prediction. Possible values areuniform
anddistance
.metric
: The similarity function to use. Possible values areeuclidean
andcosine
.
Since the class distribution is balanced, accuracy
is a valid metric to optimize.
cv_knn = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=rs)
param_grid_knn = {
"tfidf__max_features": [2048, 4096],
"lsa": [TruncatedSVD(n_components=n) for n in [100, 150]]
+ ["passthrough"], # Include 'passthrough' to test with or without LSA
"knn__n_neighbors": [5, 10],
"knn__weights": ["uniform", "distance"],
"knn__metric": ["euclidean", "cosine"],
}
grid_search_knn = GridSearchCV(
estimator=pipeline_knn,
param_grid=param_grid_knn,
scoring="accuracy",
cv=cv_knn,
n_jobs=-1,
verbose=2,
refit=True,
error_score="raise",
)
grid_search_knn.fit(twenty_train.data, twenty_train.target)
GridSearchCV(cv=StratifiedShuffleSplit(n_splits=3, random_state=RandomState(MT19937) at 0x13432E440, test_size=0.2, train_size=None), error_score='raise', estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')), ('lsa', TruncatedSVD()), ('knn', KNNClassifier())]), n_jobs=-1, param_grid={'knn__metric': ['euclidean', 'cosine'], 'knn__n_neighbors': [5, 10], 'knn__weights': ['uniform', 'distance'], 'lsa': [TruncatedSVD(n_components=100), TruncatedSVD(n_components=150), 'passthrough'], 'tfidf__max_features': [2048, 4096]}, scoring='accuracy', verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=StratifiedShuffleSplit(n_splits=3, random_state=RandomState(MT19937) at 0x13432E440, test_size=0.2, train_size=None), error_score='raise', estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')), ('lsa', TruncatedSVD()), ('knn', KNNClassifier())]), n_jobs=-1, param_grid={'knn__metric': ['euclidean', 'cosine'], 'knn__n_neighbors': [5, 10], 'knn__weights': ['uniform', 'distance'], 'lsa': [TruncatedSVD(n_components=100), TruncatedSVD(n_components=150), 'passthrough'], 'tfidf__max_features': [2048, 4096]}, scoring='accuracy', verbose=2)
Pipeline(steps=[('tfidf', TfidfVectorizer(max_features=4096, stop_words='english')), ('lsa', 'passthrough'), ('knn', KNNClassifier(n_neighbors=10))])
TfidfVectorizer(max_features=4096, stop_words='english')
passthrough
KNNClassifier(n_neighbors=10)
Grid Search Results
grid_search_results = pd.DataFrame(grid_search_knn.cv_results_).sort_values("rank_test_score").head(5)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_knn__metric | param_knn__n_neighbors | param_knn__weights | param_lsa | param_tfidf__max_features | params | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | 0.2953193 | 0.0077828 | 0.4920407 | 0.0108658 | cosine | 10 | distance | passthrough | 4096 | cosine , 10 , distance , passthrough, 4096 | 0.9438202 | 0.9578652 | 0.9550562 | 0.9522472 | 0.0060681 | 1 |
23 | 0.3585903 | 0.0821416 | 0.4670664 | 0.0075176 | euclidean | 10 | distance | passthrough | 4096 | euclidean , 10 , distance , passthrough, 4096 | 0.9410112 | 0.9578652 | 0.9494382 | 0.9494382 | 0.0068806 | 2 |
35 | 0.3102987 | 0.0427768 | 0.7355963 | 0.0443589 | cosine | 5 | distance | passthrough | 4096 | cosine , 5 , distance , passthrough, 4096 | 0.9325843 | 0.9466292 | 0.9606742 | 0.9466292 | 0.0114676 | 3 |
41 | 0.3300992 | 0.0152099 | 0.6898276 | 0.0764058 | cosine | 10 | uniform | passthrough | 4096 | cosine , 10 , uniform , passthrough, 4096 | 0.9325843 | 0.9550562 | 0.9466292 | 0.9447566 | 0.0092692 | 4 |
11 | 0.3165097 | 0.0270617 | 0.4828021 | 0.0226678 | euclidean | 5 | distance | passthrough | 4096 | euclidean , 5 , distance , passthrough, 4096 | 0.9297753 | 0.9438202 | 0.9606742 | 0.9447566 | 0.0126318 | 4 |
Prediction on Test Data
twenty_test = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=rs
)
test_data = twenty_test.data
predicted_targets = grid_search_knn.predict(test_data)
0it [00:00, ?it/s]121it [00:00, 1208.09it/s]244it [00:00, 1220.88it/s]368it [00:00, 1227.74it/s]492it [00:00, 1230.63it/s]616it [00:00, 1232.31it/s]740it [00:00, 1233.13it/s]864it [00:00, 1232.81it/s]988it [00:00, 1220.74it/s]1111it [00:00, 1205.15it/s]1183it [00:00, 1218.92it/s]
predicted_targets
array([0, 0, 2, ..., 2, 1, 2])
Reports
We can generate a classification report to evaluate the performance of the KNN classifier on the test data.
classification_reports = classification_report(
y_true=twenty_test.target,
y_pred=predicted_targets,
target_names=target_names_map.values(),
output_dict=True,
)
classification_reports_data = pd.DataFrame(classification_reports).T
precision | recall | f1-score | support | |
---|---|---|---|---|
comp.graphics | 0.8564593 | 0.9203085 | 0.8872367 | 389.0000000 |
sci.med | 0.9140401 | 0.8055556 | 0.8563758 | 396.0000000 |
soc.religion.christian | 0.8870192 | 0.9271357 | 0.9066339 | 398.0000000 |
accuracy | 0.8841927 | 0.8841927 | 0.8841927 | 0.8841927 |
macro avg | 0.8858396 | 0.8843332 | 0.8834155 | 1183.0000000 |
weighted avg | 0.8860154 | 0.8841927 | 0.8834321 | 1183.0000000 |
In addition, we can visualize the confusion matrix to better understand the classification performance.
ConfusionMatrixDisplay.from_predictions(
y_true=twenty_test.target,
y_pred=predicted_targets,
display_labels=target_names_map.values(),
)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x1445a8ed0>
Reference Implementation from Scikit-Learn
We can compare the results of our custom KNN classifier with the scikit-learn implementation.
pipeline_knn_sklearn = Pipeline(
[
(
"tfidf",
TfidfVectorizer(stop_words="english", max_features=None, analyzer="word"),
),
("knn", KNeighborsClassifier()),
]
)
param_grid_knn_sklearn = {
"tfidf__max_features": [2048, 4096],
"knn__n_neighbors": [5, 10],
"knn__weights": ["uniform", "distance"],
"knn__metric": ["euclidean", "cosine"],
}
grid_search_knn_sklearn = GridSearchCV(
estimator=pipeline_knn_sklearn,
param_grid=param_grid_knn_sklearn,
scoring="accuracy",
cv=cv_knn,
n_jobs=-1,
verbose=2,
refit=True,
error_score="raise",
)
grid_search_knn_sklearn.fit(twenty_train.data, twenty_train.target)
GridSearchCV(cv=StratifiedShuffleSplit(n_splits=3, random_state=RandomState(MT19937) at 0x13432E440, test_size=0.2, train_size=None), error_score='raise', estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')), ('knn', KNeighborsClassifier())]), n_jobs=-1, param_grid={'knn__metric': ['euclidean', 'cosine'], 'knn__n_neighbors': [5, 10], 'knn__weights': ['uniform', 'distance'], 'tfidf__max_features': [2048, 4096]}, scoring='accuracy', verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=StratifiedShuffleSplit(n_splits=3, random_state=RandomState(MT19937) at 0x13432E440, test_size=0.2, train_size=None), error_score='raise', estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')), ('knn', KNeighborsClassifier())]), n_jobs=-1, param_grid={'knn__metric': ['euclidean', 'cosine'], 'knn__n_neighbors': [5, 10], 'knn__weights': ['uniform', 'distance'], 'tfidf__max_features': [2048, 4096]}, scoring='accuracy', verbose=2)
Pipeline(steps=[('tfidf', TfidfVectorizer(max_features=4096, stop_words='english')), ('knn', KNeighborsClassifier(metric='cosine', n_neighbors=10, weights='distance'))])
TfidfVectorizer(max_features=4096, stop_words='english')
KNeighborsClassifier(metric='cosine', n_neighbors=10, weights='distance')
grid_search_results_sklearn = pd.DataFrame(grid_search_knn_sklearn.cv_results_).sort_values("rank_test_score").head(5)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_knn__metric | param_knn__n_neighbors | param_knn__weights | param_tfidf__max_features | params | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 0.2018224 | 0.0167407 | 0.0474807 | 0.0009341 | cosine | 10 | distance | 4096 | cosine , 10 , distance, 4096 | 0.9466292 | 0.9353933 | 0.9466292 | 0.9428839 | 0.0052967 | 1 |
7 | 0.1971777 | 0.0042258 | 0.0645568 | 0.0005940 | euclidean | 10 | distance | 4096 | euclidean, 10 , distance , 4096 | 0.9438202 | 0.9297753 | 0.9466292 | 0.9400749 | 0.0073727 | 2 |
3 | 0.2053347 | 0.0061259 | 0.0688587 | 0.0025981 | euclidean | 5 | distance | 4096 | euclidean, 5 , distance , 4096 | 0.9410112 | 0.9353933 | 0.9269663 | 0.9344569 | 0.0057719 | 3 |
11 | 0.2152572 | 0.0075339 | 0.0688361 | 0.0024215 | cosine | 5 | distance | 4096 | cosine , 5 , distance, 4096 | 0.9438202 | 0.9325843 | 0.9269663 | 0.9344569 | 0.0070068 | 3 |
5 | 0.2102303 | 0.0119457 | 0.0737727 | 0.0071226 | euclidean | 10 | uniform | 4096 | euclidean, 10 , uniform , 4096 | 0.9410112 | 0.9269663 | 0.9297753 | 0.9325843 | 0.0060681 | 5 |
predicted_targets_sklearn = grid_search_knn_sklearn.predict(test_data)
classification_reports_sklearn = classification_report(
y_true=twenty_test.target,
y_pred=predicted_targets_sklearn,
target_names=target_names_map.values(),
output_dict=True,
)
classification_reports_sklearn_data = pd.DataFrame(classification_reports_sklearn).T
precision | recall | f1-score | support | |
---|---|---|---|---|
comp.graphics | 0.8564593 | 0.9203085 | 0.8872367 | 389.0000000 |
sci.med | 0.9140401 | 0.8055556 | 0.8563758 | 396.0000000 |
soc.religion.christian | 0.8870192 | 0.9271357 | 0.9066339 | 398.0000000 |
accuracy | 0.8841927 | 0.8841927 | 0.8841927 | 0.8841927 |
macro avg | 0.8858396 | 0.8843332 | 0.8834155 | 1183.0000000 |
weighted avg | 0.8860154 | 0.8841927 | 0.8834321 | 1183.0000000 |
ConfusionMatrixDisplay.from_predictions(
y_true=twenty_test.target,
y_pred=predicted_targets_sklearn,
display_labels=target_names_map.values(),
)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x146502990>
The scikit-learn implementation of the KNN classifier exactly matches the results of our custom implementation, demonstrating the correctness of our implementation.