K-Nearest Neighbors (KNN) Classifier: Step-by-Step Python Implementation from Scratch

Wednesday, 02 Nov 2022 Python

Model Formulation
- Distance Metrics
- Decision Rule
Python Implementation
News Group Multiclass Classification
References

In this post, we will implement the K-Nearest Neighbors (KNN) algorithm from scratch in Python. KNN is a simple, yet powerful non-parametric algorithm commonly used for both classification and regression tasks. Unlike many machine learning algorithms, KNN doesn’t require an explicit training phase, as it falls under the category of instance-based learning. The core assumption behind KNN is that similar data points are typically located near one another in the feature space.

For classification tasks, KNN operates by measuring the distance between the new data point and all the points in the training set. It then identifies the K closest points, and the majority class among these neighbors is used to predict the class of the new point. The proximity of each neighbor can be weighted by its distance to ensure closer neighbors have a greater influence on the prediction.

In this post, we’ll break down how to implement this intuitive and flexible algorithm from scratch in Python.

Model Formulation

In K-Nearest Neighbors (KNN) classification, a query point (i.e., the one requiring a prediction) is classified based on the majority class of its nearest neighbors in the training data. Given a query point $\mathbf{x}_q$, the classifier finds the $k$ nearest points in the training set based on a chosen distance metric.

Distance Metrics

The choice of distance metric is an important hyperparameter in KNN, as it determines how the algorithm measures the similarity between data points. The most common distance metrics include Euclidean distance and cosine similarity.

For a query point $ _q $ and a training point $ _i $, the Euclidean distance is defined as:

\[\begin{align*} d(\mathbf{x}_q, \mathbf{x}_i) = \sqrt{\sum_{j=1}^{n} (x_{q,j} - x_{i,j})^2} \end{align*}\]

where $n$ is the dimensionality of the data points. Cosine similarity measures the angle between the vectors of two points, calculated as:

\[\begin{align*} \text{sim}(\mathbf{x}_q, \mathbf{x}_i) = \frac{\mathbf{x}_q \cdot \mathbf{x}_i}{\|\mathbf{x}_q\|_{2} \|\mathbf{x}_i\|_{2}} \end{align*}\]

where $\|\cdot\|_{2}$ denotes the $L_2$ norm of a vector.

The predicted label for a query point is determined by the majority vote from its $k$-nearest neighbors. The vote can be either uniform or weighted based on the distance between the query point and its neighbors.

Note: the concept of distance and similarity is related but not identical. In KNN, we typically use distance metrics to measure the similarity between data points. The cosine similarity, which is between 0 and 1, can be converted to a distance metric by subtracting it from 1.

Decision Rule

Let $N_k(\mathbf{x}_q)$ denote the set of $k$-nearest neighbors of $\mathbf{x}_q$. The predicted class $\hat{y}_q$ is:

\[\begin{align*} \hat{y}_q = \arg \max_{y} \sum_{\mathbf{x}_i \in N_k(\mathbf{x}_q)} w_i \cdot \mathbb{I}[y_i = y] \end{align*}\]

where $ w_i $ is a weight applied to neighbor $ _i $ and $ [y_i = y] $ is an indicator function that equals 1 if the neighbor’s class is $ y $, and 0 otherwise.

Python Implementation

Cosine Similarity

Let’s start by implementing the cosine similarity function in Python. We’ll use the numpy library for vectorized operations.

def custom_cosine_similarity(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
    """
    Compute the cosine similarity between a sample and all samples in a dataset.

    Parameters
    ----------
    sample : np.ndarray
        A single sample as a row vector with dimensions (1, n_features).
    X : np.ndarray
        The training data matrix with dimensions (n_samples, n_features).

    Returns
    -------
    np.ndarray
        The cosine similarities between the query sample and all samples in the training data matrix.
    """
    # Compute the dot products between the sample and all samples in the dataset
    # The result is a row vector with n_samples elements
    dot_products = np.dot(X, sample.T)

    # Compute the L2 norm of the sample and all samples in the dataset
    # This is a scalar value
    sample_norm = np.linalg.norm(sample)

    # The axis=1 argument ensures that the norm is computed along the feature axis
    # X_norm is a row vector with n_samples elements
    X_norm = np.linalg.norm(X, axis=1)

    # The broadcasting of sample_norm and X_norm is handled automatically
    cosine_similarities = dot_products / (sample_norm * X_norm + 1e-8)

    return cosine_similarities

Note the use of $axis=1$ in the np.linalg.norm() function to compute the norm along the feature axis. This ensures that the norm is calculated for each sample in the training set. By default, np.linalg.norm() computes the Frobenius norm of the entire matrix (a 2D array).

Eucledian Distance

Next, we’ll implement the Euclidean distance function in Python. Again, we compute the distances between a single sample and all samples in the training set.

def custom_euclidean_distances(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
    """
    Compute the Euclidean distance between a sample and all samples in a dataset.

    Parameters
    ----------
    sample : np.ndarray
        A single sample as a row vector with dimensions (1, n_features).
    X : np.ndarray
        The training data matrix with dimensions (n_samples, n_features).

    Returns
    -------
    np.ndarray
        The Euclidean distances between the query sample and all samples in the training data matrix.
    """
    # X is (n_samples, n_features) and sample is (1, n_features), and so the subtraction is broadcasted along the first dimension (i.e., each row of X is subtracted by sample)
    # The result is a matrix with dimensions (n_samples, n_features)
    differences = X - sample
    squared_differences = differences**2

    # Sum across the feature axis to obtain the sum of squared differences for each sample
    # The result is a row vector with n_samples elements
    squared_distances = np.sum(squared_differences, axis=1)
    euclidean_distances = np.sqrt(squared_distances)

    return euclidean_distances

Once again, we utilize broadcasting and element-wise calculations:

Subtracting the sample from all samples in the training set is broadcasted along the first dimension (i.e., the rows of $X$). This results in a matrix with the same dimensions as $X$, whose rows are the element-wise differences between the sample and each training sample.
Squaring the differences is performed element-wise, resulting in a matrix with the same dimensions as $X$.
Summing across the feature axis (axis=1) calculates the sum of squared differences for each sample in the training set, resulting in a row vector with $n_{\text{training samples}}$ elements.
Finally, taking the square root of each sum of squared differences yields the Euclidean distances between the query sample and all training samples.

KNN Classifier

Initialization

The initializer logic is straightforward as we only need to store the hyperparameters.

class KNNClassifier(BaseEstimator, ClassifierMixin):
    """
    K-Nearest Neighbors classifier.
    """

    def __init__(
        self,
        n_neighbors: int = 5,
        weights: str = "distance",
        metric: str = "cosine",
        metric_params: Dict[str, Any] = None,
    ) -> None:
        """
        Initialize the KNN classifier instance.

        Parameters
        ----------
        n_neighbors : int
            The number of nearest neighbors to consider.
        weights : str
            The weight function to use in prediction. Possible values are 'uniform' and 'distance'.
                - 'uniform' : All points in each neighborhood are weighted equally.
                - 'distance' : Weight points by the inverse of their distance.
        metric : str
            The similarity function to use. Possible values are 'cosine' and 'euclidean'.
        metric_params : Dict[str, Any]
            Additional keyword arguments for the metric function. See the documentation of scipy.spatial.distance and the metrics
            listed in distance_metrics for valid metric values.

        Returns
        -------
        None
        """
        self.n_neighbors = n_neighbors
        self.weights = weights
        self.metric = metric
        self.metric_params = metric_params

Fit Method

As discussed earlier, KNN is a lazy learner and doesn’t learn any model during the training phase. Instead, it stores the training data for later use. The fit() method in the KNN classifier simply stores the training data and the target labels. Some housekeeping tasks are also performed, such as validation logic (only needed if we want sklearn compatibility), encoding the target labels, and storing the unique classes seen during training.

def fit(
    self,
    X: Union[csr_matrix, np.ndarray, pd.DataFrame],
    y: Union[np.ndarray, pd.Series, MutableSequence[Union[str, float]]],
) -> Self:
    """
    The KNN classifier does not learn a model. Instead, it simply stores the training data. It is a 'lazy' learner.

    Parameters
    ----------
    X : Union[csr_matrix, np.ndarray, pd.DataFrame]
        The input data.
    y : Union[np.ndarray, pd.Series, MutableSequence[Union[str, float]]]
        The target labels.

    Returns
    -------
    Self
        The instance itself.
    """
    X, y = check_X_y(
        X=X,
        y=y,
        accept_sparse="csr",
        ensure_2d=True,
        allow_nd=False,
        y_numeric=False,
    )
    check_classification_targets(y)

    # Attributes that have been estimated from the data must end with a trailing underscore
    self.X_train_ = X
    self.label_encoder_ = LabelEncoder()
    self.y_train_ = self.label_encoder_.fit_transform(y)
    self.classes_ = self.label_encoder_.classes_  # Unique classes seen during fit
    self.n_features_in_ = X.shape[1]  # Number of features seen during fit

    return self

Predict Method

The predict() method is where the actual prediction interface is implemented. For each query sample, the method calculates the distances to all training samples and identifies the $k$ nearest neighbors. The class label of the query sample is then determined by the majority class among the neighbors.

def predict(self, X: Union[csr_matrix, np.ndarray, pd.DataFrame]) -> np.ndarray:
    """
    Predict the labels for the test set using the KNN algorithm.

    Parameters
    ----------
    X : Union[csr_matrix, np.ndarray, pd.DataFrame]
        The input data.

    Returns
    -------
    np.ndarray
        The predicted labels.
    """
    check_is_fitted(
        self,
        ["X_train_", "y_train_", "label_encoder_", "classes_", "n_features_in_"],
    )
    X = check_array(X, accept_sparse="csr")

    # Handle edge case: only one class in training data
    if len(self.classes_) == 1:
        # The classes_ attribute was already in the format before encoding, so we can return it directly
        return np.full(shape=X.shape[0], fill_value=self.classes_[0])

    predictions = np.zeros(X.shape[0], dtype=int)
    for i, test_sample in enumerate(tqdm(X)):
        # Calculate distances to all training samples
        test_sample_reshaped = test_sample.reshape(1, -1)
        distances = self._compute_distances(test_sample_reshaped)
        # Identify the indices of the k nearest neighbors
        k_indices = np.argsort(distances, axis=-1)[: self.n_neighbors]
        # Predict class based on nearest neighbors
        prediction = self._predict_from_neighbors(k_indices, distances)
        predictions[i] = prediction

    return self.label_encoder_.inverse_transform(predictions)

The internal methods _compute_distances() and _predict_from_neighbors() are used to calculate the distances between the query sample and all training samples, and to predict the class label based on the nearest neighbors, respectively.

Note: In this implementation, we substituded the educlidean distance and cosine similarity functions with implementations from the from sklearn.metrics.pairwise module. On the one-hand, it is useful to implement these functions from scratch to understand the underlying concepts. On the other hand, the scikit-learn implementation is optimized and more efficient, so using them directly here speeds up the runtime quite a bit.

def _compute_distances(
    self, test_sample: Union[np.ndarray, csr_matrix]
) -> np.ndarray:
    """
    Compute the distances of the test sample to all training samples.

    Parameters
    ----------
    test_sample : Union[np.ndarray, csr_matrix]
        A single test sample.

    Returns
    -------
    distances : np.ndarray
        Distances from the test sample to each training sample.
    """
    metric_params = {} if self.metric_params is None else self.metric_params
    if self.metric == "cosine":
        return (
            1
            - cosine_similarity(
                X=test_sample, Y=self.X_train_, **metric_params
            ).flatten()
        )
    elif self.metric == "euclidean":
        return euclidean_distances(
            X=test_sample, Y=self.X_train_, **metric_params
        ).flatten()
    else:
        raise NotImplementedError(
            f"Unsupported similarity function '{self.metric}'"
        )

Similarly, the _predict_from_neighbors() method calculates the class label based on the majority class among the $k$ nearest neighbors.

def _predict_from_neighbors(
    self, k_indices: np.ndarray, distances: np.ndarray
) -> int:
    """
    Predict the class of a sample based on its nearest neighbors.

    Parameters
    ----------
    k_indices : np.ndarray
        Indices of the k nearest neighbors.
    distances : np.ndarray
        Distances from the sample to each neighbor.

    Returns
    -------
    prediction : int
        Predicted class label.
    """
    k_nearest_labels = self.y_train_[k_indices]
    if self.weights == "distance":
        # Inverse distance weighting, considering a small epsilon to avoid division by zero
        inv_distances = 1 / (distances[k_indices] + 1e-8)
        weighted_votes = np.bincount(
            k_nearest_labels, weights=inv_distances, minlength=len(self.classes_)
        )
    elif self.weights == "uniform":
        # Uniform weights (each neighbor contributes equally)
        weighted_votes = np.bincount(k_nearest_labels, minlength=len(self.classes_))
    else:
        raise ValueError(f"Unsupported weight function '{self.weights}'")

    return np.argmax(weighted_votes)

Scikit-Learn Compatibility (Optional)

An additional feature of the implementation above is that it is compatible with the scikit-learn API. This means that the KNN classifier can be used in conjunction with scikit-learn’s utilities, such as GridSearchCV and cross_val_score.

To this end, we needed to take care of the following at the very least:

Inherit from scikit-learn’s BaseEstimator and ClassifierMixin classes.

class KNNClassifier(BaseEstimator, ClassifierMixin):

Implement the fit() and predict() methods with the correct validation logic and interface.

The X and y inputs should be validated using check_X_y() and check_classification_targets() during the fit() method.
The X input should be validated using check_array() during the predict() method, which must also check if the model has been fitted using check_is_fitted().

For more details on scikit-learn’s API conventions, refer to the documentation. Needless to say, while rolling our implemenations can be a great learning experience, it is often more efficient to use the optimized implementations provided by scikit-learn for production-level code.

Putting It All Together

The following script implements the entire KNN classifier, the cosine similarity and Euclidean distance functions, and runs a test for its compatibility with scikit-learn using the check_estimator() function.

from typing import Any, Dict, MutableSequence, Self, Union

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils.validation import check_array, check_is_fitted, check_X_y
from tqdm import tqdm


def custom_cosine_similarity(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
    # Compute the dot products between the sample and all samples in the dataset
    # The result is a row vector with n_samples elements
    dot_products = np.dot(X, sample.T)

    # Compute the L2 norm of the sample and all samples in the dataset
    # This is a scalar value
    sample_norm = np.linalg.norm(sample)

    # The axis=1 argument ensures that the norm is computed along the feature axis
    # X_norm is a row vector with n_samples elements
    X_norm = np.linalg.norm(X, axis=1)

    # The broadcasting of sample_norm and X_norm is handled automatically
    cosine_similarities = dot_products / (sample_norm * X_norm + 1e-8)

    return cosine_similarities


def custom_euclidean_distances(sample: np.ndarray, X: np.ndarray) -> np.ndarray:
    # X is (n_samples, n_features) and sample is (1, n_features), and so the subtraction is broadcasted along the first dimension (i.e., each row of X is subtracted by sample)
    # The result is a matrix with dimensions (n_samples, n_features)
    differences = X - sample
    squared_differences = differences**2

    # Sum across the feature axis to obtain the sum of squared differences for each sample
    # The result is a row vector with n_samples elements
    squared_distances = np.sum(squared_differences, axis=1)
    euclidean_distances = np.sqrt(squared_distances)

    return euclidean_distances


class KNNClassifier(BaseEstimator, ClassifierMixin):
    """
    K-Nearest Neighbors classifier.
    """

    def __init__(
        self,
        n_neighbors: int = 5,
        weights: str = "distance",
        metric: str = "cosine",
        metric_params: Dict[str, Any] = None,
    ) -> None:
        self.n_neighbors = n_neighbors
        self.weights = weights
        self.metric = metric
        self.metric_params = metric_params

    def fit(
        self,
        X: Union[csr_matrix, np.ndarray, pd.DataFrame],
        y: Union[np.ndarray, pd.Series, MutableSequence[Union[str, float]]],
    ) -> Self:
        X, y = check_X_y(
            X=X,
            y=y,
            accept_sparse="csr",
            ensure_2d=True,
            allow_nd=False,
            y_numeric=False,
        )
        check_classification_targets(y)

        # Attributes that have been estimated from the data must end with a trailing underscore
        self.X_train_ = X
        self.label_encoder_ = LabelEncoder()
        self.y_train_ = self.label_encoder_.fit_transform(y)
        self.classes_ = self.label_encoder_.classes_  # Unique classes seen during fit
        self.n_features_in_ = X.shape[1]  # Number of features seen during fit

        return self

    def predict(self, X: Union[csr_matrix, np.ndarray, pd.DataFrame]) -> np.ndarray:
        check_is_fitted(
            self,
            ["X_train_", "y_train_", "label_encoder_", "classes_", "n_features_in_"],
        )
        X = check_array(X, accept_sparse="csr")

        # Handle edge case: only one class in training data
        if len(self.classes_) == 1:
            # The classes_ attribute was already in the format before encoding, so we can return it directly
            return np.full(shape=X.shape[0], fill_value=self.classes_[0])

        predictions = np.zeros(X.shape[0], dtype=int)
        for i, test_sample in enumerate(tqdm(X)):
            # Calculate distances to all training samples
            test_sample_reshaped = test_sample.reshape(1, -1)
            distances = self._compute_distances(test_sample_reshaped)
            # Identify the indices of the k nearest neighbors
            k_indices = np.argsort(distances, axis=-1)[: self.n_neighbors]
            # Predict class based on nearest neighbors
            prediction = self._predict_from_neighbors(k_indices, distances)
            predictions[i] = prediction

        return self.label_encoder_.inverse_transform(predictions)

    def _compute_distances(
        self, test_sample: Union[np.ndarray, csr_matrix]
    ) -> np.ndarray:
        metric_params = {} if self.metric_params is None else self.metric_params
        if self.metric == "cosine":
            return (
                1
                - cosine_similarity(
                    X=test_sample, Y=self.X_train_, **metric_params
                ).flatten()
            )
        elif self.metric == "euclidean":
            return euclidean_distances(
                X=test_sample, Y=self.X_train_, **metric_params
            ).flatten()
        else:
            raise NotImplementedError(
                f"Unsupported similarity function '{self.metric}'"
            )

    def _predict_from_neighbors(
        self, k_indices: np.ndarray, distances: np.ndarray
    ) -> int:
        k_nearest_labels = self.y_train_[k_indices]
        if self.weights == "distance":
            # Inverse distance weighting, considering a small epsilon to avoid division by zero
            inv_distances = 1 / (distances[k_indices] + 1e-8)
            weighted_votes = np.bincount(
                k_nearest_labels, weights=inv_distances, minlength=len(self.classes_)
            )
        elif self.weights == "uniform":
            # Uniform weights (each neighbor contributes equally)
            weighted_votes = np.bincount(k_nearest_labels, minlength=len(self.classes_))
        else:
            raise ValueError(f"Unsupported weight function '{self.weights}'")

        return np.argmax(weighted_votes)


if __name__ == "__main__":
    from sklearn.utils.estimator_checks import check_estimator

    knn_classifier = KNNClassifier()
    check_estimator(knn_classifier)

    print(
        "\nKNNClassifier class is a valid estimator compatible with scikit-learn API!"
    )

News Group Multiclass Classification

To demonstrate the KNN classifier in action, we’ll use the 20 Newsgroups dataset available in scikit-learn. This dataset consists of approximately $18,846$ newsgroup documents across 20 different newsgroups. We will use a subset of the dataset containing three categories: soc.religion.christian, comp.graphics, and sci.med.

The goal is to train a KNN classifier to predict the category of a newsgroup post based on its content, which is vectorized using the TfidfVectorizer.

import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

Global Settings

rs = np.random.RandomState(12)

Data

# Subset of 20 newsgroups dataset
categories = ["soc.religion.christian", "comp.graphics", "sci.med"]

twenty_train = fetch_20newsgroups(
    subset="train", categories=categories, shuffle=True, random_state=rs
)

len(twenty_train.data)

The features are the raw text data, which we will vectorize using the TfidfVectorizer. We will also reduce the dimensionality of the feature space using the TruncatedSVD transformer.

for i in range(3):
    print("\n".join(twenty_train.data[i].split("\n")[:3]), "\n")

From: kene@acs.bu.edu (Kenneth Engel)
Subject: Re: Atheists and Hell
Organization: Boston University, Boston, MA, USA 

From: ab@nova.cc.purdue.edu (Allen B)
Subject: Re: TIFF: philosophical significance of 42
Organization: Purdue University 

From: grante@aquarius.rosemount.com (Grant Edwards)
Subject: Re: Krillean Photography
Reply-To: grante@aquarius.rosemount.com (Grant Edwards)

The target distribution for this toy dataset is as balanced as it can be compared to any real-world dataset.

target_names_map = {i: name for i, name in enumerate(twenty_train.target_names)}

fig, ax = plt.subplots(figsize=(10, 7))
ax.hist(twenty_train.target)
ax.set_xticks(range(len(target_names_map)))
ax.set_xticklabels(target_names_map.values(), rotation=45)
# Add the count at the top of the bar
for i, count in enumerate(np.bincount(twenty_train.target)):
    ax.text(i, count, str(count), ha="center", va="bottom");
ax.set_xlabel("Category")
ax.set_ylabel("Count")
plt.title("Target Distribution")
plt.show()

Custom KNN Classifier

Pipeline

For the pipeline, we carry out the following preprocessing steps:

Vectorize the text data using the TfidfVectorizer.
Reduce the dimensionality of the feature space using the TruncatedSVD. This is also known as Latent Semantic Analysis (LSA). This step can be a hyperparameter in the pipeline.
Train the KNN classifier.

pipeline_knn = Pipeline(
    [
        (
            "tfidf",
            TfidfVectorizer(stop_words="english", max_features=None, analyzer="word"),
        ),
        ("lsa", TruncatedSVD()),
        ("knn", KNNClassifier()),
    ]
)
pipeline_knn

Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')),
                ('lsa', TruncatedSVD()), ('knn', KNNClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Grid Search

We will perform a grid search to find the optimal hyperparameters for the KNN classifier. The hyperparameters we will tune are:

max_features: The maximum number of features to consider in the TfidfVectorizer.
lsa: The number of components to keep in the TruncatedSVD transformer. We will also include the option to use the original feature space without LSA using the passthrough option. See the documentation for more details.
n_neighbors: The number of neighbors to consider in the KNN classifier.
weights: The weight function to use in prediction. Possible values are uniform and distance.
metric: The similarity function to use. Possible values are euclidean and cosine.

Since the class distribution is balanced, accuracy is a valid metric to optimize.

cv_knn = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=rs)

param_grid_knn = {
    "tfidf__max_features": [2048, 4096],
    "lsa": [TruncatedSVD(n_components=n) for n in [100, 150]]
    + ["passthrough"],  # Include 'passthrough' to test with or without LSA
    "knn__n_neighbors": [5, 10],
    "knn__weights": ["uniform", "distance"],
    "knn__metric": ["euclidean", "cosine"],
}

grid_search_knn = GridSearchCV(
    estimator=pipeline_knn,
    param_grid=param_grid_knn,
    scoring="accuracy",
    cv=cv_knn,
    n_jobs=-1,
    verbose=2,
    refit=True,
    error_score="raise",
)

grid_search_knn.fit(twenty_train.data, twenty_train.target)

GridSearchCV(cv=StratifiedShuffleSplit(n_splits=3,
            random_state=RandomState(MT19937) at 0x13432E440,
            test_size=0.2, train_size=None),
             error_score='raise',
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words='english')),
                                       ('lsa', TruncatedSVD()),
                                       ('knn', KNNClassifier())]),
             n_jobs=-1,
             param_grid={'knn__metric': ['euclidean', 'cosine'],
                         'knn__n_neighbors': [5, 10],
                         'knn__weights': ['uniform', 'distance'],
                         'lsa': [TruncatedSVD(n_components=100),
                                 TruncatedSVD(n_components=150),
                                 'passthrough'],
                         'tfidf__max_features': [2048, 4096]},
             scoring='accuracy', verbose=2)

Grid Search Results

grid_search_results = pd.DataFrame(grid_search_knn.cv_results_).sort_values("rank_test_score").head(5)

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_knn__metric	param_knn__n_neighbors	param_knn__weights	param_lsa	param_tfidf__max_features	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
47	0.2953193	0.0077828	0.4920407	0.0108658	cosine	10	distance	passthrough	4096	cosine , 10 , distance , passthrough, 4096	0.9438202	0.9578652	0.9550562	0.9522472	0.0060681	1
23	0.3585903	0.0821416	0.4670664	0.0075176	euclidean	10	distance	passthrough	4096	euclidean , 10 , distance , passthrough, 4096	0.9410112	0.9578652	0.9494382	0.9494382	0.0068806	2
35	0.3102987	0.0427768	0.7355963	0.0443589	cosine	5	distance	passthrough	4096	cosine , 5 , distance , passthrough, 4096	0.9325843	0.9466292	0.9606742	0.9466292	0.0114676	3
41	0.3300992	0.0152099	0.6898276	0.0764058	cosine	10	uniform	passthrough	4096	cosine , 10 , uniform , passthrough, 4096	0.9325843	0.9550562	0.9466292	0.9447566	0.0092692	4
11	0.3165097	0.0270617	0.4828021	0.0226678	euclidean	5	distance	passthrough	4096	euclidean , 5 , distance , passthrough, 4096	0.9297753	0.9438202	0.9606742	0.9447566	0.0126318	4

Prediction on Test Data

twenty_test = fetch_20newsgroups(
    subset="test", categories=categories, shuffle=True, random_state=rs
)

test_data = twenty_test.data

predicted_targets = grid_search_knn.predict(test_data)

0it [00:00, ?it/s]121it [00:00, 1208.09it/s]244it [00:00, 1220.88it/s]368it [00:00, 1227.74it/s]492it [00:00, 1230.63it/s]616it [00:00, 1232.31it/s]740it [00:00, 1233.13it/s]864it [00:00, 1232.81it/s]988it [00:00, 1220.74it/s]1111it [00:00, 1205.15it/s]1183it [00:00, 1218.92it/s]

predicted_targets

array([0, 0, 2, ..., 2, 1, 2])

Reports

We can generate a classification report to evaluate the performance of the KNN classifier on the test data.

classification_reports = classification_report(
    y_true=twenty_test.target,
    y_pred=predicted_targets,
    target_names=target_names_map.values(),
    output_dict=True,
)

classification_reports_data = pd.DataFrame(classification_reports).T

	precision	recall	f1-score	support
comp.graphics	0.8564593	0.9203085	0.8872367	389.0000000
sci.med	0.9140401	0.8055556	0.8563758	396.0000000
soc.religion.christian	0.8870192	0.9271357	0.9066339	398.0000000
accuracy	0.8841927	0.8841927	0.8841927	0.8841927
macro avg	0.8858396	0.8843332	0.8834155	1183.0000000
weighted avg	0.8860154	0.8841927	0.8834321	1183.0000000

In addition, we can visualize the confusion matrix to better understand the classification performance.

ConfusionMatrixDisplay.from_predictions(
    y_true=twenty_test.target,
    y_pred=predicted_targets,
    display_labels=target_names_map.values(),
)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x1445a8ed0>

Reference Implementation from Scikit-Learn

We can compare the results of our custom KNN classifier with the scikit-learn implementation.

pipeline_knn_sklearn = Pipeline(
    [
        (
            "tfidf",
            TfidfVectorizer(stop_words="english", max_features=None, analyzer="word"),
        ),
        ("knn", KNeighborsClassifier()),
    ]
)

param_grid_knn_sklearn = {
    "tfidf__max_features": [2048, 4096],
    "knn__n_neighbors": [5, 10],
    "knn__weights": ["uniform", "distance"],
    "knn__metric": ["euclidean", "cosine"],
}

grid_search_knn_sklearn = GridSearchCV(
    estimator=pipeline_knn_sklearn,
    param_grid=param_grid_knn_sklearn,
    scoring="accuracy",
    cv=cv_knn,
    n_jobs=-1,
    verbose=2,
    refit=True,
    error_score="raise",
)

grid_search_knn_sklearn.fit(twenty_train.data, twenty_train.target)

GridSearchCV(cv=StratifiedShuffleSplit(n_splits=3,
            random_state=RandomState(MT19937) at 0x13432E440,
            test_size=0.2, train_size=None),
             error_score='raise',
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words='english')),
                                       ('knn', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'knn__metric': ['euclidean', 'cosine'],
                         'knn__n_neighbors': [5, 10],
                         'knn__weights': ['uniform', 'distance'],
                         'tfidf__max_features': [2048, 4096]},
             scoring='accuracy', verbose=2)

grid_search_results_sklearn = pd.DataFrame(grid_search_knn_sklearn.cv_results_).sort_values("rank_test_score").head(5)

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_knn__metric	param_knn__n_neighbors	param_knn__weights	param_tfidf__max_features	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
15	0.2018224	0.0167407	0.0474807	0.0009341	cosine	10	distance	4096	cosine , 10 , distance, 4096	0.9466292	0.9353933	0.9466292	0.9428839	0.0052967	1
7	0.1971777	0.0042258	0.0645568	0.0005940	euclidean	10	distance	4096	euclidean, 10 , distance , 4096	0.9438202	0.9297753	0.9466292	0.9400749	0.0073727	2
3	0.2053347	0.0061259	0.0688587	0.0025981	euclidean	5	distance	4096	euclidean, 5 , distance , 4096	0.9410112	0.9353933	0.9269663	0.9344569	0.0057719	3
11	0.2152572	0.0075339	0.0688361	0.0024215	cosine	5	distance	4096	cosine , 5 , distance, 4096	0.9438202	0.9325843	0.9269663	0.9344569	0.0070068	3
5	0.2102303	0.0119457	0.0737727	0.0071226	euclidean	10	uniform	4096	euclidean, 10 , uniform , 4096	0.9410112	0.9269663	0.9297753	0.9325843	0.0060681	5

predicted_targets_sklearn = grid_search_knn_sklearn.predict(test_data)

classification_reports_sklearn = classification_report(
    y_true=twenty_test.target,
    y_pred=predicted_targets_sklearn,
    target_names=target_names_map.values(),
    output_dict=True,
)

classification_reports_sklearn_data = pd.DataFrame(classification_reports_sklearn).T

	precision	recall	f1-score	support
comp.graphics	0.8564593	0.9203085	0.8872367	389.0000000
sci.med	0.9140401	0.8055556	0.8563758	396.0000000
soc.religion.christian	0.8870192	0.9271357	0.9066339	398.0000000
accuracy	0.8841927	0.8841927	0.8841927	0.8841927
macro avg	0.8858396	0.8843332	0.8834155	1183.0000000
weighted avg	0.8860154	0.8841927	0.8834321	1183.0000000

ConfusionMatrixDisplay.from_predictions(
    y_true=twenty_test.target,
    y_pred=predicted_targets_sklearn,
    display_labels=target_names_map.values(),
)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x146502990>

The scikit-learn implementation of the KNN classifier exactly matches the results of our custom implementation, demonstrating the correctness of our implementation.

References

Machine Learning Python Programming Statistical Learning Supervised Learning Classification