Implementing A Custom Kernel

In this notebook we will show how to implement a custom kernel in Falkon.

There are several complementary parts to a kernel, which can be added to support different operations. We will go through them one-by-one in this notebook:

  • Basic support: supports learning with Falkon!

  • Autodiff support: supports automatic hyperparameter tuning (in the hopt module)

  • KeOps support: faster kernel-vector products in low dimension

  • Sparse support: support learning on sparse data.

[1]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn import datasets
import torch
import numpy as np

import falkon
from falkon import FalkonOptions
from falkon.kernels import Kernel, DiffKernel, KeopsKernelMixin
[pyKeOps]: Warning, no cuda detected. Switching to cpu only.

Setup a simple problem for testing

Load and preprocess the California housing dataset. The learn_with_kernel function sets up Falkon for learning on the California housing datase with a given kernel.

[2]:
X, Y = datasets.fetch_california_housing(return_X_y=True)
num_train = int(X.shape[0] * 0.8)
num_test = X.shape[0] - num_train
shuffle_idx = np.arange(X.shape[0])
np.random.shuffle(shuffle_idx)
train_idx = shuffle_idx[:num_train]
test_idx = shuffle_idx[num_train:]

Xtrain, Ytrain = X[train_idx], Y[train_idx]
Xtest, Ytest = X[test_idx], Y[test_idx]
# convert numpy -> pytorch
Xtrain = torch.from_numpy(Xtrain).to(dtype=torch.float32)
Xtest = torch.from_numpy(Xtest).to(dtype=torch.float32)
Ytrain = torch.from_numpy(Ytrain).to(dtype=torch.float32)
Ytest = torch.from_numpy(Ytest).to(dtype=torch.float32)
# z-score normalization
train_mean = Xtrain.mean(0, keepdim=True)
train_std = Xtrain.std(0, keepdim=True)
Xtrain -= train_mean
Xtrain /= train_std
Xtest -= train_mean
Xtest /= train_std
[3]:
def rmse(true, pred):
    return torch.sqrt(torch.mean((true.reshape(-1, 1) - pred.reshape(-1, 1))**2))

def learn_with_kernel(kernel):
    flk_opt = FalkonOptions(use_cpu=True)
    model = falkon.Falkon(
        kernel=kernel, penalty=1e-5, M=1000, options=flk_opt,
        error_every=1, error_fn=rmse)
    model.fit(Xtrain, Ytrain)
    ts_err = rmse(Ytest, model.predict(Xtest))
    print("Test RMSE: %.2f" % (ts_err))

Basic Kernel Implementation

We must inherit from the falkon.kernels.Kernel class, and implement: - compute method: the core of the kernel implementation. Given two input matrices (of size \(n\times d\) and \(m\times d\)), and an output matrix (of size \(n\times m\)), compute the kernel function between the two inputs and store it in the output.

The additional `diag` parameter is a boolean flag. It indicates that a) $n$ is equal to $m$, b) only the diagonal of the kernel matrix should be computed.
  • compute_sparse method: this should be used if you want your kernel to support sparse data. We will implement it in a later section.

We will implement a linear kernel:

\[k(x, x') = \sigma (x^\top x')\]

the parameter \(\sigma\) is the variance of the kernel. It is the only hyperparameter.

[4]:
class BasicLinearKernel(Kernel):
    def __init__(self, lengthscale, options):
        # The base class takes as inputs a name for the kernel, and
        # an instance of `FalkonOptions`.
        super().__init__("basic_linear", options)

        self.lengthscale = lengthscale

    def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool) -> torch.Tensor:
        # To support different devices/data types, you must make sure
        # the lengthscale is compatible with the data.
        lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)

        scaled_X1 = X1 * lengthscale

        if diag:
            out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
        else:
            # The dot-product row-by-row on `X1` and `X2` can be computed
            # on many rows at a time with matrix multiplication.
            out = torch.matmul(scaled_X1, X2.T, out=out)

        return out

    def compute_sparse(self, X1, X2, out, diag, **kwargs) -> torch.Tensor:
        raise NotImplementedError("Sparse not implemented")

Test the basic kernel

[5]:
# Initialize the kernel
lengthscale_init = torch.tensor([1.0])
k = BasicLinearKernel(lengthscale_init, options=falkon.FalkonOptions())
[6]:
# The kernel matrix
k(torch.randn(5, 3), torch.randn(5, 3))
[6]:
tensor([[-1.3538,  4.0383, -0.5058, -3.1306, -0.3159],
        [-0.9498, -2.0581,  0.4684,  0.8994,  0.7577],
        [ 0.3122, -0.1038, -0.5039,  2.5076, -0.4032],
        [ 0.8383,  3.8545, -1.4094,  1.0497, -1.4979],
        [ 0.8344, -4.5258,  2.9362, -7.7300,  2.0740]])
[7]:
# Kernel-vector product
k.mmv(torch.randn(4, 3), torch.randn(4, 3), v=torch.randn(4, 1))
[7]:
tensor([[6.1084],
        [3.6743],
        [1.2653],
        [1.2448]])
[8]:
# Double kernel-vector product
k.dmmv(torch.randn(3, 3), torch.randn(4, 3), v=torch.randn(4, 1), w=torch.randn(3, 1))
[8]:
tensor([[ -3.6467],
        [ -9.8628],
        [  1.4857],
        [-12.8557]])
[9]:
# Learning on the california housing dataset
learn_with_kernel(k)
Iteration   1 - Elapsed 0.07s - training error: 2.36367178
Iteration   2 - Elapsed 0.11s - training error: 2.19508219
Iteration   3 - Elapsed 0.14s - training error: 2.19265079
Iteration   4 - Elapsed 0.17s - training error: 2.19265032
Iteration   5 - Elapsed 0.20s - training error: 2.19262338
Iteration   6 - Elapsed 0.24s - training error: 2.19262123
Iteration   7 - Elapsed 0.27s - training error: 2.19261861
Iteration   8 - Elapsed 0.30s - training error: 2.19261885
Iteration   9 - Elapsed 0.33s - training error: 2.19261789
Iteration  10 - Elapsed 0.39s - training error: 2.19261765
Iteration  11 - Elapsed 0.42s - training error: 2.19261956
Iteration  12 - Elapsed 0.45s - training error: 2.19261932
Iteration  13 - Elapsed 0.48s - training error: 2.19261909
Iteration  14 - Elapsed 0.51s - training error: 2.19261813
Iteration  15 - Elapsed 0.55s - training error: 2.19261885
Iteration  16 - Elapsed 0.57s - training error: 2.19261742
Iteration  17 - Elapsed 0.61s - training error: 2.19261813
Iteration  18 - Elapsed 0.63s - training error: 2.19261980
Iteration  19 - Elapsed 0.66s - training error: 2.19261956
Iteration  20 - Elapsed 0.73s - training error: 2.19262052
Test RMSE: 2.19

Differentiable Kernel

A differentiable kernel is needed for automatic hyperparameter optimization (see the notebook).

It requires inheriting from falkon.kernels.DiffKernel. In addition to the methods already discussed, we must implement: - compute_diff, which works similarly to the compute method but it does not have an out parameter. The implementation should be fully differentiable with respect to its inputs, and to the kernel hyperparameters. - detach, which essentially clones the kernel with the parameters detached from the computational graph.

Another important difference from the basic kernel is the call to the constructor, which must include - All kernel hyperparameters as keyword arguments. These will be available as attributes on the class. Hyperparameters do not need to be tensors.

``core_fn`` parameter (optional)

The constructor can also optionally contain a core_fn parameter which can simplify implementation by uniting the compute and compute_diff implementations. Have a look at the implementation of kernels in falkon.kernels.dot_prod_kernel.py and falkon.kernels.distance_kernel.py for how to use the core_fn parameter.

[10]:
class DiffLinearKernel(DiffKernel):
    def __init__(self, lengthscale, options):
        # Super-class constructor call. We do not specify core_fn
        # but we must specify the hyperparameter of this kernel (lengthscale)
        super().__init__("diff_linear",
                         options,
                         core_fn=None,
                         lengthscale=lengthscale)

    def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool):
        lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
        scaled_X1 = X1 * lengthscale
        if diag:
            out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
        else:
            out = torch.matmul(scaled_X1, X2.T, out=out)

        return out

    def compute_diff(self, X1: torch.Tensor, X2: torch.Tensor, diag: bool):
        # The implementation here is similar to `compute` without in-place operations.
        lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
        scaled_X1 = X1 * lengthscale

        if diag:
            return torch.sum(scaled_X1 * X2, dim=-1)

        return torch.matmul(scaled_X1, X2.T)

    def detach(self):
        # Clones the class with detached hyperparameters
        return DiffLinearKernel(
            lengthscale=self.lengthscale.detach(),
            options=self.params
        )

    def compute_sparse(self, X1, X2, out, diag, **kwargs) -> torch.Tensor:
        raise NotImplementedError("Sparse not implemented")

Test the differentiable kernel

[11]:
# Initialize the kernel, with a lengthscale which requires grad.
lengthscale_init = torch.tensor([1.0]).requires_grad_()
k = DiffLinearKernel(lengthscale_init, options=falkon.FalkonOptions())
[12]:
# Kernel matrix. Notice how the outputs has a `grad_fn`
k_mat = k(torch.randn(5, 3), torch.randn(5, 3))
k_mat
[12]:
tensor([[ 2.7480,  1.6149, -1.2979, -2.3070, -1.1852],
        [ 4.2437,  2.8397, -2.6248, -3.1610, -1.1940],
        [ 2.6474,  0.9644, -0.4447, -1.1742, -1.0197],
        [-3.4735,  0.4214, -1.9773,  0.3380,  2.2361],
        [-1.8094, -0.2183, -0.5620,  1.8260,  1.8644]],
       grad_fn=<KernelMmFnFullBackward>)
[13]:
# Gradient of the kernel with respect to the lengthscale.
torch.autograd.grad(k_mat.sum(), k.lengthscale)
[13]:
(tensor([-0.7049]),)
[14]:
# kernel-vector product + gradient
m1 = torch.randn(4, 3).requires_grad_()
m2 = torch.randn(2, 3)
v = torch.randn(2, 1)
k_mmv = k.mmv(m1, m2, v)
print("Kernel-vector product")
print(k_mmv)
print("Gradients:")
print(torch.autograd.grad(k_mmv.sum(), [k.lengthscale, m1]))
Kernel-vector product
tensor([[ 0.0198],
        [-1.6055],
        [ 2.3654],
        [-0.6039]], grad_fn=<KernelMmvFnFullBackward>)
Gradients:
(tensor([0.1758]), tensor([[ 0.6192,  1.2183, -0.2544],
        [ 0.6192,  1.2183, -0.2544],
        [ 0.6192,  1.2183, -0.2544],
        [ 0.6192,  1.2183, -0.2544]]))
[15]:
# Learning on the california housing dataset
learn_with_kernel(k)
Iteration   1 - Elapsed 0.06s - training error: 2.20815659
Iteration   2 - Elapsed 0.10s - training error: 2.19324374
Iteration   3 - Elapsed 0.12s - training error: 2.19264197
Iteration   4 - Elapsed 0.15s - training error: 2.19263649
Iteration   5 - Elapsed 0.18s - training error: 2.19262934
Iteration   6 - Elapsed 0.21s - training error: 2.19261909
Iteration   7 - Elapsed 0.24s - training error: 2.19261813
Iteration   8 - Elapsed 0.26s - training error: 2.19262004
Iteration   9 - Elapsed 0.29s - training error: 2.19261765
Iteration  10 - Elapsed 0.34s - training error: 2.19261789
Iteration  11 - Elapsed 0.38s - training error: 2.19261909
Iteration  12 - Elapsed 0.40s - training error: 2.19261885
Iteration  13 - Elapsed 0.43s - training error: 2.19261956
Iteration  14 - Elapsed 0.46s - training error: 2.19261932
Iteration  15 - Elapsed 0.49s - training error: 2.19261932
Iteration  16 - Elapsed 0.52s - training error: 2.19262099
Iteration  17 - Elapsed 0.54s - training error: 2.19262123
Iteration  18 - Elapsed 0.57s - training error: 2.19262147
Iteration  19 - Elapsed 0.60s - training error: 2.19262195
Iteration  20 - Elapsed 0.65s - training error: 2.19262338
Test RMSE: 2.19

Adding KeOps Support

We must inherit from falkon.kernels.KeopsKernelMixin and implement the method keops_mmv_impl.

KeOps-enabled kernels will still use the implementation in the compute function for computing the kernel matrix itself, but will use KeOps to compute kernel-vector products (if the data dimension is small enough).

This method is responsible for kernel-vector products, and it should contain: 1. A formula definition (see https://www.kernel-operations.io/keops/api/math-operations.html for the appropriate syntax) 2. A definition of all variables (again have a look at the KeOps documentation, or the implementation of other kernels within Falkon) 3. A call to the keops_mmv method of the KeopsKernelMixin class, responsible for calling into the KeOps formula.

For our kernel we will use the (X | Y) syntax for the dot-product between samples, and then multiplication with the vector v. The aliases list maps the symbols used in the formula with the KeOps variable types.

For more examples check the KeOps documentatiaon or the implementation of existing kernels.

[16]:
class KeopsLinearKernel(DiffKernel, KeopsKernelMixin):
    def __init__(self, lengthscale, options):
        super().__init__("my-keops-linear",
                         options,
                         core_fn=None,
                         lengthscale=lengthscale)

    def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool):
        lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
        scaled_X1 = X1 * lengthscale

        if diag:
            out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
        else:
            out = torch.matmul(scaled_X1, X2.T, out=out)

        return out

    def compute_diff(self, X1: torch.Tensor, X2: torch.Tensor, diag: bool):
        scaled_X1 = X1 * self.lengthscale

        if diag:
            return torch.sum(scaled_X1 * X2, dim=-1)

        return torch.matmul(scaled_X1, X2.T)

    def detach(self):
        return KeopsLinearKernel(
            lengthscale=self.lengthscale.detach(),
            options=self.params
        )

    def keops_mmv_impl(self, X1, X2, v, kernel, out, opt):
        # Keops formula for kernel-vector.
        formula = '(scale * (X | Y)) * v'
        aliases = [
            'X = Vi(%d)' % (X1.shape[1]),
            'Y = Vj(%d)' % (X2.shape[1]),
            'v = Vj(%d)' % (v.shape[1]),
            'scale = Pm(%d)' % (self.lengthscale.shape[0]),
        ]
        other_vars = [
            self.lengthscale.to(dtype=X1.dtype, device=X1.device),
        ]
        # Call to the executor of the formula.
        return self.keops_mmv(X1, X2, v, out, formula, aliases, other_vars, opt)


    def compute_sparse(self, X1, X2, out: torch.Tensor, diag: bool, **kwargs) -> torch.Tensor:
        raise NotImplementedError("Sparse not implemented")

Test the KeOps kernel

Note that KeOps will need to compile the kernels the first time they are run!

[17]:
lengthscale_init = torch.tensor([1.0]).requires_grad_()
k = KeopsLinearKernel(lengthscale_init, options=falkon.FalkonOptions(use_cpu=True))
[18]:
# kernel-vector product + gradient
m1 = torch.randn(4, 3).requires_grad_()
m2 = torch.randn(2, 3)
v = torch.randn(2, 1)
k_mmv = k.mmv(m1, m2, v)
print("Kernel-vector product")
print(k_mmv)
print("Gradients:")
print(torch.autograd.grad(k_mmv.sum(), [k.lengthscale, m1]))
Kernel-vector product
tensor([[-1.2121],
        [-0.1148],
        [ 2.2435],
        [ 0.9918]], grad_fn=<TilingGenredAutogradBackward>)
Gradients:
(tensor([1.9084]), tensor([[ 1.0124, -0.8363,  0.7706],
        [ 1.0124, -0.8363,  0.7706],
        [ 1.0124, -0.8363,  0.7706],
        [ 1.0124, -0.8363,  0.7706]], requires_grad=True))
[19]:
# Learning on the california housing dataset.
# Due to differences in floating point code, results may be slightly
# different from the other implementations.
learn_with_kernel(k)
Iteration   1 - Elapsed 0.17s - training error: 2.27769995
Iteration   2 - Elapsed 0.34s - training error: 2.19313025
Iteration   3 - Elapsed 0.51s - training error: 2.19323778
Iteration   4 - Elapsed 0.66s - training error: 2.19308257
Iteration   5 - Elapsed 0.82s - training error: 2.19269753
Iteration   6 - Elapsed 0.98s - training error: 2.19266987
Iteration   7 - Elapsed 1.13s - training error: 2.19262886
Iteration   8 - Elapsed 1.29s - training error: 2.19262505
Iteration   9 - Elapsed 1.45s - training error: 2.19262052
Iteration  10 - Elapsed 1.76s - training error: 2.19260979
Iteration  11 - Elapsed 1.92s - training error: 2.19261813
Iteration  12 - Elapsed 2.08s - training error: 2.19261646
Iteration  13 - Elapsed 2.25s - training error: 2.19263911
Iteration  14 - Elapsed 2.42s - training error: 2.19263911
Iteration  15 - Elapsed 2.58s - training error: 2.19264960
Iteration  16 - Elapsed 2.74s - training error: 2.19265103
Iteration  17 - Elapsed 2.91s - training error: 2.19268680
Iteration  18 - Elapsed 3.07s - training error: 2.19269395
Iteration  19 - Elapsed 3.23s - training error: 2.19270301
Iteration  20 - Elapsed 3.55s - training error: 2.19275403
Test RMSE: 2.19

Supporting Sparse Data

Sparse support can be necessary for kernel learning in extremely high dimensions, when the inputs are sparse.

Sparse support requires using special functions for common operations such as matrix multiplication. Falkon implements sparse tensors in a CSR format (PyTorch is slowly picking this format up, in place of COO), through the falkon.sparse.SparseTensor class.

We will implement the compute_sparse method below, supporting both diagonal and full kernels. However, only CPU support is added here (CUDA support is possible but requires a few more details), and differentiable sparse kernels are not supported.

[20]:
from falkon.sparse import SparseTensor
from falkon.sparse import sparse_matmul, bdot
[21]:
class SparseLinearKernel(Kernel):
    def __init__(self, lengthscale, options):
        # The base class takes as inputs a name for the kernel, and
        # an instance of `FalkonOptions`.
        super().__init__("sparse_linear", options)

        self.lengthscale = lengthscale

    def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool) -> torch.Tensor:
        lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)

        scaled_X1 = X1 * lengthscale

        if diag:
            out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
        else:
            # The dot-product row-by-row on `X1` and `X2` can be computed
            # on many rows at a time with matrix multiplication.
            out = torch.matmul(scaled_X1, X2.T, out=out)

        return out

    def compute_sparse(self,
                       X1: SparseTensor,
                       X2: SparseTensor,
                       out: torch.Tensor,
                       diag: bool,
                       **kwargs) -> torch.Tensor:
        # The inputs will be matrix X1(n*d) in CSR format, and X2(d*n) in CSC format.

        # To support different devices/data types, you must make sure
        # the lengthscale is compatible with the data.
        lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)

        if diag:
            # The diagonal is a dot-product between rows of X1 and X2.
            # The batched-dot is only implemented on CPU.
            out = bdot(X1, X2.transpose_csr(), out)
        else:
            # Otherwise we need to matrix-multiply. Note that X2 is already
            # transposed correctly.
            out = sparse_matmul(X1, X2, out)

        out.mul_(lengthscale)
        return out

Testing sparse support

We generate two sparse matrices, and check that the sparse kernel is equivalent to the dense version.

[22]:
indexptr = torch.tensor([0, 1, 3, 4], dtype=torch.long)
index = torch.tensor([1, 0, 1, 0], dtype=torch.long)
value = torch.tensor([5, 1, 8, 2], dtype=torch.float32)
sp1 = SparseTensor(indexptr=indexptr, index=index, data=value, size=(3, 2), sparse_type="csr")
# Converted to dense:
dense1 = torch.from_numpy(sp1.to_scipy().todense())
dense1
[22]:
tensor([[0., 5.],
        [1., 8.],
        [2., 0.]])
[23]:
indexptr = torch.tensor([0, 1, 2, 4], dtype=torch.long)
index = torch.tensor([1, 0, 0, 1], dtype=torch.long)
value = torch.tensor([2, 1, 3, 4], dtype=torch.float32)
sp2 = SparseTensor(indexptr=indexptr, index=index, data=value, size=(3, 2), sparse_type="csr")
dense2 = torch.from_numpy(sp2.to_scipy().todense())
dense2
[23]:
tensor([[0., 2.],
        [1., 0.],
        [3., 4.]])
[24]:
# Initialize the kernel
lengthscale_init = torch.tensor([1.0])
k = SparseLinearKernel(lengthscale_init, options=falkon.FalkonOptions())
[25]:
k(sp1, sp2) == k(dense1, dense2)

[25]:
tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])