Implementing A Custom Kernel
In this notebook we will show how to implement a custom kernel in Falkon.
There are several complementary parts to a kernel, which can be added to support different operations. We will go through them one-by-one in this notebook:
Basic support: supports learning with Falkon!
Autodiff support: supports automatic hyperparameter tuning (in the
hopt
module)KeOps support: faster kernel-vector products in low dimension
Sparse support: support learning on sparse data.
[1]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn import datasets
import torch
import numpy as np
import falkon
from falkon import FalkonOptions
from falkon.kernels import Kernel, DiffKernel, KeopsKernelMixin
[pyKeOps]: Warning, no cuda detected. Switching to cpu only.
Setup a simple problem for testing
Load and preprocess the California housing dataset. The learn_with_kernel
function sets up Falkon for learning on the California housing datase with a given kernel.
[2]:
X, Y = datasets.fetch_california_housing(return_X_y=True)
num_train = int(X.shape[0] * 0.8)
num_test = X.shape[0] - num_train
shuffle_idx = np.arange(X.shape[0])
np.random.shuffle(shuffle_idx)
train_idx = shuffle_idx[:num_train]
test_idx = shuffle_idx[num_train:]
Xtrain, Ytrain = X[train_idx], Y[train_idx]
Xtest, Ytest = X[test_idx], Y[test_idx]
# convert numpy -> pytorch
Xtrain = torch.from_numpy(Xtrain).to(dtype=torch.float32)
Xtest = torch.from_numpy(Xtest).to(dtype=torch.float32)
Ytrain = torch.from_numpy(Ytrain).to(dtype=torch.float32)
Ytest = torch.from_numpy(Ytest).to(dtype=torch.float32)
# z-score normalization
train_mean = Xtrain.mean(0, keepdim=True)
train_std = Xtrain.std(0, keepdim=True)
Xtrain -= train_mean
Xtrain /= train_std
Xtest -= train_mean
Xtest /= train_std
[3]:
def rmse(true, pred):
return torch.sqrt(torch.mean((true.reshape(-1, 1) - pred.reshape(-1, 1))**2))
def learn_with_kernel(kernel):
flk_opt = FalkonOptions(use_cpu=True)
model = falkon.Falkon(
kernel=kernel, penalty=1e-5, M=1000, options=flk_opt,
error_every=1, error_fn=rmse)
model.fit(Xtrain, Ytrain)
ts_err = rmse(Ytest, model.predict(Xtest))
print("Test RMSE: %.2f" % (ts_err))
Basic Kernel Implementation
We must inherit from the falkon.kernels.Kernel
class, and implement: - compute
method: the core of the kernel implementation. Given two input matrices (of size \(n\times d\) and \(m\times d\)), and an output matrix (of size \(n\times m\)), compute the kernel function between the two inputs and store it in the output.
The additional `diag` parameter is a boolean flag. It indicates that a) $n$ is equal to $m$, b) only the diagonal of the kernel matrix should be computed.
compute_sparse
method: this should be used if you want your kernel to support sparse data. We will implement it in a later section.
We will implement a linear kernel:
the parameter \(\sigma\) is the variance of the kernel. It is the only hyperparameter.
[4]:
class BasicLinearKernel(Kernel):
def __init__(self, lengthscale, options):
# The base class takes as inputs a name for the kernel, and
# an instance of `FalkonOptions`.
super().__init__("basic_linear", options)
self.lengthscale = lengthscale
def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool) -> torch.Tensor:
# To support different devices/data types, you must make sure
# the lengthscale is compatible with the data.
lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
scaled_X1 = X1 * lengthscale
if diag:
out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
else:
# The dot-product row-by-row on `X1` and `X2` can be computed
# on many rows at a time with matrix multiplication.
out = torch.matmul(scaled_X1, X2.T, out=out)
return out
def compute_sparse(self, X1, X2, out, diag, **kwargs) -> torch.Tensor:
raise NotImplementedError("Sparse not implemented")
Test the basic kernel
[5]:
# Initialize the kernel
lengthscale_init = torch.tensor([1.0])
k = BasicLinearKernel(lengthscale_init, options=falkon.FalkonOptions())
[6]:
# The kernel matrix
k(torch.randn(5, 3), torch.randn(5, 3))
[6]:
tensor([[-1.3538, 4.0383, -0.5058, -3.1306, -0.3159],
[-0.9498, -2.0581, 0.4684, 0.8994, 0.7577],
[ 0.3122, -0.1038, -0.5039, 2.5076, -0.4032],
[ 0.8383, 3.8545, -1.4094, 1.0497, -1.4979],
[ 0.8344, -4.5258, 2.9362, -7.7300, 2.0740]])
[7]:
# Kernel-vector product
k.mmv(torch.randn(4, 3), torch.randn(4, 3), v=torch.randn(4, 1))
[7]:
tensor([[6.1084],
[3.6743],
[1.2653],
[1.2448]])
[8]:
# Double kernel-vector product
k.dmmv(torch.randn(3, 3), torch.randn(4, 3), v=torch.randn(4, 1), w=torch.randn(3, 1))
[8]:
tensor([[ -3.6467],
[ -9.8628],
[ 1.4857],
[-12.8557]])
[9]:
# Learning on the california housing dataset
learn_with_kernel(k)
Iteration 1 - Elapsed 0.07s - training error: 2.36367178
Iteration 2 - Elapsed 0.11s - training error: 2.19508219
Iteration 3 - Elapsed 0.14s - training error: 2.19265079
Iteration 4 - Elapsed 0.17s - training error: 2.19265032
Iteration 5 - Elapsed 0.20s - training error: 2.19262338
Iteration 6 - Elapsed 0.24s - training error: 2.19262123
Iteration 7 - Elapsed 0.27s - training error: 2.19261861
Iteration 8 - Elapsed 0.30s - training error: 2.19261885
Iteration 9 - Elapsed 0.33s - training error: 2.19261789
Iteration 10 - Elapsed 0.39s - training error: 2.19261765
Iteration 11 - Elapsed 0.42s - training error: 2.19261956
Iteration 12 - Elapsed 0.45s - training error: 2.19261932
Iteration 13 - Elapsed 0.48s - training error: 2.19261909
Iteration 14 - Elapsed 0.51s - training error: 2.19261813
Iteration 15 - Elapsed 0.55s - training error: 2.19261885
Iteration 16 - Elapsed 0.57s - training error: 2.19261742
Iteration 17 - Elapsed 0.61s - training error: 2.19261813
Iteration 18 - Elapsed 0.63s - training error: 2.19261980
Iteration 19 - Elapsed 0.66s - training error: 2.19261956
Iteration 20 - Elapsed 0.73s - training error: 2.19262052
Test RMSE: 2.19
Differentiable Kernel
A differentiable kernel is needed for automatic hyperparameter optimization (see the notebook).
It requires inheriting from falkon.kernels.DiffKernel
. In addition to the methods already discussed, we must implement: - compute_diff
, which works similarly to the compute
method but it does not have an out
parameter. The implementation should be fully differentiable with respect to its inputs, and to the kernel hyperparameters. - detach
, which essentially clones the kernel with the parameters detached from the computational graph.
Another important difference from the basic kernel is the call to the constructor, which must include - All kernel hyperparameters as keyword arguments. These will be available as attributes on the class. Hyperparameters do not need to be tensors.
``core_fn`` parameter (optional)
The constructor can also optionally contain a core_fn
parameter which can simplify implementation by uniting the compute
and compute_diff
implementations. Have a look at the implementation of kernels in falkon.kernels.dot_prod_kernel.py
and falkon.kernels.distance_kernel.py
for how to use the core_fn
parameter.
[10]:
class DiffLinearKernel(DiffKernel):
def __init__(self, lengthscale, options):
# Super-class constructor call. We do not specify core_fn
# but we must specify the hyperparameter of this kernel (lengthscale)
super().__init__("diff_linear",
options,
core_fn=None,
lengthscale=lengthscale)
def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool):
lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
scaled_X1 = X1 * lengthscale
if diag:
out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
else:
out = torch.matmul(scaled_X1, X2.T, out=out)
return out
def compute_diff(self, X1: torch.Tensor, X2: torch.Tensor, diag: bool):
# The implementation here is similar to `compute` without in-place operations.
lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
scaled_X1 = X1 * lengthscale
if diag:
return torch.sum(scaled_X1 * X2, dim=-1)
return torch.matmul(scaled_X1, X2.T)
def detach(self):
# Clones the class with detached hyperparameters
return DiffLinearKernel(
lengthscale=self.lengthscale.detach(),
options=self.params
)
def compute_sparse(self, X1, X2, out, diag, **kwargs) -> torch.Tensor:
raise NotImplementedError("Sparse not implemented")
Test the differentiable kernel
[11]:
# Initialize the kernel, with a lengthscale which requires grad.
lengthscale_init = torch.tensor([1.0]).requires_grad_()
k = DiffLinearKernel(lengthscale_init, options=falkon.FalkonOptions())
[12]:
# Kernel matrix. Notice how the outputs has a `grad_fn`
k_mat = k(torch.randn(5, 3), torch.randn(5, 3))
k_mat
[12]:
tensor([[ 2.7480, 1.6149, -1.2979, -2.3070, -1.1852],
[ 4.2437, 2.8397, -2.6248, -3.1610, -1.1940],
[ 2.6474, 0.9644, -0.4447, -1.1742, -1.0197],
[-3.4735, 0.4214, -1.9773, 0.3380, 2.2361],
[-1.8094, -0.2183, -0.5620, 1.8260, 1.8644]],
grad_fn=<KernelMmFnFullBackward>)
[13]:
# Gradient of the kernel with respect to the lengthscale.
torch.autograd.grad(k_mat.sum(), k.lengthscale)
[13]:
(tensor([-0.7049]),)
[14]:
# kernel-vector product + gradient
m1 = torch.randn(4, 3).requires_grad_()
m2 = torch.randn(2, 3)
v = torch.randn(2, 1)
k_mmv = k.mmv(m1, m2, v)
print("Kernel-vector product")
print(k_mmv)
print("Gradients:")
print(torch.autograd.grad(k_mmv.sum(), [k.lengthscale, m1]))
Kernel-vector product
tensor([[ 0.0198],
[-1.6055],
[ 2.3654],
[-0.6039]], grad_fn=<KernelMmvFnFullBackward>)
Gradients:
(tensor([0.1758]), tensor([[ 0.6192, 1.2183, -0.2544],
[ 0.6192, 1.2183, -0.2544],
[ 0.6192, 1.2183, -0.2544],
[ 0.6192, 1.2183, -0.2544]]))
[15]:
# Learning on the california housing dataset
learn_with_kernel(k)
Iteration 1 - Elapsed 0.06s - training error: 2.20815659
Iteration 2 - Elapsed 0.10s - training error: 2.19324374
Iteration 3 - Elapsed 0.12s - training error: 2.19264197
Iteration 4 - Elapsed 0.15s - training error: 2.19263649
Iteration 5 - Elapsed 0.18s - training error: 2.19262934
Iteration 6 - Elapsed 0.21s - training error: 2.19261909
Iteration 7 - Elapsed 0.24s - training error: 2.19261813
Iteration 8 - Elapsed 0.26s - training error: 2.19262004
Iteration 9 - Elapsed 0.29s - training error: 2.19261765
Iteration 10 - Elapsed 0.34s - training error: 2.19261789
Iteration 11 - Elapsed 0.38s - training error: 2.19261909
Iteration 12 - Elapsed 0.40s - training error: 2.19261885
Iteration 13 - Elapsed 0.43s - training error: 2.19261956
Iteration 14 - Elapsed 0.46s - training error: 2.19261932
Iteration 15 - Elapsed 0.49s - training error: 2.19261932
Iteration 16 - Elapsed 0.52s - training error: 2.19262099
Iteration 17 - Elapsed 0.54s - training error: 2.19262123
Iteration 18 - Elapsed 0.57s - training error: 2.19262147
Iteration 19 - Elapsed 0.60s - training error: 2.19262195
Iteration 20 - Elapsed 0.65s - training error: 2.19262338
Test RMSE: 2.19
Adding KeOps Support
We must inherit from falkon.kernels.KeopsKernelMixin
and implement the method keops_mmv_impl
.
KeOps-enabled kernels will still use the implementation in the compute
function for computing the kernel matrix itself, but will use KeOps to compute kernel-vector products (if the data dimension is small enough).
This method is responsible for kernel-vector products, and it should contain: 1. A formula definition (see https://www.kernel-operations.io/keops/api/math-operations.html for the appropriate syntax) 2. A definition of all variables (again have a look at the KeOps documentation, or the implementation of other kernels within Falkon) 3. A call to the keops_mmv
method of the KeopsKernelMixin
class, responsible for calling into the KeOps formula.
For our kernel we will use the (X | Y)
syntax for the dot-product between samples, and then multiplication with the vector v
. The aliases list maps the symbols used in the formula with the KeOps variable types.
For more examples check the KeOps documentatiaon or the implementation of existing kernels.
[16]:
class KeopsLinearKernel(DiffKernel, KeopsKernelMixin):
def __init__(self, lengthscale, options):
super().__init__("my-keops-linear",
options,
core_fn=None,
lengthscale=lengthscale)
def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool):
lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
scaled_X1 = X1 * lengthscale
if diag:
out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
else:
out = torch.matmul(scaled_X1, X2.T, out=out)
return out
def compute_diff(self, X1: torch.Tensor, X2: torch.Tensor, diag: bool):
scaled_X1 = X1 * self.lengthscale
if diag:
return torch.sum(scaled_X1 * X2, dim=-1)
return torch.matmul(scaled_X1, X2.T)
def detach(self):
return KeopsLinearKernel(
lengthscale=self.lengthscale.detach(),
options=self.params
)
def keops_mmv_impl(self, X1, X2, v, kernel, out, opt):
# Keops formula for kernel-vector.
formula = '(scale * (X | Y)) * v'
aliases = [
'X = Vi(%d)' % (X1.shape[1]),
'Y = Vj(%d)' % (X2.shape[1]),
'v = Vj(%d)' % (v.shape[1]),
'scale = Pm(%d)' % (self.lengthscale.shape[0]),
]
other_vars = [
self.lengthscale.to(dtype=X1.dtype, device=X1.device),
]
# Call to the executor of the formula.
return self.keops_mmv(X1, X2, v, out, formula, aliases, other_vars, opt)
def compute_sparse(self, X1, X2, out: torch.Tensor, diag: bool, **kwargs) -> torch.Tensor:
raise NotImplementedError("Sparse not implemented")
Test the KeOps kernel
Note that KeOps will need to compile the kernels the first time they are run!
[17]:
lengthscale_init = torch.tensor([1.0]).requires_grad_()
k = KeopsLinearKernel(lengthscale_init, options=falkon.FalkonOptions(use_cpu=True))
[18]:
# kernel-vector product + gradient
m1 = torch.randn(4, 3).requires_grad_()
m2 = torch.randn(2, 3)
v = torch.randn(2, 1)
k_mmv = k.mmv(m1, m2, v)
print("Kernel-vector product")
print(k_mmv)
print("Gradients:")
print(torch.autograd.grad(k_mmv.sum(), [k.lengthscale, m1]))
Kernel-vector product
tensor([[-1.2121],
[-0.1148],
[ 2.2435],
[ 0.9918]], grad_fn=<TilingGenredAutogradBackward>)
Gradients:
(tensor([1.9084]), tensor([[ 1.0124, -0.8363, 0.7706],
[ 1.0124, -0.8363, 0.7706],
[ 1.0124, -0.8363, 0.7706],
[ 1.0124, -0.8363, 0.7706]], requires_grad=True))
[19]:
# Learning on the california housing dataset.
# Due to differences in floating point code, results may be slightly
# different from the other implementations.
learn_with_kernel(k)
Iteration 1 - Elapsed 0.17s - training error: 2.27769995
Iteration 2 - Elapsed 0.34s - training error: 2.19313025
Iteration 3 - Elapsed 0.51s - training error: 2.19323778
Iteration 4 - Elapsed 0.66s - training error: 2.19308257
Iteration 5 - Elapsed 0.82s - training error: 2.19269753
Iteration 6 - Elapsed 0.98s - training error: 2.19266987
Iteration 7 - Elapsed 1.13s - training error: 2.19262886
Iteration 8 - Elapsed 1.29s - training error: 2.19262505
Iteration 9 - Elapsed 1.45s - training error: 2.19262052
Iteration 10 - Elapsed 1.76s - training error: 2.19260979
Iteration 11 - Elapsed 1.92s - training error: 2.19261813
Iteration 12 - Elapsed 2.08s - training error: 2.19261646
Iteration 13 - Elapsed 2.25s - training error: 2.19263911
Iteration 14 - Elapsed 2.42s - training error: 2.19263911
Iteration 15 - Elapsed 2.58s - training error: 2.19264960
Iteration 16 - Elapsed 2.74s - training error: 2.19265103
Iteration 17 - Elapsed 2.91s - training error: 2.19268680
Iteration 18 - Elapsed 3.07s - training error: 2.19269395
Iteration 19 - Elapsed 3.23s - training error: 2.19270301
Iteration 20 - Elapsed 3.55s - training error: 2.19275403
Test RMSE: 2.19
Supporting Sparse Data
Sparse support can be necessary for kernel learning in extremely high dimensions, when the inputs are sparse.
Sparse support requires using special functions for common operations such as matrix multiplication. Falkon implements sparse tensors in a CSR format (PyTorch is slowly picking this format up, in place of COO), through the falkon.sparse.SparseTensor
class.
We will implement the compute_sparse
method below, supporting both diagonal and full kernels. However, only CPU support is added here (CUDA support is possible but requires a few more details), and differentiable sparse kernels are not supported.
[20]:
from falkon.sparse import SparseTensor
from falkon.sparse import sparse_matmul, bdot
[21]:
class SparseLinearKernel(Kernel):
def __init__(self, lengthscale, options):
# The base class takes as inputs a name for the kernel, and
# an instance of `FalkonOptions`.
super().__init__("sparse_linear", options)
self.lengthscale = lengthscale
def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool) -> torch.Tensor:
lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
scaled_X1 = X1 * lengthscale
if diag:
out.copy_(torch.sum(scaled_X1 * X2, dim=-1))
else:
# The dot-product row-by-row on `X1` and `X2` can be computed
# on many rows at a time with matrix multiplication.
out = torch.matmul(scaled_X1, X2.T, out=out)
return out
def compute_sparse(self,
X1: SparseTensor,
X2: SparseTensor,
out: torch.Tensor,
diag: bool,
**kwargs) -> torch.Tensor:
# The inputs will be matrix X1(n*d) in CSR format, and X2(d*n) in CSC format.
# To support different devices/data types, you must make sure
# the lengthscale is compatible with the data.
lengthscale = self.lengthscale.to(device=X1.device, dtype=X1.dtype)
if diag:
# The diagonal is a dot-product between rows of X1 and X2.
# The batched-dot is only implemented on CPU.
out = bdot(X1, X2.transpose_csr(), out)
else:
# Otherwise we need to matrix-multiply. Note that X2 is already
# transposed correctly.
out = sparse_matmul(X1, X2, out)
out.mul_(lengthscale)
return out
Testing sparse support
We generate two sparse matrices, and check that the sparse kernel is equivalent to the dense version.
[22]:
indexptr = torch.tensor([0, 1, 3, 4], dtype=torch.long)
index = torch.tensor([1, 0, 1, 0], dtype=torch.long)
value = torch.tensor([5, 1, 8, 2], dtype=torch.float32)
sp1 = SparseTensor(indexptr=indexptr, index=index, data=value, size=(3, 2), sparse_type="csr")
# Converted to dense:
dense1 = torch.from_numpy(sp1.to_scipy().todense())
dense1
[22]:
tensor([[0., 5.],
[1., 8.],
[2., 0.]])
[23]:
indexptr = torch.tensor([0, 1, 2, 4], dtype=torch.long)
index = torch.tensor([1, 0, 0, 1], dtype=torch.long)
value = torch.tensor([2, 1, 3, 4], dtype=torch.float32)
sp2 = SparseTensor(indexptr=indexptr, index=index, data=value, size=(3, 2), sparse_type="csr")
dense2 = torch.from_numpy(sp2.to_scipy().todense())
dense2
[23]:
tensor([[0., 2.],
[1., 0.],
[3., 4.]])
[24]:
# Initialize the kernel
lengthscale_init = torch.tensor([1.0])
k = SparseLinearKernel(lengthscale_init, options=falkon.FalkonOptions())
[25]:
k(sp1, sp2) == k(dense1, dense2)
[25]:
tensor([[True, True, True],
[True, True, True],
[True, True, True]])