Most kernel-based methods, such as kernel regression, kernel PCA, ICA or kernel k-means clustering, do not scale to large datasets, because constructing and storing the kernel matrix K requires at least O(n^2) time and space for n samples. A common approach used to achieve scalability is to construct a small dictionary of atoms that accurately approximate the spectrum of K, and compute approximate solutions efficiently on the dictionary.

Recent works in graph sparsification and randomized linear algebra show that sampling points with replacement according to their ridge leverage scores (RLS) is a simple way to construct dictionaries that are provably accurate. Moreover, the size of the generated dictionaries only depends on the effective dimensionality (deff) of the dataset, and not on its size n. The drawback of RLS-based methods is that, in a chicken and egg problem, computing exact RLS still requires constructing and storing the whole kernel matrix.

In this seminar I will present SQUEAK, a novel sequential algorithm for kernel dictionary learning based on RLS sampling. Starting from an empty dictionary SQUEAK processes the dataset one sample at a time, alternating between using the current dictionary to estimate RLS, and using the estimated RLS to add or remove samples from the dictionary. Dictionaries generated using SQUEAK are as small and accurate as those obtained with exact RLS sampling, but since all the RLS estimation are performed using only the small intermediate dictionaries, SQUEAK never constructs the whole matrix K, runs in linear time O(n deff) w.r.t. n, and requires only a single pass over the dataset. We also propose a parallel and distributed version of SQUEAK achieving similar accuracy in as little as O(log(n) deff) time.