In this talk, I will present new approaches to train models that predict probability distributions over complex output spaces. For this, we will transform hard selection operators into probabilistic operators using entropy functions, thereby introducing a 'temperature' in the selection.

In a first part, we will seek to assign a probability to an exponential set of outputs (e.g. sequences of tags, alignment matrices between two time series). For this, we will consider dynamic programing algorithms, that are able to find the highest-scoring output among a combinatorial set by iteratively breaking down the selection problems into smaller problems. By smoothing the hard operations performed in the DP recursion, we force DP to output a (potentially sparse) probability distribution over the full set. This allows to relax both the optimal value and solution of the original high-scoring selection problem, and turns a broad class of DP algorithms into differentiable operators. This allows to plug them into end-to-end trained models. We provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We showcase the new differentiable DP operators on several structured prediction tasks, that benefits from sparse probability prediction.

In a second part, we will design models that output probability distributions over a potentially continuous metric space of output. For this, we will rely on optimal transport to defined a cost-informed entropy function, defined on the set of all distributions. We will add this entropy instead of the classical Shannon entropy to the highest-scoring selection layer of machine-learning models. This effectively replaces the celebrated softmax operator by a geometric-softmax operator, that takes into account the geometry of the output space. We propose an adapted geometric logistic loss function for end-to-end model training. Unlike previous attempts to use optimal transport distances for learning, this loss is convex and supports infinite (or very large) class spaces. The geometric-softmax is suitable for predicting sparse and singular distributions, for instance supported on curves or hyper-surfaces. We showcase its use on two applications: ordinal regression and drawing generation.