Cluster analysis aims to identify homogeneous groups of units, called clusters, within data. Model-based clustering methods consider the overall population as a mixture of groups and each component of this mixture is modeled through its conditional probability distribution. The choice of the conditional probability distribution affects the clustering results and the performance of the algorithm. Recently the generalized hyperbolic distribution (GHD) has been used because it has the advantage of being really flexible.
Another challenging issue is to model data sets characterized by the presence of outliers, the p-variate contaminated Gaussian distribution (CGD) was proposed to face this issue.
Despite the advantages in the use of the GHD and CGD, both distributions are characterized by some univariate parameters, i.e. some parameters are constant in each dimension. This is limiting for real applications, for example, the proportion of outliers may be different in each dimension. To face this issue, we proposed the use of multiple scaled distributions. The GHD and CGD are Gaussian scale mixtures with univariate weights, we proposed to incorporate multi-dimensional weights via an eigendecomposition of the symmetric positive-definite scale matrix. The generalized EM-algorithm is used for parameters estimation.
In this talk I’ll illustrate the use of multiple scaled distributions to detect flexible clusters.