κ F The Institute of Statistical Mathematics Tachikawa, Japan Seminar room 2 (D304)
Speaker: Paula Brito (Faculty of Economics & LIADD – INESC Porto L.A. ,University of Porto, Portugal)
In multivariate data analysis often each single observation contains some intrinsic variability which should
be taken into account to avoid an important loss of information. This is the case when analyzing a group rather
than a single individual, where intra group variability should not be overlooked, or when we observe a variable
along time and wish to record the set of observed values rather than just a specific one. Symbolic data extend the
classical tabular model, where individuals (usually in rows) take one single value for each variable (usually in
columns), by allowing for multiple, possibly weighted, values for each variable. New variable types have therefore
been introduced which allow representing variability inherent to the data. The main objective of Symbolic Data
Analysis is to extend classical data analysis techniques to symbolic data, addressing the issues raised by the new
representation spaces.
As concerns clustering, the problem consists in developing methods that allow clustering symbolic data, and
which provide clusters directly interpreted in terms of the descriptive variables. The proposed symbolic clustering
method allows considering data where each element is described by variables of possibly different types. Here we
use a bottom-up approach, merging (aggregating) two clusters at each step.
The criterion that guides cluster formation is the duality between intent (a – conjunctive –
description expressing its properties) and extent (the list of its members). Therefore, each cluster is a concept: its
description generalises its members, and it is false for any element outside the cluster. In other words, each cluster is
associated with a condition necessary and sufficient for cluster membership. In addition, a numerical criterion is
defined, a generality degree that allows, at each step, to choose the “best” among several possible aggregations.
We consider hierarchical and pyramidal clustering models. Pyramids extend the hierarchical model by allowing for
overlapping clusters which are not contained in each other, but the model imposes the existence of a total order on the
set being clustered, such that each cluster formed is an interval of this order. Pyramidal clustering provides then a
richer clustering than a hierarchy, in that it allows for the formation of a larger number of clusters, and it
provides an ordering of the given set.
In this talk, we start by recalling the hierarchical and pyramidal clustering models, showing how pyramids extend
hierarchies to a richer structure. We then introduce symbolic data, and the above-mentioned new variable types.
We proceed by presenting the symbolic hierarchical/pyramidal clustering method, detailing the generalisation procedure
and the determination of generality measures for the different variable types. We introduce the HIPYR module of
the SODAS software, which allows performing a hierarchical or a pyramidal clustering of a given symbolic data set.
Finally, we present current research developments on a whole family of hierarchical/pyramidal clustering methods,
based on quantile representation. In this setup, observed variable values are expressed by some predefined quantiles
of the underlying distribution or by its entire quantile function. At each step, the method merges the two clusters
with closest quantile representations; the newly formed cluster is then represented according to the same model,
i.e., a discrete or continuous quantile representation for the new cluster is determined from a suitable mixture of the
respective distributions.
Speaker: Manabu Ichino (School of Science and Engineering, Tokyo Denki University,Japan)
