“Œv”—Œ€‹†Š
Workshop on Symbolic Data Analysis

Date: 1st November 2011@13:00`17:20
13:30-15:00 Talk 1:
Title:   Hierarchical and Pyramidal Symbolic Clustering
Speaker:   Paula Brito (Faculty of Economics & LIADD – INESC Porto L.A. ,University of Porto, Portugal)

   In multivariate data analysis often each single observation contains some intrinsic variability which should be taken into account to avoid an important loss of information. This is the case when analyzing a group rather than a single individual, where intra group variability should not be overlooked, or when we observe a variable along time and wish to record the set of observed values rather than just a specific one. Symbolic data extend the classical tabular model, where individuals (usually in rows) take one single value for each variable (usually in columns), by allowing for multiple, possibly weighted, values for each variable. New variable types have therefore been introduced which allow representing variability inherent to the data. The main objective of Symbolic Data Analysis is to extend classical data analysis techniques to symbolic data, addressing the issues raised by the new representation spaces.
   As concerns clustering, the problem consists in developing methods that allow clustering symbolic data, and which provide clusters directly interpreted in terms of the descriptive variables. The proposed symbolic clustering method allows considering data where each element is described by variables of possibly different types. Here we use a bottom-up approach, merging (aggregating) two clusters at each step.
   The criterion that guides cluster formation is the duality between intent (a – conjunctive – description expressing its properties) and extent (the list of its members). Therefore, each cluster is a concept: its description generalises its members, and it is false for any element outside the cluster. In other words, each cluster is associated with a condition necessary and sufficient for cluster membership. In addition, a numerical criterion is defined, a generality degree that allows, at each step, to choose the “best” among several possible aggregations.
  We consider hierarchical and pyramidal clustering models. Pyramids extend the hierarchical model by allowing for overlapping clusters which are not contained in each other, but the model imposes the existence of a total order on the set being clustered, such that each cluster formed is an interval of this order. Pyramidal clustering provides then a richer clustering than a hierarchy, in that it allows for the formation of a larger number of clusters, and it provides an ordering of the given set.
  In this talk, we start by recalling the hierarchical and pyramidal clustering models, showing how pyramids extend hierarchies to a richer structure. We then introduce symbolic data, and the above-mentioned new variable types. We proceed by presenting the symbolic hierarchical/pyramidal clustering method, detailing the generalisation procedure and the determination of generality measures for the different variable types. We introduce the HIPYR module of the SODAS software, which allows performing a hierarchical or a pyramidal clustering of a given symbolic data set.
  Finally, we present current research developments on a whole family of hierarchical/pyramidal clustering methods, based on quantile representation. In this setup, observed variable values are expressed by some predefined quantiles of the underlying distribution or by its entire quantile function. At each step, the method merges the two clusters with closest quantile representations; the newly formed cluster is then represented according to the same model, i.e., a discrete or continuous quantile representation for the new cluster is determined from a suitable mixture of the respective distributions.

15:00-15:30
Discussion and break
15:30-17:00 Talk 2:
Title: Analysis of Mixed Feature-type Data by the Cartesian System Model
Speaker:  Manabu Ichino (School of Science and Engineering, Tokyo Denki University,Japan)
  This paper presents some methods to mixed feature-type data analysis based on the Cartesian system model (CSM) which is a mathematical model to manipulate mixed feature-type data. The CSM is represented by (D(d), ?, ?). The feature space, written D(d), is represented as a Cartesian product of d domains for mixed types of features including interval features, multinominal features, and others. The Cartesian join operator, written ?, generates the least generalized description in the feature space from a given pair of objects, while the Cartesian meet operator, written ?, generates the greatest common description in the feature space from the pair of objects. We outline several applications of the CSM to the mixed feature-type data analysis problems. Especially, we present and compare the properties of some possible similarity and dissimilarity measures based on the CSM.
17:00-17:20
Discussion