Taking into account Qualitative and Textual Variables in Hierarchical Ascending Clustering (HAC)
Odilon Yapo M. ACHIEPO1, Kouassi Hilaire EDI2, Behou Gérard N’GUESSAN3, Patrice MENSAH4
1Odilon Yapo M. ACHIEPO*, University Péléforo Gon Coulibaly, Management Institut Agropastorale – Korhogo, Côte d’Ivoire.
2Kouassi Hilaire EDI, University Nangui Abrogoua, Mathematics and Computer Science Laboratory – Abidjan, Côte d’Ivoire.
3Behou Gérard N’GUESSAN, Virtual University of Cote d’Ivoire, Research and Digital Expertise unit (UREN) Abidjan.
4 Patrice Edoété MENSAH, National Polytechnic institute Felix Houphouet Boigny of Yamoussoukro Côte d’Ivoire.
Manuscript received on 5 August 2019. | Revised Manuscript received on 11 August 2019. | Manuscript published on 30 September 2019. | PP: 1555-1561 | Volume-8 Issue-3 September 2019 | Retrieval Number: C4276098319/19©BEIESP | DOI: 10.35940/ijrte.C4276.098319
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: In Machine Learning, the clustering methods are the mains unsupervised methods. Their objectives is to partition a set of objects in some homogeneously groups. Clustering methods in general and more particularly Hierarchical Ascending Clustering (HAC) techniques are based on metrics and ultra-metrics. Metrics are used to evaluate the similarities between two objects; and ultra-metrics are used to estimate the similarity of two groups or the similarity of an element and a group. The main characteristic of these metrics and ultra-metrics is the fact that they are only adapted to numerical variables or can be reduced to them. With the advent of Data Mining and Data Science, most of the datasets to be analyzed contain different types of variables. In the same dataset, we can find numeric attributes, qualitative variables and free text fields very often together. Despite this diversity of variables in the same dataset, the existed clustering methods are generally build to use only an unique kind of attribute. In this paper, we propose an approach to take account different types of attributes in the same clustering method. The method proposed is a variant of HAC methods that can take into account both numerical, qualitative and textual data. Our approach is based on a metric call Phi-Similarity we developed in order to estimate the proximity of two objects, each of them is describe by a vector of attributes of different types. The developed method will be implemented with the scientific computing language R and applied to real survey data. A comparison of the results will be made with HAC techniques based on classical metrics with the Ward criterion as aggregation criteria. For classical algorithms, we will limit ourselves to the variables of the database compatible with them. This work of comparison will highlight the gain in precision in terms of classification brought by our method compared to the classic versions of HAC
Keywords: Hierarchical Ascending Clustering, Phi-similarity, R-Language
Scope of the Article: Clustering