{"id":362,"date":"2013-12-06T15:09:21","date_gmt":"2013-12-06T20:09:21","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=362"},"modified":"2013-12-06T15:09:21","modified_gmt":"2013-12-06T20:09:21","slug":"clustering-an-introduction","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/12\/06\/clustering-an-introduction\/","title":{"rendered":"Clustering: An Introduction"},"content":{"rendered":"<div align=\"center\">\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>What is Clustering?<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">Clustering can be considered the most important\u00a0<em>unsupervised learning<\/em>\u00a0problem; so, as every other problem of this kind, it deals with finding a\u00a0<em>structure<\/em>\u00a0in a collection of unlabeled data.<br \/>\nA loose definition of clustering could be \u201cthe process of organizing objects into groups whose members are similar in some way\u201d.<br \/>\nA\u00a0<em>cluster<\/em>\u00a0is therefore a collection of objects which are \u201csimilar\u201d between them and are \u201cdissimilar\u201d to the objects belonging to other clusters.<br \/>\nWe can show this with a simple graphical example:<\/span><\/p>\n<p><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/clustering.gif\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is\u00a0<em>distance<\/em>: two or more objects belong to the same cluster if they are \u201cclose\u201d according to a given distance (in this case geometrical distance). This is called\u00a0<em>distance-based clustering<\/em>.<br \/>\nAnother kind of clustering is\u00a0<em>conceptual clustering<\/em>: two or more objects belong to the same cluster if this one defines a concept\u00a0<em>common<\/em>\u00a0to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>The Goals of Clustering<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute \u201cbest\u201d criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.<br \/>\nFor instance, we could be interested in finding representatives for homogeneous groups (<em>data reduction<\/em>), in finding \u201cnatural clusters\u201d and describe their unknown properties (<em>\u201cnatural\u201d data types<\/em>), in finding useful and suitable groupings (<em>\u201cuseful\u201d data classes<\/em>) or in finding unusual data objects (<em>outlier detection<\/em>).<\/span><\/p>\n<p align=\"justify\">\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Possible Applications<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">Clustering algorithms can be applied in many fields, for instance:<\/span><\/p>\n<\/div>\n<ul>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>Marketing<\/em>: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>Biology<\/em>: classification of plants and animals given their features;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>Libraries<\/em>: book ordering;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>Insurance<\/em>: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>City-planning<\/em>: identifying groups of houses according to their house type, value and geographical location;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>Earthquake studies<\/em>: clustering observed earthquake epicenters to identify dangerous zones;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><em>WWW<\/em>: document classification; clustering weblog data to discover groups of similar access patterns.<\/span><\/div>\n<\/li>\n<\/ul>\n<div align=\"center\">\n<p align=\"justify\">\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Requirements<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">The main requirements that a clustering algorithm should satisfy are:<\/span><\/p>\n<\/div>\n<ul>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">scalability;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">dealing with different types of attributes;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">discovering clusters with arbitrary shape;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">minimal requirements for domain knowledge to determine input parameters;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">ability to deal with noise and outliers;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">insensitivity to order of input records;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">high dimensionality;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">interpretability and usability.<\/span><\/div>\n<\/li>\n<\/ul>\n<div align=\"center\">\n<p>&nbsp;<\/p>\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Problems<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">There are a number of problems with clustering. Among them:<\/span><\/p>\n<\/div>\n<ul>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">current clustering techniques do not address all the requirements adequately (and concurrently);<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">dealing with large number of dimensions and large number of data items can be problematic because of time complexity;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">the effectiveness of the method depends on the definition of \u201cdistance\u201d (for distance-based clustering);<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">if an\u00a0<em>obvious<\/em>\u00a0distance measure doesn\u2019t exist we must \u201cdefine\u201d it, which is not always easy, especially in multi-dimensional spaces;<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.<\/span><\/div>\n<\/li>\n<\/ul>\n<div align=\"center\">\n<hr \/>\n<p><strong><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: small;\">Clustering Algorithms<\/span><\/strong><\/p>\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Classification<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">Clustering algorithms may be classified as listed below:<\/span><\/p>\n<\/div>\n<ul>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Exclusive Clustering<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Overlapping Clustering<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Hierarchical Clustering<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Probabilistic Clustering<\/span><\/div>\n<\/li>\n<\/ul>\n<div align=\"center\">\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane.<br \/>\nOn the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value.<\/span><\/p>\n<p align=\"center\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image004.gif\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted.<br \/>\nFinally, the last kind of clustering use a completely probabilistic approach.<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">In this tutorial we propose four of the most used clustering algorithms:<\/span><\/p>\n<\/div>\n<ul>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">K-means<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Fuzzy C-means<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Hierarchical clustering<\/span><\/div>\n<\/li>\n<li>\n<div align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Mixture of Gaussians<\/span><\/div>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: 'Times New Roman', Times, serif;\">Each of these algorithms belongs to one of the clustering types listed above. So that,\u00a0<a href=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/kmeans.html\">K-means<\/a>\u00a0is an\u00a0<em>exclusive clustering<\/em>\u00a0algorithm,\u00a0<a href=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/cmeans.html\">Fuzzy C-means<\/a>\u00a0is an\u00a0<em>overlapping clustering<\/em>\u00a0algorithm,\u00a0<a href=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/hierarchical.html\">Hierarchical clustering<\/a>\u00a0is obvious and lastly\u00a0<a href=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/mixture.html\">Mixture of Gaussian<\/a>\u00a0is a\u00a0<em>probabilistic clustering<\/em>\u00a0algorithm. We will discuss about each clustering method in the following paragraphs.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Distance Measure<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scalings can lead to different clusterings.<\/span><\/p>\n<p align=\"center\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image005.gif\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Notice however that this is not only a graphic issue: the problem arises from the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes: different formulas leads to different clusterings.<br \/>\nAgain, domain knowledge must be used to guide the formulation of a suitable distance measure for each particular application.<\/span><\/p>\n<p align=\"left\"><strong>Minkowski Metric<br \/>\n<\/strong>For higher dimensional data, a popular measure is the Minkowski metric,<\/p>\n<p align=\"center\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image007.gif\" \/><\/p>\n<p align=\"justify\">where\u00a0<em>d<\/em>\u00a0is the dimensionality of the data. The\u00a0<em>Euclidean<\/em>\u00a0distance is a special case where\u00a0<em>p<\/em>=2, while\u00a0<em>Manhattan<\/em>\u00a0metric has\u00a0<em>p<\/em>=1. However, there are no general theoretical guidelines for selecting a measure for any given application.<\/p>\n<p>It is often the case that the components of the data feature vectors are not immediately comparable. It can be that the components are not continuous variables, like length, but nominal categories, such as the days of the week. In these cases again, domain knowledge must be used to formulate an appropriate measure.<\/p>\n<p>&nbsp;<\/p>\n<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Bibliography<\/em><\/span><\/p>\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">Tariq Rashid: \u201cClustering\u201d<br \/>\n<a href=\"http:\/\/www.cs.bris.ac.uk\/home\/tr1690\/documentation\/fuzzy_clustering_initial_report\/node11.html\">http:\/\/www.cs.bris.ac.uk\/home\/tr1690\/documentation\/fuzzy_clustering_initial_report\/node11.html<\/a><\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">Osmar R. Za\u00efane: \u201cPrinciples of Knowledge Discovery in Databases &#8211; Chapter 8: Data Clustering\u201d<br \/>\n<a href=\"http:\/\/www.cs.ualberta.ca\/~zaiane\/courses\/cmput690\/slides\/Chapter8\/index.html\">http:\/\/www.cs.ualberta.ca\/~zaiane\/courses\/cmput690\/slides\/Chapter8\/index.html<\/a><\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">Pier Luca Lanzi: \u201cIngegneria della Conoscenza e Sistemi Esperti \u2013 Lezione 2: Apprendimento non supervisionato\u201d<br \/>\n<a href=\"http:\/\/www.elet.polimi.it\/upload\/lanzi\/corsi\/icse\/2002\/Lezione%202%20-%20Apprendimento%20non%20supervisionato.pdf\">http:\/\/www.elet.polimi.it\/upload\/lanzi\/corsi\/icse\/2002\/Lezione%202%20-%20Apprendimento%20non%20supervisionato.pdf<\/a><\/span><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>What is Clustering? Clustering can be considered the most important\u00a0unsupervised learning\u00a0problem; so, as every other problem of this kind, it deals with finding a\u00a0structure\u00a0in a&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24],"tags":[],"class_list":["post-362","post","type-post","status-publish","format-standard","hentry","category-statistics"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/362","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=362"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/362\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=362"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=362"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=362"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}