{"id":368,"date":"2013-12-06T15:12:05","date_gmt":"2013-12-06T20:12:05","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=368"},"modified":"2013-12-06T15:12:05","modified_gmt":"2013-12-06T20:12:05","slug":"clustering-as-a-mixture-of-gaussians","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/12\/06\/clustering-as-a-mixture-of-gaussians\/","title":{"rendered":"Clustering as a Mixture of Gaussians"},"content":{"rendered":"<p align=\"justify\"><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\"><em>Introduction to Model-Based Clustering<\/em><\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">There\u2019s another way to deal with clustering problems: a\u00a0<em>model-based<\/em>\u00a0approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model.<br \/>\nIn practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modelled by a\u00a0<em>mixture<\/em>\u00a0of these distributions. An individual distribution used to model a specific cluster is often referred to as a\u00a0<em>component<\/em>\u00a0distribution.<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">A mixture model with high likelihood tends to have the following traits:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">component distributions have high \u201cpeaks\u201d (data in one cluster are tight);<\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">the mixture model \u201ccovers\u201d the data well (dominant patterns in the data are captured by component distributions).<\/span><\/li>\n<\/ul>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Main advantages of model-based clustering:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">well-studied statistical inference techniques available;<\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">flexibility in choosing the component distribution;<\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">obtain a density estimation for each cluster;<\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">a \u201csoft\u201d classification is available.<\/span><\/li>\n<\/ul>\n<p align=\"justify\">\n<p align=\"justify\"><em><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\">Mixture of Gaussians<\/span><\/em><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">The most widely used clustering method of this kind is the one based on learning a\u00a0<em>mixture of Gaussians<\/em>: we can actually consider clusters as Gaussian distributions centred on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution:<\/span><\/p>\n<p align=\"center\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image060.gif\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">The algorithm works in this way:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">it chooses the component (the Gaussian) at random with probability\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image062.gif\" align=\"absmiddle\" \/>;<\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">it samples a point\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image064.gif\" align=\"absmiddle\" \/>.<\/span><\/li>\n<\/ul>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Let\u2019s suppose to have:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">x<sub>1<\/sub>, x<sub>2<\/sub>,&#8230;, x<sub>N<\/sub><\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image068.gif\" align=\"absbottom\" \/><\/span><\/li>\n<\/ul>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">We can obtain the likelihood of the sample:\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image070.gif\" align=\"absmiddle\" \/>.<br \/>\n<\/span><span style=\"font-family: 'Times New Roman', Times, serif;\">What we really want to maximise is\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image072.gif\" align=\"absmiddle\" \/>\u00a0(probability of a datum given the centres of the Gaussians).<\/span><\/p>\n<p align=\"justify\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image074.gif\" align=\"absmiddle\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">is the base to write the likelihood function:<\/span><\/p>\n<p align=\"justify\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image076.gif\" align=\"absmiddle\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Now we should maximise the likelihood function by calculating\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image078.gif\" align=\"absmiddle\" \/>, but it would be too difficult. That\u2019s why we use a simplified algorithm called EM (Expectation-Maximization).<\/span><\/p>\n<p align=\"justify\">\n<p align=\"justify\"><em><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\">The EM Algorithm<\/span><\/em><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">The algorithm which is used in practice to find the mixture of Gaussians that can model the data set is called EM (Expectation-Maximization) (<a href=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/mixture.html#dempster\">Dempster, Laird and Rubin, 1977<\/a>). Let\u2019s see how it works with an example.<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Suppose x<sub>k<\/sub>\u00a0are the marks got by the students of a class, with these probabilities:<\/span><\/p>\n<p align=\"justify\">x<sub>1<\/sub>\u00a0= 30\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image084.gif\" align=\"absmiddle\" \/><\/p>\n<p align=\"justify\">x<sub>2<\/sub>\u00a0= 18\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image088.gif\" align=\"absmiddle\" \/><\/p>\n<p align=\"justify\">x<sub>3<\/sub>\u00a0= 0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image092.gif\" align=\"absmiddle\" \/><\/p>\n<p align=\"justify\">x<sub>4<\/sub>\u00a0= 23\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image096.gif\" align=\"absmiddle\" \/><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">First case: we observe that the marks are so distributed among students:<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">x<sub>1<\/sub>\u00a0: a students<br \/>\nx<sub>2<\/sub>\u00a0: b students<br \/>\nx<sub>3<\/sub>\u00a0: c students<br \/>\nx<sub>4<\/sub>\u00a0: d students<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image106.gif\" align=\"absmiddle\" \/><\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">We should maximise this function by calculating\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image108.gif\" align=\"absmiddle\" \/>. Let\u2019s instead calculate the logarithm of the function and maximise it:<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\"><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image110.gif\" \/><\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Supposing a = 14, b = 6, c = 9 and d = 10 we can calculate that\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image112.gif\" align=\"absmiddle\" \/>.<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Second case: we observe that marks are so distributed among students:<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">x<sub>1<\/sub>\u00a0+ x<sub>2<\/sub>\u00a0: h students<\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">x<sub>3<\/sub>\u00a0: c students<\/span><br \/>\n<span style=\"font-family: 'Times New Roman', Times, serif;\">x<sub>4<\/sub>\u00a0: d students<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">We have so obtained a circularity which is divided into two steps:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">expectation:\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image116.gif\" align=\"absmiddle\" \/><\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">maximization:\u00a0<img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image118.gif\" align=\"absmiddle\" \/><\/span><\/li>\n<\/ul>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">This circularity can be solved in an iterative way.<\/span><\/p>\n<p align=\"justify\"><span style=\"font-family: 'Times New Roman', Times, serif;\">Let\u2019s now see how the EM algorithm works for a mixture of Gaussians (parameters estimated at the<em>\u00a0p<\/em>th iteration are marked by a superscript (<em>p<\/em>):<\/span><\/p>\n<table width=\"75%\" border=\"1\" align=\"center\">\n<tbody>\n<tr>\n<td>\n<ol>\n<li><em><span style=\"font-family: 'Times New Roman', Times, serif;\">Initialize parameters:<\/span><\/em>\n<p><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image120.gif\" align=\"absmiddle\" \/><\/li>\n<li><em><span style=\"font-family: 'Times New Roman', Times, serif;\">E-step:\n<p><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image122.gif\" align=\"absmiddle\" \/><br \/>\n<\/span><\/em><\/li>\n<li><em><span style=\"font-family: 'Times New Roman', Times, serif;\">M-step:<\/span>\n<p><img decoding=\"async\" alt=\"\" src=\"http:\/\/home.deib.polimi.it\/matteucc\/Clustering\/tutorial_html\/images\/image124.gif\" align=\"absmiddle\" \/><\/p>\n<p><span style=\"font-family: 'Times New Roman', Times, serif;\">where R is the number of records.<\/span><\/em><\/li>\n<\/ol>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p align=\"justify\"><em><span style=\"font-family: Arial, Helvetica, sans-serif; font-size: xx-small;\">Bibliography<\/span><\/em><\/p>\n<div align=\"justify\">\n<ul>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\"><a name=\"dempster\"><\/a>A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): &#8220;Maximum Likelihood from Incomplete Data via theEM algorithm&#8221;,\u00a0<em>Journal of the Royal Statistical Society<\/em>, Series B, vol. 39, 1:1-38<\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">Osmar R. Za\u00efane: \u201cPrinciples of Knowledge Discovery in Databases &#8211; Chapter 8: Data Clustering\u201d<br \/>\n<a href=\"http:\/\/www.cs.ualberta.ca\/~zaiane\/courses\/cmput690\/slides\/Chapter8\/index.html\">http:\/\/www.cs.ualberta.ca\/~zaiane\/courses\/cmput690\/slides\/Chapter8\/index.html<\/a><\/span><\/li>\n<li><span style=\"font-family: 'Times New Roman', Times, serif;\">Jia Li: \u201cData Mining &#8211; Clustering by Mixture Models\u201d<br \/>\n<a href=\"http:\/\/www.stat.psu.edu\/~jiali\/course\/stat597e\/notes\/mix.pdf\">http:\/\/www.stat.psu.edu\/~jiali\/course\/stat597e\/notes\/mix.pd<\/a><\/span><\/li>\n<li><\/li>\n<\/ul>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Introduction to Model-Based Clustering There\u2019s another way to deal with clustering problems: a\u00a0model-based\u00a0approach, which consists in using certain models for clusters and attempting to optimize&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24],"tags":[],"class_list":["post-368","post","type-post","status-publish","format-standard","hentry","category-statistics"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/368","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=368"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/368\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}