{"id":657,"date":"2014-07-15T09:46:31","date_gmt":"2014-07-15T16:46:31","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=657"},"modified":"2014-07-15T09:46:31","modified_gmt":"2014-07-15T16:46:31","slug":"ggplot2-cheatsheet-for-visualizing-distributions-2","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2014\/07\/15\/ggplot2-cheatsheet-for-visualizing-distributions-2\/","title":{"rendered":"ggplot2: Cheatsheet for Visualizing Distributions"},"content":{"rendered":"<p><span style=\"color: #444444;\">In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed.<\/span><\/p>\n<pre style=\"color: #444444;\"><code>library(ggplot2)\nlibrary(gridExtra)\nset.seed(10005)\n\nxvar &lt;- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))\nyvar &lt;- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))\nzvar &lt;- as.factor(c(rep(1, 1500), rep(2, 1500)))\nxy &lt;- data.frame(xvar, yvar, zvar)\n<\/code><\/pre>\n<h3 style=\"color: #222222;\">&gt;&gt; Histograms<\/h3>\n<p><span style=\"color: #444444;\">I&#8217;ve already done a\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/rforpublichealth.blogspot.com\/2012\/12\/basics-of-histograms.html\" target=\"_blank\" rel=\"noopener\">post on histograms<\/a><span style=\"color: #444444;\">\u00a0using base R, so I won&#8217;t spend too much time on them. Here are the basics of doing them in ggplot.\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/docs.ggplot2.org\/current\/geom_histogram.html\" target=\"_blank\" rel=\"noopener\">More on all options for histograms here.<\/a><span style=\"color: #444444;\">\u00a0The R cookbook has a nice page about it too:\u00a0<\/span><strong style=\"color: #444444;\"><a style=\"color: #205b87;\" href=\"http:\/\/www.cookbook-r.com\/Graphs\/Plotting_distributions_(ggplot2)\/\" target=\"_blank\" rel=\"noopener\">http:\/\/www.cookbook-r.com\/Graphs\/Plotting_distributions_(ggplot2)\/<\/a><\/strong><span style=\"color: #444444;\">Also, I found\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/sape.inf.usi.ch\/quick-reference\/ggplot2\/geom\" target=\"_blank\" rel=\"noopener\">this really great aggregation<\/a><span style=\"color: #444444;\">\u00a0of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot.<\/span><\/p>\n<pre style=\"color: #444444;\"><code>#counts on y-axis\ng1&lt;-ggplot(xy, aes(xvar)) + geom_histogram()                                      #horribly ugly default\ng2&lt;-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1)                            #change binwidth\ng3&lt;-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color=\"black\") + theme_bw()   #nicer looking\n\n#density on y-axis\ng4&lt;-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color=\"black\", fill=NA) + theme_bw()\n\ngrid.arrange(g1, g2, g3, g4, nrow=1)\n<\/code>\n<code>## stat_bin: binwidth defaulted to range\/30. Use 'binwidth = x' to adjust\n## this. stat_bin: binwidth defaulted to range\/30. Use 'binwidth = x' to\n## adjust this. stat_bin: binwidth defaulted to range\/30. Use 'binwidth = x'\n## to adjust this.\n<\/code><\/pre>\n<p><span style=\"color: #444444;\">Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space.<\/span><\/p>\n<h3 style=\"color: #222222;\">&gt;&gt; Density plots<\/h3>\n<p><span style=\"color: #444444;\">We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including\u00a0<\/span><strong style=\"color: #444444;\">kernel=\u201cepanechnikov\u201d<\/strong><span style=\"color: #444444;\">\u00a0and\u00a0<\/span><strong style=\"color: #444444;\">kernel=\u201crectangular\u201d<\/strong><span style=\"color: #444444;\">\u00a0or whatever you want. You can\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/docs.ggplot2.org\/current\/stat_density.html\" target=\"_blank\" rel=\"noopener\">find all of those options here<\/a><span style=\"color: #444444;\">.<\/span><\/p>\n<pre style=\"color: #444444;\"><code>#basic density\np1&lt;-ggplot(xy, aes(xvar)) + geom_density()\n\n#histogram with density line overlaid\np2&lt;-ggplot(xy, aes(x=xvar)) + \n  geom_histogram(aes(y = ..density..), color=\"black\", fill=NA) +\n  geom_density(color=\"blue\")\n\n#split and color by third variable, alpha fades the color a bit\np3&lt;-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2)\n\ngrid.arrange(p1, p2, p3, nrow=1)\n<\/code><\/pre>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #222222;\">&gt;&gt; Boxplots and more<\/h3>\n<p><span style=\"color: #444444;\">We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a volcano plot.\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/docs.ggplot2.org\/0.9.3.1\/geom_boxplot.html\" target=\"_blank\" rel=\"noopener\">More on boxplots here.<\/a><span style=\"color: #444444;\">\u00a0Note that I removed the legend from each one because it is redundant.<\/span><\/p>\n<pre style=\"color: #444444;\"><code>#boxplot\nb1&lt;-ggplot(xy, aes(zvar, xvar)) + \n  geom_boxplot(aes(fill = zvar)) +\n  theme(legend.position = \"none\")\n\n#jitter plot\nb2&lt;-ggplot(xy, aes(zvar, xvar)) + \n  geom_jitter(alpha=I(1\/4), aes(color=zvar)) +\n  theme(legend.position = \"none\")\n\n#volcano plot\nb3&lt;-ggplot(xy, aes(x = xvar)) +\n  stat_density(aes(ymax = ..density..,  ymin = -..density..,\n               fill = zvar, color = zvar),\n               geom = \"ribbon\", position = \"identity\") +\n  facet_grid(. ~ zvar) +\n  coord_flip() +\n  theme(legend.position = \"none\")\n\ngrid.arrange(b1, b2, b3, nrow=1)\n<\/code><\/pre>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #222222;\">&gt;&gt; Putting multiple plots together<\/h3>\n<p><span style=\"color: #444444;\">Finally, it&#8217;s nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/stackoverflow.com\/questions\/8545035\/scatterplot-with-marginal-histograms-in-ggplot2\" target=\"_blank\" rel=\"noopener\">StackOverflow page<\/a><span style=\"color: #444444;\">. One way to do this is to add distribution information to a scatterplot as a \u201crug plot\u201d. It adds a little tick mark for every point in your data projected onto the axis.<\/span><\/p>\n<pre style=\"color: #444444;\"><code>#rug plot\nggplot(xy,aes(xvar,yvar))  + geom_point() + geom_rug(col=\"darkred\",alpha=.1)\n<\/code><\/pre>\n<p><span style=\"color: #444444;\">\u00a0Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder. I do the density plots by the zvar variable to highlight the differences in the two groups.<\/span><\/p>\n<pre style=\"color: #444444;\"><code>#placeholder plot - prints nothing at all\nempty &lt;- ggplot()+geom_point(aes(1,1), colour=\"white\") +\n     theme(                              \n       plot.background = element_blank(), \n       panel.grid.major = element_blank(), \n       panel.grid.minor = element_blank(), \n       panel.border = element_blank(), \n       panel.background = element_blank(),\n       axis.title.x = element_blank(),\n       axis.title.y = element_blank(),\n       axis.text.x = element_blank(),\n       axis.text.y = element_blank(),\n       axis.ticks = element_blank()\n     )\n\n#scatterplot of x and y variables\nscatter &lt;- ggplot(xy,aes(xvar, yvar)) + \n  geom_point(aes(color=zvar)) + \n  scale_color_manual(values = c(\"orange\", \"purple\")) + \n  theme(legend.position=c(1,1),legend.justification=c(1,1)) \n\n#marginal density of x - plot on top\nplot_top &lt;- ggplot(xy, aes(xvar, fill=zvar)) + \n  geom_density(alpha=.5) + \n  scale_fill_manual(values = c(\"orange\", \"purple\")) + \n  theme(legend.position = \"none\")\n\n#marginal density of y - plot on the right\nplot_right &lt;- ggplot(xy, aes(yvar, fill=zvar)) + \n  geom_density(alpha=.5) + \n  coord_flip() + \n  scale_fill_manual(values = c(\"orange\", \"purple\")) + \n  theme(legend.position = \"none\") \n\n#arrange the plots together, with appropriate height and width for each row and column\ngrid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))\n<\/code><\/pre>\n<p><span style=\"color: #444444;\">It&#8217;s really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in\u00a0<\/span><strong style=\"color: #444444;\">theme(axis.title.x = element_blank())<\/strong><span style=\"color: #444444;\">\u00a0in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first\u00a0<\/span><a style=\"color: #205b87;\" href=\"http:\/\/rforpublichealth.blogspot.com\/2013\/11\/ggplot2-cheatsheet-for-scatterplots.html\" target=\"_blank\" rel=\"noopener\">ggplot2 cheatsheet for scatterplots<\/a><span style=\"color: #444444;\">\u00a0if you need a refresher.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-657","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/657","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=657"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/657\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=657"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=657"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=657"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}