{"id":809,"date":"2015-02-27T14:15:17","date_gmt":"2015-02-27T21:15:17","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=809"},"modified":"2015-02-27T14:15:17","modified_gmt":"2015-02-27T21:15:17","slug":"pivot-tables-in-r","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2015\/02\/27\/pivot-tables-in-r\/","title":{"rendered":"Pivot tables in R"},"content":{"rendered":"<p>A common data-munging operation is to compute <a href=\"http:\/\/en.wikipedia.org\/wiki\/Cross_tab\" target=\"_blank\" rel=\"noopener\">cross tabulations<\/a> of measurements by categories. SQL Server and Excel have a nice feature called <em>pivot tables<\/em> for this purpose. Here we&#8217;ll figure out how to do pivot operations in R.<\/p>\n<p>Let&#8217;s imagine an experiment where we&#8217;re measuring the gene activity of an organism under different conditions &#8212; exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then <a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/base\/html\/expand.grid.html\" target=\"_blank\" rel=\"noopener\">expand.grid<\/a> generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.<\/p>\n<pre name=\"code\" class=\"sql\">&gt; genes = paste('MMP', sprintf(\"%04d\",1:10), sep=\"\")\n&gt; data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza'))\n&gt; data$value = rnorm(40)\n&gt; data\n      gene condition       value\n1  MMP0001    copper  0.90412805\n2  MMP0002    copper  0.92664376\n3  MMP0003    copper  0.27772147\n4  MMP0004    copper  0.08958930\n5  MMP0005    copper -0.20132304\n6  MMP0006    copper  0.34524729\n7  MMP0007    copper -0.33910206\n8  MMP0008    copper  1.21006486\n9  MMP0009    copper  0.78008022\n10 MMP0010    copper  1.05364315\n11 MMP0001   cheetos -2.31796229\n12 MMP0002   cheetos  0.76706591\n13 MMP0003   cheetos -2.93692935\n14 MMP0004   cheetos  0.25452306\n15 MMP0005   cheetos  0.24168329\n16 MMP0006   cheetos  0.28739734\n17 MMP0007   cheetos  0.69233543\n18 MMP0008   cheetos  0.48865250\n19 MMP0009   cheetos -0.11129319\n20 MMP0010   cheetos  0.53322842\n21 MMP0001      beer -0.74965948\n22 MMP0002      beer  0.27105205\n23 MMP0003      beer -0.99261363\n24 MMP0004      beer  0.65143639\n25 MMP0005      beer -0.35589696\n26 MMP0006      beer  1.40147484\n27 MMP0007      beer  0.37492710\n28 MMP0008      beer  0.64453865\n29 MMP0009      beer  0.35925345\n30 MMP0010      beer  0.96394785\n31 MMP0001     pizza -1.91818504\n32 MMP0002     pizza  0.31690523\n33 MMP0003     pizza -1.20566043\n34 MMP0004     pizza -1.91750166\n35 MMP0005     pizza  1.98010023\n36 MMP0006     pizza  0.90468249\n37 MMP0007     pizza  0.04284970\n38 MMP0008     pizza -0.08141461\n39 MMP0009     pizza -0.72471771\n40 MMP0010     pizza -0.01085060\n<\/pre>\n<p>We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. <strong>The easy way<\/strong> is to use the <a href=\"http:\/\/had.co.nz\/reshape\/\" target=\"_blank\" rel=\"noopener\">reshape package<\/a> by <a href=\"http:\/\/had.co.nz\/\" target=\"_blank\" rel=\"noopener\">Hadley Wickham<\/a>, which is made for restructuring data and does this job nicely. If you don&#8217;t already have it, you&#8217;ll have to run<a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/utils\/html\/INSTALL.html\" target=\"_blank\" rel=\"noopener\">install.packages<\/a>, then load the library.<\/p>\n<pre name=\"code\" class=\"sql\">&gt; install.packages('reshape')\n&gt; library(reshape)\n<\/pre>\n<p>Using cast to move conditions into columns is a snap.<\/p>\n<pre name=\"code\" class=\"sql\">&gt; cast(data, gene ~ condition)\n      gene     copper    cheetos       beer       pizza\n1  MMP0001  0.9041281 -2.3179623 -0.7496595 -1.91818504\n2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523\n3  MMP0003  0.2777215 -2.9369294 -0.9926136 -1.20566043\n4  MMP0004  0.0895893  0.2545231  0.6514364 -1.91750166\n5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023\n6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249\n7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970\n8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461\n9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771\n10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060\n<\/pre>\n<p>Done!<\/p>\n<p><strong>That was too easy<\/strong><\/p>\n<p>Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let&#8217;s make sure we can deal with missing data (as reshape can). Make some data go missing:<\/p>\n<pre name=\"code\" class=\"sql\">&gt; data.incomplete &lt;- data[data$value &gt; -1.0,]\n&gt; dim(data.incomplete)\n[1] 35  3\n<\/pre>\n<p>Now, <a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/base\/html\/split.html\" target=\"_blank\" rel=\"noopener\">split<\/a> the data frame up by condition. This produces a list where each element is a <a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/base\/html\/data.frame.html\" target=\"_blank\" rel=\"noopener\">data frame<\/a> containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.<\/p>\n<pre name=\"code\" class=\"sql\">&gt; data.by.condition &lt;- split(data.incomplete, data.incomplete$condition)\n&gt; typeof(data.by.condition)\n[1] \"list\"\n&gt; names(data.by.condition)\n[1] \"copper\"  \"cheetos\" \"beer\"    \"pizza\"  \n&gt; data.by.condition$cheetos\n      gene condition      value\n12 MMP0002   cheetos  0.7670659\n14 MMP0004   cheetos  0.2545231\n15 MMP0005   cheetos  0.2416833\n16 MMP0006   cheetos  0.2873973\n17 MMP0007   cheetos  0.6923354\n18 MMP0008   cheetos  0.4886525\n19 MMP0009   cheetos -0.1112932\n20 MMP0010   cheetos  0.5332284\n<\/pre>\n<p>We&#8217;re going to recombine the data into a data frame with one row for each gene, so let&#8217;s get that started:<\/p>\n<pre name=\"code\" class=\"sql\">&gt; result = data.frame(gene=genes)\n<\/pre>\n<p>Now comes some executable line noise. We&#8217;re going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an <b>outer join<\/b>. We get a row for each gene, inserting NA&#8217;s where there data is missing.<\/p>\n<pre name=\"code\" class=\"sql\">&gt; for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] &lt;- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)$value }\n\n&gt; result\n      gene     copper    cheetos       beer       pizza\n1  MMP0001  0.9041281         NA -0.7496595          NA\n2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523\n3  MMP0003  0.2777215         NA -0.9926136          NA\n4  MMP0004  0.0895893  0.2545231  0.6514364          NA\n5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023\n6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249\n7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970\n8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461\n9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771\n10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060\n<\/pre>\n<p>Extra finesse points if you can figure out how to do that last step with <a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/base\/html\/funprog.html\" target=\"_blank\" rel=\"noopener\">Reduce<\/a>instead of a loop.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-809","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/809","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=809"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/809\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=809"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=809"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=809"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}