{"id":166,"date":"2013-04-05T12:07:45","date_gmt":"2013-04-05T17:07:45","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=166"},"modified":"2013-04-05T12:07:45","modified_gmt":"2013-04-05T17:07:45","slug":"what-apply-does","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/04\/05\/what-apply-does\/","title":{"rendered":"What &#8220;Apply&#8221; does"},"content":{"rendered":"<h3>Lapply and sapply: avoiding loops on lists and data frames<\/h3>\n<h3>Tapply: avoiding loops when applying a function to subsets<\/h3>\n<p>&#8220;Apply&#8221; functions keep you from having to write loops to perform some operation on every row or every column of a\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/matrices.html#Matrices\">matrix<\/a>\u00a0or\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/matrices.html#DataFrames\">data frame<\/a>, or on every element in a\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/lists.html\">list<\/a>. For example, the built-in data set<tt>state.x77<\/tt>\u00a0contains eight columns of data describing the 50 U.S. states in 1977. If you wanted the average of each of the eight columns, you could do this:<\/p>\n<pre>&gt; avgs &lt;- numeric (8)\n&gt; for (i in 1:8)\n+     avgs[i] &lt;- mean (state.x77[,i])   # The \"+\" is R's continuation character; don't type it\n&gt; avgs\n[1]  4246.4200  4435.8000     1.1700    70.8786     7.3780    53.1080   104.4600 70735.8800<\/pre>\n<p>This is comparatively slow, much more so in large datasets. R is bad at looping. A more vectorized way to do this is to use the\u00a0<tt>apply()<\/tt>\u00a0function. In this example,\u00a0<tt>apply<\/tt>\u00a0extracts each column as a vector, one at a time, and passes it to the\u00a0<tt>median()<\/tt>\u00a0function.<\/p>\n<pre>&gt; apply (state.x77, 2, median)\n Population Income Illiteracy Life Exp Murder HS Grad Frost  Area \n     2838.5   4519       0.95   70.675   6.85   53.25 114.5 54277<\/pre>\n<p>The 2 means &#8220;go by column&#8221; &#8212; a 1 would have meant &#8220;go by row.&#8221; Of course, if we had used a 1, we would have computed 50 averages, one for each row. If we had had a\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/matrices.html#HigherDimArrays\">three-dimensional array<\/a>\u00a0we could have used a 3 there. The third argument specifies the function to be applied to each column. We can use any function that makes sense there. We can use our own function or even pass in a function that we write on the spot. If your function returns a vector of constant length, S-Plus will stick the vectors together into a matrix. However, if your function returns vectors of different lengths, S-Plus will have to create a list (see more details\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/apply.html#ReturnList\">below<\/a>).<\/p>\n<p>The special cases of mean and sum have been taken care of already with the built-in\u00a0<tt>colMeans<\/tt>,\u00a0<tt>ColSums<\/tt>,\u00a0<tt>rowMeans<\/tt>, and\u00a0<tt>rowSums<\/tt>\u00a0functions. These are highly efficient and worth using.<\/p>\n<p>In this example, we construct a function &#8220;on the fly&#8221; and pass it to apply. This particular function computes the median and maximum of each column of\u00a0<tt>state.x77<\/tt>.<\/p>\n<pre>&gt; apply (state.x77, 2, function(x) c(median (x), max(x)))\n     Population Income Illiteracy Life Exp Murder HS Grad Frost   Area \n[1,]     2838.5   4519       0.95   70.675   6.85   53.25 114.5  54277\n[2,]    21198.0   6315       2.80   73.600  15.10   67.30 188.0 566432<\/pre>\n<p>If you pass additional arguments to apply, those arguments get passed down to the function you&#8217;re having apply call. So if you wanted to calculate the mean of each column after trimming the highest and lowest 10%, you could do this:<\/p>\n<pre>&gt; apply (state.x77, 2, mean, trim=.1)\n Population      Income  Illiteracy    Life Exp      Murder     HS Grad       Frost        Area \n 3384.27500  4430.07500     1.09750    70.91775     7.29750    53.33750   106.80000 56575.72500<\/pre>\n<p>This is particularly handy for passing the\u00a0<tt>na.rm=T<\/tt>\u00a0argument to functions like\u00a0<tt>max<\/tt>.<\/p>\n<h4>Does\u00a0<tt>apply()<\/tt>\u00a0loop?<\/h4>\n<p>Yes.\u00a0<tt>apply()<\/tt>\u00a0calls\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/apply.html#lapply\"><tt>lapply<\/tt><\/a>\u00a0and\u00a0<tt>lapply()<\/tt>\u00a0loops. Clearly\u00a0<em>something<\/em>\u00a0has to loop. The reason that the apply family of functions is fast is that the looping is done in compiled code (C or Fortran), not in R&#8217;s own interpreted code. The difference can be the difference between finishing and crashing.\u00a0<b>Note:<\/b>\u00a0After writing this I got curious about the extent to which\u00a0<tt>apply()<\/tt>\u00a0increases speed. I used commands like this:<\/p>\n<pre>&gt; system.time (for (j in 1:20000) colMeans (state.x77))\n&gt; system.time (for (j in 1:20000) apply (state.x77, 2, mean))\n&gt; system.time (for (j in 1:20000) for (i in 1:8) mean (state.x77[,i]))<\/pre>\n<p>expecting the last one to be reported as the slowest. Actually, though, the middle one was. I&#8217;m not sure what the story is here.<a name=\"ReturnList\"><\/a><a name=\"ReturnList\"><\/a><\/p>\n<h4>Sometimes you expect\u00a0<tt>apply()<\/tt>\u00a0to return a vector but you get a list<\/h4>\n<p><a name=\"ReturnList\"><\/a><a name=\"ReturnList\"><\/a>I include this topic because it has bedeviled me in the past. Suppose I have this matrix\u00a0<tt>a<\/tt>, and I want to find the smallest number in each row. This is easy:<\/p>\n<pre>&gt; a &lt;- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3)\n&gt; a\n     [,1] [,2] [,3] \n[1,]    5    1    4\n[2,]    2    2    5\n[3,]    7    8    6\n&gt; apply (a, 1, min)\n[1] 1 2 6<\/pre>\n<p><a name=\"ReturnList\"><\/a><a name=\"ReturnList\"><\/a>So\u00a0<tt>apply()<\/tt>\u00a0works on each row, one at a time, to tell me the smallest number in each row. What if I want the\u00a0<em>index<\/em>of the smallest number in each row? That is, I want the answer to the question &#8220;in which column can the minimum value be found&#8221;? That sounds easy, too: we&#8217;ll use the\u00a0<tt>which()<\/tt>\u00a0function, which returns the indices within a vector for which the vector holds the value TRUE.<\/p>\n<pre>&gt; which (c(F, F, T, F, T, T, F))   # Example of \"which\" : where are the Trues?\n[1] 3 5 6\n#\n# For each row, find the column in which that row has its smallest value.\n# \n&gt; apply (a, 1, function(x) which(x == min(x)))\n[[1]]\n[1] 2\n\n[[2]]\n[1] 1 2\n\n[[3]]\n[1] 3<\/pre>\n<p><a name=\"ReturnList\"><\/a><a name=\"ReturnList\"><\/a>What has happened here is that there&#8217;s a tie in the second row.\u00a0<tt>apply()<\/tt>\u00a0returns a single value for rows 1 and 3, but two values for row 2, and R doesn&#8217;t know how to arrange those, so it makes a list. The<tt>[[1]]<\/tt>\u00a0tells us that the first element of the list has no name.<\/p>\n<p>If we needed to do this we might impose a rule like &#8220;if there&#8217;s a tie pick out the first one.&#8221;<\/p>\n<p><a name=\"ReturnList\"><\/a><a name=\"ReturnList\"><\/a><\/p>\n<pre>&gt; apply (a, 1, function(x) which(x == min(x))[1])\n[1] 2 1 3<\/pre>\n<h2><a name=\"ReturnList\"><\/a><a name=\"lapply\"><\/a>Lapply and sapply: avoiding loops on lists and data frames<\/h2>\n<p><a name=\"lapply\"><\/a><a name=\"lapply\"><\/a>The regular\u00a0<tt>apply()<\/tt>\u00a0function can be used on a data frame since a data frame is a type of matrix. When you use it on the columns of a data frame, passing the number 2 for the second argument, it does what you expect. It will work on the rows of a data frame, too, but remember: apply extracts each row as a vector, one at a time. Every element of a vector must have the same kind of data, so unless every column of the data frame has the same kind of data, R will end up converting the elements of the row to a common format (like character).<\/p>\n<p>The\u00a0<tt>lapply()<\/tt>\u00a0function works on any list, not just a rectangular one. (The &#8220;l&#8221; in &#8220;lapply&#8221; stands for &#8220;list.&#8221;) In that way it&#8217;s more general than\u00a0<tt>apply()<\/tt>, although it does not work on matrices or higher-dimensional arrrays. You don&#8217;t need to specify the &#8220;direction&#8221; as you do with\u00a0<tt>apply()<\/tt>; just pass the function.\u00a0<b>However,\u00a0<tt>lapply()<\/tt>\u00a0always returns a list.<\/b>\u00a0Usually I want a vector, and that&#8217;s what\u00a0<tt>sapply()<\/tt>tries to do. The &#8220;s&#8221; in &#8220;sapply&#8221; stands for &#8220;simplify.&#8221; Here&#8217;s an example using the built-in\u00a0<tt>barley<\/tt>\u00a0data frame. My question is, how many levels of each variable are there? We can count the number by seeing how many unique entries there are: so\u00a0<tt>length(unique(x))<\/tt>\u00a0will do the trick.<\/p>\n<p><a name=\"lapply\"><\/a><a name=\"lapply\"><\/a><\/p>\n<pre>library (lattice)                                  # Make this data available\n&gt; dim (barley)                                     # Barley has 120 rows\n[1] 120   4\n&gt; lapply (barley, function(x) length(unique(x)))   # returns a list\n$yield:\n[1] 114\n\n$variety:\n[1] 10\n\n$year:\n[1] 2\n\n$site:\n[1] 6\n\n&gt; sapply (barley, function(x) length(unique(x)))   # Simplifies output to a vector\n yield variety year site \n   114      10    2    6\n&gt; apply (barley, 2, function(x) length(unique(x))) # Also works on data frames (but not non-data frame lists).\n yield variety year site \n   114      10    2    6<\/pre>\n<h2><a name=\"lapply\"><\/a><a name=\"tapply\"><\/a>Tapply: avoiding loops when applying a function to subsets<\/h2>\n<p><a name=\"tapply\"><\/a><a name=\"tapply\"><\/a><tt>tapply()<\/tt>\u00a0is a very powerful function that lets you break a vector into pieces, and then apply some function to each of the pieces. (For you Excel users,\u00a0<tt>tapply()<\/tt>\u00a0produces things that correspond to Excel&#8217;s pivot tables.) It&#8217;s sort of like\u00a0<tt>sapply()<\/tt>, except that with\u00a0<tt>sapply()<\/tt>\u00a0the pieces are always elements of a list. With\u00a0<tt>tapply()<\/tt>\u00a0you get to specify how the breakdown is done. For example, suppose I want to find the average yield for each variety of barley in the last example.<\/p>\n<pre>&gt; tapply (barley$yield, barley$site, mean)\n Grand Rapids   Duluth University Farm Morris Crookston   Waseca \n     24.93167 27.99667        32.66667   35.4     37.42 48.10833<\/pre>\n<p><a name=\"tapply\"><\/a><a name=\"tapply\"><\/a><tt>tapply()<\/tt>\u00a0returns a vector with one element for each unique value of\u00a0<tt>barley$variety<\/tt>. The element for Grand Rapids, for example, gives the average of all the elements of\u00a0<tt>barley$yield<\/tt>\u00a0for which<tt>barley$variety == \"Grand Rapids\"<\/tt>. I have found\u00a0<tt>tapply()<\/tt>\u00a0to be incredibly useful. If you want to cross-tabulate by more than one variable, construct a list of your tabulating variables and pass that to<tt>tapply()<\/tt>. Here we break yields down by year and site.<\/p>\n<pre>&gt; tapply (barley$yield, list (barley$year, barley$site), mean)\n     Grand Rapids   Duluth University Farm   Morris Crookston   Waseca \n1932     20.81000 25.70000        29.50667 41.51333     31.18 41.87000\n1931     29.05334 30.29333        35.82667 29.28667     43.66 54.34667<\/pre>\n<p><a name=\"tapply\"><\/a>We&#8217;ve learned something: 1931 was a much better year, except in Morris. (There&#8217;s some suspicion that Morris was in fact incorrectly recorded in this well-known data set.) 1932 appears before 1931 in the table because that&#8217;s how the levels of &#8220;year&#8221; were set up in S-Plus. (If this bothers you see\u00a0<a href=\"http:\/\/faculty.nps.edu\/sebuttre\/home\/R\/factors.html#Reordering\">Reordering the levels of a factor<\/a>.) Years appear in the rows because they came first in the list. Of course a three- or higher-way table can be made in this way as well.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Lapply and sapply: avoiding loops on lists and data frames Tapply: avoiding loops when applying a function to subsets &#8220;Apply&#8221; functions keep you from having&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-166","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/166","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=166"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/166\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=166"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}