{"id":551,"date":"2014-02-28T13:21:07","date_gmt":"2014-02-28T18:21:07","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=551"},"modified":"2014-02-28T13:21:07","modified_gmt":"2014-02-28T18:21:07","slug":"using-apply-sapply-lapply-in-r","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2014\/02\/28\/using-apply-sapply-lapply-in-r\/","title":{"rendered":"Using apply, sapply, lapply in R"},"content":{"rendered":"<div>This is an introductory post about using apply, sapply and lapply, best suited for people relatively new to R or unfamiliar with these functions. There is a part 2 coming that will look at\u00a0<a href=\"http:\/\/petewerner.blogspot.com\/2012\/12\/density-plot-with-ggplot.html\" target=\"_blank\" rel=\"noopener\">density plots with ggplot<\/a>, but first\u00a0I thought I would go on a tangent to give some examples of the apply family, as they come up a lot working with R.<\/div>\n<div><\/div>\n<div>I have been comparing three methods on a data set. A sample from the data set was generated, and three different methods were applied to that subset. I wanted to see how their results differed from one another.<\/div>\n<div><\/div>\n<div>I would run my test harness which returned a matrix. The columns values were the metric used for evaluation of each method, and the rows were the results for a given subset. We have three columns, one for each method, and lets say 30 rows, representing 30 different subsets that the three methods were applied to.<\/div>\n<div><\/div>\n<div>It looked a bit like this<\/div>\n<div><\/div>\n<div>\u00a0 \u00a0 \u00a0 \u00a0 method1 \u00a0method2 \u00a0 \u00a0method3<\/div>\n<div>[1,] 0.05517714 0.014054038 0.017260447<\/div>\n<div>[2,] 0.08367678 0.003570883 0.004289079<\/div>\n<div>[3,] 0.05274706 0.028629661 0.071323030<\/div>\n<div>[4,] 0.06769936 0.048446559 0.057432519<\/div>\n<div>[5,] 0.06875188 0.019782518 0.080564474<\/div>\n<div>[6,] 0.04913779 0.100062929 0.102208706<\/div>\n<div><\/div>\n<div>We can simulate this data using\u00a0rnorm, to create three sets of observations. The first has mean 0, second mean of 2, third of mean of 5, and with 30 rows.<\/div>\n<div><\/div>\n<div>m &lt;- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3)<\/div>\n<div><\/div>\n<h3><span style=\"text-decoration: underline;\">Apply<\/span><\/h3>\n<div><\/div>\n<div>When do we use apply? When we have some\u00a0structured blob of data that we wish to perform operations on. Here structured means in some form of matrix.\u00a0The operations may be informational, or perhaps transforming, subsetting, whatever to the data.<\/p>\n<p>As a commenter pointed out, if you are using a data frame the data types must all be the same otherwise they will be subjected to type conversion. This may or may not be what you want, if the data frame has string\/character data as well as numeric data, the numeric data will be converted to strings\/characters and numerical operations will probably not give what you expected.<\/p><\/div>\n<div><\/div>\n<div><\/div>\n<div>Needless to say such circumstances arise quite frequently when working in R, so spending some time getting familiar with\u00a0apply\u00a0can be a great boon to our productivity.<\/div>\n<div><\/div>\n<div>Which actual apply function and which specific incantion is required depends on your data, the function you wish to use, and what you want the end result to look like. Hopefully the right choice should be a bit clearer by the end of these examples.<\/div>\n<div><\/div>\n<div>First I want to make sure I created that matrix correctly, three columns each with a mean 0, 2 and 5 respectively. We can use\u00a0apply\u00a0and the base\u00a0meanfunction to check this.<\/div>\n<div><\/div>\n<div>We tell\u00a0apply\u00a0to traverse row wise or column wise by the second argument. In this case we expect to get three numbers at the end, the mean value for each column, so tell\u00a0apply\u00a0to work along columns by passing 2 as the second argument. But let&#8217;s do it wrong for the point of illustration:<\/div>\n<div><\/div>\n<div>apply(m, 1, mean)<\/div>\n<div># [1] 2.408150 2.709325 1.718529 0.822519 2.693614 2.259044 1.849530 2.544685 2.957950 2.219874<\/div>\n<div>#[11] 2.582011 2.471938 2.015625 2.101832 2.189781 2.319142 2.504821 2.203066 2.280550 2.401297<\/div>\n<div>#[21] 2.312254 1.833903 1.900122 2.427002 2.426869 1.890895 2.515842 2.363085 3.049760 2.027570<\/div>\n<div><\/div>\n<div>Passing a 1 in the second argument, we get 30 values back, giving the mean of each row. Not the three numbers we were expecting, try again.<\/div>\n<div><\/div>\n<div>apply(m, 2, mean)<\/div>\n<div>#[1] -0.02664418\u00a0 1.95812458\u00a0 4.86857792<\/div>\n<div><\/div>\n<div>Great. We can see the mean of each column is roughly 0, 2, and 5 as we expected.<\/div>\n<div><\/div>\n<h3><span style=\"text-decoration: underline;\">Our own functions<\/span><\/h3>\n<div><\/div>\n<div>Let&#8217;s say I see that negative number and realise I wanted to only look at positive values. Let&#8217;s see how many negative numbers each column has, using apply again:<\/div>\n<div><\/div>\n<div>apply(m, 2, function(x) length(x[x&lt;0]))<\/div>\n<div>#[1] 14\u00a0 1\u00a0 0<\/div>\n<div><\/div>\n<div>So 14 negative values in column one, 1 negative value in column two, and none in column three. More or less what we would expect for three normal distributions with the given means and sd of 1.<\/div>\n<div><\/div>\n<div>Here we have used a simple function we defined in the call to\u00a0apply, rather than some built in function. Note we did not specify a return value for our function. R will magically return the last evaluated value. The actual function is using subsetting to extract all the elements in\u00a0x\u00a0that are less than 0, and then counting how many are left are using\u00a0length.<\/div>\n<div><\/div>\n<div>The function takes one argument, which I have arbitrarily called\u00a0x. In this casex\u00a0will be a single column of the matrix. Is it a 1 column matrix or a just a vector? Let&#8217;s have a look:<\/div>\n<div><\/div>\n<div>apply(m, 2, function(x) is.matrix(x))<\/div>\n<div>#[1] FALSE FALSE FALSE<\/div>\n<div><\/div>\n<div>Not a matrix. Here the function definition is not required, we could instead just pass the\u00a0is.matrix\u00a0function, as it only takes one argument and has already been wrapped up in a function for us. Let&#8217;s check they are vectors as we might expect.<\/div>\n<div><\/div>\n<div>apply(m, 2, is.vector)<\/div>\n<div>#[1] TRUE TRUE TRUE<\/div>\n<div><\/div>\n<div><\/div>\n<div>Why then did we need to wrap up our length function? When we want to define our own handling function for apply, we must at a minimum give a name to the incoming data, so we can use it in our function.<\/div>\n<div><\/div>\n<div>apply(m, 2, length(x[x&lt;0]))<\/div>\n<div>#Error in match.fun(FUN) : object &#8216;x&#8217; not found<\/div>\n<div><\/div>\n<div>We are referring to some value\u00a0x\u00a0in the function, but R does not know where that is and so gives us an error. There are other forces at play here, but for simplicity just remember to wrap any code up in a function. For example, let&#8217;s look at the mean value of only the positive values:<\/div>\n<div><\/div>\n<div>apply(m, 2, function(x) mean(x[x&gt;0]))<\/div>\n<div>#[1] 0.4466368 2.0415736 4.8685779<\/div>\n<div><\/div>\n<h3><span style=\"text-decoration: underline;\">Using sapply and lapply<\/span><\/h3>\n<div><span style=\"text-decoration: underline;\">\u00a0<\/span><\/div>\n<div>These two functions work in a similar way, traversing over a set of data like a list or vector, and calling the specified function for each item.<\/div>\n<div><\/div>\n<div>Sometimes we require traversal of our data in a less than linear way. Say we wanted to compare the current observation with the value 5 periods before it. Use can probably use\u00a0rollapply\u00a0for this (via quantmod), but a quick and dirty way is to run\u00a0sapply\u00a0or\u00a0lapply\u00a0passing a set of index values.<\/div>\n<div><\/div>\n<div>Here we will use\u00a0sapply, which works on a list or vector of data.<\/div>\n<div><\/div>\n<div>sapply(1:3, function(x) x^2)<\/div>\n<div>#[1] 1 4 9<\/div>\n<div><\/div>\n<div>lapply\u00a0is very similar, however it will return a list rather than a vector:<\/div>\n<div><\/div>\n<div>lapply(1:3, function(x) x^2)<\/div>\n<div>#[[1]]<\/div>\n<div>#[1] 1<\/div>\n<div>#<\/div>\n<div>#[[2]]<\/div>\n<div>#[1] 4<\/div>\n<div>#<\/div>\n<div>#[[3]]<\/div>\n<div>#[1] 9<\/div>\n<div><\/div>\n<div>Passing\u00a0simplify=FALSE\u00a0to\u00a0sapply\u00a0will also give you a list:<\/div>\n<div><\/div>\n<div>sapply(1:3, function(x) x^2, simplify=F)<\/div>\n<div>#[[1]]<\/div>\n<div>#[1] 1<\/div>\n<div>#<\/div>\n<div>#[[2]]<\/div>\n<div>#[1] 4<\/div>\n<div>#<\/div>\n<div>#[[3]]<\/div>\n<div>#[1] 9<\/div>\n<div><\/div>\n<div>And you can use\u00a0unlist\u00a0with\u00a0lapply\u00a0to get a vector.<\/div>\n<div><\/div>\n<div>unlist(lapply(1:3, function(x) x^2))<\/div>\n<div>#[1] 1 4 9<\/div>\n<div><\/div>\n<div>However the behviour is not as clean when things have names, so best to usesapply\u00a0or\u00a0lapply\u00a0as makes sense for your data and what you want to receive back. If you want a list returned, use\u00a0lapply. If you want a vector, use\u00a0sapply.<\/div>\n<div><\/div>\n<h3><span style=\"text-decoration: underline;\">Dirty Deeds<\/span><\/h3>\n<div><span style=\"text-decoration: underline;\">\u00a0<\/span><\/div>\n<div>Anyway, a cheap trick is to pass\u00a0sapply\u00a0a vector of indexes and write your function making some assumptions about the structure of the underlying data. Let&#8217;s look at our\u00a0mean\u00a0example again:<\/div>\n<div><\/div>\n<div>sapply(1:3, function(x) mean(m[,x]))<\/div>\n<div>[1] -0.02664418\u00a0 1.95812458\u00a0 4.86857792<\/div>\n<div><\/div>\n<div>We pass the column indexes (1,2,3) to our function, which assumes some variable\u00a0m\u00a0has our data. Fine for quickies but not very nice, and will likely turn into a maintainability bomb down the line.<\/div>\n<div><\/div>\n<div>We can neaten things up a bit by passing our data in an argument to our function, and using the\u00a0\u2026\u00a0special argument which all the apply functions have for passing extra arguments:<\/div>\n<div><\/div>\n<div>sapply(1:3, function(x, y) mean(y[,x]), y=m)<\/div>\n<div>#[1] -0.02664418\u00a0 1.95812458\u00a0 4.86857792<\/div>\n<div><\/div>\n<div>This time, our function has 2 arguments,\u00a0x\u00a0and\u00a0y. The\u00a0x\u00a0variable will be as it was before, whatever\u00a0sapply\u00a0is currently going through. The\u00a0y\u00a0variable we will pass using the optional arguments to\u00a0sapply.<\/div>\n<div><\/div>\n<div>In this case we have passed in\u00a0m, explicitly naming the\u00a0y\u00a0argument in thesapply\u00a0call. Not strictly necessary but it makes for easier to read &amp; maintain code. The\u00a0y\u00a0value will be the same for each call\u00a0sapply\u00a0makes to our function.<\/div>\n<div><\/div>\n<div>I don&#8217;t really recommend passing the index arguments like this, it is error prone and can be quite confusing to others reading your code.<\/div>\n<div><\/div>\n<div>I hope you found these examples helpful. Please check out part 2 where we create a density plot of the values in our matrix.<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is an introductory post about using apply, sapply and lapply, best suited for people relatively new to R or unfamiliar with these functions. There&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-551","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/551","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=551"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/551\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=551"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=551"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=551"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}