{"id":862,"date":"2015-07-30T14:52:11","date_gmt":"2015-07-30T21:52:11","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=862"},"modified":"2015-07-30T14:52:11","modified_gmt":"2015-07-30T21:52:11","slug":"mutiple-dataframes-within-lists-from-the-start","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2015\/07\/30\/mutiple-dataframes-within-lists-from-the-start\/","title":{"rendered":"Mutiple Dataframes within Lists from the start"},"content":{"rendered":"<h2>Lists from the start<\/h2>\n<p>Again: Don&#8217;t ever create <code>d1<\/code> <code>d2<\/code> <code>d3<\/code> in the first place, just create a list <code>d<\/code> with 3 elements.<\/p>\n<h3>Reading multiple files into a list of data frames<\/h3>\n<p>This is done pretty easily when reading in files. Maybe you&#8217;ve got files <code>data1.csv, data2.csv, ...<\/code> in a directory. Your goal is a list of data.frames called <code>mydata<\/code>. The first thing you need is a vector with all the file names. You can construct this with paste (e.g., <code>myfiles = paste0(\"data\", 1:5, \".csv\")<\/code>), but it&#8217;s probably easier to use <code>list.files<\/code> to grab all the appropriate files: <code>myfiles &lt;- list.files(pattern = \"*.csv\")<\/code>.<\/p>\n<p>At this point, most R beginners will use a <code>for<\/code> loop, and there&#8217;s nothing wrong with that, it works.<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">mydata <\/span><span class=\"pun\">&lt;-<\/span><span class=\"pln\"> list<\/span><span class=\"pun\">()<\/span>\n<span class=\"kwd\">for<\/span> <span class=\"pun\">(<\/span><span class=\"pln\">i <\/span><span class=\"kwd\">in<\/span><span class=\"pln\"> seq_along<\/span><span class=\"pun\">(<\/span><span class=\"pln\">myfiles<\/span><span class=\"pun\">))<\/span> <span class=\"pun\">{<\/span><span class=\"pln\">\n    mydata<\/span><span class=\"pun\">[[<\/span><span class=\"pln\">i<\/span><span class=\"pun\">]]<\/span> <span class=\"pun\">&lt;-<\/span><span class=\"pln\"> read.csv<\/span><span class=\"pun\">(<\/span><span class=\"pln\">file <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> myfiles<\/span><span class=\"pun\">[<\/span><span class=\"pln\">i<\/span><span class=\"pun\">])<\/span>\n<span class=\"pun\">}<\/span><\/code><\/pre>\n<p>A more R-native way to do it is with <code>lapply<\/code><\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">mydata <\/span><span class=\"pun\">&lt;-<\/span><span class=\"pln\"> lapply<\/span><span class=\"pun\">(<\/span><span class=\"pln\">myfiles<\/span><span class=\"pun\">,<\/span><span class=\"pln\"> read.csv<\/span><span class=\"pun\">)<\/span><\/code><\/pre>\n<p>Either way, it&#8217;s handy to name the list elements to match the files<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">names<\/span><span class=\"pun\">(<\/span><span class=\"pln\">mydata<\/span><span class=\"pun\">)<\/span> <span class=\"pun\">&lt;-<\/span><span class=\"pln\"> gsub<\/span><span class=\"pun\">(<\/span><span class=\"str\">\"\\\\.csv\"<\/span><span class=\"pun\">,<\/span> <span class=\"str\">\"\"<\/span><span class=\"pun\">,<\/span><span class=\"pln\"> myfiles<\/span><span class=\"pun\">)<\/span>\n<span class=\"com\"># or, if you prefer the consistent syntax of stringr<\/span><span class=\"pln\">\nnames<\/span><span class=\"pun\">(<\/span><span class=\"pln\">mydata<\/span><span class=\"pun\">)<\/span> <span class=\"pun\">&lt;-<\/span><span class=\"pln\"> stringr<\/span><span class=\"pun\">::<\/span><span class=\"pln\">str_replace<\/span><span class=\"pun\">(<\/span><span class=\"pln\">myfiles<\/span><span class=\"pun\">,<\/span><span class=\"pln\"> pattern <\/span><span class=\"pun\">=<\/span> <span class=\"str\">\".csv\"<\/span><span class=\"pun\">,<\/span><span class=\"pln\"> replacement <\/span><span class=\"pun\">=<\/span> <span class=\"str\">\"\"<\/span><span class=\"pun\">)<\/span><\/code><\/pre>\n<h3>Splitting a data frame into a list of data frames<\/h3>\n<p>This is super-easy, the base function <code>split()<\/code> does it for you. You can split by a column (or columns) of the data, or by anything else you want<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">mt_list <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> split<\/span><span class=\"pun\">(<\/span><span class=\"pln\">mtcars<\/span><span class=\"pun\">,<\/span><span class=\"pln\"> f <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> mtcars<\/span><span class=\"pun\">$<\/span><span class=\"pln\">cyl<\/span><span class=\"pun\">)<\/span>\n<span class=\"com\"># This gives a list of three data frames, one for each value of cyl<\/span><\/code><\/pre>\n<p>This is also a nice way to break a data frame into pieces for cross-validation. Maybe you want to split <code>mtcars<\/code> into training, test, and validation pieces.<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">groups <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> sample<\/span><span class=\"pun\">(<\/span><span class=\"pln\">c<\/span><span class=\"pun\">(<\/span><span class=\"str\">\"train\"<\/span><span class=\"pun\">,<\/span> <span class=\"str\">\"test\"<\/span><span class=\"pun\">,<\/span> <span class=\"str\">\"validate\"<\/span><span class=\"pun\">),<\/span><span class=\"pln\">\n                size <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> nrow<\/span><span class=\"pun\">(<\/span><span class=\"pln\">mtcars<\/span><span class=\"pun\">),<\/span><span class=\"pln\"> replace <\/span><span class=\"pun\">=<\/span> <span class=\"lit\">TRUE<\/span><span class=\"pun\">)<\/span><span class=\"pln\">\nmtsplit <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> split<\/span><span class=\"pun\">(<\/span><span class=\"pln\">mtcars<\/span><span class=\"pun\">,<\/span><span class=\"pln\"> f <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> groups<\/span><span class=\"pun\">)<\/span>\n<span class=\"com\"># and mtsplit has appropriate names already!<\/span><\/code><\/pre>\n<h3>Simulating a list of data frames<\/h3>\n<p>Maybe you&#8217;re simulating data, something like this:<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">my.sim.data <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> data.frame<\/span><span class=\"pun\">(<\/span><span class=\"pln\">x <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> rnorm<\/span><span class=\"pun\">(<\/span><span class=\"lit\">50<\/span><span class=\"pun\">),<\/span><span class=\"pln\"> y <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> rnorm<\/span><span class=\"pun\">(<\/span><span class=\"lit\">50<\/span><span class=\"pun\">))<\/span><\/code><\/pre>\n<p>But who does only one simulation? You want to do this 100 times, 1000 times, more! But you <em>don&#8217;t<\/em>want 10,000 data frames in your workspace. Use <code>replicate<\/code> and put them in a list:<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">sim_list <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> replicate<\/span><span class=\"pun\">(<\/span><span class=\"pln\">n <\/span><span class=\"pun\">=<\/span> <span class=\"lit\">10<\/span><span class=\"pun\">,<\/span><span class=\"pln\">\n                     expr <\/span><span class=\"pun\">=<\/span> <span class=\"pun\">{<\/span><span class=\"pln\">data.frame<\/span><span class=\"pun\">(<\/span><span class=\"pln\">x <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> rnorm<\/span><span class=\"pun\">(<\/span><span class=\"lit\">50<\/span><span class=\"pun\">),<\/span><span class=\"pln\"> y <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> rnorm<\/span><span class=\"pun\">(<\/span><span class=\"lit\">50<\/span><span class=\"pun\">))},<\/span><span class=\"pln\">\n                     simplify <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> F<\/span><span class=\"pun\">)<\/span><\/code><\/pre>\n<p>In this case especially, you should also consider whether you really need separate data frames, or would a single data frame with a &#8220;group&#8221; column work just as well? Using <code>data.table<\/code>, <code>dplyr<\/code>, or <code>plyr<\/code> it&#8217;s quite easy to do things &#8220;by group&#8221; to a data frame.<\/p>\n<h3>I didn&#8217;t put my data in a list \ud83d\ude41 I will next time, but what can I do now?<\/h3>\n<p>If you have data frames named in a pattern, e.g., <code>df1<\/code>, <code>df2<\/code>, <code>df3<\/code>, and you want them in a list, you can get them if you can write a regular expression to match the names. Something like<\/p>\n<pre class=\"lang-r prettyprint prettyprinted\"><code><span class=\"pln\">df_list <\/span><span class=\"pun\">=<\/span><span class=\"pln\"> lapply<\/span><span class=\"pun\">(<\/span><span class=\"pln\">ls<\/span><span class=\"pun\">(<\/span><span class=\"pln\">pattern <\/span><span class=\"pun\">=<\/span> <span class=\"str\">\"df[0-9]\"<\/span><span class=\"pun\">),<\/span><span class=\"pln\"> get<\/span><span class=\"pun\">)<\/span><\/code><\/pre>\n<p>You should start off double checking just the <code>ls<\/code> part and make sure you&#8217;re getting the right variables. And next time use lists from the start.<\/p>\n<h2>Why put the data in a list?<\/h2>\n<p>Put similar data in lists because you probably want to do similar things to each data.frame, and functions like <code>lapply<\/code>, <code>sapply<\/code> <code>do.call<\/code>, and the <code>plyr<\/code> <code>l*ply<\/code> functions make it really easy to do that. Examples of people easily doing things with lists are all over SO.<\/p>\n<p>A couple common tasks might be combining them. If you want to stack them on top of each other, you could use <code>rbind<\/code> for a pair of them, and <code>do.call<\/code> with <code>rbind<\/code>, or (for speed) <code>dplyr::bind_rows<\/code> to put them together. (Similarly using <code>cbind<\/code> or <code>dplyr::bind_cols<\/code> for columns.) To merge (join) a list of data frames, you can see <a href=\"http:\/\/stackoverflow.com\/q\/8091303\/903061\">these answers<\/a>.<\/p>\n<p>Think of scalability. If you really only need three variables, it&#8217;s fine to use <code>d1<\/code>, <code>d2<\/code>, <code>d3<\/code>. But then if it turns out you really need 6, that&#8217;s a lot more typing. And next time, when you need 10 or 20, you find yourself copying and pasting lines of code, maybe using find\/replace to change <code>d14<\/code> to <code>d15<\/code>, and you&#8217;re thinking <em>this isn&#8217;t how programming should be<\/em>. If you use a list, the difference between 3 cases, 30 cases, and 300 cases is at most one line of code&#8212;no change at all if your number of cases is automatically detected by, e.g., how many <code>.csv<\/code> files are in your directory.<\/p>\n<p>Even if you use a lowly for loop, it&#8217;s much easier to loop over the elements of a list than it is to construct variable names with <code>paste<\/code> and access the objects with <code>get<\/code>.<\/p>\n<p>You can name the elements of a list, in case you want to use something other than numeric indices to access your data frames (and you can use both, this isn&#8217;t an XOR choice).<\/p>\n<p>Overall, using lists will lead you to write cleaner, easier-to-read code, which will result in fewer bugs and less confusion.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Lists from the start Again: Don&#8217;t ever create d1 d2 d3 in the first place, just create a list d with 3 elements. Reading multiple&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-862","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/862","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=862"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/862\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=862"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=862"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=862"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}