{"id":327,"date":"2013-11-21T17:40:15","date_gmt":"2013-11-21T22:40:15","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=327"},"modified":"2013-11-21T17:40:15","modified_gmt":"2013-11-21T22:40:15","slug":"modifying-data-with-lookup-tables","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/11\/21\/modifying-data-with-lookup-tables\/","title":{"rendered":"Modifying data with lookup tables"},"content":{"rendered":"<p>In many analyses, data is read from a file, but must be modified before it can be used. For example you may want to add a new column of data, or do a \u201cfind\u201d and \u201creplace\u201d on a site, treatment or species name. There are 3 ways one might add such information. The first involves editing the original data frame \u2013 although you should\u00a0<em>never<\/em>\u00a0do this, I suspect this method is quite common. A second \u2013 and widely used \u2013 approach for adding information is to modify the values using code in your script. The third \u2013 and nicest \u2013 way of adding information is to use a lookup table.<\/p>\n<p>One of the most common things we see in the code of researchers working with data are long slabs of code modifying a data frame based on some logical tests.Such code might correct, for example, a species name:<\/p>\n<div>\n<figure><figcaption><\/figcaption><div>\n<table>\n<tbody>\n<tr>\n<td>\n<pre>1\n2<\/pre>\n<\/td>\n<td>\n<pre><code>raw$species[raw$id==\"1\"] &lt;- \"Banksia oblongifolia\"\nraw$species[raw$id==\"2\"] &lt;- \"Banksia ericifolia\"\n<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/figure>\n<\/div>\n<p>or add some details to the data set, such as location, latitude, longitude and mean annual precipitation:<\/p>\n<div>\n<figure><figcaption><\/figcaption><div>\n<table>\n<tbody>\n<tr>\n<td>\n<pre>1\n2\n3\n4\n5<\/pre>\n<\/td>\n<td>\n<pre><code>raw$location[raw$id==\"1\"] &lt;-\"NSW\"\nraw$latitude[raw$id==\"1\"] &lt;- -37\nraw$longitude[raw$id==\"1\"] &lt;- 40\nraw$map[raw$id==\"1\"] &lt;- 1208\nraw$map[raw$id==\"1\"] &lt;- 1226\n<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/figure>\n<\/div>\n<p>In large analyses, this type of code may go for hundreds of lines.<\/p>\n<p><img decoding=\"async\" alt=\"\" src=\"http:\/\/nicercode.github.io\/images\/2013-07-09-modifying-data-with-lookup-tables\/messy_script.png\" \/><\/p>\n<p>Now before we go on, let me say that this approach to adding data is\u00a0<em>much<\/em>\u00a0better than editing your datafile directly, for the following two reasons:<\/p>\n<ol>\n<li>It maintains the integrity of your raw data file<\/li>\n<li>You can see where the new value came from (it was added in a script), and modify it later if needed.<\/li>\n<\/ol>\n<p>There is also nothing\u00a0<em>wrong<\/em>\u00a0with adding data this way. However, it is what we would consider\u00a0<em>messy<\/em>\u00a0code, for these reasons:<\/p>\n<ul>\n<li>Long chunks of code modifying data is inherently difficult to read.<\/li>\n<li>There\u2019s a lot of typing involved, so lot\u2019s of work, and thus opportunities for error.<\/li>\n<li>It\u2019s harder to change variable names when they are embedded in code all over the place.<\/li>\n<\/ul>\n<p>A far\u00a0<em>nicer<\/em>\u00a0way to add data to an existing data frame is to use a lookup table. Here is an example of such a table, achieving similar (but not identical) modifications to the code above:<\/p>\n<div>\n<figure><figcaption><\/figcaption><div>\n<table>\n<tbody>\n<tr>\n<td>\n<pre>1<\/pre>\n<\/td>\n<td>\n<pre><code>read.csv(\"dataNew.csv\")\n<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/figure>\n<\/div>\n<div>\n<pre><code>##   lookupVariable lookupValue newVariable              newValue\n## 1             id           1     species  Banksia oblongifolia\n## 2             id           2     species    Banksia ericifolia\n## 3             id           3     species       Banksia serrata\n## 4             id           4     species       Banksia grandis\n## 5                         NA      family            Proteaceae\n## 6                         NA    location                   NSW\n## 7             id           4    location                    WA\n##            source\n## 1  Daniel Falster\n## 2  Daniel Falster\n## 3  Daniel Falster\n## 4  Daniel Falster\n## 5  Daniel Falster\n## 6  Daniel Falster\n## 7  Daniel Falster<\/code><\/pre>\n<\/div>\n<p>The columns of this table are<\/p>\n<ul>\n<li><strong>lookupVariable<\/strong>\u00a0is the name of the variable in the parent data we want to match against. If left blank, change all rows.<\/li>\n<li><strong>lookupValue<\/strong>\u00a0is the value of lookupVariable to match against<\/li>\n<li><strong>newVariable<\/strong>\u00a0is the variable to be changed<\/li>\n<li><strong>newValue<\/strong>\u00a0is the value of\u00a0<code>newVariable<\/code>\u00a0for matched rows<\/li>\n<li><strong>source<\/strong>\u00a0includes any notes about where the data came from (e.g., who made the change)<\/li>\n<\/ul>\n<p>So the table documents the changes we want to make to our dataframe. The function\u00a0<a href=\"https:\/\/gist.github.com\/dfalster\/5589956\">addNewData.R<\/a>\u00a0takes the file name for this table as an argument and applies it to the data frame. For example let\u2019s assume we have a data frame called\u00a0<code>data<\/code><\/p>\n<div>\n<pre><code>myData<\/code><\/pre>\n<\/div>\n<div>\n<pre><code>##          x     y id\n## 1  0.93160 5.433  1\n## 2  0.24875 3.868  2\n## 3  0.92273 5.944  2\n## 4  0.85384 5.541  2\n## 5  0.30378 3.985  2\n## 6  0.41205 4.415  2\n## 7  0.35158 4.440  2\n## 8  0.13920 3.007  2\n## 9  0.16579 2.976  2\n## 10 0.66290 5.315  3\n## 11 0.25720 3.755  3\n## 12 0.88086 5.345  3\n## 13 0.11784 3.183  3\n## 14 0.01423 3.749  4\n## 15 0.23359 4.264  4\n## 16 0.33614 4.433  4\n## 17 0.52122 4.393  4\n## 18 0.11616 3.603  4\n## 19 0.90871 6.379  4\n## 20 0.75664 5.838  4<\/code><\/pre>\n<\/div>\n<p>and want to apply the table given above, we simply write<\/p>\n<div>\n<figure><figcaption><\/figcaption><div>\n<table>\n<tbody>\n<tr>\n<td>\n<pre>1\n2\n3<\/pre>\n<\/td>\n<td>\n<pre><code>source(\"addNewData.r\")\nallowedVars &lt;- c(\"species\", \"family\", \"location\")\naddNewData(\"dataNew.csv\", myData, allowedVars)\n<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/figure>\n<\/div>\n<div>\n<pre><code>##          x     y id              species     family location\n## 1  0.93160 5.433  1 Banksia oblongifolia Proteaceae      NSW\n## 2  0.24875 3.868  2   Banksia ericifolia Proteaceae      NSW\n## 3  0.92273 5.944  2   Banksia ericifolia Proteaceae      NSW\n## 4  0.85384 5.541  2   Banksia ericifolia Proteaceae      NSW\n## 5  0.30378 3.985  2   Banksia ericifolia Proteaceae      NSW\n## 6  0.41205 4.415  2   Banksia ericifolia Proteaceae      NSW\n## 7  0.35158 4.440  2   Banksia ericifolia Proteaceae      NSW\n## 8  0.13920 3.007  2   Banksia ericifolia Proteaceae      NSW\n## 9  0.16579 2.976  2   Banksia ericifolia Proteaceae      NSW\n## 10 0.66290 5.315  3      Banksia serrata Proteaceae      NSW\n## 11 0.25720 3.755  3      Banksia serrata Proteaceae      NSW\n## 12 0.88086 5.345  3      Banksia serrata Proteaceae      NSW\n## 13 0.11784 3.183  3      Banksia serrata Proteaceae      NSW\n## 14 0.01423 3.749  4      Banksia grandis Proteaceae       WA\n## 15 0.23359 4.264  4      Banksia grandis Proteaceae       WA\n## 16 0.33614 4.433  4      Banksia grandis Proteaceae       WA\n## 17 0.52122 4.393  4      Banksia grandis Proteaceae       WA\n## 18 0.11616 3.603  4      Banksia grandis Proteaceae       WA\n## 19 0.90871 6.379  4      Banksia grandis Proteaceae       WA\n## 20 0.75664 5.838  4      Banksia grandis Proteaceae       WA<\/code><\/pre>\n<\/div>\n<p>The large block of code is now reduced to a single line that clearly expresses what we want to achieve. Moreover, the new values (data) are stored as a table of\u00a0<em>data<\/em>\u00a0in a file, which is preferable to having data mixed in with our code.<\/p>\n<p>You can use this approach You can find the example files used here, as a\u00a0<a href=\"https:\/\/gist.github.com\/dfalster\/5589956\">github gist<\/a>.<\/p>\n<p><strong>Acknowledgements:<\/strong>\u00a0Many thanks to Rich FitzJohn and Diego Barneche for valuable discussions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In many analyses, data is read from a file, but must be modified before it can be used. For example you may want to add&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-327","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/327","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=327"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/327\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=327"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=327"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=327"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}