{"id":642,"date":"2014-06-18T12:09:57","date_gmt":"2014-06-18T17:09:57","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=642"},"modified":"2014-06-18T12:09:57","modified_gmt":"2014-06-18T17:09:57","slug":"data-frames","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2014\/06\/18\/data-frames\/","title":{"rendered":"DATA FRAMES"},"content":{"rendered":"<p style=\"color: #000000;\" align=\"center\"><b><span style=\"font-size: small;\">DATA FRAMES<\/span><\/b><\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Preamble<\/b><\/p>\n<p style=\"color: #000000;\">There is plenty to say about data frames because they are the primary data structure in R. Some of what follows is essential knowledge. Some of it will be satisfactorily learned for now if you remember that &#8220;R can do that.&#8221; I will try to point out which parts are which. Set aside some time. This is a long one!<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Definition and Examples<\/b>\u00a0<i>(essential)<\/i><\/p>\n<p style=\"color: #000000;\">A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. As we shall see, a &#8220;case&#8221; is not necessarily the same as an experimental subject or unit, although they are often the same. Technically, in R a data frame is a list of column vectors, although there is only one reason why you might need to remember such an arcane thing. Unlike an array, the data you store in the columns of a data frame can be of various types. I.e., one column might be a numerical variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items).<\/p>\n<p style=\"color: #000000;\">Let&#8217;s say we&#8217;ve collected data on one response variable or DV from 15 subjects, who were divided into three experimental groups called control (&#8220;contr&#8221;), treatment one (&#8220;treat1&#8221;), and treatment two (&#8220;treat2&#8221;). We might be tempted to table the data as follows&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">contr     treat1    treat2\n---------------------------\n  22        32        30\n  18        35        28\n  25        30        25\n  25        42        22\n  20        31        33\n---------------------------<\/span><\/pre>\n<p><span style=\"color: #000000;\">While this is a perfectly acceptable table, it is NOT a data frame, because values on our one response variable have been divided into three columns (and so have values on the grouping or independent variable). A data frame has the name of the variable at the top of the column, and values of that variable in the column under the variable name. So the data above should be tabled as follows&#8230;\u00a0<\/span><\/p>\n<pre>scores     group\n----------------\n  22       contr\n  18       contr\n  25       contr\n  25       contr\n  20       contr\n  32      treat1\n  35      treat1\n  30      treat1\n  42      treat1\n  31      treat1\n  30      treat2\n  28      treat2\n  25      treat2\n  22      treat2\n  33      treat2\n----------------<\/pre>\n<p><span style=\"color: #000000;\">This is a proper data frame (and leave out the dashed lines, although in actual fact R could read this table just as you see it here). It does not matter what order you type the columns in, as long as each column contains values of one variable, and every recorded value of that variable is in that column.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">In a previous tutorial we used the data object &#8220;women&#8221; as an example of a data frame&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; women\n   height weight\n1      58    115\n2      59    117\n3      60    120\n4      61    123\n5      62    126\n6      63    129\n7      64    132\n8      65    135\n9      66    139\n10     67    142\n11     68    146\n12     69    150\n13     70    154\n14     71    159\n15     72    164<\/span><\/pre>\n<p><span style=\"color: #000000;\">In this data frame we have two numerical variables and no real explanatory variables (IVs) or response variables (DVs). Notice when R prints out a data frame, it numbers the rows. These numbers are for convenience only and are not part of the data frame, and I&#8217;ll have much more to say about them shortly.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">We can refer to any value, or subset of values, in this data frame using the already familiar notation&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; women[12,2]                          # row 12, column 2; note: square brackets\n[1] 150\n&gt; women[8,]                            # row 8, all columns\n  height weight\n8     65    135\n&gt; women[1:5,]                          # rows 1 to 5, all columns\n  height weight\n1     58    115\n2     59    117\n3     60    120\n4     61    123\n5     62    126\n&gt; women[,2]                            # all rows, column 2\n [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164\n&gt; women[c(1,3,7,13),]                  # rows 1, 3, 7, and 13, all columns\n   height weight\n1      58    115\n3      60    120\n7      64    132\n13     70    154\n&gt; women[c(1,3,7,13),1]                 # rows 1, 3, 7, and 13, column 1\n[1] 58 60 64 70<\/span><\/pre>\n<p><span style=\"color: #000000;\">Here&#8217;s the catch. Those index numbers do NOT necessarily correspond to the numbers you see printed out with the data frame. This can be confusing at first, and it is something you need to keep in mind. I will explain in a moment.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Another built-in data object that is a data frame is &#8220;warpbreaks&#8221;. This data frame contains 54 cases, so I will print out only every third one. I do this with the sequence function, since this function creates a vector just as the\u00a0<span style=\"font-family: courier;\">c(\u00a0)<\/span>\u00a0function did in the above examples&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; warpbreaks[seq(1,54,3),]\n   breaks wool tension\n1      26    A       L\n4      25    A       L\n7      51    A       L\n10     18    A       M\n13     17    A       M\n16     35    A       M\n19     36    A       H\n22     18    A       H\n25     28    A       H\n28     27    B       L\n31     19    B       L\n34     41    B       L\n37     42    B       M\n40     16    B       M\n43     21    B       M\n46     20    B       H\n49     17    B       H\n52     15    B       H<\/span><\/pre>\n<p><span style=\"color: #000000;\">In this data frame we have one numerical variable (number of breaks), and two categorical variables (type of wool and tension on the wool). We don&#8217;t have to look at the data frame itself to get this information. We can also use the\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">str(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function, which displays a breakdown of the structure of a data frame&#8230;\u00a0<\/span><\/p>\n<pre>&gt; str(warpbreaks)\n'data.frame':   54 obs. of  3 variables:\n $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...\n $ wool   : Factor w\/ 2 levels \"A\",\"B\": 1 1 1 1 1 1 1 1 1 1 ...\n $ tension: Factor w\/ 3 levels \"L\",\"M\",\"H\": 1 1 1 1 1 1 1 1 1 2 ...<\/pre>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Another example is the data object &#8220;sleep&#8221;&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; sleep\n   extra group\n1    0.7     1\n2   -1.6     1\n3   -0.2     1\n4   -1.2     1\n5   -0.1     1\n6    3.4     1\n7    3.7     1\n8    0.8     1\n9    0.0     1\n10   2.0     1\n11   1.9     2\n12   0.8     2\n13   1.1     2\n14   0.1     2\n15  -0.1     2\n16   4.4     2\n17   5.5     2\n18   1.6     2\n19   4.6     2\n20   3.4     2<\/span><\/pre>\n<p><span style=\"color: #000000;\">Here we have two variables, the change in sleep time a subject got (&#8220;extra&#8221;), and what drug the subject received (&#8220;group&#8221;). In this case, the first variable (the dependent variable, DV, response variable, etc.) is numerical and the second (the independent variable, IV, explanatory variable, grouping variable, etc.) is categorical, even though the categorical variable is coded as a number. Once again, it does not matter in what order the columns occur. Put the IV in the first column and the DV in the second column if you want.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">However, if categorical variables are coded as numbers (a common practice), R will not know this until you tell it&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; str(sleep)\n'data.frame':   20 obs. of  2 variables:\n $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...\n $ group: Factor w\/ 2 levels \"1\",\"2\": 1 1 1 1 1 1 1 1 1 1 ...<\/span><\/pre>\n<p><span style=\"color: #000000;\">In this case, the fact that &#8220;group&#8221; is a factor is stored internally in the data frame, but that will not always be the case. So it&#8217;s worth taking a look to make sure things you intend to be factors are being interpreted as factors by R. You can do this with\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">str(\u00a0)<\/span><span style=\"color: #000000;\">, but you can also do it with<\/span><span style=\"color: #000000; font-family: courier;\">summary(\u00a0)<\/span><span style=\"color: #000000;\">, because numerical variables and factors are summarized differently&#8230;\u00a0<\/span><\/p>\n<pre>&gt; summary(sleep)\n     extra        group \n Min.   :-1.600   1:10  \n 1st Qu.:-0.025   2:10  \n Median : 0.950         \n Mean   : 1.540         \n 3rd Qu.: 3.400         \n Max.   : 5.500<\/pre>\n<p><span style=\"color: #000000;\">Notice that numerical variables (extra) are summarized with numerical summary statistics, while factors are summarized with a frequency table. In these data, there are 10 subjects in group 1 and 10 subjects in group 2.<\/span><\/p>\n<p style=\"color: #000000;\">\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>An Ambiguous Case<\/b>\u00a0<i>(essential)<\/i><\/p>\n<p style=\"color: #000000;\">Entering data into a data frame sometimes involves making a tough decision as to what your variables are. The following example is from a built-in data object called &#8220;anorexia&#8221;. This data set is not in the libraries that are loaded by default when R starts, so to see it, the first thing we need to do is attach the correct library to the search path. Let&#8217;s see how that works&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; search()\n [1] \".GlobalEnv\"        \"tools:RGUI\"        \"package:stats\"    \n [4] \"package:graphics\"  \"package:grDevices\" \"package:utils\"    \n [7] \"package:datasets\"  \"package:methods\"   \"Autoloads\"        \n[10] \"package:base\"<\/span><\/pre>\n<p><span style=\"color: #000000;\">This is the default search path, the one you have right after R starts. (It will be a little different in different operating systems.) We want to see an object in the MASS library (or package), which is not currently in the search path. So to get it into the search path, do this&#8230;\u00a0<\/span><\/p>\n<pre>&gt; library(MASS)\n&gt; search()\n [1] \".GlobalEnv\"        \"package:MASS\"      \"tools:RGUI\"       \n [4] \"package:stats\"     \"package:graphics\"  \"package:grDevices\"\n [7] \"package:utils\"     \"package:datasets\"  \"package:methods\"  \n[10] \"Autoloads\"         \"package:base\"<\/pre>\n<p><span style=\"color: #000000;\">Notice we have added &#8220;package:MASS&#8221; to the search path in position 2. This means if we request an R object, R will look first in the global environment (the workspace), and if the object is not found there, R will look next in MASS, then in RGUI, then in stats, and so on, until the object either is found or R runs out of places to look for it. The &#8220;anorexia&#8221; data frame is 72 cases long, so to conserve space we will look at only every fifth row of it&#8230;\u00a0<\/span><\/p>\n<pre>&gt; anorexia[seq(1,72,5),]\n   Treat Prewt Postwt\n1   Cont  80.7   80.2\n6   Cont  88.3   78.1\n11  Cont  77.6   77.4\n16  Cont  77.3   77.3\n21  Cont  85.5   88.3\n26  Cont  89.0   78.8\n31   CBT  79.9   76.4\n36   CBT  80.5   82.1\n41   CBT  70.0   90.9\n46   CBT  84.2   83.9\n51   CBT  83.3   85.2\n56    FT  83.8   95.2\n61    FT  79.6   76.7\n66    FT  81.6   77.8\n71    FT  86.0   91.7<\/pre>\n<p><span style=\"color: #000000;\">The data frame contains data from women who underwent treatment for anorexia. In the first column we have the treatment variable (&#8220;Treat&#8221;). The second column contains the pretreatment body weight in pounds (&#8220;Prewt&#8221;). The third column contains the posttreatment body weight in pounds (&#8220;Postwt&#8221;). So where is the ambiguity?<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Here&#8217;s the awkward question. In our analysis of these data, do we wish to treat weight as two variables (pre and post) each measured once on each subject, or as one variable (weight) measured twice on each subject? The data frame is currently arranged as if the plan was for an analysis of covariance, with &#8220;Postwt&#8221; being the response, &#8220;Treat&#8221; the explanatory variable, and &#8220;Prewt&#8221; the covariate. Prewt and Postwt are treated as two variables.<\/p>\n<p style=\"color: #000000;\">If the plan was for a repeated measures ANOVA, then the data frame is wrong, because in this case, &#8220;weight&#8221; is ONE variable measured twice (&#8220;pre&#8221; and &#8220;post&#8221;) on each woman. In this analysis, we would also need to add a &#8220;subject&#8221; variable to the data frame as well, since each subject would have two lines, a &#8220;pre&#8221; line and a &#8220;post&#8221; line.<\/p>\n<p style=\"color: #000000;\">It&#8217;s not a disaster. The data frame is easy enough to rearrange on the fly, and we will do so below.<\/p>\n<p style=\"color: #000000;\">By the way, this is how you get the MASS package out of the search path if you no longer need it&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; detach(\"package:MASS\")<\/span><\/pre>\n<p style=\"color: #000000;\">\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Creating a Data Frame in R<\/b>\u00a0<i>(essential)<\/i><\/p>\n<p style=\"color: #000000;\">The easiest way&#8211;and the usual way&#8211;of getting a data frame into the R workspace is to read it in from a file. We will do that in the next tutorial. Sometimes it becomes necessary to create one at the console, however. Here are the steps involved:<\/p>\n<ul style=\"color: #000000;\">\n<li>Type each variable into a vector.<\/li>\n<li>Use the\u00a0<span style=\"font-family: courier;\">data.frame(\u00a0)<\/span>\u00a0function to create a data frame from the vectors.<\/li>\n<\/ul>\n<p style=\"color: #000000;\">You may remember these data from the &#8220;Objects&#8221; tutorial&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">name     age  hgt  wgt  race year   SAT \nBob       21   70  180  Cauc   Jr  1080\nFred      18   67  156 Af.Am   Fr  1210\nBarb      18   64  128 Af.Am   Fr   840\nSue       24   66  118  Cauc   Sr  1340\nJeff      20   72  202 Asian   So   880<\/span><\/pre>\n<p><span style=\"color: #000000;\">Let&#8217;s make a data frame of this&#8230;\u00a0<\/span><\/p>\n<pre>&gt; ls()                            # A clean workspace is a good start!\ncharacter(0)\n&gt; name = scan(what=\"character\")\n1: Bob Fred Barb Sue Jeff         # Remember: press Enter twice to end data entry.\n6: \nRead 5 items\n&gt; age = scan()\n1: 21 18 18 24 20\n6: \nRead 5 items\n&gt; hgt = scan()\n1: 70 67 64 66 72\n6: \nRead 5 items\n&gt; wgt = scan()\n1: 180 156 128 1118 202\n6: \nRead 5 items\n&gt; race = scan(what=\"character\")\n1: Cauc Af.Am Af.Am Cauc Asian\n6: \nRead 5 items\n&gt; year = scan(what=\"character\")\n1: Jr Fr Fr Sr So\n6: \nRead 5 items\n&gt; SAT = scan()\n1: 1080 1210 840 1340 880\n6: \nRead 5 items\n&gt; my.data = data.frame(name, age, hgt, wgt, race, year, SAT)\n&gt; my.data\n  name age hgt  wgt  race year  SAT\n1  Bob  21  70  180  Cauc   Jr 1080\n2 Fred  18  67  156 Af.Am   Fr 1210\n3 Barb  18  64  128 Af.Am   Fr  840\n4  Sue  24  66 1118  Cauc   Sr 1340\n5 Jeff  20  72  202 Asian   So  880<\/pre>\n<p><span style=\"color: #000000;\">Tah dah! It&#8217;s as simple as that. You wouldn&#8217;t want to have to do that with a large data set, however, and that&#8217;s why we&#8217;ll learn how to read them in from a file in the next tutorial. DON&#8217;T clean up your workspace. We will carry this example over into the next section.<\/span><\/p>\n<p style=\"color: #000000;\">\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Accessing Information Inside a Data Frame<\/b>\u00a0<i>(essential)<\/i><\/p>\n<p style=\"color: #000000;\">First, let&#8217;s look at a few functions that allow us to get general information about a data frame&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; dim(my.data)               # Get size in rows by columns.\n[1] 5 7\n&gt; names(my.data)             # Get the names of variables in the data frame.\n[1] \"name\" \"age\"  \"hgt\"  \"wgt\"  \"race\" \"year\" \"SAT\" \n&gt; str(my.data)               # See the internal structure of the data frame.\n'data.frame':   5 obs. of  7 variables:\n $ name: Factor w\/ 5 levels \"Barb\",\"Bob\",\"Fred\",..: 2 3 1 5 4\n $ age : num  21 18 18 24 20\n $ hgt : num  70 67 64 66 72\n $ wgt : num  180 156 128 1118 202\n $ race: Factor w\/ 3 levels \"Af.Am\",\"Asian\",..: 3 1 1 3 2\n $ year: Factor w\/ 4 levels \"Fr\",\"Jr\",\"So\",..: 2 1 1 4 3\n $ SAT : num  1080 1210 840 1340 880<\/span><\/pre>\n<p><span style=\"color: #000000;\">These are self-explanatory, with the exception of\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">str(\u00a0)<\/span><span style=\"color: #000000;\">. First, notice that our character variables were entered into the data frame as factors. This is standard in R, but it may not be what you want. Second, notice on the lines giving info about factors that there are strange numbers at the ends of those lines. You don&#8217;t have to worry about these. What R is telling you is that factors are coded internally in R as numbers. R will keep it all straight for you, so don&#8217;t sweat the details.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">The\u00a0<span style=\"font-family: courier;\">summary(\u00a0)<\/span>\u00a0function is also useful here&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; summary(my.data)\n   name        age            hgt            wgt            race   year  \n Barb:1   Min.   :18.0   Min.   :64.0   Min.   : 128.0   Af.Am:2   Fr:2  \n Bob :1   1st Qu.:18.0   1st Qu.:66.0   1st Qu.: 156.0   Asian:1   Jr:1  \n Fred:1   Median :20.0   Median :67.0   Median : 180.0   Cauc :2   So:1  \n Jeff:1   Mean   :20.2   Mean   :67.8   Mean   : 356.8             Sr:1  \n Sue :1   3rd Qu.:21.0   3rd Qu.:70.0   3rd Qu.: 202.0                   \n          Max.   :24.0   Max.   :72.0   Max.   :1118.0                   \n      SAT      \n Min.   : 840  \n 1st Qu.: 880  \n Median :1080  \n Mean   :1070  \n 3rd Qu.:1210  \n Max.   :1340<\/span><\/pre>\n<p><span style=\"color: #000000;\">Or at least that would be useful if the data frame were larger!<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">There are four ways to get at the data inside a data frame, and this is NOT one of them&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; SAT\n[1] 1080 1210  840 1340  880<\/span><\/pre>\n<p><span style=\"color: #000000;\">That only seemed to work, because remember when you created the data frame, you started by putting a vector called &#8220;SAT&#8221; into the workspace. THAT&#8217;S what you&#8217;re seeing now! You are not seeing the SAT variable from inside the data frame.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Let&#8217;s erase all those vectors EXCEPT &#8220;age&#8221;, which we will keep to illustrate something that you will need to remember about R&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; ls()\n[1] \"age\"     \"hgt\"     \"my.data\" \"name\"    \"race\"    \"SAT\"     \"wgt\"    \n[8] \"year\"   \n&gt; rm(hgt, name, race, SAT, wgt, year)  ### Don't erase my.data!\n&gt; ls()\n[1] \"age\"     \"my.data\"<\/span><\/pre>\n<p><span style=\"color: #000000;\">Now if we try to see SAT as we did above&#8230;\u00a0<\/span><\/p>\n<pre>&gt; SAT\nError: object 'SAT' not found<\/pre>\n<p><span style=\"color: #000000;\">&#8230;we get an error. R will not look inside data frames for variables unless you tell it to. Here are the four ways to do that&#8230;<\/span><\/p>\n<p style=\"color: #000000;\">\n<ul style=\"color: #000000;\">\n<li>by using $<\/li>\n<li>by using\u00a0<span style=\"font-family: courier;\">with( )<\/span><\/li>\n<li>by using data=<\/li>\n<li>by using\u00a0<span style=\"font-family: courier;\">attach( )<\/span><\/li>\n<\/ul>\n<p style=\"color: #000000;\">A data frame is a list of column vectors. We can extract items from inside it by using the usual list indexing device, $. To do this, type the name of the data frame, a dollar sign, and the name of the variable you want to work with&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; my.data$SAT\n[1] 1080 1210  840 1340  880\n&gt; mean(my.data$SAT)\n[1] 1070<\/span><\/pre>\n<p><span style=\"color: #000000;\">If that dollar sign stuff gets hard to read, you can put spaces around the $ to make the command line easier to read&#8230;\u00a0<\/span><\/p>\n<pre>&gt; mean(my.data $ SAT)\n[1] 1070<\/pre>\n<p><span style=\"color: #000000;\">This can certainly be a nuisance, because it will mean that in some commands you have to type the data frame name multiple times. An example is the command that calculates a correlation&#8230;\u00a0<\/span><\/p>\n<pre>&gt; cor(my.data$hgt, my.data$wgt)\n[1] -0.2531835<\/pre>\n<p><span style=\"color: #000000;\">In this case, you can use the\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">with(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function to tell R where to get the data from&#8230;\u00a0<\/span><\/p>\n<pre>&gt; with(my.data, cor(hgt, wgt))\n[1] -0.2531835<\/pre>\n<p><span style=\"color: #000000;\">It doesn&#8217;t save much typing in this example, but there are cases where that will save a LOT of typing! Notice the syntax of this function. You type the name of the data frame first, followed by a comma, followed by the function you want to execute, then you close the parentheses on<\/span><span style=\"color: #000000; font-family: courier;\">with(\u00a0).<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">As we will learn later, some functions, especially significance tests, take what&#8217;s called a formula interface. When that&#8217;s the case, there is always a data= option to specify the name of the data frame where the variables are to be found. I&#8217;ll just show you an example for now. We&#8217;ll have plenty of time to examine the formula interface later&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; cor.test( ~ hgt + wgt, data=my.data)\n\n        Pearson's product-moment correlation\n\ndata:  hgt and wgt \nt = -0.4533, df = 3, p-value = 0.6811\nalternative hypothesis: true correlation is not equal to 0 \n95 percent confidence interval:\n -0.9281289  0.8100218 \nsample estimates:\n       cor \n-0.2531835<\/span><\/pre>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Finally, there is the dreaded\u00a0<span style=\"font-family: courier;\">attach(\u00a0)<\/span>\u00a0function. This attaches the data frame to your search path (in position 2) so that R will know to look there for data objects that are referenced by name. Some people use this device routinely when working with data frames, but it can cause problems, and we are about to see one&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; attach(my.data)\n\n        The following object(s) are masked _by_ .GlobalEnv :\n\n         age<\/span><\/pre>\n<p><span style=\"color: #000000;\">Say what? When an object is masked (or shadowed) by the global environment, that means there is a data object in the workspace that has this name AND there is a variable inside the data frame that has this name. I can now ask for any variable inside the data frame EXCEPT age&#8230;<\/span><\/p>\n<pre>&gt; SAT\n[1] 1080 1210  840 1340  880\n&gt; mean(SAT)\n[1] 1070\n&gt; table(year)\nyear\nFr Jr So Sr \n 2  1  1  1 \n&gt; age\n[1] 21 18 18 24 20<\/pre>\n<p><span style=\"color: #000000;\">You might think you are seeing my.data$age here, but YOU ARE NOT! You&#8217;re seeing &#8220;age&#8221; from the workspace. In this case they&#8217;re the same, but that won&#8217;t always be true&#8230;\u00a0<\/span><\/p>\n<pre>&gt; age = 112\n&gt; age\n[1] 112<\/pre>\n<p><span style=\"color: #000000;\">The assignment changed the value of &#8220;age&#8221; in the workspace, but not in the data frame&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data$age\n[1] 21 18 18 24 20<\/pre>\n<p><span style=\"color: #000000;\">If we remove age from the workspace, R will then search inside the data frame for it&#8230;\u00a0<\/span><\/p>\n<pre>&gt; rm(age)\n&gt; age\n[1] 21 18 18 24 20<\/pre>\n<p><span style=\"color: #000000;\">The lesson is, when you get one of these masking (or shadowing) conflicts, WATCH OUT! Be extra careful to know which version of the variable you&#8217;re working with. This has tripped up many an R user, including me. This is why you want to keep your workspace as clean as possible. The best strategy here is to remove the masking variable from the workspace. If you want to keep it, at least rename it and then remove the conflicting version from the workspace. You&#8217;ll eventually be sorry if you don&#8217;t!<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">One more lesson&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; detach(my.data)<\/span><\/pre>\n<p><span style=\"color: #000000;\">When you&#8217;re done with an attached data frame, ALWAYS detach it. This will remove it from the search path so that R will no longer look inside it for variables. You&#8217;ll have to go back to using $ to reference variables inside the data frame after it is detached. This isn&#8217;t necessary if you&#8217;re going to quit your R session right away. Quitting detaches everything that was attached. But if you&#8217;re going to continue working, detach data frames you no longer need. Otherwise, your search path will get messy, and you&#8217;ll get more and more masking conflicts as other objects are attached.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">DON&#8217;T erase my.data. We still need it.<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Data Frame Indexing and Row Names<\/b>\u00a0<i>(critical)<\/i><\/p>\n<p style=\"color: #000000;\">This will cost you BIGTIME eventually if you don&#8217;t pay close attention!<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; ls()                            # Still there?\n[1] \"my.data\"\n&gt; my.data\n  name age hgt  wgt  race year  SAT\n1  Bob  21  70  180  Cauc   Jr 1080\n2 Fred  18  67  156 Af.Am   Fr 1210\n3 Barb  18  64  128 Af.Am   Fr  840\n4  Sue  24  66 1118  Cauc   Sr 1340\n5 Jeff  20  72  202 Asian   So  880<\/span><\/pre>\n<p><span style=\"color: #000000;\">Let&#8217;s talk about those line numbers at the leftmost verge of the printed data frame. THEY ARE NOT NUMBERS. Let me repeat that. THEY ARE NOT NUMBERS. They are row names. So the rows and columns of this data frame are NAMED as follows:\u00a0<\/span><\/p>\n<pre>&gt; dimnames(my.data)\n[[1]]\n[1] \"1\" \"2\" \"3\" \"4\" \"5\"\n\n[[2]]\n[1] \"name\" \"age\"  \"hgt\"  \"wgt\"  \"race\" \"year\" \"SAT\"<\/pre>\n<p><span style=\"color: #000000;\">What&#8217;s the big deal?<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Look at the printed data frame. Suppose we wanted to extract Barb&#8217;s weight. That&#8217;s the value in row 3 and column 4, so we could get it this way&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; my.data[3,4]                    # Remember to use square brackets for indexing.\n[1] 128<\/span><\/pre>\n<p><span style=\"color: #000000;\">&#8220;Yeah, so?&#8221; We could also get it this way&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data[3,\"wgt\"]\n[1] 128<\/pre>\n<p><span style=\"color: #000000;\">&#8230;and this way&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data[\"3\",\"wgt\"]\n[1] 128<\/pre>\n<p><span style=\"color: #000000;\">Those last two ways seem to be the same, BUT THEY ARE NOT!!!<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Let&#8217;s sort the data frame using the age variable. Sorting a data frame is done using the\u00a0<span style=\"font-family: courier;\">order(\u00a0)<\/span>\u00a0function. Remember how it worked when we sorted a vector? If a call to the\u00a0<span style=\"font-family: courier;\">order(\u00a0)<\/span>\u00a0function is put in place of the row index the data frame will be sorted on whatever variable is specified inside that function. You will have to use the full name of the variable; i.e., you will have to use the $ notation. (Why?) Otherwise, R will be looking in your workspace for a variable called &#8220;age&#8221;, not finding it, and giving a &#8220;not found&#8221; error. It happens to me a lot, so you might as well just get used to it!<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; my.data[order(my.data$age),]\n  name age hgt  wgt  race year  SAT\n2 Fred  18  67  156 Af.Am   Fr 1210\n3 Barb  18  64  128 Af.Am   Fr  840\n5 Jeff  20  72  202 Asian   So  880\n1  Bob  21  70  180  Cauc   Jr 1080\n4  Sue  24  66 1118  Cauc   Sr 1340<\/span><\/pre>\n<p><span style=\"color: #000000;\">Observe the row names! They have also sorted, haven&#8217;t they? Let&#8217;s save this into a new data object so we can play with it a bit&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data[order(my.data$age),] -&gt; my.data.sorted      # Did you remember up arrow?\n&gt; my.data.sorted\n  name age hgt  wgt  race year  SAT\n2 Fred  18  67  156 Af.Am   Fr 1210\n3 Barb  18  64  128 Af.Am   Fr  840\n5 Jeff  20  72  202 Asian   So  880\n1  Bob  21  70  180  Cauc   Jr 1080\n4  Sue  24  66 1118  Cauc   Sr 1340<\/pre>\n<p><span style=\"color: #000000;\">Now let&#8217;s try to extract Barb&#8217;s weight from this new data frame&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data.sorted[3,4]                  ### Wrong!\n[1] 202\n&gt; my.data.sorted[3,\"wgt\"]              ### Also wrong!\n[1] 202\n&gt; my.data.sorted[\"3\",\"wgt\"]            ### Correct!\n[1] 128\n&gt; my.data.sorted[2,4]                  ### Also correct!\n[1] 128<\/pre>\n<p><span style=\"color: #000000;\">Confused yet?<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Here&#8217;s what you have to remember. Those numbers that often print out on the left side of a data frame ARE NOT NUMBERS. They&#8217;re row names. So data frames have both row and column names, whether you like it or not! The point becomes clearer when we give the rows actual names. Let&#8217;s erase the names from my.data and then re-enter them as row names&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; rm(my.data.sorted)                   # Get rid of that first.\n&gt; my.data$name &lt;- NULL                 # This is how you erase a variable.\n&gt; my.data                              # See?\n  age hgt  wgt  race year  SAT\n1  21  70  180  Cauc   Jr 1080\n2  18  67  156 Af.Am   Fr 1210\n3  18  64  128 Af.Am   Fr  840\n4  24  66 1118  Cauc   Sr 1340\n5  20  72  202 Asian   So  880\n&gt; rownames(my.data) &lt;- c(\"Bob\",\"Fred\",\"Barb\",\"Sue\",\"Jeff\")\n&gt; my.data\n     age hgt  wgt  race year  SAT\nBob   21  70  180  Cauc   Jr 1080\nFred  18  67  156 Af.Am   Fr 1210\nBarb  18  64  128 Af.Am   Fr  840\nSue   24  66 1118  Cauc   Sr 1340\nJeff  20  72  202 Asian   So  880\n&gt; my.data[\"Barb\", \"wgt\"]               # Makes getting Barb's weight a lot easier!\n[1] 128<\/span><\/pre>\n<p><span style=\"color: #000000;\">Notice the numbers are gone now because we have actual row names. And OF COURSE they sort with the rest of the data frame&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data[order(my.data$age),]\n     age hgt  wgt  race year  SAT\nFred  18  67  156 Af.Am   Fr 1210\nBarb  18  64  128 Af.Am   Fr  840\nJeff  20  72  202 Asian   So  880\nBob   21  70  180  Cauc   Jr 1080\nSue   24  66 1118  Cauc   Sr 1340<\/pre>\n<p><span style=\"color: #000000;\">It would be absolutely silly if they didn&#8217;t! Just remember: Data frames ALWAYS have row names. Sometimes those row names just happen to look like numbers. It&#8217;s the row names that print out to your console when you ask to see the data frame, or any part of it, and NOT the index numbers.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Don&#8217;t remove my.data yet. We still need it.<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Modifying a Data Frame<\/b>\u00a0<i>(not so essential just now)<\/i><\/p>\n<p style=\"color: #000000;\">Rule number one with a bullet:<\/p>\n<ul style=\"color: #000000;\">\n<li>NEVER MODIFY AN ATTACHED DATA FRAME!<\/li>\n<\/ul>\n<p style=\"color: #000000;\">While this isn&#8217;t strictly against the law, it&#8217;s a bad idea and can get very confusing as to exactly what it is you&#8217;ve modified. I could try to explain it, but I&#8217;m not sure I understand it myself. So just don&#8217;t do it!<\/p>\n<p style=\"color: #000000;\">The time will come when you want to change a data frame in some way. Here are some examples of how to do that. You may have noticed that Sue (in the my.data data frame) is a wee bit on the chunky side. This was an innocent mistake. I really didn&#8217;t do that on purpose. How do we fix it? The value was supposed to be 118, but let&#8217;s change it to 135 just for kicks&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; ls()                                 # Still there?\n[1] \"my.data\"\n&gt; my.data\n     age hgt  wgt  race year  SAT\nBob   21  70  180  Cauc   Jr 1080\nFred  18  67  156 Af.Am   Fr 1210\nBarb  18  64  128 Af.Am   Fr  840\nSue   24  66 1118  Cauc   Sr 1340\nJeff  20  72  202 Asian   So  880\n&gt; my.data[\"Sue\",\"wgt\"] &lt;- 135\n&gt; my.data\n     age hgt wgt  race year  SAT\nBob   21  70 180  Cauc   Jr 1080\nFred  18  67 156 Af.Am   Fr 1210\nBarb  18  64 128 Af.Am   Fr  840\nSue   24  66 135  Cauc   Sr 1340\nJeff  20  72 202 Asian   So  880<\/span><\/pre>\n<p><span style=\"color: #000000;\">That&#8217;s all there is to it. Use any kind of indexing you like. Let&#8217;s use numerical indexing to give Sue her correct weight&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data[4,3] &lt;- 118\n&gt; my.data\n     age hgt wgt  race year  SAT\nBob   21  70 180  Cauc   Jr 1080\nFred  18  67 156 Af.Am   Fr 1210\nBarb  18  64 128 Af.Am   Fr  840\nSue   24  66 118  Cauc   Sr 1340\nJeff  20  72 202 Asian   So  880<\/pre>\n<p><span style=\"color: #000000;\">Just remember that &#8220;wgt&#8221; is now in column 3, since the row names don&#8217;t count as a column.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">I have to warn you about modifying data frames. It&#8217;s always a good idea to make a backup copy in the workspace first. Because there are some commands that modify data frames that, if they go wrong, can really screw things up! But let&#8217;s live dangerously. Suppose we wanted &#8220;wgt&#8221; to be in kilograms instead of pounds. Easy enough&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; my.data$wgt \/ 2.2\n[1] 81.81818 70.90909 58.18182 53.63636 91.81818\n&gt; my.data                                # Nothing has changed yet. Why not?\n     age hgt wgt  race year  SAT\nBob   21  70 180  Cauc   Jr 1080\nFred  18  67 156 Af.Am   Fr 1210\nBarb  18  64 128 Af.Am   Fr  840\nSue   24  66 118  Cauc   Sr 1340\nJeff  20  72 202 Asian   So  880\n&gt; my.data$wgt \/ 2.2 -&gt; my.data$wgt       # Aha! It has to be stored back into my.data.\n&gt; my.data\n     age hgt      wgt  race year  SAT\nBob   21  70 81.81818  Cauc   Jr 1080\nFred  18  67 70.90909 Af.Am   Fr 1210\nBarb  18  64 58.18182 Af.Am   Fr  840\nSue   24  66 53.63636  Cauc   Sr 1340\nJeff  20  72 91.81818 Asian   So  880\n&gt; round(my.data$wgt, 1) -&gt; my.data$wgt   # A little rounding for good measure.\n&gt; my.data\n     age hgt  wgt  race year  SAT\nBob   21  70 81.8  Cauc   Jr 1080\nFred  18  67 70.9 Af.Am   Fr 1210\nBarb  18  64 58.2 Af.Am   Fr  840\nSue   24  66 53.6  Cauc   Sr 1340\nJeff  20  72 91.8 Asian   So  880<\/span><\/pre>\n<p><span style=\"color: #000000;\">Now that we&#8217;ve rounded them off, we&#8217;ve lost the original weight data in pounds&#8230;\u00a0<\/span><\/p>\n<pre>&gt; my.data$wgt*2.2\n[1] 179.96 155.98 128.04 117.92 201.96<\/pre>\n<p><span style=\"color: #000000;\">We could have avoided this by making a backup copy of my.data first, or by putting the new weight in kilograms into a new column in the data frame.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Let&#8217;s see how to create a new column in the data frame&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; my.data$IQ = c(115, 122, 100, 144, 96)\n&gt; my.data\n     age hgt  wgt  race year  SAT  IQ\nBob   21  70 81.8  Cauc   Jr 1080 115\nFred  18  67 70.9 Af.Am   Fr 1210 122\nBarb  18  64 58.2 Af.Am   Fr  840 100\nSue   24  66 53.6  Cauc   Sr 1340 144\nJeff  20  72 91.8 Asian   So  880  96<\/span><\/pre>\n<p><span style=\"color: #000000;\">Just name it and assign values to the name in a vector. The new vector has to be the same length as the other variables already in the data frame.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">You can clean up now. We&#8217;re done with this data frame.<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Missing Values<\/b>\u00a0<i>(kinda important, so listen up!)<\/i><\/p>\n<p style=\"color: #000000;\">Do this&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; library(MASS)\n&gt; data(Cars93)\n&gt; attach(Cars93)\n&gt; str(Cars93)                     # Output not shown.<\/span><\/pre>\n<p><span style=\"color: #000000;\">This is a data frame with 93 observations on 27 variables. You can see what the variables represent by looking at the help page for this data set:\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">?Cars93<\/span><span style=\"color: #000000;\">. We&#8217;re interested in the variable &#8220;Luggage.room&#8221; in particular, which is the trunk space in cubic feet, to the nearest cubic foot&#8230;\u00a0<\/span><\/p>\n<pre>&gt; summary(Luggage.room)\n   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's \n   6.00   12.00   14.00   13.89   15.00   22.00   11.00<\/pre>\n<p><span style=\"color: #000000;\">This is a numerical variable, so we get the summary we are accustomed to by now. But what are those NAs? Whether we like it or not, data sets often have missing values, and we need to know how to deal with them. R&#8217;s standard code for missing values is &#8220;NA&#8221;, for &#8220;not available&#8221;. The number associated with NA is a frequency. There are 11 cases in this data frame in which &#8220;Luggage.room&#8221; is a missing value. If you looked at the help page, you know why.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Some functions fail to work when there are missing values, but this can (almost always) be fixed with a simple option&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; mean(Luggage.room)\n[1] NA\n&gt; mean(Luggage.room, na.rm=TRUE)\n[1] 13.89024\n&gt; mean(Luggage.room, na.rm=T)\n[1] 13.89024<\/span><\/pre>\n<p><span style=\"color: #000000;\">There is no mean when some of the values are missing, so the &#8220;na.rm&#8221; option removes them when set to TRUE (must be all caps, but the shorter form T also works provided you haven&#8217;t assigned another value to it). If you want to clean the data set by removing casewise all cases with missing values on any variable, use the\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">na.omit(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function&#8230;\u00a0<\/span><\/p>\n<pre>&gt; na.omit(Cars93)                 # Output not shown.<\/pre>\n<p><span style=\"color: #000000;\">I will not reproduce the output here because it is extensive, but it is also instructive, so take a look at it. Scroll the console window backwards to see all of it. Of course, to use this cleaned data frame, you would have to assign it to a new data object.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">The\u00a0<span style=\"font-family: courier;\">which(\u00a0)<\/span>\u00a0function does not work to identify which of the values are missing. Use\u00a0<span style=\"font-family: courier;\">is.na(\u00a0)<\/span>\u00a0instead&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; which(Luggage.room == NA)\ninteger(0)\n&gt; is.na(Luggage.room)\n [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[12] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE\n[23] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[34] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[56]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE\n[67] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE\n[89]  TRUE FALSE FALSE FALSE FALSE\n&gt; which(is.na(Luggage.room))\n [1] 16 17 19 26 36 56 57 66 70 87 89<\/span><\/pre>\n<p><span style=\"color: #000000;\">Finally, some data sets come with other codes for missing values. 999 is a common missing value code, as are blank spaces. Blanks are a very bad idea. If you find a data set with blanks in it, it may have to be edited in a text editor or spreadsheet before the file can be read into R. It depends on how the file is formatted. In some cases, R will automatically assign NA to blank values, but in other cases it will not. Other missing value codes are not a problem, as they can be recoded&#8230;\u00a0<\/span><\/p>\n<pre>&gt; ifelse(is.na(Luggage.room), 999, Luggage.room) -&gt; temp\n&gt; temp\n [1]  11  15  14  17  13  16  17  21  14  18  14  13  14  13  16 999 999\n[18]  20 999  15  14  17  11  13  14 999  16  11  11  15  12  12  13  12\n[35]  18 999  18  21  10  11   8  12  14  11  12   9  14  15  14   9  19\n[52]  22  16  13  14 999 999  12  15   6  15  11  14  12  14 999  14  14\n[69]  16 999  17   8  17  13  13  16  18  14  12  10  15  14  10  11  13\n[86]  15 999  10 999  14  15  14  15\n&gt; # first we'll mess it up\n&gt; # and then we'll fix it\n&gt; ifelse(temp == 999, NA, temp) -&gt; fixed\n&gt; fixed\n [1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 NA NA 20 NA 15 14 17 11\n[24] 13 14 NA 16 11 11 15 12 12 13 12 18 NA 18 21 10 11  8 12 14 11 12  9\n[47] 14 15 14  9 19 22 16 13 14 NA NA 12 15  6 15 11 14 12 14 NA 14 14 16\n[70] NA 17  8 17 13 13 16 18 14 12 10 15 14 10 11 13 15 NA 10 NA 14 15 14\n[93] 15<\/pre>\n<p><span style=\"color: #000000;\">The\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">ifelse(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function is very handy for recoding a data vector, so let me take a moment to explain it. Inside the parentheses, the first thing you give is a test. In the second of these commands above, where we are going from the messed up variable back to &#8220;fixed&#8221;, the test was &#8220;if any value of temp is equal to 999&#8221;. Notice the double equals sign meaning &#8220;equal&#8221;. (I still get this wrong a lot!) The second thing you give is how to recode those values, and finally you tell what to do with the values that don&#8217;t pass the test. So the whole command reads like this: &#8220;If any value of temp is equal to 999, assign it the value NA, else assign it the value that is currently in temp.&#8221;<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">In the first instance of the function, we had to use is.na, since nothing can really be &#8220;equal to&#8221; something that is not available! Try these, and say them in words as you&#8217;re typing them&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; ifelse(fixed == 10, 0, 100)          # Output not shown.\n&gt; ifelse(fixed &gt; 10, 100, 0)           # Output not shown.\n&gt; ifelse(fixed &gt; 10, \"big\", \"small\")   # Output not shown.<\/span><\/pre>\n<p><span style=\"color: #000000;\">If you stored that last one, it would create a character vector.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Don&#8217;t forget to clean up your workspace and search path!!<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Subsetting a Data Frame<\/b>\u00a0<i>(optional)<\/i><\/p>\n<p style=\"color: #000000;\">We will use a data frame called USArrests for this exercise&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; data(USArrests)\n&gt; head(USArrests)\n           Murder Assault UrbanPop Rape\nAlabama      13.2     236       58 21.2\nAlaska       10.0     263       48 44.5\nArizona       8.1     294       80 31.0\nArkansas      8.8     190       50 19.5\nCalifornia    9.0     276       91 40.6\nColorado      7.9     204       78 38.7<\/span><\/pre>\n<p><span style=\"color: #000000;\">Here is another useful function for looking at a data frame. The\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">head(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function shows the first six lines of data (cases) inside a data frame. There is also a\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">tail(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function that shows the last six lines, and the number of lines shown can be changed with an option (see the help pages).<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">In this case we have a data frame with row names set to state names and containing variables that give the crime rates (per 100,000 population) for Murder, Assault, and Rape, as well as the percentage of the population that lives in urban areas. These data are from 1973 so are not current.<\/p>\n<p style=\"color: #000000;\">Because state names are used as row names, to see the data for any state, all we have to be able to do is spell the name of the state&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; USArrests[\"Pennsylvania\",]      # No column index, so all columns displayed.\n             Murder Assault UrbanPop Rape\nPennsylvania    6.3     106       72 14.9<\/span><\/pre>\n<p><span style=\"color: #000000;\">We do not have to figure out what the index number would be for that row. Thus, explicit row names can be very handy. To display the entire row of data for PA, we just left out the column index, but THE COMMA STILL HAS TO BE THERE! Otherwise, you are trying to index a two-dimensional data object using only one index, and R will tell you to knock it off!<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">Let&#8217;s answer the following questions from these data&#8230;<\/p>\n<ul style=\"color: #000000;\">\n<li>Which state has the lowest murder rate?<\/li>\n<li>Which states have murder rates less than 4.0?<\/li>\n<li>Which states are in the top quartile for urban population?<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<pre>&gt; min(USArrests$Murder)                     # What is the minimum murder rate?\n[1] 0.8\n&gt; which(USArrests$Murder == 0.8)            # Which line of the data is that?\n[1] 34\n&gt; USArrests[34,]                            # Give me the data from that line.\n             Murder Assault UrbanPop Rape\nNorth Dakota    0.8      45       44  7.3\n&gt;\n&gt; which(USArrests$Murder &lt; 4.0)             # Gives the result in a vector.\n [1]  7 12 15 19 23 29 34 39 41 44 45 49\n&gt; USArrests[which(USArrests$Murder &lt; 4.0),] # Use that vector as an index.\n              Murder Assault UrbanPop Rape\nConnecticut      3.3     110       77 11.1\nIdaho            2.6     120       54 14.2\nIowa             2.2      56       57 11.3\nMaine            2.1      83       51  7.8\nMinnesota        2.7      72       66 14.9\nNew Hampshire    2.1      57       56  9.5\nNorth Dakota     0.8      45       44  7.3\nRhode Island     3.4     174       87  8.3\nSouth Dakota     3.8      86       45 12.8\nUtah             3.2     120       80 22.9\nVermont          2.2      48       32 11.2\nWisconsin        2.6      53       66 10.8\n&gt;\n&gt; summary(USArrests$UrbanPop)\n   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n  32.00   54.50   66.00   65.54   77.75   91.00 \n&gt; USArrests[which(USArrests$UrbanPop &gt;= 77.75),]\n              Murder Assault UrbanPop Rape\nArizona          8.1     294       80 31.0\nCalifornia       9.0     276       91 40.6\nColorado         7.9     204       78 38.7\nFlorida         15.4     335       80 31.9\nHawaii           5.3      46       83 20.2\nIllinois        10.4     249       83 24.0\nMassachusetts    4.4     149       85 16.3\nNevada          12.2     252       81 46.0\nNew Jersey       7.4     159       89 18.8\nNew York        11.1     254       86 26.1\nRhode Island     3.4     174       87  8.3\nTexas           12.7     201       80 25.5\nUtah             3.2     120       80 22.9<\/pre>\n<p style=\"color: #000000;\">Suppose we wanted to work with data only from these states. How can we extract them from the data frame and make a new data frame that contains only those states? I&#8217;m glad you asked&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; subset(USArrests, subset=(UrbanPop &gt;=77.75)) -&gt; high.urban\n&gt; high.urban\n              Murder Assault UrbanPop Rape\nArizona          8.1     294       80 31.0\nCalifornia       9.0     276       91 40.6\nColorado         7.9     204       78 38.7\nFlorida         15.4     335       80 31.9\nHawaii           5.3      46       83 20.2\nIllinois        10.4     249       83 24.0\nMassachusetts    4.4     149       85 16.3\nNevada          12.2     252       81 46.0\nNew Jersey       7.4     159       89 18.8\nNew York        11.1     254       86 26.1\nRhode Island     3.4     174       87  8.3\nTexas           12.7     201       80 25.5\nUtah             3.2     120       80 22.9<\/span><\/pre>\n<p><span style=\"color: #000000;\">The\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">subset(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function does the trick. The syntax is a little squirrelly, so let me go through it. The first thing you give is the name of the data frame. That is followed by the subset= option. Then inside of parentheses (which actually aren&#8217;t necessary) give the test that defines the subset. Store the output into a new data object so that you can then work with it. Functions that take a data= option can also take a subset option, so it&#8217;s a useful thing to know.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">You can clean up your workspace now.<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Stacking and Unstacking<\/b>\u00a0<i>(optional)<\/i><\/p>\n<p style=\"color: #000000;\">Suppose someone has retained your services as a data analyst and gives you his data (from an Excel file or something) in this format&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">contr     treat1    treat2\n---------------------------\n  22        32        30\n  18        35        28\n  25        30        25\n  25        42        22\n  20        31        33\n---------------------------<\/span><\/pre>\n<p><span style=\"color: #000000;\">If you&#8217;re working for free, you can yell at him and make him do it the right way, but if you&#8217;re being paid, you probably really shouldn&#8217;t. Here&#8217;s how to deal with it. First, let&#8217;s get these data into a &#8220;data frame&#8221; in this format, and I will leave out the command prompts so that you can just copy and paste these three lines directly into R&#8230;\u00a0<\/span><\/p>\n<pre>### start copying here\nwrong.data = data.frame(contr = c(22,18,25,25,20),\n                        treat1 = c(32,35,30,42,31),\n                        treat2 = c(30,28,25,22,33))\n### stop copying here\n&gt; wrong.data\n  contr treat1 treat2\n1    22     32     30\n2    18     35     28\n3    25     30     25\n4    25     42     22\n5    20     31     33<\/pre>\n<p><span style=\"color: #000000;\">Now do this&#8230;\u00a0<\/span><\/p>\n<pre>&gt; stack(wrong.data) -&gt; correct.data\n&gt; correct.data\n   values    ind\n1      22  contr\n2      18  contr\n3      25  contr\n4      25  contr\n5      20  contr\n6      32 treat1\n7      35 treat1\n8      30 treat1\n9      42 treat1\n10     31 treat1\n11     30 treat2\n12     28 treat2\n13     25 treat2\n14     22 treat2\n15     33 treat2<\/pre>\n<p><span style=\"color: #000000;\">And there you go. Now you have a proper data frame.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">There is also an\u00a0<span style=\"font-family: courier;\">unstack(\u00a0)<\/span>\u00a0function that does the reverse of this, and it will work automatically on a data frame that has been created by<span style=\"font-family: courier;\">stack(\u00a0)<\/span>, but otherwise is a little trickier to use. You probably won&#8217;t have to use it much, so I&#8217;ll refer you to the help page if you ever need it.<\/p>\n<p style=\"color: #000000;\">You can remove these data objects. We won&#8217;t use them again.<\/p>\n<hr style=\"color: #000000;\" \/>\n<p style=\"color: #000000;\"><b>Going From Wide to Long and Long to Wide<\/b>\u00a0<i>(eventually you&#8217;ll probably need to know this)<\/i><\/p>\n<p style=\"color: #000000;\">I mention this above under &#8220;An Ambiguous Case.&#8221; There are two kinds of data frames in R, and in most statistical software: wide ones and long ones. Let&#8217;s fetch the &#8220;anorexia&#8221; data again (and we&#8217;ll do it without attaching the MASS package this time)&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; data(anorexia, package=\"MASS\")<\/span><\/pre>\n<p><span style=\"color: #000000;\">What we are about to do is a little confusing until you get some experience with it, so it will be necessary to be able to see what&#8217;s happening. The anorexia data frame is too long to print to a single console screen with causing it to scroll, so I&#8217;m going to cut it down to only nine cases, three from each group. This will help us to see the difference between wide and long data frames without constantly scrolling the console window&#8230;\u00a0<\/span><\/p>\n<pre>&gt; anorexia[c(1,2,3,27,28,29,56,57,58),] -&gt; anor\n&gt; anor\n   Treat Prewt Postwt\n1   Cont  80.7   80.2\n2   Cont  89.4   80.1\n3   Cont  91.8   86.4\n27   CBT  80.5   82.2\n28   CBT  84.9   85.6\n29   CBT  81.5   81.4\n56    FT  83.8   95.2\n57    FT  83.3   94.3\n58    FT  86.0   91.5<\/pre>\n<p><span style=\"color: #000000;\">I also shortened up the name of our data frame, because we&#8217;re going to be typing it a lot.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">This is a wide data frame. It&#8217;s wide because each line of the data frame contains information on ONE SUBJECT, even though that subject was measured multiple times (twice) on weight (Prewt, Postwt). So all the data for each subject goes on ONE LINE, even though we could interpret this as a repeated measures design, or longitudinal data.<\/p>\n<p style=\"color: #000000;\">In a long data frame, each value of weight would define a case. So each of these subjects would have two lines in such a data frame, one for the subject&#8217;s Prewt, and one for her Postwt. A wide data frame would be used, for example, in analysis of covariance. A long data frame would be used in repeated measures analysis of variance. Do we have to retype the data frame to get from wide to long? Fortunately not! Because R has a function called\u00a0<span style=\"font-family: courier;\">reshape(\u00a0)<\/span>\u00a0which will do the work for us.<\/p>\n<p style=\"color: #000000;\">It is not an easy function to understand, however (and don&#8217;t count on the help page being a whole lot of help!). So let me illustrate it, and then I will explain what&#8217;s happening&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; reshape(data=anor, direction=\"long\",\n+         varying=c(\"Prewt\",\"Postwt\"), v.names=\"Weight\",\n+         idvar=\"subject\", ids=row.names(anor),\n+         timevar=\"PrePost\", times=c(\"Prewt\",\"Postwt\")\n+        ) -&gt; anor.long\n&gt; anor.long\n          Treat PrePost Weight subject\n1.Prewt    Cont   Prewt   80.7       1\n2.Prewt    Cont   Prewt   89.4       2\n3.Prewt    Cont   Prewt   91.8       3\n27.Prewt    CBT   Prewt   80.5      27\n28.Prewt    CBT   Prewt   84.9      28\n29.Prewt    CBT   Prewt   81.5      29\n56.Prewt     FT   Prewt   83.8      56\n57.Prewt     FT   Prewt   83.3      57\n58.Prewt     FT   Prewt   86.0      58\n1.Postwt   Cont  Postwt   80.2       1\n2.Postwt   Cont  Postwt   80.1       2\n3.Postwt   Cont  Postwt   86.4       3\n27.Postwt   CBT  Postwt   82.2      27\n28.Postwt   CBT  Postwt   85.6      28\n29.Postwt   CBT  Postwt   81.4      29\n56.Postwt    FT  Postwt   95.2      56\n57.Postwt    FT  Postwt   94.3      57\n58.Postwt    FT  Postwt   91.5      58<\/span><\/pre>\n<p><span style=\"color: #000000;\">In this example, the first argument I gave to the\u00a0<\/span><span style=\"color: #000000; font-family: courier;\">reshape(\u00a0)<\/span><span style=\"color: #000000;\">\u00a0function was the name of the data frame to be reshaped, and that was given in the data= option. Then I specified the direction= option as &#8220;long&#8221; so that the data frame would be convert TO a long format.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">In the second line of this command, I specified varying= as a vector of variable names in anor that correspond to the repeated measures or longitudinal measures (the time-varying variables). These values will be given in one column in the new data frame, so I named that new column using the v.names= option.<\/p>\n<p style=\"color: #000000;\">A long data frame needs two things that a wide one does not have. One of those things is a column identifying the subject (case or experimental unit) from which the data in a row of the data frame come from. This is necessary because each subject will have multiple rows of data in a long data frame. So I used the idvar= option to specify the name of this new column that would identify the subjects. I then used ids= to specify how the subjects were to be named. I told it to use the row names from anor, which is a sensible thing to do.<\/p>\n<p style=\"color: #000000;\">The other thing a long format data frame needs that a wide one does not is a variable giving the condition (or time) in which the subject is being measured for this particular row of data. In the wide format, this information is in the column (variable) names, but that will no longer be true in the long format. We need to know which measure is Prewt and which measure is Postwt for each subject, since these will be on different rows of the data frame in long format. I named this new variable using the timevar= option, and I gave its possible values in a vector using the times= option. The order in which those values should be listed is the same as the order in which the corresponding columns occur in the wide data frame.<\/p>\n<p style=\"color: #000000;\">Finally, I closed the parentheses on the\u00a0<span style=\"font-family: courier;\">reshape(\u00a0)<\/span>\u00a0function and assigned the output to a new data object. Done!<\/p>\n<p style=\"color: #000000;\">This can also be made to work if you have more than one repeated measures variable, in which case all I can say is may the saints be with you!<\/p>\n<p style=\"color: #000000;\">If the data frame results from a\u00a0<span style=\"font-family: courier;\">reshape(\u00a0)<\/span>\u00a0command, then it can be converted back very simply. All you have to do is this&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; reshape(anor.long)\n         Treat subject Prewt Postwt\n1.Prewt   Cont       1  80.7   80.2\n2.Prewt   Cont       2  89.4   80.1\n3.Prewt   Cont       3  91.8   86.4\n27.Prewt   CBT      27  80.5   82.2\n28.Prewt   CBT      28  84.9   85.6\n29.Prewt   CBT      29  81.5   81.4\n56.Prewt    FT      56  83.8   95.2\n57.Prewt    FT      57  83.3   94.3\n58.Prewt    FT      58  86.0   91.5<\/span><\/pre>\n<p><span style=\"color: #000000;\">The row names have gone a little screwy, but all the correct information is there. This isn&#8217;t very useful actually, because we already have the data in wide format in the data frame anor, which we were smart enough not to overwrite. So let&#8217;s see how to convert from long to wide the hard way.<\/span><\/p>\n<p style=\"color: #000000;\">\n<p style=\"color: #000000;\">First, we will get rid of those ridiculous row names&#8230;<\/p>\n<pre style=\"color: #000000;\"><span style=\"font-family: courier;\">&gt; rownames(anor.long) &lt;- as.character(1:18)      # Just do it!\n&gt; anor.long\n   Treat PrePost Weight subject\n1   Cont   Prewt   80.7       1\n2   Cont   Prewt   89.4       2\n3   Cont   Prewt   91.8       3\n4    CBT   Prewt   80.5      27\n5    CBT   Prewt   84.9      28\n6    CBT   Prewt   81.5      29\n7     FT   Prewt   83.8      56\n8     FT   Prewt   83.3      57\n9     FT   Prewt   86.0      58\n10  Cont  Postwt   80.2       1\n11  Cont  Postwt   80.1       2\n12  Cont  Postwt   86.4       3\n13   CBT  Postwt   82.2      27\n14   CBT  Postwt   85.6      28\n15   CBT  Postwt   81.4      29\n16    FT  Postwt   95.2      56\n17    FT  Postwt   94.3      57\n18    FT  Postwt   91.5      58<\/span><\/pre>\n<p><span style=\"color: #000000;\">And now for the reshaping. I won&#8217;t bother storing it&#8230;\u00a0<\/span><\/p>\n<pre>&gt; reshape(data=anor.long, direction=\"wide\",\n+         v.names=c(\"Weight\"),\n+         idvar=\"subject\",\n+         timevar=\"PrePost\"\n+        )\n  Treat subject Weight.Prewt Weight.Postwt\n1  Cont       1         80.7          80.2\n2  Cont       2         89.4          80.1\n3  Cont       3         91.8          86.4\n4   CBT      27         80.5          82.2\n5   CBT      28         84.9          85.6\n6   CBT      29         81.5          81.4\n7    FT      56         83.8          95.2\n8    FT      57         83.3          94.3\n9    FT      58         86.0          91.5<\/pre>\n<p><span style=\"color: #000000;\">We didn&#8217;t quite recover the original table, but then we probably didn&#8217;t really want to. The first two options name the data frame we are reshaping and tell the direction we are reshaping TO. The next option, v.names=, gives the name of the time-varying variable that will be split into two (or more) columns. The idvar= option gives the name of the variable that is the subject identifier. Finally, the timevar= option gives the name of the variable that contains the conditions under which the longitidinal information was collected; i.e., there were two weights, a Prewt and a Postwt. Notice these values were used to name the two new columns of Weight data. Want a pneumonic to help you remember all that? Yeah, me too!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>DATA FRAMES Preamble There is plenty to say about data frames because they are the primary data structure in R. Some of what follows is&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-642","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/642","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=642"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/642\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=642"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=642"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=642"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}