{"id":247,"date":"2013-07-01T09:31:59","date_gmt":"2013-07-01T14:31:59","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=247"},"modified":"2013-07-01T09:31:59","modified_gmt":"2013-07-01T14:31:59","slug":"reading-in-data-from-an-external-file","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/07\/01\/reading-in-data-from-an-external-file\/","title":{"rendered":"Reading in data from an external file"},"content":{"rendered":"<h3>Reading in data from an external file<\/h3>\n<blockquote><p>The data sets:\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/test.txt\">test.txt<\/a>,\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/cars.txt\">cars.txt<\/a>,\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/test_missing.txt\">test_missing.txt<\/a>,\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/test_missing_comma.txt\">test_missing_comma.txt<\/a>,\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/test_fixed.txt\">test_fixed.txt<\/a>,\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/scan.txt\">scan.txt<\/a>,<\/p><\/blockquote>\n<h4>1. Reading in data from the console using the\u00a0<b>scan<\/b>\u00a0function<\/h4>\n<blockquote><p>For very small data vectors it is sometimes handy to read in data directly from the prompt. This can be accomplished using the\u00a0<b>scan<\/b>\u00a0function from the command line. The\u00a0<b>scan<\/b>\u00a0function reads the fields of data in the file as specified by the\u00a0<b>what<\/b>\u00a0option, with the default being numeric. If the\u00a0<b>what<\/b>\u00a0option is specified to be\u00a0<b>what<\/b>=character() or\u00a0<b>what<\/b>=&#8221; &#8221; then all the fields will be read as strings. If the data are a mix of numeric, string or complex data, then a list can be used in the\u00a0<b>what<\/b>\u00a0option. The default separator for the\u00a0<b>scan<\/b>function is any white space (single space, tab, or new line). Because the default is space delimiting, you can enter data on separate lines. When all the data have been entered, just hit the enter key twice which will terminate the scanning.<\/p><\/blockquote>\n<pre><b># Reading in numeric data\nx &lt;- scan()<\/b>\n\n1: 3 5 6 \n4: 3 5 78 29\n8: 34 5 1 78\n12: \nRead 11 items\n\n<b>x<\/b>\n\n[1]  3  5  6  3  5 78 29 34  5  1 78\n\n<b>mode(x)<\/b>\n\n[1] \"numeric\"\n\n<b># Reading in string data\n# empty quotes indicates character input \ny &lt;- scan(what=\" \")<\/b>\n\n1: red blue\n3: green red \n5: blue yellow\n7: \nRead 6 items\n\n<b>y<\/b>\n\n[1] \"red\"    \"blue\"   \"green\"  \"red\"    \"blue\"   \"yellow\"\n\n<b>mode(y)<\/b>\n\n[1] \"character\"<\/pre>\n<h4>2. Importing data files using the scan function<\/h4>\n<blockquote><p>The\u00a0<b>scan<\/b>\u00a0function is an extremely flexible tool for importing data.\u00a0 Unlike the\u00a0<b>read.table<\/b>\u00a0function, however, which returns a data frame, the\u00a0<b>scan<\/b>\u00a0function returns a list or a vector.\u00a0 This makes the\u00a0<b>scan<\/b>\u00a0function less useful for inputting &#8220;rectangular&#8221; data such as the\u00a0<b>car<\/b>\u00a0data set that will been seen in later examples.\u00a0 In the previous example we input first numeric data and then string data directly from the console; in the following example, we input the text file,\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/modules\/scan.txt\">scan.txt<\/a>.\u00a0 For the\u00a0<b>what<\/b>\u00a0option, we use list and then list the variables, and after each variable, we tell R what type of variable (e.g., numeric, string) it is.\u00a0 In the first example, the first variable is\u00a0<b>age<\/b>, and we tell R that\u00a0<b>age<\/b>\u00a0is a numeric variable by setting it equal to 0.\u00a0 The second variable is called\u00a0<b>name<\/b>, and it is denoted as a string variable by the empty quote marks.\u00a0 In the second example, we list NULL first, indicating that we do not want the first variable to be read.\u00a0 After using the\u00a0<b>scan<\/b>\u00a0function, we use the\u00a0<b>sapply<\/b>\u00a0function, which makes a list out of a vector of names in\u00a0<b>x<\/b>.<\/p><\/blockquote>\n<pre><b># inputting a text file and outputting a list\nx &lt;- scan(\"c:\/scan.txt\", what=list(age=0, name=\"\"))<\/b>\n\nRead 4 records\n\n<b>x<\/b>\n\n$age\n[1] 12 24 35 20\n\n$name\n[1] \"bobby\"   \"kate\"    \"david\"   \"michael\"\n\n<b># using the same text file and saving only the names as a vector\nx &lt;- scan(\"c:\/scan.txt\", what=list(NULL, name=character()))<\/b>\n\nRead 4 records\n\n<b>x &lt;- x[sapply(x, length) &gt; 0] \n\nx<\/b>\n\n$name\n[1] \"bobby\"   \"kate\"    \"david\"   \"michael\"\n\n<b>is.vector(x)<\/b>\n\n[1] TRUE<\/pre>\n<h4>3. Reading in free formatted data from an ASCII file using the\u00a0<b>read.table<\/b>\u00a0function<\/h4>\n<blockquote><p>The\u00a0<b>read.table<\/b>\u00a0function will let you read in any type of delimited ASCII file. It can read in both numeric and character values. The default is for it to read in everything as numeric data, and character data is read in as numeric, it is easiest to change that once the data has been read in using the\u00a0<b>mode<\/b>\u00a0function. This is by far the easiest and most reliable method of entering data into R.<\/p><\/blockquote>\n<pre><b># complete data, space delimited, variable names in first row\ntest &lt;-  read.table(\"c:\/test.txt\", header=T)\n\ntest<\/b>\n\n   prgtype gender  id ses schtyp level \n1  general      0  70   4      1     1\n2   vocati      1 121   4      2     1\n3  general      0  86   4      3     1\n4   vocati      0 141   4      3     1\n5 academic      0 172   4      2     1\n6 academic      0 113   4      2     1\n7  general      0  50   3      2     1\n8 academic      0  11   1      2     1<\/pre>\n<blockquote><p>The default delimiter in\u00a0<b>read.table<\/b>\u00a0is the space delimiter, but this could create problems if there are missing data. The function will not work unless every data line has the same number of values. Thus, if there are missing data, the data lines will have different number of values, and you will receive an error. If there are missing values the easiest way to fix this problem is to change the type of delimiter. In the<b>read.table<\/b>\u00a0function the\u00a0<b>sep<\/b>\u00a0argument is used to specify the delimiter.<\/p><\/blockquote>\n<pre><b># showing the file with missing values, space delimited (test_missing.txt data file)<\/b>\nprgtype  gender  id ses schtyp  level\n  general    0  70    4   1      1  \n <b>  vocati    1 121    4          1  \n  general    0  86               1<\/b>  \n   vocati    0 141    4   3      1  \n academic    0 172    4   2      1  \n academic    0 113    4   2      1  \n  general    0  50    3   2      1  \n academic    0  11    1   2      1\n\n<b>test.missing &lt;- read.table(\"c:\/test_missing.txt\", header = T)<\/b>\nError in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : \nline 2 did not have 6 elements\n\n<b># showing the file with missing data, comma delimited (test_missing_comma.txt data file)<\/b>\nprgtype,  gender,  id, ses, schtyp,  level\n  general,    0,  70,    4,   1,      1  \n<b>   vocati,    1, 121,    4,    ,      1  \n  general,    0,  86,     ,    ,      1 <\/b> \n   vocati,    0, 141,    4,   3,      1  \n academic,    0, 172,    4,   2,      1  \n academic,    0, 113,    4,   2,      1  \n  general,    0,  50,    3,   2,      1  \n academic,    0,  11,    1,   2,      1  \n\n<b>test.missing &lt;- read.table(\"c:\/test_missing_comma.txt\", header = T, sep = \",\")\n\ntest.missing<\/b>\n\n    prgtype gender  id ses schtyp level \n1   general      0  70   4      1     1\n2    vocati      1 121   4     NA     1\n3   general      0  86  NA     NA     1\n4    vocati      0 141   4      3     1\n5  academic      0 172   4      2     1\n6  academic      0 113   4      2     1\n7   general      0  50   3      2     1\n8  academic      0  11   1      2     1<\/pre>\n<blockquote><p>The\u00a0<b>read.table<\/b>\u00a0function is very useful when reading in ASCII files that contain rectangular data.\u00a0 As mentioned above, the default delimiter is blank space; other delimiters must be specified by using the\u00a0<b>sep<\/b>\u00a0option and setting it equal to the delimiter in quotes (i.e.,\u00a0<b>sep=&#8221;;&#8221;<\/b>\u00a0for the semicolon delimited data file).\u00a0 Another very common type of file is the comma delimited file. The file\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/faq\/test.csv\">test.csv<\/a>\u00a0has been saved out of Excel as a comma delimited file. This file can be read in by the<b>\u00a0read.table<\/b>\u00a0function by using the\u00a0<b>sep<\/b>\u00a0option, but it can also be read in by the\u00a0<b>read.csv\u00a0<\/b>function which was written specifically for comma delimited files.\u00a0 We use the\u00a0<b>print<\/b>\u00a0function to display the contents of the object\u00a0<b>test.csv<\/b>\u00a0just to show its use.<\/p><\/blockquote>\n<pre><b>test.csv &lt;- read.csv(\"c:\/test.csv\", header=T)\n\nprint(test.csv)<\/b>\n\n print(test.csv)\n   make   model mpg weight price \n1   AMC Concord  22   2930  4099\n2   AMC   Pacer  17   3350  4749\n3   AMC  Spirit  22   2640  3799\n4 Buick Century  20   3250  4816\n5 Buick Electra  15   4080  7827\n\n<b>test.csv1 &lt;- read.table(\"c:\/test.csv\", header=T, sep=\",\")\n\nprint(test.csv1)<\/b>\n\nprint(test.csv1)\n   make   model mpg weight price \n1   AMC Concord  22   2930  4099\n2   AMC   Pacer  17   3350  4749\n3   AMC  Spirit  22   2640  3799\n4 Buick Century  20   3250  4816\n5 Buick Electra  15   4080  7827<\/pre>\n<blockquote><p>It is, of course, also possible to use the\u00a0<b>read.table<\/b>\u00a0function for reading in files with other delimiters. In the data called\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/faq\/testsemicolon.txt\">testsemicolon.txt<\/a>\u00a0has semicolon delimiters and the dataset test called\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/faq\/testz.txt\">testz.txt<\/a>\u00a0uses the letter z as a delimiter, both of which are acceptable delimiters in R.<\/p><\/blockquote>\n<pre><b>test.semi &lt;- read.table(\"c:\/testsemicolon.txt\", header=T, sep=\";\")\n\nprint(test.semi)<\/b>\n\n print(test.semi)\n   make   model mpg weight price \n1   AMC Concord  22   2930  4099\n2   AMC   Pacer  17   3350  4749\n3   AMC  Spirit  22   2640  3799\n4 Buick Century  20   3250  4816\n5 Buick Electra  15   4080  7827\n\n<b>test.z &lt;- read.table(\"c:\/testz.txt\", header=T, sep=\"z\")\n\nprint(test.z)<\/b>\n\nprint(test.z)\n   make   model mpg weight price \n1   AMC Concord  22   2930  4099\n2   AMC   Pacer  17   3350  4749\n3   AMC  Spirit  22   2640  3799\n4 Buick Century  20   3250  4816\n5 Buick Electra  15   4080  7827<\/pre>\n<h4>4. Reading in fixed formatted files<\/h4>\n<blockquote><p>We use the\u00a0<b>read.fwf<\/b>\u00a0function to read in data with fixed formats, and we use the\u00a0<b>width<\/b>\u00a0argument to indicate the width (number of columns) of each variable. In a fixed format file we do not have the names of the variables on the first line, and therefore they must be added after we have read in the data. We add the variable names using the\u00a0<b>dimnames<\/b>\u00a0function and the bracket notation to indicate that we are attaching names to the variables (columns) of the data file.\u00a0 Please note that there are several different ways to accomplish this task; this is just one of them.<\/p><\/blockquote>\n<pre><b>test.fixed &lt;- read.fwf('c:\/test_fixed.txt', width=c(8, 1, 3, 1, 1, 1))\n\ndimnames(test.fixed)[[2]] &lt;- c(\"prgtyp\", \"gender\", \"id\", \"ses\", \"schtyp\", \"level\")\n\ntest.fixed<\/b>\n\n    prgtyp gender  id ses schtyp level\n1 general       0  70   4      1     1\n2 vocati        1 121   4      2     1\n3 general       0  86   4      3     1\n4 vocati        0 141   4      3     1\n5 academic      0 172   4      2     1\n6 academic      0 113   4      2     1\n7 general       0  50   3      2     1\n8 academic      0  11   1      2     1<\/pre>\n<blockquote><p>For fixed format files the variables names are often in a separate file from the data. In this example the variable names are in a file called\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/faq\/names.txt\">names<\/a>\u00a0and the data are in a file called\u00a0<a href=\"http:\/\/www.ats.ucla.edu\/stat\/r\/faq\/testfixed.txt\">testfixed.txt<\/a>.\u00a0 This is especially convenient when the fixed format file is very large and has many variables; then it becomes rather impractical to type in all the variable names.\u00a0 In this situation the\u00a0<b>width<\/b>\u00a0option is used to specify the width of each variable and the\u00a0<b>col.name<\/b>\u00a0option specifies the file containing the variable names.\u00a0 So, first we read in the file for the names using the\u00a0<b>scan<\/b>\u00a0function.\u00a0 We specify that file contains character values by setting the\u00a0<b>what<\/b>\u00a0option to equal\u00a0<b>character()<\/b>.\u00a0 By using the\u00a0<b>col.names<\/b>\u00a0option in the\u00a0<b>read.fwf<\/b>\u00a0function, the object\u00a0<b>names<\/b>\u00a0will supply the variables names.<\/p><\/blockquote>\n<pre><b>names &lt;- scan(\"c:\/names.txt\", what=character() )\n\nprint(names)<\/b>\n\n[1] \"model\"  \"make\"   \"mph\"    \"weight\" \"price\" \n\n<b>test.fixed &lt;- read.fwf(\"c:\/testfixed.txt\", col.names=names, width = c(5, 7, 2, 4, 4))\n\nprint(test.fixed)<\/b>\n\n  model    make mph weight price \n1   AMC Concord  22   2930  4099\n2   AMC   Pacer  17   3350  4749\n3   AMC  Spirit  22   2640  3799\n4 Buick Century  20   3250  4816\n5 Buick Electra  15   4080  7827<\/pre>\n<h4>5. Exporting files using the write.table function<\/h4>\n<blockquote><p>The\u00a0<b>write.table<\/b>\u00a0function outputs data files. The first argument specifies which data frame in R is to be exported. The next argument specifies the file to be created. The default separator is a blank space but any separator can be specified in the\u00a0<b>sep<\/b>\u00a0option. The default value for both the\u00a0<b>row.names<\/b>\u00a0and<b>col.names<\/b>\u00a0options is TRUE. In the example we specify that we do not wish to include row names. The default setting for the\u00a0<b>quote<\/b>\u00a0option is to include quotes around all the character values, i.e., around values in string variables and around the column names. As we have shown in the example it is very common not to want the quotes when creating a text file.<\/p><\/blockquote>\n<pre><b># using the test.csv data frame to write a text file with no row names \n# and without quotes around the character values (both column names and string variables)\nwrite.table(test.csv, \"c:\/test1.txt\", row.names=F, quote=F)<\/b><\/pre>\n<h4>6. Exporting files in Stata 6\/7 format using the write.dta function<\/h4>\n<blockquote><p>The\u00a0<b>write.dta<\/b>\u00a0function is part of the\u00a0<b>foreign<\/b>\u00a0package and writes an R data frame to a Stata data file in either Stata 6 or 7 format. Although these are older versions of Stata, Stata has no difficulty reading files written in older versions.\u00a0 (To download the\u00a0<b>foreign<\/b>\u00a0package, click on Packages in the menu bar at the top, click on Install package(s) from CRAN, and then scroll down in the menu until you find\u00a0<b>foreign<\/b>.)\u00a0 It takes at least two arguments, the first one being the data frame and the second one being the output Stata data file name.\u00a0 If you look at the help file for\u00a0<b>write.dta<\/b>, you will see that the function writes out a Stata 6 data file, but there are comments and options for those using later versions of Stata.\u00a0 In the example below, we use the\u00a0<b>anscombe<\/b>\u00a0data set that comes with R. It happens that the\u00a0<b>anscombe<\/b>\u00a0data is already a data frame, this being checked with the\u00a0<b>is.data.frame<\/b>\u00a0function.<\/p><\/blockquote>\n<pre><b>library(foreign)<\/b><\/pre>\n<pre><b>data(anscombe)\n\nis.data.frame(anscombe)\n\n<\/b>[1] TRUE<b>\n\nanscombe\n\n<\/b>   x1 x2 x3 x4    y1   y2    y3    y4\n1  10 10 10  8  8.04 9.14  7.46  6.58\n2   8  8  8  8  6.95 8.14  6.77  5.76\n3  13 13 13  8  7.58 8.74 12.74  7.71\n4   9  9  9  8  8.81 8.77  7.11  8.84\n5  11 11 11  8  8.33 9.26  7.81  8.47\n6  14 14 14  8  9.96 8.10  8.84  7.04\n7   6  6  6  8  7.24 6.13  6.08  5.25\n8   4  4  4 19  4.26 3.10  5.39 12.50\n9  12 12 12  8 10.84 9.13  8.15  5.56\n10  7  7  7  8  4.82 7.26  6.42  7.91\n11  5  5  5  8  5.68 4.74  5.73  6.89<\/pre>\n<pre><b>write.dta(anscombe, file=\"d:\/data\/anscombe.dta\")<\/b><\/pre>\n<blockquote><p>Now let&#8217;s see an example where the data is not yet a data frame. We can use function\u00a0<b>as.data.frame<\/b>\u00a0to convert the data into a data frame.\u00a0 Again, these data come with R.<\/p><\/blockquote>\n<pre><b>data(WorldPhones)\n\nis.data.frame(WorldPhones)\n<\/b>\n[1] FALSE\n\n<b>WorldPhones\n<\/b>\n      N.Amer Europe  Asia  S.Amer  Oceania  Africa  Mid.Amer\n1951   45939  21574  2876    1815     1646      89       555\n1956   60423  29990  4708    2568     2366     1411      733\n1957   64721  32510  5230    2695     2526     1546      773\n1958   68484  35218  6662    2845     2691     1663      836\n1959   71799  37598  6856    3000     2868     1769      911\n1960   76036  40341  8220    3145     3054     1905     1008\n1961   79831  43173  9053    3338     3224     2005     1076\n\n<b>phones_d &lt;- as.data.frame(WorldPhones)\n\nphones_d\n<\/b>\n      N.Amer Europe  Asia  S.Amer  Oceania  Africa  Mid.Amer\n1951   45939  21574  2876    1815     1646      89       555\n1956   60423  29990  4708    2568     2366     1411      733\n1957   64721  32510  5230    2695     2526     1546      773\n1958   68484  35218  6662    2845     2691     1663      836\n1959   71799  37598  6856    3000     2868     1769      911\n1960   76036  40341  8220    3145     3054     1905     1008\n1961   79831  43173  9053    3338     3224     2005     1076<\/pre>\n<pre><b>is.data.frame(phones_d)<\/b>\n\n[1] TRUE\n\n<b>write.dta(phones_d, file=\"d:\/data_stata8\/phones.dta\")<\/b><\/pre>\n<blockquote><p>To give you an idea of what types of data can be read into R using the\u00a0<b>foreign<\/b>\u00a0package, part of the help file is shown below.<\/p><\/blockquote>\n<pre>data.restore   Read an S3 Binary File\nlookup.xport   Lookup Information on a SAS XPORT Format Library\nread.dbf       Read a DBF File\nread.dta       Read Stata binary files\nread.epiinfo   Read Epi Info data files\nread.mtp       Read a Minitab Portable Worksheet\nread.octave    Read Octave Text Data Files\nread.spss      Read an SPSS data file\nread.ssd       Obtain a Data Frame from a SAS Permanent Dataset, via read.xport\nread.systat    Obtain a Data Frame from a Systat File\nread.xport     Read a SAS XPORT Format Library\nwrite.dbf      Write a DBF File\nwrite.dta      Write Files in Stata Binary Format\nwrite.foreign  Write text files and code to read them.<\/pre>\n<div>\n<p><a href=\"http:\/\/www.ats.ucla.edu\/stat\/mult_pkg\/faq\/general\/citingats.htm\">How to cite this page<\/a><\/p>\n<p><a href=\"http:\/\/www.ats.ucla.edu\/stat\/apps\/codetrack\/errdirect.php\">Report an error on this page or leave a comment<\/a><\/p>\n<\/div>\n<div>\n<p>The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Reading in data from an external file The data sets:\u00a0test.txt,\u00a0cars.txt,\u00a0test_missing.txt,\u00a0test_missing_comma.txt,\u00a0test_fixed.txt,\u00a0scan.txt, 1. Reading in data from the console using the\u00a0scan\u00a0function For very small data vectors it&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-247","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/247","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=247"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/247\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=247"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=247"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}