{"id":896,"date":"2016-01-28T10:35:17","date_gmt":"2016-01-28T17:35:17","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=896"},"modified":"2022-11-22T05:15:15","modified_gmt":"2022-11-22T05:15:15","slug":"working-with-xml-data-in-r","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2016\/01\/28\/working-with-xml-data-in-r\/","title":{"rendered":"Working with XML Data in R"},"content":{"rendered":"<h2>Working with XML Data in R<\/h2>\n<p>A common task for programmers these days is writing code to analyze data from various sources and output information for use by non-coders or business executives. Although you can use any language for this type of analysis, I&#8217;ve found that R simplifies working with almost any modern data type, including XML, a popular choice for storing large amounts of complex data. In this article, I&#8217;ll step through the process of examining XML data in an R program, so you can see how easy R makes working with XML files.<\/p>\n<p>I first began analyzing XML data while I was working within the public health department at Cornell&#8217;s Medical School. We&#8217;ll use the same data source, PubMed, for this article. For this example, I pulled data about papers authored by Dr. Madhu Mazumdar in 2012 from PubMed&#8217;s API, which is called <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/books\/NBK25501\/\">Entrez<\/a>.<\/p>\n<div class=\"sidebar\">\n<p class=\"title\"><strong>NOTE<\/strong><\/p>\n<p>To view or use the code examples in this article, download the <a href=\"http:\/\/www.informit.com\/content\/images\/art_bosede1_xml_data_r\/elementLinks\/code.zip\">code<\/a> file. The full code referenced in this article is included for the purposes of helping readers follow along in R. In the <tt>pubmed_sample.xml<\/tt> file, the data entries begin with the <tt>&lt;PubmedArticle&gt;<\/tt> tag and end with <tt>&lt;\/PubmedArticle&gt;<\/tt>.<\/p>\n<\/div>\n<p>When dealing with XML data, the main package we&#8217;ll rely on is XML. However, for our analysis in this example, we&#8217;ll also need the <tt>plyr<\/tt>, <tt>ggplot2<\/tt>, and <tt>gridExtra<\/tt> packages:<\/p>\n<ul>\n<li><tt>plyr<\/tt> to turn the XML into a dataframe<\/li>\n<li><tt>ggplot<\/tt> to create aesthetically pleasing graphs<\/li>\n<li><tt>gridExtra<\/tt> to put multiple graphs on one canvas<\/li>\n<\/ul>\n<p>Our first step is to get these packages from the nearest <a href=\"http:\/\/cran.us.r-project.org\/\">Comprehensive R Archive Network<\/a> (CRAN) mirror and load them into our current R session:<\/p>\n<pre>install.packages<span class=\"blue-grey\">(<\/span><span class=\"green\">\"XML\"<\/span><span class=\"blue-grey\">)<\/span>\ninstall.packages<span class=\"blue-grey\">(<\/span><span class=\"green\">\"plyr\"<\/span><span class=\"blue-grey\">)<\/span>\ninstall.packages<span class=\"blue-grey\">(<\/span><span class=\"green\">\"ggplot2\"<\/span><span class=\"blue-grey\">)<\/span>\ninstall.packages<span class=\"blue-grey\">(<\/span><span class=\"green\">\"gridExtra\"<\/span><span class=\"blue-grey\">)<\/span>\n\n<span class=\"blue\">require<\/span><span class=\"blue-grey\">(<\/span><span class=\"green\">\"XML\"<\/span><span class=\"blue-grey\">)<\/span>\n<span class=\"blue\">require<\/span><span class=\"blue-grey\">(<\/span><span class=\"green\">\"plyr\"<\/span><span class=\"blue-grey\">)<\/span>\n<span class=\"blue\">require<\/span><span class=\"blue-grey\">(<\/span><span class=\"green\">\"ggplot2\"<\/span><span class=\"blue-grey\">)<\/span>\n<span class=\"blue\">require<\/span><span class=\"blue-grey\">(<\/span><span class=\"green\">\"gridExtra\"<\/span><span class=\"blue-grey\">)<\/span><\/pre>\n<p>Next we need to set our working directory and parse the XML file as a matter of practice, so we&#8217;re sure that R can access the data within the file. This is basically reading the file into R. Then, just to confirm that R knows our file is in XML, we check the class. Indeed, R is aware that it&#8217;s XML.<\/p>\n<pre>setwd<span class=\"blue-grey\">(<\/span><span class=\"green\">\"C:\/Users\/Tobi\/Documents\/R\/InformIT\"<\/span><span class=\"blue-grey\">)<\/span> <span class=\"teal\">#you will need to change the filepath on  your machine<\/span>\nxmlfile<span class=\"blue-grey\">=<\/span>xmlParse<span class=\"blue-grey\">(<\/span><span class=\"green\">\"pubmed_sample.xml\"<\/span><span class=\"blue-grey\">)<\/span>\nclass<span class=\"blue-grey\">(<\/span>xmlfile<span class=\"blue-grey\">)<\/span> <span class=\"teal\">#\"XMLInternalDocument\" \"XMLAbstractDocument\"<\/span><\/pre>\n<p>Now we can begin to explore our XML. Perhaps we want to confirm that our HTTP query on Entrez pulled the correct results, just as when we query PubMed&#8217;s website. We start by looking at the contents of the first node or root, <tt>PubmedArticleSet<\/tt>. We can also find out how many child nodes the root has and their names. This process corresponds to checking how many entries are in the XML file. The root&#8217;s child nodes are all named <tt>PubmedArticle<\/tt>.<\/p>\n<pre>xmltop <span class=\"blue-grey\">=<\/span> xmlRoot<span class=\"blue-grey\">(<\/span>xmlfile<span class=\"blue-grey\">)<\/span> <span class=\"teal\">#gives content of root<\/span>\nclass<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">)<\/span><span class=\"teal\">#\"XMLInternalElementNode\" \"XMLInternalNode\" \"XMLAbstractNode\"<\/span>\nxmlName<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">)<\/span> <span class=\"teal\">#give name of node, PubmedArticleSet<\/span>\nxmlSize<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">)<\/span> <span class=\"teal\">#how many children in node, 19<\/span>\nxmlName<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">[[<\/span>1<span class=\"blue-grey\">]])<\/span> <span class=\"teal\">#name of root's children<\/span><\/pre>\n<p>To see the first two entries, we can do the following.<\/p>\n<pre><span class=\"teal\"># have a look at the content of the first child entry<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span>\n<span class=\"teal\"># have a look at the content of the 2nd child entry<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">2<\/span><span class=\"blue-grey\">]]<\/span><\/pre>\n<p>Our exploration continues by looking at subnodes of the root. As with the root node, we can list the name and size of the subnodes as well as their attributes. In this case, the subnodes are <tt>MedlineCitation<\/tt> and <tt>PubmedData<\/tt>.<\/p>\n<pre><span class=\"teal\">#Root Node's children<\/span>\nxmlSize<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">)<\/span> #number of nodes in each child\nxmlSApply<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span>, xmlName<span class=\"blue-grey\">)<\/span> #name<span class=\"blue-grey\">(<\/span>s<span class=\"blue-grey\">)<\/span>\nxmlSApply<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span>, xmlAttrs<span class=\"blue-grey\">)<\/span> #attribute<span class=\"blue-grey\">(<\/span>s<span class=\"blue-grey\">)<\/span>\nxmlSApply<span class=\"blue-grey\">(<\/span>xmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span>, xmlSize<span class=\"blue-grey\">)<\/span> #size<\/pre>\n<p>We can also separate each of the 19 entries by these subnodes. Here we do so for the first and second entries:<\/p>\n<pre><span class=\"teal\">#take a look at the MedlineCitation subnode of 1st child<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span>\n<span class=\"teal\">#take a look at the PubmedData subnode of 1st child<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span><span class=\"blue\">2<\/span><span class=\"blue-grey\">]]<\/span>\n\n<span class=\"teal\">#subnodes of 2nd child<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">2<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">2<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span><span class=\"blue\">2<\/span><span class=\"blue-grey\">]]<\/span><\/pre>\n<p>The separation of entries is really just us, indexing into the tree structure of the XML. We can continue to do this until we exhaust a path\u2014or, in XML terminology, reach the end of the branch. We can do this via the numbers of the child nodes or their actual names:<\/p>\n<pre><span class=\"teal\">#we can keep going till we reach the end of a branch<\/span>\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span><span class=\"blue\">1<\/span><span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span>5<span class=\"blue-grey\">]]<\/span><span class=\"blue-grey\">[[<\/span><span class=\"blue\">2<\/span><span class=\"blue-grey\">]]<\/span> #title of first article\nxmltop<span class=\"blue-grey\">[[<\/span><span class=\"green\">'PubmedArticle'<\/span><span class=\"blue-grey\">]][[<\/span><span class=\"green\">'MedlineCitation'<\/span><span class=\"blue-grey\">]][[<\/span><span class=\"green\">'Article'<\/span><span class=\"blue-grey\">]][[<\/span><span class=\"green\">'ArticleTitle'<\/span><span class=\"blue-grey\">]]<\/span> <span class=\"teal\">#same command, but more readable<\/span><\/pre>\n<p>Finally, we can transform the XML into a more familiar structure\u2014a dataframe. Our command completes with errors due to non-uniform formatting of data and nodes. So we must check that all the data from the XML is properly inputted into our dataframe. Indeed, there are duplicate rows, due to the creation of separate rows for tag attributes. For instance, the <tt>ELocationID<\/tt> node has two attributes, <tt>ValidYN<\/tt> and <tt>EIDType<\/tt>. Take the time to note how the duplicates arise from this separation.<\/p>\n<pre><span class=\"teal\">#Turning XML into a dataframe<\/span>\nMadhu2012<span class=\"blue-grey\">=<\/span>ldply<span class=\"blue-grey\">(<\/span>xmlToList<span class=\"blue-grey\">(<\/span><span class=\"green\">\"pubmed_sample.xml\"<\/span><span class=\"blue-grey\">)<\/span>, data.frame<span class=\"blue-grey\">)<\/span> <span class=\"teal\">#completes with errors: \"row names were found from a short variable and have been discarded\"<\/span>\nView<span class=\"blue-grey\">(<\/span>Madhu2012<span class=\"blue-grey\">)<\/span> <span class=\"teal\">#for easy checking that the data is properly formatted<\/span>\nMadhu2012.Clean<span class=\"blue-grey\">=<\/span>Madhu2012<span class=\"blue-grey\">[<\/span>Madhu2012<span class=\"blue-grey\">[<\/span><span class=\"blue\">25<\/span><span class=\"blue-grey\">]==<\/span><span class=\"green\">'Y'<\/span>,<span class=\"blue-grey\">]<\/span> <span class=\"teal\">#gets rid of duplicated rows<\/span><\/pre>\n<p>Taking a look at the titles of the column headings after calling the <tt>View<span class=\"blue-grey\">()<\/span><\/tt> function on the dataframe, it&#8217;s clear that they&#8217;re the paths from the root through various child branches until the terminus or our data of interest is reached. Try to go through the XML document to see if you can follow the path to a specific piece of data in a column.<\/p>\n<p>Now that our XML data is a dataframe, we can do some analysis. For instance, we might be interested in which authors contributed most to the articles of which Dr. Mazumdar was an author in 2012. We can make histograms for the first through eleventh authors, with the understanding that first authors contribute the most, and eleventh authors the least. The code below labels the relevant data for easy tracking and removes blank entries, which we don&#8217;t want appearing in our histogram.<\/p>\n<pre><span class=\"teal\">#looking at which authors played most active role<\/span>\nFirstAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName\nSecondAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.1\nThirdAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\"><span class=\"blue-grey\">$<\/span><\/span>MedlineCitation.Article.AuthorList.Author.LastName.2\n\n<span class=\"teal\">#removing NAs<\/span>\nMadhu2012.Na.Rm4<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.3<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nFourthAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm4<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.3\nMadhu2012.Na.Rm5<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.4<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nFifthAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm5<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.4\nMadhu2012.Na.Rm6<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.5<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nSixthAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm6<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.5\nMadhu2012.Na.Rm7<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.6<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nSeventhAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm7<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.6\nMadhu2012.Na.Rm8<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.7<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nEighthAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm8<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.7\nMadhu2012.Na.Rm9<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.8<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nNinthAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm9<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.8\nMadhu2012.Na.Rm10<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.9<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nTenthAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm10<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.9\nMadhu2012.Na.Rm11<span class=\"blue-grey\">=<\/span>Madhu2012.Clean<span class=\"blue-grey\">[<\/span>!is.na<span class=\"blue-grey\">(<\/span>Madhu2012.Clean<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.10<span class=\"blue-grey\">)<\/span>,<span class=\"blue-grey\">]<\/span>\nEleventhAuthor<span class=\"blue-grey\">=<\/span>Madhu2012.Na.Rm11<span class=\"blue-grey\">$<\/span>MedlineCitation.Article.AuthorList.Author.LastName.10<\/pre>\n<p>Next, we create histograms detailing counts of each author&#8217;s last name per author type. Our analysis then aggregates the 11 diagrams onto three pages in a single PDF file for easy comparison. The PDF file is included in the <a href=\"http:\/\/www.informit.com\/content\/images\/art_bosede1_xml_data_r\/elementLinks\/code.zip\">code<\/a> file for this article.<\/p>\n<pre><span class=\"teal\">#write all the graphs to pdf on 3 canvases<\/span>\na<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Clean, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>FirstAuthor<span class=\"blue-grey\">)) + <\/span>geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nb<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Clean, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>SecondAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nc<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Clean, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>ThirdAuthor<span class=\"blue-grey\">)) + <\/span>geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nd<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm4, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>FourthAuthor<span class=\"blue-grey\">)) + <\/span>geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\ne<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm5, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>FifthAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nf<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm6, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>SixthAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\ng<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm7, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>SeventhAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nh<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm8, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>EighthAuthor<span class=\"blue-grey\">)) + <\/span>geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\ni<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm9, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>NinthAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nj<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm10, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>TenthAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\nk<span class=\"blue-grey\">=<\/span>ggplot<span class=\"blue-grey\">(<\/span>Madhu2012.Na.Rm11, aes<span class=\"blue-grey\">(<\/span>x<span class=\"blue-grey\">=<\/span>EleventhAuthor<span class=\"blue-grey\">)) +<\/span> geom_histogram<span class=\"blue-grey\">(<\/span>binwidth<span class=\"blue-grey\">=<\/span><span class=\"blue\">.5<\/span>, colour<span class=\"blue-grey\">=<\/span><span class=\"green\">\"pink\"<\/span>, fill<span class=\"blue-grey\">=<\/span><span class=\"green\">\"purple\"<\/span><span class=\"blue-grey\">)<\/span>+coord_flip<span class=\"blue-grey\">()<\/span>\n\npdf<span class=\"blue-grey\">(<\/span><span class=\"green\">\"AuthorHistogram.pdf\"<\/span><span class=\"blue-grey\">)<\/span>\ngrid.arrange<span class=\"blue-grey\">(<\/span>a,b,c,d<span class=\"blue-grey\">)<\/span>\ngrid.arrange<span class=\"blue-grey\">(<\/span>e,f,g,h<span class=\"blue-grey\">)<\/span>\ngrid.arrange<span class=\"blue-grey\">(<\/span>i,j,k<span class=\"blue-grey\">)<\/span>\ndev.off<span class=\"blue-grey\">()<\/span><\/pre>\n<p>Page 1 of the results is shown in <a>Figure 1<\/a>. (The complete PDF is included in the <a href=\"http:\/\/www.informit.com\/content\/images\/art_bosede1_xml_data_r\/elementLinks\/code.zip\">code<\/a> file for download.) However, when we look at the other pages, we see that of all the 19 articles published under Dr. Mazumdar&#8217;s name in 2012 (whether in print or online), she was most often listed fourth, fifth, or sixth in the authors of the paper. This means that she was only moderately active in terms of writing the papers. Instead, we see that Dr. Memtsoudis was the primary writer of about 1\/3 of the papers, with Dr. Ma as next most active. We should also note that most of the papers have just six or fewer authors. Hence, sometimes Dr. Mazumdar was the last author and thus contributed in a minor way.<\/p>\n<div class=\"figure\"><a> <img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/www.informit.com\/content\/images\/art_bosede1_xml_data_r\/elementLinks\/thbosede1_fig01.jpg\" alt=\"\" width=\"400\" height=\"399\"><\/a><a>Figure 1<\/a> Histogram for the first four authors used in analysis of level of contribution.<\/p>\n<\/div>\n<p>Finally, we can write the dataframe to a <tt>.csv<\/tt> or <tt>.txt<\/tt> file and import it into another program such as SQL, Python, or SAS. Or we might want to share our data with a collaborator. The output files are included in the <a href=\"http:\/\/www.informit.com\/content\/images\/art_bosede1_xml_data_r\/elementLinks\/code.zip\">code<\/a> file for this article.<\/p>\n<pre><span class=\"teal\">#exporting data<\/span>\nwrite.table<span class=\"blue-grey\">(<\/span>Madhu2012.Clean, <span class=\"green\">\"Madhu2012.txt\"<\/span>, sep<span class=\"blue-grey\">=<\/span><span class=\"green\">\"\\t\"<\/span>, row.names<span class=\"blue-grey\">=<\/span>FALSE<span class=\"blue-grey\">)<\/span>\nwrite.csv<span class=\"blue-grey\">(<\/span>Madhu2012.Clean, <span class=\"green\">\"Madhu2012.csv\"<\/span>, row.names<span class=\"blue-grey\">=<\/span>FALSE<span class=\"blue-grey\">)<\/span><\/pre>\n<p>That was interesting to learn. No big deal, right?<\/p>\n<p>As you can imagine, R allows us to perform many other analyses on our XML data with ease. For example, we might be interested in finding out which journals published 2012 papers by Dr. Mazumdar, or which grants supported her work.<\/p>\n<p>As we&#8217;ve seen in this example, with R, there&#8217;s no need to shy away from unfamiliar data types such as XML. The world is your oyster!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Working with XML Data in R A common task for programmers these days is writing code to analyze data from various sources and output information&hellip; <\/p>\n","protected":false},"author":1,"featured_media":953,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-896","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/896","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=896"}],"version-history":[{"count":1,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/896\/revisions"}],"predecessor-version":[{"id":961,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/896\/revisions\/961"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media\/953"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=896"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=896"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=896"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}