{"id":754,"date":"2015-02-14T21:55:07","date_gmt":"2015-02-15T04:55:07","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=754"},"modified":"2015-02-14T21:55:07","modified_gmt":"2015-02-15T04:55:07","slug":"pdf-2-text-or-csv-r-2","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2015\/02\/14\/pdf-2-text-or-csv-r-2\/","title":{"rendered":"PDF-2-text-or-CSV.r"},"content":{"rendered":"<pre class=\"\"># Here are a few methods for getting text from PDF files. Do read through \n# the instructions carefully! NOte that this code is written for Windows 7,\n# slight adjustments may be needed for other OSs\n\n# Tell R what folder contains your 1000s of PDFs\ndest &lt;- \"G:\/somehere\/with\/many\/PDFs\"\n\n# make a vector of PDF file names\nmyfiles &lt;- list.files(path = dest, pattern = \"pdf\",  full.names = TRUE)\n\n# now there are a few options...\n\n############### PDF (image of text format) to TXT ##########\n# This is for is your PDF is an image of text, this is the case\n# if you open the PDF in a PDF viewer and you cannot select\n# words or lines with your cursor.\n\n                     ##### Wait! #####\n# Before proceeding, make sure you have a copy of Tesseract\n# on your computer! Details &amp; download:\n# https:\/\/code.google.com\/p\/tesseract-ocr\/\n# and a copy of ImageMagick: http:\/\/www.imagemagick.org\/\n# and a copy of pdftoppm on your computer! \n# Download: http:\/\/www.foolabs.com\/xpdf\/download.html\n# And then after installing those three, restart to \n# ensure R can find them on your path. \n# And note that this process can be quite slow...\n\n# PDF filenames can't have spaces in them for these operations\n# so let's get rid of the spaces in the filenames\n\nsapply(myfiles, FUN = function(i){\n  file.rename(from = i, to =  paste0(dirname(i), \"\/\", gsub(\" \", \"\", basename(i))))\n})\n\n# get the PDF file names without spaces\nmyfiles &lt;- list.files(path = dest, pattern = \"pdf\",  full.names = TRUE)\n\n# Now we can do the OCR to the renamed PDF files. Don't worry\n# if you get messages like 'Config Error: No display \n# font for...' it's nothing to worry about\n\nlapply(myfiles, function(i){\n  # convert pdf to ppm (an image format), just pages 1-10 of the PDF\n  # but you can change that easily, just remove or edit the \n  # -f 1 -l 10 bit in the line below\n  shell(shQuote(paste0(\"pdftoppm \", i, \" -f 1 -l 10 -r 600 ocrbook\")))\n  # convert ppm to tif ready for tesseract\n  shell(shQuote(paste0(\"convert *.ppm \", i, \".tif\")))\n  # convert tif to text file\n  shell(shQuote(paste0(\"tesseract \", i, \".tif \", i, \" -l eng\")))\n  # delete tif file\n  file.remove(paste0(i, \".tif\" ))\n  })\n\n\n# where are the txt files you just made?\ndest # in this folder\n\n# And now you're ready to do some text mining on the text files\n\n############### PDF (text format) to TXT ###################\n\n                  ##### Wait! #####\n# Before proceeding, make sure you have a copy of pdf2text\n# on your computer! Details: https:\/\/en.wikipedia.org\/wiki\/Pdftotext\n# Download: http:\/\/www.foolabs.com\/xpdf\/download.html\n\n# If you have a PDF with text, ie you can open the PDF in a \n# PDF viewer and select text with your curser, then use these \n# lines to convert each PDF file that is named in the vector \n# into text file is created in the same directory as the PDFs\n# note that my pdftotext.exe is in a different location to yours\nlapply(myfiles, function(i) system(paste('\"C:\/Program Files\/xpdf\/bin64\/pdftotext.exe\"', paste0('\"', i, '\"')), wait = FALSE) )\n\n# where are the txt files you just made?\ndest # in this folder\n\n# And now you're ready to do some text mining on the text files\n\n############### PDF to CSV (DfR format) ####################\n\n# or if you want DFR-style csv files...\n# read txt files into R\nmytxtfiles &lt;- list.files(path = dest, pattern = \"txt\",  full.names = TRUE)\n\nlibrary(tm)\nmycorpus &lt;- Corpus(DirSource(dest, pattern = \"txt\"))\n# warnings may appear after you run the previous line, they\n# can be ignored\nmycorpus &lt;- tm_map(mycorpus,  removeNumbers)\nmycorpus &lt;- tm_map(mycorpus,  removePunctuation)\nmycorpus &lt;- tm_map(mycorpus,  stripWhitespace)\nmydtm &lt;- DocumentTermMatrix(mycorpus)\n# remove some OCR weirdness\n# words with more than 2 consecutive characters\nmydtm &lt;- mydtm[,!grepl(\"(.)\\\\1{2,}\", mydtm$dimnames$Terms)]\n\n# get each doc as a csv with words and counts\nfor(i in 1:nrow(mydtm)){\n  # get word counts\n  counts &lt;- as.vector(as.matrix(mydtm[1,]))\n  # get words\n  words &lt;- mydtm$dimnames$Terms\n  # combine into data frame\n  df &lt;- data.frame(word = words, count = counts,stringsAsFactors = FALSE)\n  # exclude words with count of zero\n  df &lt;- df[df$count != 0,]\n  # write to CSV with original txt filename\n  write.csv(df, paste0(mydtm$dimnames$Docs[i],\".csv\"), row.names = FALSE) \n}\n\n# and now you're ready to work with the csv files\n\n############### PDF to TXT (all text between two words) ####\n\n## Below is about splitting the text files at certain characters\n## can be skipped...\n\n# if you just want the abstracts, we can use regex to extract that part of\n# each txt file, Assumes that the abstract is always between the words 'Abstract'\n# and 'Introduction'\n\nabstracts &lt;- lapply(mytxtfiles, function(i) {\n  j &lt;- paste0(scan(i, what = character()), collapse = \" \")\n  regmatches(j, gregexpr(\"(?&lt;=Abstract).*?(?=Introduction)\", j, perl=TRUE))\n})\n# Write abstracts into separate txt files...\n\n# write abstracts as txt files \n# (or use them in the list for whatever you want to do next)\nlapply(1:length(abstracts),  function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], \"abstract\", \"txt\", sep=\".\"), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = \" \" ))\n\n# And now you're ready to do some text mining on the txt \n\n# originally on http:\/\/stackoverflow.com\/a\/21449040\/1036500<\/pre>\n","protected":false},"excerpt":{"rendered":"<p># Here are a few methods for getting text from PDF files. Do read through # the instructions carefully! NOte that this code is written&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-754","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/754","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=754"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/754\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=754"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=754"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=754"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}