{"id":743,"date":"2015-02-13T10:50:15","date_gmt":"2015-02-13T17:50:15","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=743"},"modified":"2015-02-13T10:50:15","modified_gmt":"2015-02-13T17:50:15","slug":"extracting-text-from-pdfs-doing-ocr-all-within-r","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2015\/02\/13\/extracting-text-from-pdfs-doing-ocr-all-within-r\/","title":{"rendered":"Extracting Text from PDFs; Doing OCR; all within R"},"content":{"rendered":"<p><em>I am a huge fan of <a href=\"https:\/\/gist.github.com\/benmarwick\">Ben Marwick. He has so many useful pieces of code<\/a> for the programming archaeologist or historian!<\/em><\/p>\n<p><em>Edit July 17 1.20 pm: Mea culpa: I originally titled this post, \u2018Doing OCR within R\u2019. But, what I\u2019m describing below \u2013 that\u2019s not OCR. That\u2019s extracting text from pdfs. It\u2019s very fast and efficient, but it\u2019s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt\u2019s piece at the bottom. Sorry.<br \/>\n<\/em><\/p>\n<p><em>Edit July 17 10 pm: I am now an even bigger fan of Ben\u2019s. He\u2019s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!<\/em><\/p>\n<p>Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive<\/p>\n<p>One way of doing OCR on your own machine with free tools, is to use Ben Marwick\u2019s pdf-2-text-or-csv.r script for the R programming language. Marwick\u2019s script uses R as wrapper for the Xpdf programme from <a href=\"http:\/\/www.foolabs.com\/\">Foolabs<\/a>. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick\u2019s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There\u2019s a final\u00a0part to Marwick\u2019s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.<\/p>\n<ol>\n<li>Make sure you have R downloaded and installed on your machine (available from <a href=\"http:\/\/www.r-project.org\/\" rel=\"nofollow\">http:\/\/www.r-project.org\/<\/a>)<\/li>\n<li>Make sure you have Xpdf downloaded and installed (available from<a href=\"ftp:\/\/ftp.foolabs.com\/pub\/xpdf\/xpdfbin-win-3.04.zip\">ftp:\/\/ftp.foolabs.com\/pub\/xpdf\/xpdfbin-win-3.04.zip<\/a> ). Make a note of where you unzipped it. In particular, you are looking for the location of the file \u2018pdftotext.exe\u2019. Also, make sure you know where \u2018pdftoppm\u2019 is located too (it\u2019s in that download).<\/li>\n<li>Download and install Tesseract\u00a0<a href=\"https:\/\/code.google.com\/p\/tesseract-ocr\/\">https:\/\/code.google.com\/p\/tesseract-ocr\/\u00a0<\/a><\/li>\n<li>Download and install Imagemagick\u00a0<a href=\"http:\/\/www.imagemagick.org\/\">http:\/\/www.imagemagick.org\/<\/a><\/li>\n<li>Have a folder with the pdfs you wish to extract text from.<\/li>\n<li>Open R, and paste Marwick\u2019s script into the script editor window.<\/li>\n<li>Make sure you adjust the path for \u201cdest\u201d and the path to \u201cpdftotext.exe\u201d to the correct location<\/li>\n<li>Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.<\/li>\n<\/ol>\n<p>Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software\u00a0are going to be different. And of course you\u2019re using R for Mac, which means the \u2018shell\u2019 commands have to be swapped to \u2018system\u2019! (As of July 2014, the Xpdf file for Mac that you want is at <a href=\"ftp:\/\/ftp.foolabs.com\/pub\/xpdf\/xpdfbin-mac-3.04.tar.gz\">ftp:\/\/ftp.foolabs.com\/pub\/xpdf\/xpdfbin-mac-3.04.tar.gz<\/a>\u00a0) I\u2019m not 100% certain of any other Mac\/PC differences in the R script \u2013 these should only exist\u00a0at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?<\/p>\n<p>The full R script may be found at<a href=\"https:\/\/gist.github.com\/benmarwick\/11333467\">https:\/\/gist.github.com\/benmarwick\/11333467.<\/a>\u00a0So here is\u00a0the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):<\/p>\n<pre class=\"line number1 index0 alt2\"><code class=\"r comments\">###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist<\/code>################# Wait! ####################################\n# Before proceeding, make sure you have a copy of pdf2text\n# on your computer! Details: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pdftotext\">https:\/\/en.wikipedia.org\/wiki\/Pdftotext\n<\/a># Download: <a href=\"http:\/\/www.foolabs.com\/xpdf\/download.html\">http:\/\/www.foolabs.com\/xpdf\/download.html\n\n\n# Tell R what folder contains your 1000s of PDFs\n<\/a><code class=\"r plain\">dest &lt;- <\/code><code class=\"r string\">\"G:\/somehere\/with\/many\/PDFs\"\n\n# make a vector of PDF file names\n<\/code><code class=\"r plain\">myfiles &lt;- <\/code><code class=\"r functions\">list.files<\/code><code class=\"r plain\">(path = dest, pattern = <\/code><code class=\"r string\">\"pdf\"<\/code><code class=\"r plain\">,\u00a0 full.names = <\/code><code class=\"r keyword\">TRUE<\/code><code class=\"r plain\">)\n# now there are a few options...\n############### PDF to TXT #################################\n<\/code># convert each PDF file that is named in the vector into a text file\n# text file is created in the same directory as the PDFs\n# note that my pdftotext.exe is in a different location to yours\n<code class=\"r functions\">lapply<\/code><code class=\"r plain\">(myfiles, <\/code><code class=\"r functions\">function<\/code><code class=\"r plain\">(i) <\/code><code class=\"r functions\">system<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">paste<\/code><code class=\"r plain\">(<\/code><code class=\"r string\">'\"C:\/Program Files\/xpdf\/bin64\/pdftotext.exe\"'<\/code><code class=\"r plain\">, <\/code><code class=\"r functions\">paste0<\/code><code class=\"r plain\">(<\/code><code class=\"r string\">'\"'<\/code><code class=\"r plain\">, i, <\/code><code class=\"r string\">'\"'<\/code><code class=\"r plain\">)), wait = <\/code><code class=\"r keyword\">FALSE<\/code><code class=\"r plain\">) )\n<\/code># where are the txt files you just made?\n<code class=\"r plain\">dest <\/code><code class=\"r comments\"># in this folder<\/code><\/pre>\n<div><\/div>\n<p>And here\u2019s the bit that does the OCR<\/p>\n<pre class=\"line number1 index0 alt2\"><code class=\"r plain\">&lt;\/pre&gt;<\/code><\/pre>\n<pre class=\"line number2 index1 alt1\"><code class=\"r spaces\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/code><code class=\"r comments\">##### Wait! #####\n<\/code># Before proceeding, make sure you have a copy of Tesseract\n# on your computer! Details &amp; download:\n# <a href=\"https:\/\/code.google.com\/p\/tesseract-ocr\/\">https:\/\/code.google.com\/p\/tesseract-ocr\/\n<\/a># and a copy of ImageMagick: <a href=\"http:\/\/www.imagemagick.org\/\">http:\/\/www.imagemagick.org\/\n<\/a># and a copy of pdftoppm on your computer!\n# Download: <a href=\"http:\/\/www.foolabs.com\/xpdf\/download.html\">http:\/\/www.foolabs.com\/xpdf\/download.html\n# And then after installing those three, restart to\n<\/a># ensure R can find them on your path.\n# And note that this process can be quite slow...\n# PDF filenames can't have spaces in them for these operations\n# so let's get rid of the spaces in the filenames\n<code class=\"r functions\">sapply<\/code><code class=\"r plain\">(myfiles, FUN = <\/code><code class=\"r keyword\">function<\/code><code class=\"r plain\">(i){\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r functions\">file.rename<\/code><code class=\"r plain\">(from = i, to =\u00a0 <\/code><code class=\"r functions\">paste0<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">dirname<\/code><code class=\"r plain\">(i), <\/code><code class=\"r string\">\"\/\"<\/code><code class=\"r plain\">, <\/code><code class=\"r functions\">gsub<\/code><code class=\"r plain\">(<\/code><code class=\"r string\">\" \"<\/code><code class=\"r plain\">, <\/code><code class=\"r string\">\"\"<\/code><code class=\"r plain\">, <\/code><code class=\"r functions\">basename<\/code><code class=\"r plain\">(i))))\n<\/code>})\n# get the PDF file names without spaces\n<code class=\"r plain\">myfiles &lt;- <\/code><code class=\"r functions\">list.files<\/code><code class=\"r plain\">(path = dest, pattern = <\/code><code class=\"r string\">\"pdf\"<\/code><code class=\"r plain\">,\u00a0 full.names = <\/code><code class=\"r keyword\">TRUE<\/code><code class=\"r plain\">)\n<\/code># Now we can do the OCR to the renamed PDF files. Don't worry\n# if you get messages like 'Config Error: No display\n# font for...' it's nothing to worry about\n<code class=\"r functions\">lapply<\/code><code class=\"r plain\">(myfiles, <\/code><code class=\"r keyword\">function<\/code><code class=\"r plain\">(i){\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r comments\"># convert pdf to ppm (an image format), using\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r functions\">shell<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">shQuote<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">paste0<\/code><code class=\"r plain\">(<\/code><code class=\"r string\">\"pdftoppm \"<\/code><code class=\"r plain\">, i, <\/code><code class=\"r string\">\" -f 1 -l 10 -r 600 ocrbook\"<\/code><code class=\"r plain\">)))\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r comments\"># convert ppm to tif ready for tesseract\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r functions\">shell<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">shQuote<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">paste0<\/code><code class=\"r plain\">(<\/code><code class=\"r string\">\"convert *.ppm \"<\/code><code class=\"r plain\">, i, <\/code><code class=\"r string\">\".tif\"<\/code><code class=\"r plain\">)))\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r comments\"># convert tif to text file\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r functions\">shell<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">shQuote<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">paste0<\/code><code class=\"r plain\">(<\/code><code class=\"r string\">\"tesseract \"<\/code><code class=\"r plain\">, i, <\/code><code class=\"r string\">\".tif \"<\/code><code class=\"r plain\">, i, <\/code><code class=\"r string\">\" -l eng\"<\/code><code class=\"r plain\">)))\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r comments\"># delete tif file\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r functions\">file.remove<\/code><code class=\"r plain\">(<\/code><code class=\"r functions\">paste0<\/code><code class=\"r plain\">(i, <\/code><code class=\"r string\">\".tif\"<\/code> <code class=\"r plain\">))\n<\/code><code class=\"r spaces\">\u00a0\u00a0<\/code><code class=\"r plain\">})\n<\/code># where are the txt files you just made?\n<code class=\"r plain\">dest <\/code><code class=\"r comments\"># in this folder<\/code><\/pre>\n<div><\/div>\n<p>Besides showing how to do your own OCR, Marwick\u2019s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt\u2019s tutorial \u2018Command-line OCR on a Mac\u2019 from his digital history graduate seminar at Northeastern University, online at<a href=\"http:\/\/benschmidt.org\/dighist13\/?page_id=129\" rel=\"nofollow\">http:\/\/benschmidt.org\/dighist13\/?page_id=129<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian! Edit July 17&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-743","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/743","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=743"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/743\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=743"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=743"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}