{"id":767,"date":"2015-02-22T15:36:02","date_gmt":"2015-02-22T22:36:02","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=767"},"modified":"2015-02-22T15:36:02","modified_gmt":"2015-02-22T22:36:02","slug":"r-programmingtext-processing-2","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2015\/02\/22\/r-programmingtext-processing-2\/","title":{"rendered":"R Programming\/Text Processing"},"content":{"rendered":"<p>This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.<\/p>\n<p>This page may be useful to\u00a0:<\/p>\n<ul>\n<li>perform statistical text analysis.<\/li>\n<li>collect data from an unformatted text file.<\/li>\n<li>deal with character variables.<\/li>\n<\/ul>\n<p>In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of function for characters, simple functions and regular expressions. Many functions are part of the standard R <b>base<\/b> package.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"kw8\">help.<span class=\"me1\">search<\/span><\/span><span class=\"br0\">(<\/span>keyword <span class=\"sy0\">=<\/span> <span class=\"st0\">\"character\"<\/span>, package <span class=\"sy0\">=<\/span> <span class=\"st0\">\"base\"<\/span><span class=\"br0\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<p>However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the <b>stringr<\/b> package which defines functions with similar behaviour but their names are easier to retain and their syntax much more systematic<sup id=\"cite_ref-stringr_1-0\" class=\"reference\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_note-stringr-1\">[1]<\/a><\/sup>.<\/p>\n<ul>\n<li>Keywords\u00a0: <i>text mining<\/i>, <i>natural language processing<\/i><\/li>\n<li>See CRAN Task view on Natural Language Processing<sup id=\"cite_ref-2\" class=\"reference\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_note-2\">[2]<\/a><\/sup><\/li>\n<li>See also the following packages <b>tm<\/b>, <b>tau<\/b>, <b>languageR<\/b>, <b>scrapeR<\/b>.<\/li>\n<\/ul>\n<div id=\"toc\" class=\"toc\">\n<div id=\"toctitle\">\n<h2>Contents<\/h2>\n<p><span class=\"toctoggle\">\u00a0[<a id=\"togglelink\" href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#\">hide<\/a>]\u00a0<\/span><\/div>\n<ul>\n<li class=\"toclevel-1 tocsection-1\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Reading_and_writing_text_files\"><span class=\"tocnumber\">1<\/span> <span class=\"toctext\">Reading and writing text files<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-2\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Character_encoding\"><span class=\"tocnumber\">2<\/span> <span class=\"toctext\">Character encoding<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-3\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Example\"><span class=\"tocnumber\">2.1<\/span> <span class=\"toctext\">Example<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-4\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Regular_Expressions\"><span class=\"tocnumber\">3<\/span> <span class=\"toctext\">Regular Expressions<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-5\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Functions_which_use_regular_expressions_in_R\"><span class=\"tocnumber\">3.1<\/span> <span class=\"toctext\">Functions which use regular expressions in R<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-6\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extended_regular_expressions_.28The_default.29\"><span class=\"tocnumber\">3.2<\/span> <span class=\"toctext\">Extended regular expressions (The default)<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-7\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Perl-like_regular_expressions\"><span class=\"tocnumber\">3.3<\/span> <span class=\"toctext\">Perl-like regular expressions<\/span><\/a>\n<ul>\n<li class=\"toclevel-3 tocsection-8\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Examples\"><span class=\"tocnumber\">3.3.1<\/span> <span class=\"toctext\">Examples<\/span><\/a><\/li>\n<li class=\"toclevel-3 tocsection-9\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#See_also\"><span class=\"tocnumber\">3.3.2<\/span> <span class=\"toctext\">See also<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-10\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Concatenating_strings\"><span class=\"tocnumber\">4<\/span> <span class=\"toctext\">Concatenating strings<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-11\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Examples_2\"><span class=\"tocnumber\">4.1<\/span> <span class=\"toctext\">Examples<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-12\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Splitting_a_string\"><span class=\"tocnumber\">5<\/span> <span class=\"toctext\">Splitting a string<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-13\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Counting_the_number_of_characters_in_a_string\"><span class=\"tocnumber\">6<\/span> <span class=\"toctext\">Counting the number of characters in a string<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-14\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Detecting_the_presence_of_a_substring\"><span class=\"tocnumber\">7<\/span> <span class=\"toctext\">Detecting the presence of a substring<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-15\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Detecting_a_pattern_in_a_string_.3F\"><span class=\"tocnumber\">7.1<\/span> <span class=\"toctext\">Detecting a pattern in a string\u00a0?<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-16\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Counting_the_occurrence_of_each_pattern_in_a_string_.3F\"><span class=\"tocnumber\">7.2<\/span> <span class=\"toctext\">Counting the occurrence of each pattern in a string\u00a0?<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-17\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_position_of_a_substring_or_a_pattern_in_a_string\"><span class=\"tocnumber\">8<\/span> <span class=\"toctext\">Extracting the position of a substring or a pattern in a string<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-18\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_position_of_a_substring_.3F\"><span class=\"tocnumber\">8.1<\/span> <span class=\"toctext\">Extracting the position of a substring\u00a0?<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-19\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_position_of_a_pattern_in_a_string_.3F\"><span class=\"tocnumber\">8.2<\/span> <span class=\"toctext\">Extracting the position of a pattern in a string\u00a0?<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-20\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_a_substring_from_a_string\"><span class=\"tocnumber\">9<\/span> <span class=\"toctext\">Extracting a substring from a string<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-21\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_a_fixed_width_substring_.3F\"><span class=\"tocnumber\">9.1<\/span> <span class=\"toctext\">Extracting a fixed width substring\u00a0?<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-22\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_first_word_in_a_string_.3F\"><span class=\"tocnumber\">9.2<\/span> <span class=\"toctext\">Extracting the first word in a string\u00a0?<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-23\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_a_pattern_in_a_string_.3F\"><span class=\"tocnumber\">9.3<\/span> <span class=\"toctext\">Extracting a pattern in a string\u00a0?<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-24\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Making_some_substitution_inside_a_string\"><span class=\"tocnumber\">10<\/span> <span class=\"toctext\">Making some substitution inside a string<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-25\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Substituting_a_pattern_in_a_string\"><span class=\"tocnumber\">10.1<\/span> <span class=\"toctext\">Substituting a pattern in a string<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-26\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Substituting_characters_in_a_string_.3F\"><span class=\"tocnumber\">10.2<\/span> <span class=\"toctext\">Substituting characters in a string\u00a0?<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-27\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Converting_letters_to_lower_or_upper-case\"><span class=\"tocnumber\">11<\/span> <span class=\"toctext\">Converting letters to lower or upper-case<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-28\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Filling_a_string_with_some_character\"><span class=\"tocnumber\">12<\/span> <span class=\"toctext\">Filling a string with some character<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-29\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Removing_leading_and_trailing_spaces\"><span class=\"tocnumber\">13<\/span> <span class=\"toctext\">Removing leading and trailing spaces<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-30\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Comparing_two_strings\"><span class=\"tocnumber\">14<\/span> <span class=\"toctext\">Comparing two strings<\/span><\/a>\n<ul>\n<li class=\"toclevel-2 tocsection-31\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Assessing_if_they_are_identical\"><span class=\"tocnumber\">14.1<\/span> <span class=\"toctext\">Assessing if they are identical<\/span><\/a><\/li>\n<li class=\"toclevel-2 tocsection-32\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Computing_distance_between_strings\"><span class=\"tocnumber\">14.2<\/span> <span class=\"toctext\">Computing distance between strings<\/span><\/a>\n<ul>\n<li class=\"toclevel-3 tocsection-33\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Example_with_utils\"><span class=\"tocnumber\">14.2.1<\/span> <span class=\"toctext\">Example with utils<\/span><\/a><\/li>\n<li class=\"toclevel-3 tocsection-34\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Example_with_MiscPsycho\"><span class=\"tocnumber\">14.2.2<\/span> <span class=\"toctext\">Example with MiscPsycho<\/span><\/a><\/li>\n<li class=\"toclevel-3 tocsection-35\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Approximate_matching\"><span class=\"tocnumber\">14.2.3<\/span> <span class=\"toctext\">Approximate matching<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li class=\"toclevel-1 tocsection-36\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Miscellaneous\"><span class=\"tocnumber\">15<\/span> <span class=\"toctext\">Miscellaneous<\/span><\/a><\/li>\n<li class=\"toclevel-1 tocsection-37\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#References\"><span class=\"tocnumber\">16<\/span> <span class=\"toctext\">References<\/span><\/a><\/li>\n<\/ul>\n<\/div>\n<h2><span id=\"Reading_and_writing_text_files\" class=\"mw-headline\">Reading and writing text files<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Reading and writing text files\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=1\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<p><b>R<\/b> can read any text file using <code>readLines()<\/code> or <code>scan()<\/code>. It is possible to specify the encoding of the imported text file with <code>readLines()<\/code>. The entire contents of the text file can be read into an R object (e.g., a character vector). <code>scan()<\/code> is more flexible. The kind of data expected can be specified in the second argument (e.g., character(0) for a string).<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"kw4\">text<\/span> <span class=\"sy0\">&lt;-<\/span> <span class=\"kw2\">readLines<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"file.txt\"<\/span>,encoding<span class=\"sy0\">=<\/span><span class=\"st0\">\"UTF-8\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"kw2\">scan<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"file.txt\"<\/span>, <span class=\"kw2\">character<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span> <span class=\"co1\"># separate each word<\/span>\n<span class=\"kw2\">scan<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"file.txt\"<\/span>, <span class=\"kw2\">character<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span>, <span class=\"kw2\">quote<\/span> <span class=\"sy0\">=<\/span> NULL<span class=\"br0\">)<\/span> <span class=\"co1\"># get rid of quotes<\/span>\n<span class=\"kw2\">scan<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"file.txt\"<\/span>, <span class=\"kw2\">character<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span>, sep <span class=\"sy0\">=<\/span> <span class=\"st0\">\".\"<\/span><span class=\"br0\">)<\/span> <span class=\"co1\"># separate each sentence<\/span>\n<span class=\"kw2\">scan<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"file.txt\"<\/span>, <span class=\"kw2\">character<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span>, sep <span class=\"sy0\">=<\/span> <span class=\"st0\">\"<span class=\"es0\">\\n<\/span>\"<\/span><span class=\"br0\">)<\/span> <span class=\"co1\"># separate each line<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<p>We can write the content of an R object into a text file using <code>cat()<\/code> or<code>writeLines()<\/code>. By default <code>cat()<\/code> concatenates vectors when writing to the text file. You can change it by adding options <code>sep=\"\\n\"<\/code> or <code>fill=TRUE<\/code>. The default encoding depends on your computer.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"kw2\">cat<\/span><span class=\"br0\">(<\/span><span class=\"kw4\">text<\/span>,<span class=\"kw2\">file<\/span><span class=\"sy0\">=<\/span><span class=\"st0\">\"file.txt\"<\/span>,sep<span class=\"sy0\">=<\/span><span class=\"st0\">\"<span class=\"es0\">\\n<\/span>\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"kw2\">writeLines<\/span><span class=\"br0\">(<\/span><span class=\"kw4\">text<\/span>, con <span class=\"sy0\">=<\/span> <span class=\"st0\">\"file.txt\"<\/span>, sep <span class=\"sy0\">=<\/span> <span class=\"st0\">\"<span class=\"es0\">\\n<\/span>\"<\/span>, useBytes <span class=\"sy0\">=<\/span> FALSE<span class=\"br0\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<p>Before reading a text file, you can look at its properties. <code>nlines()<\/code> (<b>parser<\/b> package) and <code>countLines()<\/code>(<b>R.utils<\/b> package) count the number of lines in the file. <code>count.chars()<\/code> (<b>parser<\/b> package) counts the number of bytes and characters in each line of a file. You can also display a text file using <code>file.show()<\/code>.<\/p>\n<h2><span id=\"Character_encoding\" class=\"mw-headline\">Character encoding<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Character encoding\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=2\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<table class=\"plainlinks noprint messagebox notice\">\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/40px-Wikipedia-logo.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/60px-Wikipedia-logo.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/80px-Wikipedia-logo.png 2x\" alt=\"Wikipedia-logo.png\" width=\"40\" height=\"40\" data-file-width=\"200\" data-file-height=\"200\" \/><\/td>\n<td><a class=\"extiw\" title=\"w:\" href=\"http:\/\/en.wikipedia.org\/wiki\/\" target=\"_blank\" rel=\"noopener\">Wikipedia<\/a> has related information at <a class=\"external text\" href=\"http:\/\/en.wikipedia.org\/wiki\/Character_encoding\" target=\"_blank\" rel=\"noopener\"><i><b>Character encoding<\/b><\/i><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which have been created with another operating system and especially if the language is not English and has many accents and specific characters. For instance, the standard encoding scheme in Linux is &#8220;UTF-8&#8221; whereas the standard encoding scheme in Windows is &#8220;Latin1&#8221;. The <code>Encoding()<\/code> functions returns the encoding of a string. <code>iconv()<\/code> is similar to the unix command <a class=\"extiw\" title=\"w:iconv\" href=\"http:\/\/en.wikipedia.org\/wiki\/iconv\" target=\"_blank\" rel=\"noopener\">iconv<\/a> and converts the encoding.<\/p>\n<ul>\n<li><code>iconvlist()<\/code> gives the list of available encoding scheme on your computer.<\/li>\n<li><code>readLines()<\/code>, <code>scan()<\/code> and <code>file.show()<\/code> have also an encoding option.<\/li>\n<li><code>is.utf8()<\/code> (<b>tau<\/b>) tests if the encoding is &#8220;utf8&#8221;.<\/li>\n<li><code>is.locale()<\/code> (<b>tau<\/b>) tests if encoding is the same as the default encoding on your computer.<\/li>\n<li><code>translate()<\/code> (<b>tau<\/b>) translates the encoding into the current locale.<\/li>\n<li><code>fromUTF8()<\/code> (<b>descr<\/b>) is less general than <code>iconv()<\/code>.<\/li>\n<li><code>utf8ToInt()<\/code> (<b>base<\/b>)<\/li>\n<\/ul>\n<h3><span id=\"Example\" class=\"mw-headline\">Example<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Example\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=3\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<p>The following example was run under Windows. Thus, the default encoding is &#8220;latin1&#8221;.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> texte <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"H\u00e9 h\u00e9\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">Encoding<\/span><span class=\"br0\">(<\/span>texte<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"latin1\"<\/span>\n<span class=\"sy0\">&gt;<\/span> texte2 <span class=\"sy0\">&lt;-<\/span>  <span class=\"kw2\">iconv<\/span><span class=\"br0\">(<\/span>texte,<span class=\"st0\">\"latin1\"<\/span>,<span class=\"st0\">\"UTF-8\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">Encoding<\/span><span class=\"br0\">(<\/span>texte2<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"UTF-8\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Regular_Expressions\" class=\"mw-headline\">Regular Expressions<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Regular Expressions\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=4\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<table class=\"plainlinks noprint messagebox notice\">\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/d\/d5\/Wikibooks-logo.png\/40px-Wikibooks-logo.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/d\/d5\/Wikibooks-logo.png\/60px-Wikibooks-logo.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/d\/d5\/Wikibooks-logo.png\/80px-Wikibooks-logo.png 2x\" alt=\"Wikibooks-logo.png\" width=\"40\" height=\"40\" data-file-width=\"135\" data-file-height=\"135\" \/><\/td>\n<td>Also see the <i><a class=\"mw-redirect\" title=\"Regular expressions\" href=\"http:\/\/en.wikibooks.org\/wiki\/Regular_expressions\">Regular expressions<\/a><\/i> book.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table class=\"plainlinks noprint messagebox notice\">\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/40px-Wikipedia-logo.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/60px-Wikipedia-logo.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/80px-Wikipedia-logo.png 2x\" alt=\"Wikipedia-logo.png\" width=\"40\" height=\"40\" data-file-width=\"200\" data-file-height=\"200\" \/><\/td>\n<td><a class=\"extiw\" title=\"w:\" href=\"http:\/\/en.wikipedia.org\/wiki\/\" target=\"_blank\" rel=\"noopener\">Wikipedia<\/a> has related information at <a class=\"external text\" href=\"http:\/\/en.wikipedia.org\/wiki\/Regular_expression\" target=\"_blank\" rel=\"noopener\"><i><b>Regular expression<\/b><\/i><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern\u00a0: 2 digits, 2 letters and 4 digits. <b>R<\/b>provides powerful functions to deal with regular expressions. Two types of regular expressions are used in <b>R<\/b><sup id=\"cite_ref-3\" class=\"reference\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_note-3\">[3]<\/a><\/sup><\/p>\n<ul>\n<li>extended regular expressions, used by <code>\u2018perl = FALSE\u2019<\/code> (the default),<\/li>\n<li>Perl-like regular expressions used by <code>\u2018perl = TRUE\u2019<\/code>.<\/li>\n<\/ul>\n<p>There is a also an option called <code>\u2018fixed = TRUE\u2019<\/code> which can be considered as a literal regular expression. <code>fixed()<\/code> (<b>stringr<\/b>) is equivalent to <code>fixed=TRUE<\/code> in the standard regex functions. These functions are by default case sensitive. This can be changed by specifying the option<code>ignore.case = TRUE<\/code>.<\/p>\n<p>If you are not a specialist in regular expression you my find the <code>glob2rx()<\/code> useful. This function suggests some regular expression for a specific pattern\u00a0:<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw8\">glob2rx<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"abc.*\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"^abc<span class=\"es0\">\\\\<\/span>.\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Functions_which_use_regular_expressions_in_R\" class=\"mw-headline\">Functions which use regular expressions in R<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Functions which use regular expressions in R\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=5\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>sub()<\/code>, <code>gsub()<\/code>, <code>str_replace()<\/code> (<b>stringr<\/b>) make some substitutions in a string.<\/li>\n<li><code>grep()<\/code>, <code>str_extract()<\/code> (<b>stringr<\/b>) extract some value<\/li>\n<li><code>grepl()<\/code>, <code>str_detect()<\/code> (<b>stringr<\/b>) detect the presence of a pattern.<\/li>\n<li>see also <code>splitByPattern()<\/code> (<b>R.utils<\/b>)<\/li>\n<li>See also <code>gsubfn()<\/code> in the <b>gsubfn<\/b> package.<\/li>\n<\/ul>\n<h3><span id=\"Extended_regular_expressions_.28The_default.29\" class=\"mw-headline\">Extended regular expressions (The default)<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extended regular expressions (The default)\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=6\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>\".\"<\/code> stands for any character.<\/li>\n<li><code>\"[ABC]\"<\/code> means A,B or C.<\/li>\n<li><code>\"[A-Z]\"<\/code> means any upper letter between A and Z.<\/li>\n<li><code>\"[0-9]\"<\/code> means any digit between 0 and 9.<\/li>\n<\/ul>\n<p>Here is the list of metacharacters <code>\u2018$ * + .\u00a0? [ ] ^ { } | ( ) \\\u2019<\/code>. If you need to use one of those characters, precede them with a doubled backslash.<\/p>\n<p>Here are some classes of regular expressions\u00a0: For numbers\u00a0:<\/p>\n<ul>\n<li><code>\u2018[:digit:]\u2019<\/code> Digits: <code>\u20180 1 2 3 4 5 6 7 8 9\u2019<\/code>.<\/li>\n<\/ul>\n<p>For letters\u00a0:<\/p>\n<ul>\n<li><code>\u2018[:alpha:]\u2019<\/code> Alphabetic characters: <code>\u2018[:lower:]\u2019<\/code> and <code>\u2018[:upper:]\u2019<\/code>.<\/li>\n<li><code>\u2018[:upper:]\u2019<\/code> Upper-case letters.<\/li>\n<li><code>\u2018[:lower:]\u2019<\/code> Lower-case letters.<\/li>\n<\/ul>\n<p>Note that the set of alphabetic characters includes accents such as <code>\u00e9 \u00e8 \u00ea<\/code> which are very common is some languages like French. Therefore, it is more general than <code>\"[A-Za-z]\"<\/code> which does not include letters with accent.<\/p>\n<p>For other characters\u00a0:<\/p>\n<ul>\n<li><code>\u2018[:punct:]\u2019<\/code> Punctuation characters: <code>\u2018! \" # $\u00a0% &amp; ' ( ) * + , - . \/\u00a0:\u00a0; &lt; = &gt;\u00a0? @ [ \\ ] ^ _ ` { | } ~\u2019<\/code>.<\/li>\n<li><code>\u2018[:space:]\u2019<\/code> Space characters: tab, newline, vertical tab, form feed, carriage return, and space.<\/li>\n<li><code>\u2018[:blank:]\u2019<\/code> Blank characters: space and tab.<\/li>\n<li><code>\u2018[:cntrl:]\u2019<\/code> Control characters.<\/li>\n<\/ul>\n<p>For combination of other classes\u00a0:<\/p>\n<ul>\n<li><code>[:alnum:]<\/code> Alphanumeric characters: <code>\u2018[:alpha:]\u2019<\/code> and <code>\u2018[:digit:]\u2019<\/code>.<\/li>\n<li><code>\u2018[:graph:]\u2019<\/code> Graphical characters: <code>\u2018[:alnum:]\u2019<\/code> and <code>\u2018[:punct:]\u2019<\/code>.<\/li>\n<li><code>\u2018[:print:]\u2019<\/code> Printable characters: <code>\u2018[:alnum:]\u2019<\/code>, <code>\u2018[:punct:]\u2019<\/code> and space.<\/li>\n<li><code>\u2018[:xdigit:]\u2019<\/code> Hexadecimal digits: <code>\u20180 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f\u2019<\/code>.<\/li>\n<\/ul>\n<p>You can quantify the number of repetition by adding after the regular expression the following characters\u00a0:<\/p>\n<ul>\n<li><code>\u2018?\u2019<\/code> The preceding item is optional and will be matched at most once.<\/li>\n<li><code>\u2018*\u2019<\/code> The preceding item will be matched zero or more times.<\/li>\n<li><code>\u2018+\u2019<\/code> The preceding item will be matched one or more times.<\/li>\n<li><code>\u2018{n}\u2019<\/code> The preceding item is matched exactly \u2018n\u2019 times.<\/li>\n<li><code>\u2018{n,}\u2019<\/code> The preceding item is matched \u2018n\u2019 or more times.<\/li>\n<li><code>\u2018{n,m}\u2019<\/code> The preceding item is matched at least \u2018n\u2019 times, but not more than \u2018m\u2019 times.<\/li>\n<\/ul>\n<ul>\n<li><code>^<\/code> to force the regular expression to be at the beginning of the string<\/li>\n<li><code>$<\/code> to force the regular expression to be at the end of the string<\/li>\n<\/ul>\n<p>If you want to know more, have a look at the 2 following help files\u00a0:<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;?<\/span>regexp <span class=\"co1\"># gives some general explanations<\/span>\n<span class=\"sy0\">&gt;?<\/span><span class=\"kw2\">grep<\/span> <span class=\"co1\"># help file for grep(),regexpr(),sub(),gsub(),etc<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Perl-like_regular_expressions\" class=\"mw-headline\">Perl-like regular expressions<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Perl-like regular expressions\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=7\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<table class=\"metadata plainlinks ambox ambox-content\">\n<tbody>\n<tr>\n<td class=\"mbox-image\">\n<div><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/9\/91\/Book_important2.svg\/40px-Book_important2.svg.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/9\/91\/Book_important2.svg\/60px-Book_important2.svg.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/9\/91\/Book_important2.svg\/80px-Book_important2.svg.png 2x\" alt=\"\" width=\"40\" height=\"40\" data-file-width=\"128\" data-file-height=\"128\" \/><\/div>\n<\/td>\n<td class=\"mbox-text\"><b>This section is a stub.<\/b><br \/>\nYou can help Wikibooks by <a class=\"external text\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit\">expanding it<\/a>.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It is also possible to use &#8220;perl-like&#8221; regular expressions. You just need to use the option <code>perl=TRUE<\/code>.<\/p>\n<h4><span id=\"Examples\" class=\"mw-headline\">Examples<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Examples\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=8\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h4>\n<p>If you want to remove space characters in a string, you can use the <code>\\\\s<\/code> Perl macro.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"kw2\">sub<\/span><span class=\"br0\">(<\/span><span class=\"st0\">'<span class=\"es0\">\\\\<\/span>s'<\/span>, <span class=\"st0\">''<\/span>,x, perl <span class=\"sy0\">=<\/span> TRUE<span class=\"br0\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h4><span id=\"See_also\" class=\"mw-headline\">See also<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: See also\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=9\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h4>\n<ul>\n<li><a title=\"Perl Programming\/Regular Expressions\" href=\"http:\/\/en.wikibooks.org\/wiki\/Perl_Programming\/Regular_Expressions\">Perl Programming\/Regular Expressions<\/a><\/li>\n<\/ul>\n<h2><span id=\"Concatenating_strings\" class=\"mw-headline\">Concatenating strings<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Concatenating strings\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=10\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>paste()<\/code> concatenates strings.<\/li>\n<li><code>str_c()<\/code> (<b>stringr<\/b>) does a similar job.<\/li>\n<li><code>cat()<\/code> prints and concatenates strings.<\/li>\n<\/ul>\n<h3><span id=\"Examples_2\" class=\"mw-headline\">Examples<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Examples\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=11\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">paste<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"toto\"<\/span>,<span class=\"st0\">\"tata\"<\/span>,sep<span class=\"sy0\">=<\/span><span class=\"st0\">' '<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"toto tata\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">paste<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"toto\"<\/span>,<span class=\"st0\">\"tata\"<\/span>,sep<span class=\"sy0\">=<\/span><span class=\"st0\">\",\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"toto,tata\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_c<span class=\"br0\">(<\/span><span class=\"st0\">\"toto\"<\/span>,<span class=\"st0\">\"tata\"<\/span>,sep<span class=\"sy0\">=<\/span><span class=\"st0\">\",\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"toto,tata\"<\/span>\n<span class=\"sy0\">&gt;<\/span> x <span class=\"sy0\">&lt;-<\/span> <span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"a\"<\/span>,<span class=\"st0\">\"b\"<\/span>,<span class=\"st0\">\"c\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">paste<\/span><span class=\"br0\">(<\/span>x,collapse<span class=\"sy0\">=<\/span><span class=\"st0\">\" \"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"a b c\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_c<span class=\"br0\">(<\/span>x, collapse <span class=\"sy0\">=<\/span> <span class=\"st0\">\" \"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"a b c\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">cat<\/span><span class=\"br0\">(<\/span><span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"a\"<\/span>,<span class=\"st0\">\"b\"<\/span>,<span class=\"st0\">\"c\"<\/span><span class=\"br0\">)<\/span>, sep <span class=\"sy0\">=<\/span> <span class=\"st0\">\"+\"<\/span><span class=\"br0\">)<\/span>\na<span class=\"sy0\">+<\/span>b<span class=\"sy0\">+<\/span><span class=\"kw2\">c<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Splitting_a_string\" class=\"mw-headline\">Splitting a string<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Splitting a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=12\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>strsplit()<\/code>\u00a0: Split the elements of a character vector \u2018x\u2019 into substrings according to the matches to substring \u2018split\u2019 within them.<\/li>\n<li>See also <code>str_split()<\/code> (<b>stringr<\/b>).<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">unlist<\/span><span class=\"br0\">(<\/span><span class=\"kw2\">strsplit<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"a.b.c\"<\/span>, <span class=\"st0\">\"<span class=\"es0\">\\\\<\/span>.\"<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"a\"<\/span> <span class=\"st0\">\"b\"<\/span> <span class=\"st0\">\"c\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<ul>\n<li><code>tokenize()<\/code> (<b>tau<\/b>) split a string into tokens.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> tokenize<span class=\"br0\">(<\/span><span class=\"st0\">\"abc defghk\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abc\"<\/span>    <span class=\"st0\">\" \"<\/span>      <span class=\"st0\">\"defghk\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Counting_the_number_of_characters_in_a_string\" class=\"mw-headline\">Counting the number of characters in a string<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Counting the number of characters in a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=13\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>nchar()<\/code> gives the length of a string.<\/li>\n<li>See also <code>str_length()<\/code> (<b>stringr<\/b>).<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">nchar<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"abcdef\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">6<\/span>\n<span class=\"sy0\">&gt;<\/span> str_length<span class=\"br0\">(<\/span><span class=\"st0\">\"abcdef\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">6<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">nchar<\/span><span class=\"br0\">(<\/span>NA<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">2<\/span>\n<span class=\"sy0\">&gt;<\/span> str_length<span class=\"br0\">(<\/span>NA<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> NA\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Detecting_the_presence_of_a_substring\" class=\"mw-headline\">Detecting the presence of a substring<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Detecting the presence of a substring\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=14\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<h3><span id=\"Detecting_a_pattern_in_a_string_.3F\" class=\"mw-headline\">Detecting a pattern in a string\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Detecting a pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=15\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>grepl()<\/code> returns a logical expression (TRUE or FALSE).<\/li>\n<li><code>str_detect()<\/code> (<b>stringr<\/b>) does a similar job.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> string <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span>\n<span class=\"sy0\">&gt;<\/span> string2 <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"1 mai 2000\"<\/span>\n<span class=\"sy0\">&gt;<\/span> regexp <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">grepl<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, x <span class=\"sy0\">=<\/span> string<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> TRUE\n<span class=\"sy0\">&gt;<\/span> str_detect<span class=\"br0\">(<\/span>string, regexp<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> TRUE\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">grepl<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, x <span class=\"sy0\">=<\/span> string2<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> FALSE\n<\/pre>\n<\/div>\n<\/div>\n<p>The 1st one is true and the second one is false since there is only one digit in the first number.<\/p>\n<h3><span id=\"Counting_the_occurrence_of_each_pattern_in_a_string_.3F\" class=\"mw-headline\">Counting the occurrence of each pattern in a string\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Counting the occurrence of each pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=16\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>textcnt()<\/code> (<b>tau<\/b>) counts the occurrence of each pattern or each term in a text.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> string <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"blabla 23 mai 2000 blabla 18 mai 2004\"<\/span>\n<span class=\"sy0\">&gt;<\/span> textcnt<span class=\"br0\">(<\/span>string,n<span class=\"sy0\">=<\/span>1L,method<span class=\"sy0\">=<\/span><span class=\"st0\">\"string\"<\/span><span class=\"br0\">)<\/span>\nblabla    mai \n     <span class=\"nu0\">2<\/span>      <span class=\"nu0\">2<\/span> \n<span class=\"kw2\">attr<\/span><span class=\"br0\">(<\/span>,<span class=\"st0\">\"class\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"textcnt\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Extracting_the_position_of_a_substring_or_a_pattern_in_a_string\" class=\"mw-headline\">Extracting the position of a substring or a pattern in a string<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting the position of a substring or a pattern in a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=17\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<h3><span id=\"Extracting_the_position_of_a_substring_.3F\" class=\"mw-headline\">Extracting the position of a substring\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting the position of a substring\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=18\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>cpos()<\/code> (<b>cwhmisc<\/b>) returns the position of a substring in a string.<\/li>\n<li><code>substring.location()<\/code> (<b>cwhmisc<\/b>) does the same job but returns the first and the last position.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"> \n<span class=\"sy0\">&gt;<\/span> cpos<span class=\"br0\">(<\/span><span class=\"st0\">\"abcdefghijklmnopqrstuvwxyz\"<\/span>,<span class=\"st0\">\"p\"<\/span>,<span class=\"kw7\">start<\/span><span class=\"sy0\">=<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">16<\/span>\n<span class=\"sy0\">&gt;<\/span> substring.<span class=\"me1\">location<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"abcdefghijklmnopqrstuvwxyz\"<\/span>,<span class=\"st0\">\"def\"<\/span><span class=\"br0\">)<\/span>\n$first\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">4<\/span>\n \n$last\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">6<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Extracting_the_position_of_a_pattern_in_a_string_.3F\" class=\"mw-headline\">Extracting the position of a pattern in a string\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting the position of a pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=19\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>regexpr()<\/code> returns the position of the regular expression. <code>str_locate()<\/code> (<b>stringr<\/b>) does the same job.<code>gregexpr()<\/code> is similar to <code>regexpr()<\/code> but the starting position of every match is returned.<code>str_locate_all()<\/code> (<b>stringr<\/b>) does the same job.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> regexp <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"<\/span>\n<span class=\"sy0\">&gt;<\/span> string <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"blabla 23 mai 2000 blabla 18 mai 2004\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">regexpr<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, <span class=\"kw4\">text<\/span> <span class=\"sy0\">=<\/span> string<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">8<\/span>\n<span class=\"kw2\">attr<\/span><span class=\"br0\">(<\/span>,<span class=\"st0\">\"match.length\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">11<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">gregexpr<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, <span class=\"kw4\">text<\/span> <span class=\"sy0\">=<\/span> string<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span><span class=\"br0\">]<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span>  <span class=\"nu0\">8<\/span> <span class=\"nu0\">27<\/span>\n<span class=\"kw2\">attr<\/span><span class=\"br0\">(<\/span>,<span class=\"st0\">\"match.length\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">11<\/span> <span class=\"nu0\">11<\/span>\n<span class=\"sy0\">&gt;<\/span> str_locate<span class=\"br0\">(<\/span>string,regexp<span class=\"br0\">)<\/span>\n     <span class=\"kw7\">start<\/span> <span class=\"kw7\">end<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span>,<span class=\"br0\">]<\/span>     <span class=\"nu0\">8<\/span>  <span class=\"nu0\">18<\/span>\n<span class=\"sy0\">&gt;<\/span> str_locate_all<span class=\"br0\">(<\/span>string,regexp<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span><span class=\"br0\">]<\/span>\n     <span class=\"kw7\">start<\/span> <span class=\"kw7\">end<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span>,<span class=\"br0\">]<\/span>     <span class=\"nu0\">8<\/span>  <span class=\"nu0\">18<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">2<\/span>,<span class=\"br0\">]<\/span>    <span class=\"nu0\">27<\/span>  <span class=\"nu0\">37<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Extracting_a_substring_from_a_string\" class=\"mw-headline\">Extracting a substring from a string<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting a substring from a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=20\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<h3><span id=\"Extracting_a_fixed_width_substring_.3F\" class=\"mw-headline\">Extracting a fixed width substring\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting a fixed width substring\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=21\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>substr()<\/code> takes a sub string.<\/li>\n<li><code>str_sub()<\/code> (<b>stringr<\/b>) is similar.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">substr<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"simple text\"<\/span>,<span class=\"nu0\">1<\/span>,<span class=\"nu0\">3<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"sim\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_sub<span class=\"br0\">(<\/span><span class=\"st0\">\"simple text\"<\/span>,<span class=\"nu0\">1<\/span>,<span class=\"nu0\">3<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"sim\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Extracting_the_first_word_in_a_string_.3F\" class=\"mw-headline\">Extracting the first word in a string\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting the first word in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=22\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>first.word()<\/code> First Word in a String or Expression in the <b>Hmisc<\/b> package<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> first.<span class=\"me1\">word<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"abc def ghk\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abc\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Extracting_a_pattern_in_a_string_.3F\" class=\"mw-headline\">Extracting a pattern in a string\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Extracting a pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=23\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>grep()<\/code> returns the value or the position of the regular expression if <code>value=T<\/code> and its position if <code>value=F<\/code>.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">grep<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, x <span class=\"sy0\">=<\/span> string , value <span class=\"sy0\">=<\/span> <span class=\"kw2\">T<\/span><span class=\"br0\">)<\/span> \n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">grep<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, x <span class=\"sy0\">=<\/span> string2 , value <span class=\"sy0\">=<\/span> <span class=\"kw2\">T<\/span><span class=\"br0\">)<\/span> \n<span class=\"kw2\">character<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">grep<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, x <span class=\"sy0\">=<\/span> string , value <span class=\"sy0\">=<\/span> <span class=\"kw2\">F<\/span><span class=\"br0\">)<\/span> \n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">1<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">grep<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, x <span class=\"sy0\">=<\/span> string2 , value <span class=\"sy0\">=<\/span> <span class=\"kw2\">F<\/span><span class=\"br0\">)<\/span> \n<span class=\"kw2\">integer<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<ul>\n<li><code>str_extract()<\/code>, <code>str_extract_all()<\/code>, <code>str_match()<\/code>, <code>str_match_all()<\/code> (<b>stringr<\/b>) and <code>m()<\/code> (<b>caroline<\/b>package) are similar to <code>grep()<\/code>. <code>str_extract()<\/code> and <code>str_extract_all()<\/code> return a vector. <code>str_match()<\/code>and <code>str_match_all()<\/code> return a matrix and <code>m()<\/code> a dataframe.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">library<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"stringr\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> regexp <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"<\/span>\n<span class=\"sy0\">&gt;<\/span> string <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"blabla 23 mai 2000 blabla 18 mai 2004\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_extract<span class=\"br0\">(<\/span>string,regexp<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_extract_all<span class=\"br0\">(<\/span>string,regexp<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span><span class=\"br0\">]<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span> <span class=\"st0\">\"18 mai 2004\"<\/span>\n \n<span class=\"sy0\">&gt;<\/span> str_match<span class=\"br0\">(<\/span>string,regexp<span class=\"br0\">)<\/span>\n     <span class=\"br0\">[<\/span>,<span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span>          <span class=\"br0\">[<\/span>,<span class=\"nu0\">2<\/span><span class=\"br0\">]<\/span> <span class=\"br0\">[<\/span>,<span class=\"nu0\">3<\/span><span class=\"br0\">]<\/span>  <span class=\"br0\">[<\/span>,<span class=\"nu0\">4<\/span><span class=\"br0\">]<\/span>  \n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span>,<span class=\"br0\">]<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span> <span class=\"st0\">\"23\"<\/span> <span class=\"st0\">\"mai\"<\/span> <span class=\"st0\">\"2000\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_match_all<span class=\"br0\">(<\/span>string,regexp<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span><span class=\"br0\">]<\/span>\n     <span class=\"br0\">[<\/span>,<span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span>          <span class=\"br0\">[<\/span>,<span class=\"nu0\">2<\/span><span class=\"br0\">]<\/span> <span class=\"br0\">[<\/span>,<span class=\"nu0\">3<\/span><span class=\"br0\">]<\/span>  <span class=\"br0\">[<\/span>,<span class=\"nu0\">4<\/span><span class=\"br0\">]<\/span>  \n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span>,<span class=\"br0\">]<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span> <span class=\"st0\">\"23\"<\/span> <span class=\"st0\">\"mai\"<\/span> <span class=\"st0\">\"2000\"<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">2<\/span>,<span class=\"br0\">]<\/span> <span class=\"st0\">\"18 mai 2004\"<\/span> <span class=\"st0\">\"18\"<\/span> <span class=\"st0\">\"mai\"<\/span> <span class=\"st0\">\"2004\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">library<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"caroline\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> m<span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, vect <span class=\"sy0\">=<\/span> string, <span class=\"kw2\">names<\/span> <span class=\"sy0\">=<\/span> <span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"day\"<\/span>,<span class=\"st0\">\"month\"<\/span>,<span class=\"st0\">\"year\"<\/span><span class=\"br0\">)<\/span>, types <span class=\"sy0\">=<\/span> <span class=\"kw2\">rep<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"character\"<\/span>,<span class=\"nu0\">3<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span>\n  day month year\n<span class=\"nu0\">1<\/span>  <span class=\"nu0\">18<\/span>   mai <span class=\"nu0\">2004<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Making_some_substitution_inside_a_string\" class=\"mw-headline\">Making some substitution inside a string<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Making some substitution inside a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=24\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<h3><span id=\"Substituting_a_pattern_in_a_string\" class=\"mw-headline\">Substituting a pattern in a string<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Substituting a pattern in a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=25\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>sub()<\/code> makes a substitution.<\/li>\n<li><code>gsub()<\/code> is similar to <code>sub()<\/code> but replace all occurrences of the pattern whereas <code>sub()<\/code> only replaces the first occurrence.<\/li>\n<li><code>str_replace()<\/code> (<b>stringr<\/b>) is similar.<\/li>\n<\/ul>\n<p>In the following example, we have a French date. The regular pattern is the following\u00a0: 2 digits, a blank, some letters, a blank, 4 digits. We capture the 2 digits with the <code>[[:digit:]]{2}<\/code> expression, the letters with<code>[[:alpha:]]+<\/code> and the 4 digits with <code>[[:digit:]]{4}<\/code>. Each of these three substrings is surrounded with parenthesis. The first substring is stored in <code>\"\\\\1\"<\/code>, the second one in <code>\"\\\\2\"<\/code> and the 3rd one in <code>\"\\\\3\"<\/code>.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\">string <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"23 mai 2000\"<\/span>\nregexp <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"<\/span>\n<span class=\"kw2\">sub<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, replacement <span class=\"sy0\">=<\/span> <span class=\"st0\">\"<span class=\"es0\">\\\\<\/span>1\"<\/span>, x <span class=\"sy0\">=<\/span> string<span class=\"br0\">)<\/span> <span class=\"co1\"># returns the first part of the regular expression<\/span>\n<span class=\"kw2\">sub<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, replacement <span class=\"sy0\">=<\/span> <span class=\"st0\">\"<span class=\"es0\">\\\\<\/span>2\"<\/span>, x <span class=\"sy0\">=<\/span> string<span class=\"br0\">)<\/span> <span class=\"co1\"># returns the second part<\/span>\n<span class=\"kw2\">sub<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> regexp, replacement <span class=\"sy0\">=<\/span> <span class=\"st0\">\"<span class=\"es0\">\\\\<\/span>3\"<\/span>, x <span class=\"sy0\">=<\/span> string<span class=\"br0\">)<\/span> <span class=\"co1\"># returns the third part<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<p>In the following example, we compare the outcome of <code>sub()<\/code> and <code>gsub()<\/code>. The first one removes the first space whereas the second one removes all spaces in the text.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw4\">text<\/span> <span class=\"sy0\">&lt;-<\/span> <span class=\"st0\">\"abc def ghk\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">sub<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> <span class=\"st0\">\" \"<\/span>, replacement <span class=\"sy0\">=<\/span> <span class=\"st0\">\"\"<\/span>,  x <span class=\"sy0\">=<\/span> <span class=\"kw4\">text<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abcdef ghk\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">gsub<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> <span class=\"st0\">\" \"<\/span>, replacement <span class=\"sy0\">=<\/span> <span class=\"st0\">\"\"<\/span>,  x <span class=\"sy0\">=<\/span> <span class=\"kw4\">text<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abcdefghk\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Substituting_characters_in_a_string_.3F\" class=\"mw-headline\">Substituting characters in a string\u00a0?<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Substituting characters in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=26\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>chartr()<\/code> substitutes characters in an expression. It stands for &#8220;character translation&#8221;.<\/li>\n<li><code>replacechar()<\/code> (<b>cwhmisc<\/b>) does the same job &#8230;<\/li>\n<li>as well as <code>str_replace_all()<\/code> (<b>stringr<\/b>).<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">chartr<\/span><span class=\"br0\">(<\/span>old<span class=\"sy0\">=<\/span><span class=\"st0\">\"a\"<\/span>,<span class=\"kw6\">new<\/span><span class=\"sy0\">=<\/span><span class=\"st0\">\"o\"<\/span>,x<span class=\"sy0\">=<\/span><span class=\"st0\">\"baba\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"bobo\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">chartr<\/span><span class=\"br0\">(<\/span>old<span class=\"sy0\">=<\/span><span class=\"st0\">\"ab\"<\/span>,<span class=\"kw6\">new<\/span><span class=\"sy0\">=<\/span><span class=\"st0\">\"ot\"<\/span>,x<span class=\"sy0\">=<\/span><span class=\"st0\">\"baba\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"toto\"<\/span>\n<span class=\"sy0\">&gt;<\/span> replacechar<span class=\"br0\">(<\/span><span class=\"st0\">\"abc.def.ghi.jkl\"<\/span>,<span class=\"st0\">\".\"<\/span>,<span class=\"st0\">\"_\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abc_def_ghi_jkl\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_replace_all<span class=\"br0\">(<\/span><span class=\"st0\">\"abc.def.ghi.jkl\"<\/span>,<span class=\"st0\">\"<span class=\"es0\">\\\\<\/span>.\"<\/span>,<span class=\"st0\">\"_\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abc_def_ghi_jkl\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Converting_letters_to_lower_or_upper-case\" class=\"mw-headline\">Converting letters to lower or upper-case<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Converting letters to lower or upper-case\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=27\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>tolower()<\/code> converts upper-case characters to lower-case.<\/li>\n<li><code>toupper()<\/code> converts lower-case characters to upper-case.<\/li>\n<li><code>capitalize()<\/code> (<b>Hmisc<\/b>) capitalize the first letter of a string<\/li>\n<li>See also <code>cap()<\/code>, <code>capitalize()<\/code>, <code>lower()<\/code>, <code>lowerize()<\/code> and <code>CapLeading()<\/code> in the <b>cwhmisc<\/b> package.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">tolower<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"ABCdef\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abcdef\"<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">toupper<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"ABCdef\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"ABCDEF\"<\/span>\n<span class=\"sy0\">&gt;<\/span> capitalize<span class=\"br0\">(<\/span><span class=\"st0\">\"abcdef\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"Abcdef\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Filling_a_string_with_some_character\" class=\"mw-headline\">Filling a string with some character<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Filling a string with some character\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=28\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>padding()<\/code> (<b>cwhmisc<\/b>) fills a string with some characters to fit a given length. See also <code>str_pad()<\/code> (<b>stringr<\/b>).<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">library<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"cwhmisc\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> padding<span class=\"br0\">(<\/span><span class=\"st0\">\"abc\"<\/span>,<span class=\"nu0\">10<\/span>,<span class=\"st0\">\" \"<\/span>,<span class=\"st0\">\"center\"<\/span><span class=\"br0\">)<\/span> <span class=\"co1\"># adds blanks such that the length of the string is 10.<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"   abc    \"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_pad<span class=\"br0\">(<\/span><span class=\"st0\">\"abc\"<\/span>,width<span class=\"sy0\">=<\/span><span class=\"nu0\">10<\/span>,side<span class=\"sy0\">=<\/span><span class=\"st0\">\"center\"<\/span>, pad <span class=\"sy0\">=<\/span> <span class=\"st0\">\"+\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"+++abc++++\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_pad<span class=\"br0\">(<\/span><span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"1\"<\/span>,<span class=\"st0\">\"11\"<\/span>,<span class=\"st0\">\"111\"<\/span>,<span class=\"st0\">\"1111\"<\/span><span class=\"br0\">)<\/span>,<span class=\"nu0\">3<\/span>,side<span class=\"sy0\">=<\/span><span class=\"st0\">\"left\"<\/span>,pad<span class=\"sy0\">=<\/span><span class=\"st0\">\"0\"<\/span><span class=\"br0\">)<\/span> \n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"001\"<\/span>  <span class=\"st0\">\"011\"<\/span>  <span class=\"st0\">\"111\"<\/span>  <span class=\"st0\">\"1111\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<p>Note that <code>str_pad()<\/code> is very slow. For instance for a vector of length 10,000, we have a very long computing time. <code>padding()<\/code>does not seem to handle character vectors but the best solution may be to use the <code>sapply()<\/code>and <code>padding()<\/code> functions together.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span>library<span class=\"br0\">(<\/span><span class=\"st0\">\"stringr\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span>library<span class=\"br0\">(<\/span><span class=\"st0\">\"cwhmisc\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span>a <span class=\"sy0\">&lt;-<\/span> <span class=\"kw2\">rep<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">1<\/span>,<span class=\"nu0\">10<\/span><span class=\"sy0\">^<\/span><span class=\"nu0\">4<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">system.<span class=\"me1\">time<\/span><\/span><span class=\"br0\">(<\/span>b <span class=\"sy0\">&lt;-<\/span> str_pad<span class=\"br0\">(<\/span>a,<span class=\"nu0\">3<\/span>,side<span class=\"sy0\">=<\/span><span class=\"st0\">\"left\"<\/span>,pad<span class=\"sy0\">=<\/span><span class=\"st0\">\"0\"<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span>\nutilisateur     syst\u00e8me      \u00e9coul\u00e9 \n     <span class=\"nu0\">50.968<\/span>       <span class=\"nu0\">0.208<\/span>      <span class=\"nu0\">73.322<\/span> \n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">system.<span class=\"me1\">time<\/span><\/span><span class=\"br0\">(<\/span><span class=\"kw2\">c<\/span> <span class=\"sy0\">&lt;-<\/span> <span class=\"kw2\">sapply<\/span><span class=\"br0\">(<\/span>a, padding, space <span class=\"sy0\">=<\/span> <span class=\"nu0\">3<\/span>, <span class=\"kw2\">with<\/span> <span class=\"sy0\">=<\/span> <span class=\"st0\">\"0\"<\/span>, to <span class=\"sy0\">=<\/span> <span class=\"st0\">\"left\"<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span>\nutilisateur     syst\u00e8me      \u00e9coul\u00e9 \n      <span class=\"nu0\">7.700<\/span>       <span class=\"nu0\">0.020<\/span>      <span class=\"nu0\">12.206<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Removing_leading_and_trailing_spaces\" class=\"mw-headline\">Removing leading and trailing spaces<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Removing leading and trailing spaces\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=29\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>trimws()<\/code> (<b>memisc<\/b> package) trim leading and trailing white spaces.<\/li>\n<li><code>trim()<\/code> (<b>gdata<\/b> package) does the same job.<\/li>\n<li>See also <code>str_trim()<\/code> (<b>stringr<\/b>)<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">library<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"memisc\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> trimws<span class=\"br0\">(<\/span><span class=\"st0\">\"  abc def   \"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abc def\"<\/span> \n<span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">library<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"gdata\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> trim<span class=\"br0\">(<\/span><span class=\"st0\">\" abc def \"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abc def\"<\/span>\n<span class=\"sy0\">&gt;<\/span> str_trim<span class=\"br0\">(<\/span><span class=\"st0\">\"  abd def  \"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"abd def\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Comparing_two_strings\" class=\"mw-headline\">Comparing two strings<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Comparing two strings\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=30\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<h3><span id=\"Assessing_if_they_are_identical\" class=\"mw-headline\">Assessing if they are identical<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Assessing if they are identical\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=31\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<ul>\n<li><code>==<\/code> returns TRUE if both strings are the same and false otherwise.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"st0\">\"abc\"<\/span><span class=\"sy0\">==<\/span><span class=\"st0\">\"abc\"<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> TRUE\n<span class=\"sy0\">&gt;<\/span> <span class=\"st0\">\"abc\"<\/span><span class=\"sy0\">==<\/span><span class=\"st0\">\"abd\"<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> FALSE\n<\/pre>\n<\/div>\n<\/div>\n<h3><span id=\"Computing_distance_between_strings\" class=\"mw-headline\">Computing distance between strings<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Computing distance between strings\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=32\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h3>\n<p>Few packages implement the the <a class=\"extiw\" title=\"w:Levenshtein distance\" href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\" target=\"_blank\" rel=\"noopener\">Levenshtein distance<\/a> between two strings:<\/p>\n<ul>\n<li><code>adist()<\/code> in base package <b>utils<\/b><\/li>\n<li><code>stringMatch()<\/code> in <b>MiscPsycho<\/b><\/li>\n<li><code>stringdist()<\/code> in <b>stringdist<\/b><\/li>\n<li><code>levenshteinDist()<\/code> in <b>RecordLinkage<\/b><\/li>\n<\/ul>\n<p>A benchmark comparing the speed of <code>levenshteinDist()<\/code> and <code>stringdist()<\/code> is available here: <a class=\"external autonumber\" href=\"http:\/\/www.markvanderloo.eu\/yaRb\/2013\/09\/07\/a-bit-of-benchmarking-with-string-distances\/\" target=\"_blank\" rel=\"nofollow noopener\">[1]<\/a>.<\/p>\n<h4><span id=\"Example_with_utils\" class=\"mw-headline\">Example with utils<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Example with utils\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=33\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h4>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> adist<span class=\"br0\">(<\/span><span class=\"st0\">\"test\"<\/span>,<span class=\"st0\">\"tester\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">2<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h4><span id=\"Example_with_MiscPsycho\" class=\"mw-headline\">Example with MiscPsycho<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Example with MiscPsycho\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=34\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h4>\n<p><code>stringMatch()<\/code> (<b>MiscPsycho<\/b>) computes If <code>normalize=\"YES\"<\/code> the levenshtein distance is divided by the maximum length of each string.<\/p>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">library<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"MiscPsycho\"<\/span><span class=\"br0\">)<\/span>\n<span class=\"sy0\">&gt;<\/span> stringMatch<span class=\"br0\">(<\/span><span class=\"st0\">\"test\"<\/span>,<span class=\"st0\">\"tester\"<\/span>,normalize<span class=\"sy0\">=<\/span><span class=\"st0\">\"NO\"<\/span>,penalty<span class=\"sy0\">=<\/span><span class=\"nu0\">1<\/span>,case.<span class=\"me1\">sensitive<\/span> <span class=\"sy0\">=<\/span> TRUE<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">2<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h4><span id=\"Approximate_matching\" class=\"mw-headline\">Approximate matching<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Approximate matching\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=35\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h4>\n<p><code>agrep()<\/code> search for approximate matches using the <a class=\"extiw\" title=\"w:Levenshtein distance\" href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\" target=\"_blank\" rel=\"noopener\">Levenshtein distance<\/a>.<\/p>\n<ul>\n<li>If &#8216;value = TRUE&#8217;, this returns the value of the string<\/li>\n<li>If &#8216;value = FALSE&#8217; this returns the position of the string<\/li>\n<li><i>max<\/i> returns the maximal levenshtein distance.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span>  <span class=\"kw2\">agrep<\/span><span class=\"br0\">(<\/span>pattern <span class=\"sy0\">=<\/span> <span class=\"st0\">\"laysy\"<\/span>, x <span class=\"sy0\">=<\/span> <span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"1 lazy\"<\/span>, <span class=\"st0\">\"1\"<\/span>, <span class=\"st0\">\"1 LAZY\"<\/span><span class=\"br0\">)<\/span>, <span class=\"kw2\">max<\/span> <span class=\"sy0\">=<\/span> <span class=\"nu0\">2<\/span>, value <span class=\"sy0\">=<\/span> TRUE<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"1 lazy\"<\/span>\n<span class=\"sy0\">&gt;<\/span>  <span class=\"kw2\">agrep<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"laysy\"<\/span>, <span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"1 lazy\"<\/span>, <span class=\"st0\">\"1\"<\/span>, <span class=\"st0\">\"1 LAZY\"<\/span><span class=\"br0\">)<\/span>, <span class=\"kw2\">max<\/span> <span class=\"sy0\">=<\/span> <span class=\"nu0\">3<\/span>, value <span class=\"sy0\">=<\/span> TRUE<span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"1 lazy\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"Miscellaneous\" class=\"mw-headline\">Miscellaneous<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: Miscellaneous\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=36\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<ul>\n<li><code>deparse()<\/code>\u00a0: Turn unevaluated expressions into character strings.<\/li>\n<li><code>char.expand()<\/code> (<b>base<\/b>) expands a string with respect to a target.<\/li>\n<li><code>pmatch()<\/code> (<b>base<\/b>) and <code>charmatch()<\/code> (<b>base<\/b>) seek matches for the elements of their first argument among those of their second.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">pmatch<\/span><span class=\"br0\">(<\/span><span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"a\"<\/span>,<span class=\"st0\">\"b\"<\/span>,<span class=\"st0\">\"c\"<\/span>,<span class=\"st0\">\"d\"<\/span><span class=\"br0\">)<\/span>,<span class=\"kw2\">table<\/span> <span class=\"sy0\">=<\/span> <span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"b\"<\/span>,<span class=\"st0\">\"c\"<\/span><span class=\"br0\">)<\/span>, nomatch <span class=\"sy0\">=<\/span> <span class=\"nu0\">0<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"nu0\">0<\/span> <span class=\"nu0\">1<\/span> <span class=\"nu0\">2<\/span> <span class=\"nu0\">0<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<ul>\n<li><code>make.unique()<\/code> makes a character string unique. This is useful if you want to use a string as an identifier in your data.<\/li>\n<\/ul>\n<div class=\"mw-geshi mw-code mw-content-ltr\" dir=\"ltr\">\n<div class=\"rsplus source-rsplus\">\n<pre class=\"de1\"><span class=\"sy0\">&gt;<\/span> <span class=\"kw2\">make.<span class=\"me1\">unique<\/span><\/span><span class=\"br0\">(<\/span><span class=\"kw2\">c<\/span><span class=\"br0\">(<\/span><span class=\"st0\">\"a\"<\/span>, <span class=\"st0\">\"a\"<\/span>, <span class=\"st0\">\"a\"<\/span><span class=\"br0\">)<\/span><span class=\"br0\">)<\/span>\n<span class=\"br0\">[<\/span><span class=\"nu0\">1<\/span><span class=\"br0\">]<\/span> <span class=\"st0\">\"a\"<\/span>   <span class=\"st0\">\"a.1\"<\/span> <span class=\"st0\">\"a.2\"<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<h2><span id=\"References\" class=\"mw-headline\">References<\/span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[<\/span><a title=\"Edit section: References\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=37\">edit<\/a><span class=\"mw-editsection-bracket\">]<\/span><\/span><\/h2>\n<div class=\"reflist references references-column-count references-column-count-2\">\n<ol class=\"references\">\n<li id=\"cite_note-stringr-1\"><span class=\"mw-cite-backlink\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_ref-stringr_1-0\"><span class=\"cite-accessibility-label\">Jump up<\/span>\u2191<\/a><\/span> <span class=\"reference-text\">Hadley Wickham &#8220;stringr: modern, consistent string processing&#8221; The R Journal, December 2010, Vol 2\/2,<a class=\"external free\" href=\"http:\/\/journal.r-project.org\/archive\/2010-2\/RJournal_2010-2_Wickham.pdf\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/journal.r-project.org\/archive\/2010-2\/RJournal_2010-2_Wickham.pdf<\/a><\/span><\/li>\n<li id=\"cite_note-2\"><span class=\"mw-cite-backlink\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_ref-2\"><span class=\"cite-accessibility-label\">Jump up<\/span>\u2191<\/a><\/span> <span class=\"reference-text\"><a class=\"external free\" href=\"http:\/\/cran.r-project.org\/web\/views\/NaturalLanguageProcessing.html\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/cran.r-project.org\/web\/views\/NaturalLanguageProcessing.html<\/a><\/span><\/li>\n<li id=\"cite_note-3\"><span class=\"mw-cite-backlink\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_ref-3\"><span class=\"cite-accessibility-label\">Jump up<\/span>\u2191<\/a><\/span> <span class=\"reference-text\">In former versions (&lt; 2.10) we had also basic regular expressions in <b>R<\/b>\u00a0:<\/span><\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to understand the&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-767","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/767","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=767"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/767\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=767"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=767"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=767"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}