{"id":293,"date":"2013-11-03T17:25:44","date_gmt":"2013-11-03T22:25:44","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=293"},"modified":"2013-11-03T17:25:44","modified_gmt":"2013-11-03T22:25:44","slug":"r-programmingtext-processing","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/11\/03\/r-programmingtext-processing\/","title":{"rendered":"R Programming\/Text Processing"},"content":{"rendered":"<p><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing\">http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.<\/p>\n<p>This page may be useful to\u00a0:<\/p>\n<ul>\n<li>perform statistical text analysis.<\/li>\n<li>collect data from an unformatted text file.<\/li>\n<li>deal with character variables.<\/li>\n<\/ul>\n<p>In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of function for characters, simple functions and regular expressions. Many functions are part of the standard R\u00a0<b>base<\/b>\u00a0package.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>help.search(keyword = \"character\", package = \"base\")<\/pre>\n<\/div>\n<\/div>\n<p>However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the\u00a0<b>stringr<\/b>\u00a0package which defines functions with similar behaviour but their names are easier to retain and their syntax much more systematic<sup id=\"cite_ref-stringr_1-0\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_note-stringr-1\">[1]<\/a><\/sup>.<\/p>\n<ul>\n<li>Keywords\u00a0:\u00a0<i>text mining<\/i>,\u00a0<i>natural language processing<\/i><\/li>\n<li>See CRAN Task view on Natural Language Processing<sup id=\"cite_ref-2\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_note-2\">[2]<\/a><\/sup><\/li>\n<li>See also the following packages\u00a0<b>tm<\/b>,\u00a0<b>tau<\/b>,\u00a0<b>languageR<\/b>,\u00a0<b>scrapeR<\/b>.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<div id=\"toc\">\n<div id=\"toctitle\">\n<h2>Contents<\/h2>\n<p>[<a id=\"togglelink\" href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#\">hide<\/a>]<\/div>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Reading_and_writing_text_files\">1\u00a0Reading and writing text files<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Character_encoding\">2\u00a0Character encoding<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Example\">2.1\u00a0Example<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Regular_Expressions\">3\u00a0Regular Expressions<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Functions_which_use_regular_expressions_in_R\">3.1\u00a0Functions which use regular expressions in R<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extended_regular_expressions_.28The_default.29\">3.2\u00a0Extended regular expressions (The default)<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Perl-like_regular_expressions\">3.3\u00a0Perl-like regular expressions<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Examples\">3.3.1\u00a0Examples<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#See_also\">3.3.2\u00a0See also<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Concatenating_strings\">4\u00a0Concatenating strings<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Examples_2\">4.1\u00a0Examples<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Splitting_a_string\">5\u00a0Splitting a string<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Counting_the_number_of_characters_in_a_string\">6\u00a0Counting the number of characters in a string<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Detecting_the_presence_of_a_substring\">7\u00a0Detecting the presence of a substring<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Detecting_a_pattern_in_a_string_.3F\">7.1\u00a0Detecting a pattern in a string\u00a0?<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Counting_the_occurrence_of_each_pattern_in_a_string_.3F\">7.2\u00a0Counting the occurrence of each pattern in a string\u00a0?<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_position_of_a_substring_or_a_pattern_in_a_string\">8\u00a0Extracting the position of a substring or a pattern in a string<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_position_of_a_substring_.3F\">8.1\u00a0Extracting the position of a substring\u00a0?<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_position_of_a_pattern_in_a_string_.3F\">8.2\u00a0Extracting the position of a pattern in a string\u00a0?<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_a_substring_from_a_string\">9\u00a0Extracting a substring from a string<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_a_fixed_width_substring_.3F\">9.1\u00a0Extracting a fixed width substring\u00a0?<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_the_first_word_in_a_string_.3F\">9.2\u00a0Extracting the first word in a string\u00a0?<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Extracting_a_pattern_in_a_string_.3F\">9.3\u00a0Extracting a pattern in a string\u00a0?<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Making_some_substitution_inside_a_string\">10\u00a0Making some substitution inside a string<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Substituting_a_pattern_in_a_string\">10.1\u00a0Substituting a pattern in a string<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Substituting_characters_in_a_string_.3F\">10.2\u00a0Substituting characters in a string\u00a0?<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Converting_letters_to_lower_or_upper-case\">11\u00a0Converting letters to lower or upper-case<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Filling_a_string_with_some_character\">12\u00a0Filling a string with some character<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Removing_leading_and_trailing_spaces\">13\u00a0Removing leading and trailing spaces<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Comparing_two_strings\">14\u00a0Comparing two strings<\/a>\n<ul>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Assessing_if_they_are_identical\">14.1\u00a0Assessing if they are identical<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Computing_distance_between_strings\">14.2\u00a0Computing distance between strings<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#Miscellaneous\">15\u00a0Miscellaneous<\/a><\/li>\n<li><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#References\">16\u00a0References<\/a><\/li>\n<\/ul>\n<\/div>\n<h2>Reading and writing text files[<a title=\"Edit section: Reading and writing text files\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=1\">edit<\/a>]<\/h2>\n<p><b>R<\/b>\u00a0can read any text file using\u00a0<code>readLines()<\/code>\u00a0or\u00a0<code>scan()<\/code>. It is possible to specify the encoding of the imported text file with<code>readLines()<\/code>. The entire contents of the text file can be read into an R object (e.g., a character vector).\u00a0<code>scan()<\/code>\u00a0is more flexible. The kind of data expected can be specified in the second argument (e.g., character(0) for a string).<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>text &lt;- readLines(\"file.txt\",encoding=\"UTF-8\")\nscan(\"file.txt\", character(0)) # separate each word\nscan(\"file.txt\", character(0), quote = NULL) # get rid of quotes\nscan(\"file.txt\", character(0), sep = \".\") # separate each sentence\nscan(\"file.txt\", character(0), sep = \"\\n\") # separate each line<\/pre>\n<\/div>\n<\/div>\n<p>We can write the content of an R object into a text file using\u00a0<code>cat()<\/code>\u00a0or\u00a0<code>writeLines()<\/code>. By default\u00a0<code>cat()<\/code>\u00a0concatenates vectors when writing to the text file. You can change it by adding options\u00a0<code>sep=\"\\n\"<\/code>\u00a0or\u00a0<code>fill=TRUE<\/code>. The default encoding depends on your computer.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>cat(text,file=\"file.txt\",sep=\"\\n\")\nwriteLines(text, con = \"file.txt\", sep = \"\\n\", useBytes = FALSE)<\/pre>\n<\/div>\n<\/div>\n<p>Before reading a text file, you can look at its properties.\u00a0<code>nlines()<\/code>\u00a0(<b>parser<\/b>\u00a0package) and\u00a0<code>countLines()<\/code>\u00a0(<b>R.utils<\/b>\u00a0package) count the number of lines in the file.<code>count.chars()<\/code>\u00a0(<b>parser<\/b>\u00a0package) counts the number of bytes and characters in each line of a file. You can also display a text file using\u00a0<code>file.show()<\/code>.<\/p>\n<h2>Character encoding[<a title=\"Edit section: Character encoding\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=2\">edit<\/a>]<\/h2>\n<table>\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" alt=\"Wikipedia-logo.png\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/40px-Wikipedia-logo.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/60px-Wikipedia-logo.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/80px-Wikipedia-logo.png 2x\" width=\"40\" height=\"40\" \/><\/td>\n<td><a title=\"w:\" href=\"http:\/\/en.wikipedia.org\/wiki\/\" target=\"_blank\" rel=\"noopener\">Wikipedia<\/a>\u00a0has related information at\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Character_encoding\" target=\"_blank\" rel=\"noopener\"><i><b>Character encoding<\/b><\/i><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which have been created with another operating system and especially if the language is not English and has many accents and specific characters. For instance, the standard encoding scheme in Linux is &#8220;UTF-8&#8221; whereas the standard encoding scheme in Windows is &#8220;Latin1&#8221;. The\u00a0<code>Encoding()<\/code>functions returns the encoding of a string.\u00a0<code>iconv()<\/code>\u00a0is similar to the unix command\u00a0<a title=\"w:iconv\" href=\"http:\/\/en.wikipedia.org\/wiki\/iconv\" target=\"_blank\" rel=\"noopener\">iconv<\/a>\u00a0and converts the encoding.<\/p>\n<ul>\n<li><code>iconvlist()<\/code>\u00a0gives the list of available encoding scheme on your computer.<\/li>\n<li><code>readLines()<\/code>,\u00a0<code>scan()<\/code>\u00a0and\u00a0<code>file.show()<\/code>\u00a0have also an encoding option.<\/li>\n<li><code>is.utf8()<\/code>\u00a0(<b>tau<\/b>) tests if the encoding is &#8220;utf8&#8221;.<\/li>\n<li><code>is.locale()<\/code>\u00a0(<b>tau<\/b>) tests if encoding is the same as the default encoding on your computer.<\/li>\n<li><code>translate()<\/code>\u00a0(<b>tau<\/b>) translates the encoding into the current locale.<\/li>\n<li><code>fromUTF8()<\/code>\u00a0(<b>descr<\/b>) is less general than\u00a0<code>iconv()<\/code>.<\/li>\n<li><code>utf8ToInt()<\/code>\u00a0(<b>base<\/b>)<\/li>\n<\/ul>\n<h3>Example[<a title=\"Edit section: Example\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=3\">edit<\/a>]<\/h3>\n<p>The following example was run under Windows. Thus, the default encoding is &#8220;latin1&#8221;.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; texte &lt;- \"H\u00e9 h\u00e9\"\n&gt; Encoding(texte)\n[1] \"latin1\"\n&gt; texte2 &lt;-  iconv(texte,\"latin1\",\"UTF-8\")\n&gt; Encoding(texte2)\n[1] \"UTF-8\"<\/pre>\n<\/div>\n<\/div>\n<h2>Regular Expressions[<a title=\"Edit section: Regular Expressions\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=4\">edit<\/a>]<\/h2>\n<table>\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" alt=\"Wikibooks-logo.png\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/d\/d5\/Wikibooks-logo.png\/40px-Wikibooks-logo.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/d\/d5\/Wikibooks-logo.png\/60px-Wikibooks-logo.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/d\/d5\/Wikibooks-logo.png\/80px-Wikibooks-logo.png 2x\" width=\"40\" height=\"40\" \/><\/td>\n<td>Also see the\u00a0<i><a title=\"Regular expressions\" href=\"http:\/\/en.wikibooks.org\/wiki\/Regular_expressions\">Regular expressions<\/a><\/i>\u00a0book.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table>\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" alt=\"Wikipedia-logo.png\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/40px-Wikipedia-logo.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/60px-Wikipedia-logo.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/63\/Wikipedia-logo.png\/80px-Wikipedia-logo.png 2x\" width=\"40\" height=\"40\" \/><\/td>\n<td><a title=\"w:\" href=\"http:\/\/en.wikipedia.org\/wiki\/\" target=\"_blank\" rel=\"noopener\">Wikipedia<\/a>\u00a0has related information at\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Regular_expression\" target=\"_blank\" rel=\"noopener\"><i><b>Regular expression<\/b><\/i><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern\u00a0: 2 digits, 2 letters and 4 digits.\u00a0<b>R<\/b>\u00a0provides powerful functions to deal with regular expressions. Two types of regular expressions are used in\u00a0<b>R<\/b><sup id=\"cite_ref-3\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_note-3\">[3]<\/a><\/sup><\/p>\n<ul>\n<li>extended regular expressions, used by\u00a0<code>\u2018perl = FALSE\u2019<\/code>\u00a0(the default),<\/li>\n<li>Perl-like regular expressions used by\u00a0<code>\u2018perl = TRUE\u2019<\/code>.<\/li>\n<\/ul>\n<p>There is a also an option called\u00a0<code>\u2018fixed = TRUE\u2019<\/code>\u00a0which can be considered as a literal regular expression.\u00a0<code>fixed()<\/code>\u00a0(<b>stringr<\/b>) is equivalent to\u00a0<code>fixed=TRUE<\/code>\u00a0in the standard regex functions. These functions are by default case sensitive. This can be changed by specifying the option\u00a0<code>ignore.case = TRUE<\/code>.<\/p>\n<p>If you are not a specialist in regular expression you my find the\u00a0<code>glob2rx()<\/code>\u00a0useful. This function suggests some regular expression for a specific pattern\u00a0:<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; glob2rx(\"abc.*\")\n[1] \"^abc\\\\.\"<\/pre>\n<\/div>\n<\/div>\n<h3>Functions which use regular expressions in R[<a title=\"Edit section: Functions which use regular expressions in R\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=5\">edit<\/a>]<\/h3>\n<ul>\n<li><code>sub()<\/code>,\u00a0<code>gsub()<\/code>,\u00a0<code>str_replace()<\/code>\u00a0(<b>stringr<\/b>) make some substitutions in a string.<\/li>\n<li><code>grep()<\/code>,\u00a0<code>str_extract()<\/code>\u00a0(<b>stringr<\/b>) extract some value<\/li>\n<li><code>grepl()<\/code>,\u00a0<code>str_detect()<\/code>\u00a0(<b>stringr<\/b>) detect the presence of a pattern.<\/li>\n<li>see also\u00a0<code>splitByPattern()<\/code>\u00a0(<b>R.utils<\/b>)<\/li>\n<li>See also\u00a0<code>gsubfn()<\/code>\u00a0in the\u00a0<b>gsubfn<\/b>\u00a0package.<\/li>\n<\/ul>\n<h3>Extended regular expressions (The default)[<a title=\"Edit section: Extended regular expressions (The default)\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=6\">edit<\/a>]<\/h3>\n<ul>\n<li><code>\".\"<\/code>\u00a0stands for any character.<\/li>\n<li><code>\"[ABC]\"<\/code>\u00a0means A,B or C.<\/li>\n<li><code>\"[A-Z]\"<\/code>\u00a0means any upper letter between A and Z.<\/li>\n<li><code>\"[0-9]\"<\/code>\u00a0means any digit between 0 and 9.<\/li>\n<\/ul>\n<p>Here is the list of metacharacters\u00a0<code>\u2018$ * + .\u00a0? [ ] ^ { } | ( ) \\\u2019<\/code>. If you need to use one of those characters, precede them with a doubled backslash.<\/p>\n<p>Here are some classes of regular expressions\u00a0: For numbers\u00a0:<\/p>\n<ul>\n<li><code>\u2018[:digit:]\u2019<\/code>\u00a0Digits:\u00a0<code>\u20180 1 2 3 4 5 6 7 8 9\u2019<\/code>.<\/li>\n<\/ul>\n<p>For letters\u00a0:<\/p>\n<ul>\n<li><code>\u2018[:alpha:]\u2019<\/code>\u00a0Alphabetic characters:\u00a0<code>\u2018[:lower:]\u2019<\/code>\u00a0and\u00a0<code>\u2018[:upper:]\u2019<\/code>.<\/li>\n<li><code>\u2018[:upper:]\u2019<\/code>\u00a0Upper-case letters.<\/li>\n<li><code>\u2018[:lower:]\u2019<\/code>\u00a0Lower-case letters.<\/li>\n<\/ul>\n<p>Note that the set of alphabetic characters includes accents such as\u00a0<code>\u00e9 \u00e8 \u00ea<\/code>\u00a0which are very common is some languages like French. Therefore, it is more general than\u00a0<code>\"[A-Za-z]\"<\/code>which does not include letters with accent.<\/p>\n<p>For other characters\u00a0:<\/p>\n<ul>\n<li><code>\u2018[:punct:]\u2019<\/code>\u00a0Punctuation characters:\u00a0<code>\u2018! \" # $\u00a0% &amp; ' ( ) * + , - . \/\u00a0:\u00a0; &lt; = &gt;\u00a0? @ [ \\ ] ^ _ ` { | } ~\u2019<\/code>.<\/li>\n<li><code>\u2018[:space:]\u2019<\/code>\u00a0Space characters: tab, newline, vertical tab, form feed, carriage return, and space.<\/li>\n<li><code>\u2018[:blank:]\u2019<\/code>\u00a0Blank characters: space and tab.<\/li>\n<li><code>\u2018[:cntrl:]\u2019<\/code>\u00a0Control characters.<\/li>\n<\/ul>\n<p>For combination of other classes\u00a0:<\/p>\n<ul>\n<li><code>[:alnum:]<\/code>\u00a0Alphanumeric characters:\u00a0<code>\u2018[:alpha:]\u2019<\/code>\u00a0and\u00a0<code>\u2018[:digit:]\u2019<\/code>.<\/li>\n<li><code>\u2018[:graph:]\u2019<\/code>\u00a0Graphical characters:\u00a0<code>\u2018[:alnum:]\u2019<\/code>\u00a0and\u00a0<code>\u2018[:punct:]\u2019<\/code>.<\/li>\n<li><code>\u2018[:print:]\u2019<\/code>\u00a0Printable characters:\u00a0<code>\u2018[:alnum:]\u2019<\/code>,\u00a0<code>\u2018[:punct:]\u2019<\/code>\u00a0and space.<\/li>\n<li><code>\u2018[:xdigit:]\u2019<\/code>\u00a0Hexadecimal digits:\u00a0<code>\u20180 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f\u2019<\/code>.<\/li>\n<\/ul>\n<p>You can quantify the number of repetition by adding after the regular expression the following characters\u00a0:<\/p>\n<ul>\n<li><code>\u2018?\u2019<\/code>\u00a0The preceding item is optional and will be matched at most once.<\/li>\n<li><code>\u2018*\u2019<\/code>\u00a0The preceding item will be matched zero or more times.<\/li>\n<li><code>\u2018+\u2019<\/code>\u00a0The preceding item will be matched one or more times.<\/li>\n<li><code>\u2018{n}\u2019<\/code>\u00a0The preceding item is matched exactly \u2018n\u2019 times.<\/li>\n<li><code>\u2018{n,}\u2019<\/code>\u00a0The preceding item is matched \u2018n\u2019 or more times.<\/li>\n<li><code>\u2018{n,m}\u2019<\/code>\u00a0The preceding item is matched at least \u2018n\u2019 times, but not more than \u2018m\u2019 times.<\/li>\n<\/ul>\n<ul>\n<li><code>^<\/code>\u00a0to force the regular expression to be at the beginning of the string<\/li>\n<li><code>$<\/code>\u00a0to force the regular expression to be at the end of the string<\/li>\n<\/ul>\n<p>If you want to know more, have a look at the 2 following help files\u00a0:<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>&gt;?regexp # gives some general explanations\n&gt;?grep # help file for grep(),regexpr(),sub(),gsub(),etc<\/pre>\n<\/div>\n<\/div>\n<h3>Perl-like regular expressions[<a title=\"Edit section: Perl-like regular expressions\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=7\">edit<\/a>]<\/h3>\n<table>\n<tbody>\n<tr>\n<td>\n<div><img loading=\"lazy\" decoding=\"async\" alt=\"\" src=\"http:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/9\/91\/Book_important2.svg\/40px-Book_important2.svg.png\" srcset=\"\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/9\/91\/Book_important2.svg\/60px-Book_important2.svg.png 1.5x, \/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/9\/91\/Book_important2.svg\/80px-Book_important2.svg.png 2x\" width=\"40\" height=\"40\" \/><\/div>\n<\/td>\n<td><b>This section is a stub.<\/b><br \/>\nYou can help Wikibooks by\u00a0<a href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit\">expanding it<\/a>.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It is also possible to use &#8220;perl-like&#8221; regular expressions. You just need to use the option\u00a0<code>perl=TRUE<\/code>.<\/p>\n<h4>Examples[<a title=\"Edit section: Examples\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=8\">edit<\/a>]<\/h4>\n<p>If you want to remove space characters in a string, you can use the\u00a0<code>\\\\s<\/code>\u00a0Perl macro.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>sub('\\\\s', '',x, perl = TRUE)<\/pre>\n<\/div>\n<\/div>\n<h4>See also[<a title=\"Edit section: See also\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=9\">edit<\/a>]<\/h4>\n<ul>\n<li><a title=\"Perl Programming\/Regular Expressions\" href=\"http:\/\/en.wikibooks.org\/wiki\/Perl_Programming\/Regular_Expressions\">Perl Programming\/Regular Expressions<\/a><\/li>\n<\/ul>\n<h2>Concatenating strings[<a title=\"Edit section: Concatenating strings\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=10\">edit<\/a>]<\/h2>\n<ul>\n<li><code>paste()<\/code>\u00a0concatenates strings.<\/li>\n<li><code>str_c()<\/code>\u00a0(<b>stringr<\/b>) does a similar job.<\/li>\n<li><code>cat()<\/code>\u00a0prints and concatenates strings.<\/li>\n<\/ul>\n<h3>Examples[<a title=\"Edit section: Examples\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=11\">edit<\/a>]<\/h3>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; paste(\"toto\",\"tata\",sep=' ')\n[1] \"toto tata\"\n&gt; paste(\"toto\",\"tata\",sep=\",\")\n[1] \"toto,tata\"\n&gt; str_c(\"toto\",\"tata\",sep=\",\")\n[1] \"toto,tata\"\n&gt; x &lt;- c(\"a\",\"b\",\"c\")\n&gt; paste(x,collapse=\" \")\n[1] \"a b c\"\n&gt; str_c(x, collapse = \" \")\n[1] \"a b c\"\n&gt; cat(c(\"a\",\"b\",\"c\"), sep = \"+\")\na+b+c<\/pre>\n<\/div>\n<\/div>\n<h2>Splitting a string[<a title=\"Edit section: Splitting a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=12\">edit<\/a>]<\/h2>\n<ul>\n<li><code>strsplit()<\/code>\u00a0: Split the elements of a character vector \u2018x\u2019 into substrings according to the matches to substring \u2018split\u2019 within them.<\/li>\n<li>See also\u00a0<code>str_split()<\/code>\u00a0(<b>stringr<\/b>).<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; unlist(strsplit(\"a.b.c\", \"\\\\.\"))\n[1] \"a\" \"b\" \"c\"<\/pre>\n<\/div>\n<\/div>\n<ul>\n<li><code>tokenize()<\/code>\u00a0(<b>tau<\/b>) split a string into tokens.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; tokenize(\"abc defghk\")\n[1] \"abc\"    \" \"      \"defghk\"<\/pre>\n<\/div>\n<\/div>\n<h2>Counting the number of characters in a string[<a title=\"Edit section: Counting the number of characters in a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=13\">edit<\/a>]<\/h2>\n<ul>\n<li><code>nchar()<\/code>\u00a0gives the length of a string.<\/li>\n<li>See also\u00a0<code>str_length()<\/code>\u00a0(<b>stringr<\/b>).<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; nchar(\"abcdef\")\n[1] 6\n&gt; str_length(\"abcdef\")\n[1] 6\n&gt; nchar(NA)\n[1] 2\n&gt; str_length(NA)\n[1] NA<\/pre>\n<\/div>\n<\/div>\n<h2>Detecting the presence of a substring[<a title=\"Edit section: Detecting the presence of a substring\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=14\">edit<\/a>]<\/h2>\n<h3>Detecting a pattern in a string\u00a0?[<a title=\"Edit section: Detecting a pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=15\">edit<\/a>]<\/h3>\n<ul>\n<li><code>grepl()<\/code>\u00a0returns a logical expression (TRUE or FALSE).<\/li>\n<li><code>str_detect()<\/code>\u00a0(<b>stringr<\/b>) does a similar job.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; string &lt;- \"23 mai 2000\"\n&gt; string2 &lt;- \"1 mai 2000\"\n&gt; regexp &lt;- \"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"\n&gt; grepl(pattern = regexp, x = string)\n[1] TRUE\n&gt; str_detect(string, regexp)\n[1] TRUE\n&gt; grepl(pattern = regexp, x = string2)\n[1] FALSE<\/pre>\n<\/div>\n<\/div>\n<p>The 1st one is true and the second one is false since there is only on digit in the first number.<\/p>\n<h3>Counting the occurrence of each pattern in a string\u00a0?[<a title=\"Edit section: Counting the occurrence of each pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=16\">edit<\/a>]<\/h3>\n<ul>\n<li><code>textcnt()<\/code>\u00a0(<b>tau<\/b>) counts the occurrence of each pattern or each term in a text.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; string &lt;- \"blabla 23 mai 2000 blabla 18 mai 2004\"\n&gt; textcnt(string,n=1L,method=\"string\")\nblabla    mai \n     2      2 \nattr(,\"class\")\n[1] \"textcnt\"<\/pre>\n<\/div>\n<\/div>\n<h2>Extracting the position of a substring or a pattern in a string[<a title=\"Edit section: Extracting the position of a substring or a pattern in a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=17\">edit<\/a>]<\/h2>\n<h3>Extracting the position of a substring\u00a0?[<a title=\"Edit section: Extracting the position of a substring\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=18\">edit<\/a>]<\/h3>\n<ul>\n<li><code>cpos()<\/code>\u00a0(<b>cwhmisc<\/b>) returns the position of a substring in a string.<\/li>\n<li><code>substring.location()<\/code>\u00a0(<b>cwhmisc<\/b>) does the same job but returns the first and the last position.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre> \n&gt; cpos(\"abcdefghijklmnopqrstuvwxyz\",\"p\",start=1)\n[1] 16\n&gt; substring.location(\"abcdefghijklmnopqrstuvwxyz\",\"def\")\n$first\n[1] 4\n\n$last\n[1] 6<\/pre>\n<\/div>\n<\/div>\n<h3>Extracting the position of a pattern in a string\u00a0?[<a title=\"Edit section: Extracting the position of a pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=19\">edit<\/a>]<\/h3>\n<ul>\n<li><code>regexpr()<\/code>\u00a0returns the position of the regular expression.\u00a0<code>str_locate()<\/code>\u00a0(<b>stringr<\/b>) does the same job.\u00a0<code>gregexpr()<\/code>\u00a0is similar to\u00a0<code>regexpr()<\/code>\u00a0but the starting position of every match is returned.\u00a0<code>str_locate_all()<\/code>\u00a0(<b>stringr<\/b>) does the same job.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; regexp &lt;- \"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"\n&gt; string &lt;- \"blabla 23 mai 2000 blabla 18 mai 2004\"\n&gt; regexpr(pattern = regexp, text = string)\n[1] 8\nattr(,\"match.length\")\n[1] 11\n&gt; gregexpr(pattern = regexp, text = string)\n[[1]]\n[1]  8 27\nattr(,\"match.length\")\n[1] 11 11\n&gt; str_locate(string,regexp)\n     start end\n[1,]     8  18\n&gt; str_locate_all(string,regexp)\n[[1]]\n     start end\n[1,]     8  18\n[2,]    27  37<\/pre>\n<\/div>\n<\/div>\n<h2>Extracting a substring from a string[<a title=\"Edit section: Extracting a substring from a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=20\">edit<\/a>]<\/h2>\n<h3>Extracting a fixed width substring\u00a0?[<a title=\"Edit section: Extracting a fixed width substring\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=21\">edit<\/a>]<\/h3>\n<ul>\n<li><code>substr()<\/code>\u00a0takes a sub string.<\/li>\n<li><code>str_sub()<\/code>\u00a0(<b>stringr<\/b>) is similar.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; substr(\"simple text\",1,3)\n[1] \"sim\"\n&gt; str_sub(\"simple text\",1,3)\n[1] \"sim\"<\/pre>\n<\/div>\n<\/div>\n<h3>Extracting the first word in a string\u00a0?[<a title=\"Edit section: Extracting the first word in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=22\">edit<\/a>]<\/h3>\n<ul>\n<li><code>first.word()<\/code>\u00a0First Word in a String or Expression in the\u00a0<b>Hmisc<\/b>\u00a0package<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; first.word(\"abc def ghk\")\n[1] \"abc\"<\/pre>\n<\/div>\n<\/div>\n<h3>Extracting a pattern in a string\u00a0?[<a title=\"Edit section: Extracting a pattern in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=23\">edit<\/a>]<\/h3>\n<ul>\n<li><code>grep()<\/code>\u00a0returns the value or the position of the regular expression if\u00a0<code>value=T<\/code>\u00a0and its position if\u00a0<code>value=F<\/code>.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; grep(pattern = regexp, x = string , value = T) \n[1] \"23 mai 2000\"\n&gt; grep(pattern = regexp, x = string2 , value = T) \ncharacter(0)\n&gt; grep(pattern = regexp, x = string , value = F) \n[1] 1\n&gt; grep(pattern = regexp, x = string2 , value = F) \ninteger(0)<\/pre>\n<\/div>\n<\/div>\n<ul>\n<li><code>str_extract()<\/code>,\u00a0<code>str_extract_all()<\/code>,\u00a0<code>str_match()<\/code>,\u00a0<code>str_match_all()<\/code>\u00a0(<b>stringr<\/b>) and\u00a0<code>m()<\/code>\u00a0(<b>caroline<\/b>\u00a0package) are similar to\u00a0<code>grep()<\/code>.\u00a0<code>str_extract()<\/code>\u00a0and<code>str_extract_all()<\/code>\u00a0return a vector.\u00a0<code>str_match()<\/code>\u00a0and\u00a0<code>str_match_all()<\/code>\u00a0return a matrix and\u00a0<code>m()<\/code>\u00a0a dataframe.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; library(\"stringr\")\n&gt; regexp &lt;- \"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"\n&gt; string &lt;- \"blabla 23 mai 2000 blabla 18 mai 2004\"\n&gt; str_extract(string,regexp)\n[1] \"23 mai 2000\"\n&gt; str_extract_all(string,regexp)\n[[1]]\n[1] \"23 mai 2000\" \"18 mai 2004\"\n\n&gt; str_match(string,regexp)\n     [,1]          [,2] [,3]  [,4]  \n[1,] \"23 mai 2000\" \"23\" \"mai\" \"2000\"\n&gt; str_match_all(string,regexp)\n[[1]]\n     [,1]          [,2] [,3]  [,4]  \n[1,] \"23 mai 2000\" \"23\" \"mai\" \"2000\"\n[2,] \"18 mai 2004\" \"18\" \"mai\" \"2004\"\n&gt; library(\"caroline\")\n&gt; m(pattern = regexp, vect = string, names = c(\"day\",\"month\",\"year\"), types = rep(\"character\",3))\n  day month year\n1  18   mai 2004<\/pre>\n<\/div>\n<\/div>\n<h2>Making some substitution inside a string[<a title=\"Edit section: Making some substitution inside a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=24\">edit<\/a>]<\/h2>\n<h3>Substituting a pattern in a string[<a title=\"Edit section: Substituting a pattern in a string\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=25\">edit<\/a>]<\/h3>\n<ul>\n<li><code>sub()<\/code>\u00a0makes a substitution.<\/li>\n<li><code>gsub()<\/code>\u00a0is similar to\u00a0<code>sub()<\/code>\u00a0but replace all occurrences of the pattern whereas\u00a0<code>sub()<\/code>\u00a0only replaces the first occurrence.<\/li>\n<li><code>str_replace()<\/code>\u00a0(<b>stringr<\/b>) is similar.<\/li>\n<\/ul>\n<p>In the following example, we have a French date. The regular pattern is the following\u00a0: 2 digits, a blank, some letters, a blank, 4 digits. We capture the 2 digits with the<code>[[:digit:]]{2}<\/code>\u00a0expression, the letters with\u00a0<code>[[:alpha:]]+<\/code>\u00a0and the 4 digits with\u00a0<code>[[:digit:]]{4}<\/code>. Each of these three substrings is surrounded with parenthesis. The first substring is stored in\u00a0<code>\"\\\\1\"<\/code>, the second one in\u00a0<code>\"\\\\2\"<\/code>\u00a0and the 3rd one in\u00a0<code>\"\\\\3\"<\/code>.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>string &lt;- \"23 mai 2000\"\nregexp &lt;- \"([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})\"\nsub(pattern = regexp, replacement = \"\\\\1\", x = string) # returns the first part of the regular expression\nsub(pattern = regexp, replacement = \"\\\\2\", x = string) # returns the second part\nsub(pattern = regexp, replacement = \"\\\\3\", x = string) # returns the third part<\/pre>\n<\/div>\n<\/div>\n<p>In the following example, we compare the outcome of\u00a0<code>sub()<\/code>\u00a0and\u00a0<code>gsub()<\/code>. The first one removes the first space whereas the second one removes all spaces in the text.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; text &lt;- \"abc def ghk\"\n&gt; sub(pattern = \" \", replacement = \"\",  x = text)\n[1] \"abcdef ghk\"\n&gt; gsub(pattern = \" \", replacement = \"\",  x = text)\n[1] \"abcdefghk\"<\/pre>\n<\/div>\n<\/div>\n<h3>Substituting characters in a string\u00a0?[<a title=\"Edit section: Substituting characters in a string\u00a0?\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=26\">edit<\/a>]<\/h3>\n<ul>\n<li><code>chartr()<\/code>\u00a0substitutes characters in an expression. It stands for &#8220;character translation&#8221;.<\/li>\n<li><code>replacechar()<\/code>\u00a0(<b>cwhmisc<\/b>) does the same job &#8230;<\/li>\n<li>as well as\u00a0<code>str_replace_all()<\/code>\u00a0(<b>stringr<\/b>).<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; chartr(old=\"a\",new=\"o\",x=\"baba\")\n[1] \"bobo\"\n&gt; chartr(old=\"ab\",new=\"ot\",x=\"baba\")\n[1] \"toto\"\n&gt; replacechar(\"abc.def.ghi.jkl\",\".\",\"_\")\n[1] \"abc_def_ghi_jkl\"\n&gt; str_replace_all(\"abc.def.ghi.jkl\",\"\\\\.\",\"_\")\n[1] \"abc_def_ghi_jkl\"<\/pre>\n<\/div>\n<\/div>\n<h2>Converting letters to lower or upper-case[<a title=\"Edit section: Converting letters to lower or upper-case\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=27\">edit<\/a>]<\/h2>\n<ul>\n<li><code>tolower()<\/code>\u00a0converts upper-case characters to lower-case.<\/li>\n<li><code>toupper()<\/code>\u00a0converts lower-case characters to upper-case.<\/li>\n<li><code>capitalize()<\/code>\u00a0(<b>Hmisc<\/b>) capitalize the first letter of a string<\/li>\n<li>See also\u00a0<code>cap()<\/code>,\u00a0<code>capitalize()<\/code>,\u00a0<code>lower()<\/code>,\u00a0<code>lowerize()<\/code>\u00a0and\u00a0<code>CapLeading()<\/code>\u00a0in the\u00a0<b>cwhmisc<\/b>\u00a0package.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; tolower(\"ABCdef\")\n[1] \"abcdef\"\n&gt; toupper(\"ABCdef\")\n[1] \"ABCDEF\"\n&gt; capitalize(\"abcdef\")\n[1] \"Abcdef\"<\/pre>\n<\/div>\n<\/div>\n<h2>Filling a string with some character[<a title=\"Edit section: Filling a string with some character\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=28\">edit<\/a>]<\/h2>\n<ul>\n<li><code>padding()<\/code>\u00a0(<b>cwhmisc<\/b>) fills a string with some characters to fit a given length. See also\u00a0<code>str_pad()<\/code>\u00a0(<b>stringr<\/b>).<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; library(\"cwhmisc\")\n&gt; padding(\"abc\",10,\" \",\"center\") # adds blanks such that the length of the string is 10.\n[1] \"   abc    \"\n&gt; str_pad(\"abc\",width=10,side=\"center\", pad = \"+\")\n[1] \"+++abc++++\"\n&gt; str_pad(c(\"1\",\"11\",\"111\",\"1111\"),3,side=\"left\",pad=\"0\") \n[1] \"001\"  \"011\"  \"111\"  \"1111\"<\/pre>\n<\/div>\n<\/div>\n<p>Note that\u00a0<code>str_pad()<\/code>\u00a0is very slow. For instance for a vector of length 10,000, we have a very long computing time.\u00a0<code>padding()<\/code>does not seem to handle character vectors but the best solution may be to use the\u00a0<code>sapply()<\/code>\u00a0and\u00a0<code>padding()<\/code>\u00a0functions together.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>&gt;library(\"stringr\")\n&gt;library(\"cwhmisc\")\n&gt;a &lt;- rep(1,10^4)\n&gt; system.time(b &lt;- str_pad(a,3,side=\"left\",pad=\"0\"))\nutilisateur     syst\u00e8me      \u00e9coul\u00e9 \n     50.968       0.208      73.322 \n&gt; system.time(c &lt;- sapply(a, padding, space = 3, with = \"0\", to = \"left\"))\nutilisateur     syst\u00e8me      \u00e9coul\u00e9 \n      7.700       0.020      12.206<\/pre>\n<\/div>\n<\/div>\n<h2>Removing leading and trailing spaces[<a title=\"Edit section: Removing leading and trailing spaces\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=29\">edit<\/a>]<\/h2>\n<ul>\n<li><code>trimws()<\/code>\u00a0(<b>memisc<\/b>\u00a0package) trim leading and trailing white spaces.<\/li>\n<li><code>trim()<\/code>\u00a0(<b>gdata<\/b>\u00a0package) does the same job.<\/li>\n<li>See also\u00a0<code>str_trim()<\/code>\u00a0(<b>stringr<\/b>)<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; library(\"memisc\")\n&gt; trimws(\"  abc def   \")\n[1] \"abc def\" \n&gt; library(\"gdata\")\n&gt; trim(\" abc def \")\n[1] \"abc def\"\n&gt; str_trim(\"  abd def  \")\n[1] \"abd def\"<\/pre>\n<\/div>\n<\/div>\n<h2>Comparing two strings[<a title=\"Edit section: Comparing two strings\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=30\">edit<\/a>]<\/h2>\n<h3>Assessing if they are identical[<a title=\"Edit section: Assessing if they are identical\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=31\">edit<\/a>]<\/h3>\n<ul>\n<li><code>==<\/code>\u00a0returns TRUE if both strings are the same and false otherwise.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; \"abc\"==\"abc\"\n[1] TRUE\n&gt; \"abc\"==\"abd\"\n[1] FALSE<\/pre>\n<\/div>\n<\/div>\n<h3>Computing distance between strings[<a title=\"Edit section: Computing distance between strings\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=32\">edit<\/a>]<\/h3>\n<p><code>stringMatch()<\/code>\u00a0(<b>MiscPsycho<\/b>) computes the\u00a0<a title=\"w:Levenshtein distance\" href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\" target=\"_blank\" rel=\"noopener\">Levenshtein distance<\/a>\u00a0between two strings. If\u00a0<code>normalize=\"YES\"<\/code>\u00a0the levenshtein distance is divided by the maximum length of each string.<\/p>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; library(\"MiscPsycho\")\n&gt; stringMatch(\"test\",\"tester\",normalize=\"NO\",penalty=1,case.sensitive = T)\n[1] 2<\/pre>\n<\/div>\n<\/div>\n<p><code>agrep()<\/code>\u00a0search for approximate matches using the\u00a0<a title=\"w:Levenshtein distance\" href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\" target=\"_blank\" rel=\"noopener\">Levenshtein distance<\/a>.<\/p>\n<ul>\n<li>If &#8216;value = T&#8217;, this returns the value of the string<\/li>\n<li>If &#8216;value = F&#8217; this returns the position of the string<\/li>\n<li><i>max<\/i>\u00a0returns the maximal levenshtein distance.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt;  agrep(pattern = \"laysy\", x = c(\"1 lazy\", \"1\", \"1 LAZY\"), max = 2, value = TRUE)\n[1] \"1 lazy\"\n&gt;  agrep(\"laysy\", c(\"1 lazy\", \"1\", \"1 LAZY\"), max = 3, value = TRUE)\n[1] \"1 lazy\"<\/pre>\n<\/div>\n<\/div>\n<h2>Miscellaneous[<a title=\"Edit section: Miscellaneous\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=33\">edit<\/a>]<\/h2>\n<ul>\n<li><code>deparse()<\/code>\u00a0: Turn unevaluated expressions into character strings.<\/li>\n<li><code>char.expand()<\/code>\u00a0(<b>base<\/b>) expands a string with respect to a target.<\/li>\n<li><code>pmatch()<\/code>\u00a0(<b>base<\/b>) and\u00a0<code>charmatch()<\/code>\u00a0(<b>base<\/b>) seek matches for the elements of their first argument among those of their second.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; pmatch(c(\"a\",\"b\",\"c\",\"d\"),table = c(\"b\",\"c\"), nomatch = 0)\n[1] 0 1 2 0<\/pre>\n<\/div>\n<\/div>\n<ul>\n<li><code>make.unique()<\/code>\u00a0makes a character string unique. This is useful if you want to use a string as an identifier in your data.<\/li>\n<\/ul>\n<div dir=\"ltr\">\n<div>\n<pre>&gt; make.unique(c(\"a\", \"a\", \"a\"))\n[1] \"a\"   \"a.1\" \"a.2\"<\/pre>\n<\/div>\n<\/div>\n<h2>References[<a title=\"Edit section: References\" href=\"http:\/\/en.wikibooks.org\/w\/index.php?title=R_Programming\/Text_Processing&amp;action=edit&amp;section=34\">edit<\/a>]<\/h2>\n<div>\n<ol>\n<li id=\"cite_note-stringr-1\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_ref-stringr_1-0\">Jump up\u2191<\/a>\u00a0Hadley Wickham &#8220;stringr: modern, consistent string processing&#8221; The R Journal, December 2010, Vol 2\/2,\u00a0<a href=\"http:\/\/journal.r-project.org\/archive\/2010-2\/RJournal_2010-2_Wickham.pdf\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/journal.r-project.org\/archive\/2010-2\/RJournal_2010-2_Wickham.pdf<\/a><\/li>\n<li id=\"cite_note-2\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_ref-2\">Jump up\u2191<\/a>\u00a0<a href=\"http:\/\/cran.r-project.org\/web\/views\/NaturalLanguageProcessing.html\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/cran.r-project.org\/web\/views\/NaturalLanguageProcessing.html<\/a><\/li>\n<li id=\"cite_note-3\"><a href=\"http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing#cite_ref-3\">Jump up\u2191<\/a>\u00a0In former versions (&lt; 2.10) we had also basic regular expressions in\u00a0<b>R<\/b>\u00a0:\n<ul>\n<li>extended regular expressions, used by\u00a0<code>extended = TRUE<\/code>\u00a0(the default),<\/li>\n<li>basic regular expressions, as used by\u00a0<code>extended = FALSE<\/code>\u00a0(obsolete in\u00a0<b>R 2.10<\/b>).<\/li>\n<\/ul>\n<p>Since basic regular expressions (<code>\u2018extended = FALSE\u2019<\/code>) are now obsolete, the<\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>http:\/\/en.wikibooks.org\/wiki\/R_Programming\/Text_Processing &nbsp; This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-293","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/293","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=293"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/293\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}