{"id":638,"date":"2014-06-12T17:35:46","date_gmt":"2014-06-12T22:35:46","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=638"},"modified":"2014-06-12T17:35:46","modified_gmt":"2014-06-12T22:35:46","slug":"regular-expression-tutorial-2-commands-in-r","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2014\/06\/12\/regular-expression-tutorial-2-commands-in-r\/","title":{"rendered":"Regular Expression Tutorial 2: Commands in R"},"content":{"rendered":"<p style=\"color: #555555;\">The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to write a regular expression to match a string, you may want to manipulate strings such as deletion or replacing. Here is the list of string matching &amp;manipulation commands commonly used with regular expressions in R. These commands also appear in many other languages.<\/p>\n<pre style=\"color: #555555;\"><strong>Command          Function<\/strong>\ngrep( )\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Return index of the object \n                 where reg exp found the string\ngrepl( )         Return logical values for reg exp \n                 matching \nregexpr( )       Return the first position of found\n                 string by reg exp\ngregexpr( )      Return all positions of found string\n                 by regexp\nsub( )           Substitute a pattern with a given string\n                 (first occurrence only)\ngsub( )          Globally substitute a pattern with a \n                 given string (all occurrences) \nsubstr( )        Return the substring in the giving \n                 character positions (start and stop)\n                 in given string\nstrsplit( )      Split the input string into parts \n                 based on another string (character)\nregexec( )       Return the first position of matched \n                 pattern in a given string\nregmatches ( )   Extract or replace matched substrings\n                 from match data obtained by\u00a0gregexpr,\n                 or regexec<\/pre>\n<h4 style=\"color: #333333;\"><span style=\"color: #0000ff;\">Find &amp; Display Matching string: grep<\/span><\/h4>\n<pre style=\"color: #555555;\">grep(pattern,vector) \n&gt;x&lt;-c(\"abc\",\"bcd\",\"cde\",\"def\")\n&gt;grep(\"bc\",x)\n[1] 1 2<\/pre>\n<p style=\"color: #555555;\">The first one is grep() command, which was originally created in Unix system. Its name came from\u00a0<strong>g<\/strong>lobally search a\u00a0<strong>r<\/strong>egular\u00a0<strong>e<\/strong>xpression and\u00a0<strong>p<\/strong>rint. You see \u201cbc\u201d appears in the first two entries of x. grep() function returns indexes of the matched string. If you want to show the matched entries (not index),\u00a0 use value option\u00a0 or\u00a0 use square brackets.<\/p>\n<pre style=\"color: #555555;\">&gt;grep(\"bc\",x,value=TRUE)\n[1] \"abc\" \"bcd\"\n&gt;x[grep(\"bc\",x)] \n[1] \"abc\" \"bcd\"<\/pre>\n<h4 style=\"color: #333333;\"><span style=\"color: #0000ff;\">Show Matched Pattern Using Find &amp; Replace<\/span><\/h4>\n<p style=\"color: #555555;\">If you want to get only the matched pattern, it is kind of awkward but you can use the output above and remove the unmatched part (In linux, you just use grep -o).<\/p>\n<p style=\"color: #555555;\">First, sub function\u2019s syntax is<\/p>\n<pre style=\"color: #555555;\">sub(\"matching_string\",\"replacing_string\", input_vector)<\/pre>\n<p style=\"color: #555555;\">This function works like \u201cfind and replace\u201d. Using this to remove unmatched part.<\/p>\n<pre style=\"color: #555555;\">&gt; sub(\".*(bc).*\",\"\\\\1\",grep(\"bc\",x,value=TRUE))\n[1] \"bc\" \"bc\"<\/pre>\n<p style=\"color: #555555;\">Remember .* means any character with any length and \\\\1 means the matched string in the first parenthesis.\u00a0In this case, you see only \u201cbc\u201d, but if you use regular expression for pattern, you will see different kind of matches found in the string.<\/p>\n<h4 style=\"color: #333333;\"><span style=\"color: #0000ff;\">Remove Matched String<\/span><\/h4>\n<p style=\"color: #555555;\">If you want to return indexes of\u00a0<span style=\"color: #ff0000;\">unmatched<\/span>\u00a0string, add invert option.<\/p>\n<pre style=\"color: #555555;\">&gt; grep(\"bc\",x,<span style=\"color: #ff0000;\">invert=TRUE<\/span>)\n[1] 3 4<\/pre>\n<p style=\"color: #555555;\">Combining with value option, you can remove matched string from the vector<\/p>\n<pre style=\"color: #555555;\">&gt; grep(\"bc\",x,<span style=\"color: #ff0000;\">invert=TRUE, value=TRUE<\/span>)\n[1] \"cde\" \"def\"<\/pre>\n<p style=\"color: #555555;\">If the search is not case sensitive,<\/p>\n<pre style=\"color: #555555;\">&gt; grep(\"BC\",x,<span style=\"color: #ff0000;\">ignore.case=TRUE<\/span>)\n[1] 1 2<\/pre>\n<p style=\"color: #555555;\">If you want to get\u00a0<span style=\"color: #ff0000;\">logical returns<\/span>\u00a0for matches,<\/p>\n<pre style=\"color: #555555;\">&gt; grepl(\"bc\",x)\n[1]  TRUE  TRUE FALSE FALSE<\/pre>\n<h4 style=\"color: #333333;\"><span style=\"color: #0000ff;\">Manipulating String with Matched String Position<\/span><\/h4>\n<p style=\"color: #555555;\">To get the first position of the matched pattern in the string,<strong><span style=\"color: #ff0000;\">\u00a0regexpr()<\/span><\/strong>\u00a0is used.<\/p>\n<pre style=\"color: #555555;\">&gt;y&lt;-\"Waikiki\"\n&gt;regexpr(\"ki\",y)\n[1] 4\nattr(,\"match.length\")\n[1] 2\nattr(,\"useBytes\")\n[1] TRUE<\/pre>\n<p style=\"color: #555555;\">Since the first match occurs at 4th character in y, the first value returned is 4. If there is no match it will return -1.<\/p>\n<p style=\"color: #555555;\">If you want to get this value only,<\/p>\n<pre style=\"color: #555555;\">&gt; regexpr(\"ki\",y)[1]\n[1] 4<\/pre>\n<p style=\"color: #555555;\">You see that regexpr() returns two attributes \u201cmatch.length\u201d and \u201cuseBytes\u201d. These value can be accessed by<\/p>\n<pre style=\"color: #555555;\">&gt; attr(regexpr(\"ki\",y),\"match.length\")\n[1] 2\n&gt; attr(regexpr(\"ki\",y),\"useBytes\")\n[1] TRUE<\/pre>\n<p style=\"color: #555555;\">If you want to get positions for all matches, use\u00a0<strong><span style=\"color: #ff0000;\">gregexpr()<\/span><\/strong><\/p>\n<pre style=\"color: #555555;\">&gt; gregexpr(\"ki\",y)\n[[1]]\n[1] 4 6\nattr(,\"match.length\")\n[1] 2 2\nattr(,\"useBytes\")\n[1] TRUE<\/pre>\n<p style=\"color: #555555;\">To show the only values of positions, you need to use length function. It is a bit awkward but can be done.<\/p>\n<pre style=\"color: #555555;\">&gt;z&lt;-gregexpr(\"ki\",y)\n&gt; z[[1]][1:length(z[[1]])]\n[1] 4 6<\/pre>\n<p style=\"color: #555555;\"><strong><span style=\"color: #ff0000;\">regexec()<\/span>\u00a0<\/strong>command works very similarly to regexpr(), however if there is parenthesized matching conditions, it will show both matched string position and\u00a0the position of parenthesized matched string.<\/p>\n<pre style=\"color: #555555;\">&gt; regexec(\"kik\",y)\n[[1]]\n[1] 4\nattr(,\"match.length\")\n[1] 3\n&gt; regexec(\"k(ik)\",y)\n[[1]]\n[1] 4 5\nattr(,\"match.length\")\n[1] 3 2<\/pre>\n<p style=\"color: #555555;\">To extract a substring from an input string, use\u00a0<strong><span style=\"color: #ff0000;\">substr()<\/span><\/strong><\/p>\n<pre style=\"color: #555555;\">substr(x,start, end)\n&gt;x&lt;-\"abcdef\" \n&gt;substr(x,3,5)\n[1] \"cde\"<\/pre>\n<p style=\"color: #555555;\">This function can also\u00a0<span style=\"color: #ff0000;\">replace a substring<\/span>\u00a0in a string.<\/p>\n<pre style=\"color: #555555;\">&gt;substr(x,3,4)&lt;-\"XX\n[1] \"abXXef\"<\/pre>\n<h4 style=\"color: #333333;\"><span style=\"color: #0000ff;\">Another Way to Show Matched Strings Using regmatches()<br \/>\n<\/span><\/h4>\n<p style=\"color: #555555;\">I showed one way to list the matched string using sub() and grep() , you can do the same thing with regmatches together with regexpr() or regexec().<br \/>\nFirst, regexpr() gives you the position of the found string and the length of the mtached string in the input, you pass this information on to regmatches().\u00a0 It will show all the matched strings from the input string. regexec() will show both matched substrings and matched substrings in the parenthesis.<\/p>\n<pre style=\"color: #555555;\">&gt; a&lt;-\"Mississippi contains a palindrome ississi.\"\n&gt; b&lt;-gregexpr(\".(ss)\",a)\n&gt; c&lt;-regexec(\".(ss)\",a)\n\n&gt; regmatches(a,b)\n[[1]]\n[1] \"iss\" \"iss\" \"iss\" \"iss\"\n\n&gt; regmatches(a,c)\n[[1]]\n[1] \"iss\" \"ss\"<\/pre>\n<p style=\"color: #555555;\">The syntax of regmatches() is<\/p>\n<pre style=\"color: #555555;\">regmatches(input, position&amp;length)<\/pre>\n<p style=\"color: #555555;\">Therefore, if you put position and length information of matched strings obtained from either gregexpr() or regexec() will be used to extract the matched string from the input. Note that regexec takes only the first match, you see only \u201ciss\u201d and \u201css\u201d.<\/p>\n<h4 style=\"color: #333333;\"><span style=\"color: #0000ff;\">Split Strings with Common Separator Using strplit Function<\/span><\/h4>\n<p style=\"color: #555555;\">Suppose you have a date string \u201c11\/03\/2031\u2033 and want to extract the numbers \u201c11\u2033, \u201c03\u2033 and \u201c2013\u2033. Since the numbers are separated by the common character \u201c\/\u201d, you can use strsplit function to do the job.<\/p>\n<pre style=\"color: #555555;\">&gt; strsplit(\"11\/03\/2013\",\"\/\")\n[[1]]\n[1] \"11\"   \"03\"   \"2013\"<\/pre>\n<p style=\"color: #555555;\">If you use \u201c\u201d for separator you can extract each character.<\/p>\n<pre style=\"color: #555555;\">&gt; strsplit(\"11\/03\/2013\",\"\")\n[[1]]\n [1] \"1\" \"1\" \"\/\" \"0\" \"3\" \"\/\" \"2\" \"0\" \"1\" \"3\"<\/pre>\n<p style=\"color: #555555;\">One thing you want to remember is when string starts with a separator, strsplit puts an empty character in the vector first.<\/p>\n<pre style=\"color: #555555;\">&gt; strsplit(\".a.b.c\",\"\\\\.\")\n[[1]]\n[1] \"\"  \"a\" \"b\" \"c\"<\/pre>\n<p style=\"color: #555555;\">If dot (.) is a separator, you need two backslashes for regular expression.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-638","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/638","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=638"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/638\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=638"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=638"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=638"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}