Guide to the Regular expressions

		Previous

Guide to the Regular expressions
Home ▸ Documentation ▸ Smile ▸ Text commands ▸ Guide to the Regular expressions

Table of contents
Defining a search pattern Metacharacters and "escape" character Anchors Character classes Unicode characters Operators Syntaxes Flags Defining a replace pattern

Defining a search pattern	Back to top
Using regular expressions consists in the first place in passing a special string as the search string. That special string, instead of its literal meaning, defines the pattern that will be searched for. When used in a regular expression search pattern, most characters assume their literal meaning, and several characters take a special role: the metacharacters, described below in this section. The AppleScript compiler considers backslash \ and double-quote " as special characters: you have to "escape" them with backslash. Thus: "\"" is how you enter a double-quote, and "\\" is for backslash.
Metacharacters and "escape" character	Back to top
To have a metacharacter (for instance, the bracket [) recover its literal meaning, you prefix it with backslash \. For instance, [a-z]\[[0-9]\] may match c[8]. In other cases, as you will see below, the character with the backslash is the metacharacter, while the character alone keeps its literal meaning. Those metacharacters which do not make sense inside brackets (the brackets define characters class, for example [a-z], see below) recover their literal meaning inside the brackets. For instance, [.] and [\] stand for the period and for the backslash, respectively. The backslash itself must be escaped with backslash to recover its literal meaning. Thus to find a backslash with a regular expression in an AppleScript you write four backslashes. find text "\\\\" in "escape with \\" with regexp find text "\\" in "escape with \\"
Anchors	Back to top
You can use the characters below as tags which will stand for some specific kind of location in the text. ^ (hat) beginning of a line, or the beginning of the selection in a window, or the beginning of the text stored in a variable $ end of a line, or the end of the selection in a window, or the end of the text stored in a variable. change "^to be" into "" in "to be or not to be" with regexp -- " or not to be" change "to be$" into "" in "to be or not to be" with regexp -- "to be or not " This is the default behavior, see also below about flags which change the meaning of the tags above. \b beginning or end of a word \B strictly within a word find text "\\bbe" in "tobe or not to be" with regexp will match the last word.
Character classes	Back to top
. (period) stands for any character except CR (ASCII 13). find text "n(.)" in "to be or not to be" with regexp will match the end of the line from "not". This is the default behavior, see also below about flags which allow . (period) to match CR (ASCII 13). [] the brackets encapsulate the definition for a class of characters. For instance, [0-9] matches any digit. - defines the range of characters which are within (considering the ASCII ordering) the characters on each side of the hyphen, for instance [a-zA-Z] matches any of the 52 uppercase and lowercase Roman letters ^ defines a class by excluding the characters which follow the hat character. find text "[^@]" in "homer@lol.com" with regexp and string result -- "homer" (the meaning of the star * is explained below) \w any of the characters which are allowed in words \W any of the characters which are allowed as word separators \r CR, carriage return, ASCII 13 \n LF, line feed, ASCII 10 \t tab, ASCII 9 [:alnum:] pre-defined set, the Roman letters and the digits. The pre-defined sets work only when encapsulated within brackets. For instance, ^[[:alnum:]]{5}@ will match a set of exactly 5 alpha-numeric characters located at the beginning of a line and followed by "@". [:alpha:] the Roman letters [:lower:] the lowercase Roman letters [:upper:] the uppercase Roman letters [:digit:] the digits [:xdigit:] the hexadecimal digits (lowercase and uppercase) [:blank:] space or tab [:space:] space, tab, CR, LF or FF [:cntrl:] the set of the characters with an ASCII code < 32 or = 127 [:punct:] neither a control character nor alphanumeric change "^" into "--" in selection of window 1 with regexp change "^[[:space:]]*--" into "" in selection of window 1 with regexp would comment out and uncomment, respectively, the block of text selected in the active window. To include a literal ] in a [] range escape it [\]]. To include a literal ^ place it anywhere but first. To include a literal - place it last. \w and \W are considered metacharacters only outside brackets []. \r, \n and \t are considered metacharacters inside and outside brackets, except when they just follow a backslash. Thus, to match a literal backslash followed by an r (or to search for the sequence \n or \t) insert an additional backslash to escape the backslash: search for \\r.
Unicode characters	Back to top
\u is an additional metacharacters for regular expressions on Unicode: \u followed by the hexadecimal Unicode of a character, specifies that character. The example below finds the mathematical symbols (Unicode 0x2200 to 0x22FF) in a given formula. (Because of the ASCII limitation of AppleScript source, we have to enter the original string in a Unicode window.) set x to text of Unicode window 1 -- "∀x\|x∈[0,+∞] ∃y, y∉Ω, ∫y(x).dx ≪ y∧1" find text "[\\u2200-\\u22FF]" in x with regexp, string result and all occurrences -- {"∀", "∈", "∞", "∃", "∉", "∫", "≪", "∧"}
Operators	Back to top
* zero or more occurrences of the preceding group, for example ^[[:space:]]* will match any combination of spaces and tabs at the beginning of a line + one or more occurrences ? zero or one occurrence {i, j} i to j occurrences, for instance [0-9]{2,4} will match a group of 2, 3 or 4 digits {i,} i occurrences or more {i} i occurrences exactly \| or, for example begin\|end will match either "begin" or "end" () groups characters, for example ([0-9]{3},)+ may match "123,234,345,". You also use groups when you want to be able to reference them later: \1, \2 ... \9 are references to the successive groups of the pattern. Those references can be used, either in the search string itself, or in the using parameter of find text, or in the into parameter of change. ^(.)\r(.\r)\1$ will match a block of text bracketed between two identical lines. While: find text "^(.)\\r(.\\r)\\1$" in someText with regexp using "\\1" will match the same pattern, but will return only the duplicate line. \0 stands for the whole match. References to groups may be helpful with the change verb. change "([0-9]{2})/([0-9]{2})" into "\\2/\\1" in someText with regexp will change, for instance, "25/12" into "12/25". The order of the groups is the order of the opening parentheses. If some group is repeated in the pattern, it finally stands for the last occurrence. find text "(^\|[^0-9])(([0-9]{1,3}\\.){3}[0-9]{1,3})" in theText using "\\2" with regexp, all occurrences and string result will return (as strings) the list of all dotted numeric IP addresses found in theText. find text "(^\|[^0-9])(([0-9]{1,3}\\.){3}[0-9]{1,3})" in theText using "\\3" with regexp, all occurrences and string result will return (as strings followed by a dot) the list of the third bytes of the dotted numeric IP addresses found in theText. Other operators can be defined by specific syntaxes and are available in the default syntax ("RUBY"). For example, laziness is defined by some syntaxes where you can use the +? operator. find text "<.+>" in "<div>Satimage-Software</div>" with regexp --is different from: find text "<.+?>" in "<div>Satimage-Software</div>" with regexp Lookaround is defined by some syntaxes where you can use the (?=), (?!), (?<=) and (?<!) operators. find text "m(?!i)" in "min, max" with regexp --does not match the first "m", and the letter following the second "m" is not part of the result
Syntaxes	Back to top
Several syntaxes exist for regular expressions. find text and change allow you to choose which syntax to use: you can define them using the syntax parameter. Default is "RUBY", but you can use any syntax of the following: "POSIX", "POSIX_EXTENDED", "EMACS", "GREP", "GNU_REGEX", "JAVA", "PERL", "RUBY". Each one supports a specific set of operators and they have different behaviors. For example, the "POSIX" syntax does not support the + operator. find text p in s with regexp syntax "PERL"
Flags	Back to top
The dictionary of Smile states that the regular expressions for find text and change support the following flags: "IGNORECASE", "EXTEND", "MULTILINE", "SINGLELINE", "FIND LONGEST", "FIND NOT EMPTY", "DONT CAPTURE GROUP", "NOTBOL" and "NOTEOL". The flags can only be used from a script. They can be used to modify the regexp syntax behavior. find text "(b?\|.*)" in "aaa" regexpflag {"FIND NOT EMPTY"} with regexp
Defining a replace pattern	Back to top
The replace string, such as entered in the Find panel or as the into argument of the change verb, can include the following metacharacters: \0, \1 ... \9 (the references to the groups which where defined in the search pattern - \0 refers to the whole match) and the special characters \r, \n and \t. The same characters are valid in the string passed as the using argument of find text. The using argument of find text supports strings and lists of strings - all strings recognize the metacharacters listed just above. find text "(.+) (.+)" in "Mickey Mouse" using {"Dest: Mr \\2", "Dear \\1,"} with regexp and string result


Copyright ©2009 Paris, Satimage