Regular Expression, or regex or regexp in short,
is extremely and amazingly powerful in searching and manipulating text strings.
One line of regex can easily replace several dozen lines of programming codes.
Regex is supported in almost all the scripting languages (such as Perl, Python,
PHP, and JavaScript); as well as general purpose programming languages such as
Java; and even word processors such as Word for searching texts. Getting
started with regex may not be easy due to its geeky syntax, but it is certainly
worth the investment of your time.
1. Regex By
Examples
Example 1: Numeric String /^[0-9]+$/
1.
A regex is typically
delimited by a pair of forward slash, in the form of /.../.
2.
The leading ^ and
the trailing $ are known as position anchors, which match the
beginning and ending of the input string, respectively. As a result, the entire
input string shall be matched, instead of a portion of the input string.
Without these position anchors, the regex can match any part of the input
string, i.e., with leading and trailing sub-string unmatched.
3.
The [...] encloses
a list of characters, and matches any character in the list. [0-9] matches
any character between 0 and 9 (i.e., a digit), where dash (-) denotes the
range.
4.
The + indicate
1 or more occurrences of the previous sub-expression. In this case, [0-9]+ matches
one or more digits.
5.
This regex matches any
non-empty numeric string (of digits 0 to 9), e.g., "0" and "12345".
It does not match with "" (empty string), "abc", "a123",
etc. However, it also matches "000", "0123" and "0001".
6.
You can also use /^\d+$/,
where \d is a metacharacter that matches any digit, which is
identical to [0-9].
Example 2: Positive Integer Literal /[1-9][0-9]*|0/
1.
[1-9] matches any
character between 1 to 9; [0-9]* matches zero or more digits. The * is
an occurrence indicator representing zero or more occurrences. Together, [1-9][0-9]* matches
any number without a leading zero.
2.
| represents the OR
operator; which is used to include the number 0.
3.
This expression matches
"0" and "123"; but does not match "000" and
"0123".
4.
You can replace [0-9] by
metacharacter \d.
5.
We did not use position
anchors ^ and $ in this regex. Hence, it can match any part
of the input string, instead of the whole string.
6.
Integers, both positive
and negative, can be written as /[+-]?[1-9]\d*|0/, where [+-] matches
either + or - sign, and ? is an occurrence
indicator denoting 0 or 1 occurrence (i.e. optional).
Example 3: An Identifier (or Name) /[a-zA-Z_][0-9a-zA-Z_]*/
1.
Begin with one letters
or underscore, followed by zero or more digits, letters and underscore.
2. You can use metacharacter \w (word
character) for [a-zA-Z0-9_]; \d (digit) for [0-9]. Hence,
it can be written as /[a-zA-Z_]\w*/.
3. To include dash (-) in the identifier, use /[a-zA-Z_][\w-]*/.
Nonetheless, dash conflicts with subtraction and is often excluded from
identifier.
Example 4: An Image Filename /^\w+\.(gif|png|jpg|jpeg)$/i
1. The two forward-slashes /.../ contains
a regex.
2. The leading ^ and trailing $ match
the beginning and the ending of the input string, respectively. That is, the
entire input string shall match with this regex, instead of a part of the input
string.
3. \w+ matches one or more word characters
(i.e., [a-zA-Z0-9_]+).
4. \. matches the . character. We
need to use \. to represent . as . has special
meaning in regex. The \ is known as the escape code, which restore
the original literal meaning of the following character.
5. (gif|png|jpg|jpeg) matches either "gif",
"png", "jpg" or "jpeg". The | denotes
"OR" operator.
6. The modifier i after the regex
specifies case-insensitive matching. That is, it accepts "test.GIF"
and "test.Gif".
Example 5: Email Address /^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/
1. The two forward-slashes /.../ contains
a regex.
2. The leading ^ and trailing $ match
the beginning and the ending of the input string, respectively. That is, the
entire input string shall match with this regex, instead of a part of the input
string.
3. \w+ matches 1 or more word characters
(i.e., [a-zA-Z0-9_]).
4. [\.-] matches character . or -.
We need to use \. to represent . as . has special
meaning in regex. The \ is known as the escape code, which restore
the original literal meaning of the following character.
5. [\.-]? matches 0 or 1 occurrence of [\.-].
6. Again, \w+ matches 1 or more word
characters.
7. ([\.-]?\w+)* matches 0 or more occurrences
of [\.-]?\w+.
8. The sub-expression \w+([\.-]?\w+)* is
used to match the username in the email, before the @ sign. It begins
with at least one word character (a-z, A-Z, 0-9 and underscore), followed by
more word characters or . or -. However, a . or - must follow by
a word character (a-z, A-Z, 0-9 and underscore). That is, the string cannot
contain "..", "--", ".-" or "-.".
Example of valid string are "a.1-2-3".
9. The @ matches itself.
10. Again, the sub-expression \w+([\.-]?\w+)* is
used to match the email domain name, with the same pattern as the username
described above.
11. The sub-expression \.\w{2,3} matches a
. followed by two or three word characters, e.g., ".com", ".edu",
".us", ".uk", ".co".
12. (\.\w{2,3})+ specifies that the above
sub-expression could occur one or more times, e.g., ".com", ".co.uk",
".edu.sg" etc.
Exercise: Interpret this regex, which provide another representation
of email address: /^[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}$/.
Example 6: Swapping Words using
Back-References /^(\S+)\s+(\S+)$/
1.
The two forward-slashes /.../ contain
a regex.
2.
The ^ and $ match
the beginning and ending of the input string, respectively
3.
The \s (lowercase s)
matches a whitespace (blank, tab \t, form-feed \f and newline \r or \n).
On the other hand, the \S+ (uppercase S) matches anything that
is not matched by \s, i.e., non-whitespace. In regex, the uppercase
metacharacter denotes the inverse of the lowercase
counterpart, for example, \w for word character and \W for
non-word character.
4.
The above regex matches
two words separated by one or more whitespaces.
5.
Parentheses () has two
meanings in regex: to group a sub-expression (e.g., (abc)?) and to provide
a so-called back-reference for capturing and extracting matches.
6.
The parentheses in (\S+),
called parenthesized back-reference, is used to extract the matched
sub-string from the input string. In this regex, there are two (\S+),
match the first two words, separated by one or more whitespaces \s+. The
two matched words are extracted from the input string and typically kept in
special variables $1 and $2 respectively.
7.
To swap the two words,
you can access the special variables, and print "$2 $1" (via a
programming language); or substitute operator "s/(\S+)\s+(\S+)/$2 $1/"
(in Perl).
Example 7: HTTP Address /^http:\/\/\S+(\/\S+)*(\/)?$/
1.
Begin with http://.
Take note that / needs to be written as \/ with an escape
code.
2.
Followed by \S+,
one or more non-whitespaces, for the domain name.
3.
Followed by (\/\S+)*,
zero or more "/...", for the sub-directories.
4.
Followed by (\/)?,
an optional (0 or 1) trailing /, for directory request.
A Regular Expression (or Regex) is a pattern (or filter)
that accepts a set of strings that matches the pattern.
A regex is typically delimited by a pair of forward slashes /.../.
It consists of a sequence of characters, meta-characters (such as ., \d, \s, \S)
and operators (such as +, *, ?, |, ^)
For examples,
Regex
|
Description
|
/^[1-9][0-9]*|0$/
|
A positive integer
|
/^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/
|
An email address
|
The fundamental building blocks are patterns that match a single character. Most characters, including all letters (a-z and A-Z) and digits (0-9), match themselves. In other words, the regex /a/ matches the string "a"; regex /x/matches the string "x".2.1 Matching a Character
A regex is constructed by combining many smaller sub-expressions or atoms.
For example, the regex /Friday/ matches the string "Friday".
Escape Sequence: To match some special characters, such as +, *, / and \ which
are regex's operators, we need to precede the character with a backslash, known
as escape sequence (i.e., \+, \*, \/ and \\).
For example, the regex /a\*b\+/ matches the string "a*b+".
2.2 OR (|)
Operator
You can provide alternatives using the
"OR" operator, denoted by a vertical bar '|'. For example, the
regex /four|for|floor|4/ matches strings "four", "for",
"floor" or "4".
2.3 Bracket
List [...], [^...] and Range [ - ] Expressions
A bracket expression is a list of
characters enclosed by [ ], also called character class.
It matches any ONE character in the list. However, if the first character of
the list is the caret (^), then it matches any ONE character NOT in the list.
For example, the regex [02468] matches a single digit 0, 2, 4, 6, or
8; the regex [^02468] matches any single character other than 0, 2,
4, 6, or 8.
Instead of listing all characters, you could use a range
expression inside the bracket. A range expression consists of 2
characters separated by a hyphen (-). It matches any single character that
sorts between the two characters, inclusive. For example, [a-d] is
the same as [abcd]. You could include a caret (^) in front of the range to
invert the matching. For example, [^a-d] is equivalent to [^abcd].
Special characters:
·
To include a ],
place it first in the list.
·
To include a ^,
place it anywhere but first.
·
To include a - place
it last.
·
Most special characters
(such as +, *) lose their special meaning inside bracket list, and
can be used as they are.
2.4 Metacharacters ., \w, \W, \d, \D, \s, \S
A metacharacter is a symbol with a special
meaning inside a regex.
·
The metacharacter dot (.)
matches any single character except newline \n (same as [^\n]).
For example, /.../ matches any 3 characters (including alphabets,
numbers, whitespaces, but except newline); /the../ matches "there",
"these", "the ", and etc.
·
\w (word character)
matches any single letter, number or underscore (same as [a-zA-Z0-9_]).
The uppercase counterpart \W (non-word-character) matches any single
character that doesn't match by \w (same as [^a-zA-Z0-9_]).
·
In regex, the uppercase
metacharacter is always the inverse of the lowercase counterpart.
·
\d (digit) matches
any single digit (same as [0-9]). The uppercase counterpart \D (non-digit)
matches any single character that is not a digit (same as [^0-9]).
·
\s (space) matches
any single whitespace (same as [ \t\n\r\f], blank, tab, newline,
carriage-return and form-feed). The uppercase counterpart \S (non-space)
matches any single character that doesn't match by \s (same as [^
\t\n\r\f]).
Examples:
/\s\s/ # Matches two
whitespaces
/\S\S\s/ # Two
non-whitespaces followed by a whitespace
/\s+/ # one or more whitespaces
/\S+\s\S+/ # two words
(non-whitespaces) separated by a whitespace
2.5 Occurrence
Indicators: +, *, ?, {m}, {m,n}, {m,}
A regex sub-expression may be followed by an occurrence
indicator (aka repetition operator):
·
?: The preceding item is
optional and matched at most once (i.e., occurs 0 or 1 times).
·
*: The preceding item
will be matched zero or more times.
·
+: The preceding item
will be matched one or more times.
·
{m}: The preceding item
is matched exactly m times.
·
{m,}: The preceding item
is matched m or more times.
·
{m,n}: The preceding
item is matched at least m times, but not more than n times.
For example: The regex /xy{2,4}/ ('x' followed by 2
to 4 'y') matches "xyy", "xyyy" and "xyyyy".
2.6 Modifier
You can attach modifiers after a regex, in the
form of /.../modifiers, to control the matching behavior. For
example,
·
i: case insensitive
matching. The default is case-sensitive.
·
g: global matching,
i.e., search for all the matches. The default searches only the first match.
Examples: [TODO]
2.7 Greediness,
Backtracking and Laziness
Greediness: The *, +, ?, {m,n} repetition
operators are greedy operators, and by default grasp as many
characters as possible for a match. For example, the regex /xy{2,4}/ try
to match for "xyyyy", then "xyyy", and then
"xyy".
Backtracking: If a regex reaches a state where a match
cannot be completed, it backtracks by unwinding one character from the greedy
match. For example, if the regex /z*zzz/ is matched against the
string "zzzz", the z* first matches "zzzz";
unwinds to match "zzz"; unwinds to match "zz"; and finally
unwinds to match "z", such that the rest of the patterns can find a
match.
Laziness: You can put an extra ? after the repetition operator (*, +, ?,
or {}) to curb its greediness (i.e., stop at the shortest match) in the
form of *?, +?, ??, {}?. For example,
input = "The <code>first</code> and
<code>second</code> instances"
/<code>.*</code>/g matches
"<code>first</code> and <code>second</code>".
But
/<code>.*?</code>/g produces two matches:
"<code>first</code>" and
"<code>second</code>".
2.8 Positional
Metacharacters (aka Position Anchors) ^, $, \b, \B, \<, \>, \A, \Z
Positional anchors do NOT match actual character but matches position in
a string, such as begin-of-line, end-of-line, begin-of-word, and end-of-word.
·
The metacharacter caret
(^) matches the beginning of the line (or input string); and the metacharacter
dollar ($) matches the end of line, excluding newline (or end of input string).
These two are the most commonly used position anchors.
·
The metacharacter \b matches
the the edge of a word (i.e., word boundary after a whitespace); and \B matches
a string provided it's not at the edge of a word. For example, the regex /\bcat\b/ matches
the word "cat" in input string "This is a cat."; but does
not match string "This is a catalog.".
·
The metacharacters \< and \> match
the empty string at the beginning and the end of a word respectively (compared
with \b, which can match both the beginning and the end of a word).
·
\A matches the
beginning of the line, just like ^. \Z matches the end of the
line, just like $. However, \A and \Z match only the
actual beginning and ending, and ignore the embedded newlines within the
string.
You can use positional anchors liberally to
increase the speed of matching. For examples:
/ing$/ # ending with 'ing'
/^testing 123$/ # Matches
only one pattern. Should use equality comparison instead.
2.9 Capturing
Matches via Parenthesized Back-References & Matched Variables $1,... , $9
Parentheses ( ) serve two purposes in regex.
1. Firstly, parentheses ( ) can be used
to group sub-expressions for overriding the precedence or applying a repetition
operator. For example, /(a|e|i|o|u){3,5}/ is the same as /a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/.
2. Secondly, parentheses are used to provide the so
called back-references. A back-reference contains the matched
sub-string. For examples, the regex /(\S+)/ creates one
back-reference (\S+), which contains the first word (consecutive
non-spaces) of the input string; the regex /(\S+)\s+(\S+)/ creates
two back-references: (\S+) and another (\S+), containing the
first two words, separated by one or more spaces \s+.
The back-references are stored in special variables $1, $2,
…, $9, where $1 contains the substring matched the first pair of
parentheses, and so on. For example, /(\S+)\s+(\S+)/ creates two
back-references which matched with the first two words. The matched words are
stored in $1 and $2, respectively.
Back-references are important if you wish to
manipulate the string.
For example, the following Perl expression swap
the first and second words separate by a space:
s/(\S+) (\S+)/$2 $1/; #
Swap the first and second words separated by a single space
No comments:
Post a Comment