Regex - Vinod Sebastian

Regex or Regular Expression

What is RegEx ?
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. It is usually used for find or find and replace or validation. For example d outside [ ] normally matches any digit character, ^ the start and $ the end. So /^dd$/ will match any exact 2-digit text. Hence we can find such strings, replace them or validate them.

To Remember

The asterisk (*) matches the character preceding it zero or more times, the plus / addition sign (+) matches the character preceding it one or more times, and the question mark (?) matches zero or one of the preceding characters.
The dot (.) matches exactly one character.
[…] matches any character inside brackets.
[^…] matches every character except the ones inside brackets.
\D represents Non digit character.
\d represents Digit character.
\w is same as regex [A-Za-z0-9_].
\W is same as regex [^A-Za-z0-9_].
\b is same as regex (^\w|\w$|\W\w|\w\W).
\s matches white space.
\S matches anything but white space.
^ matches beginning of line or string and $ matches end of line or string.
\A matches beginning of string and not beginning of internal line and \z matches end of string and not end of internal line .
Multiple line flag is given by (m). Example is /dd/m.
Parentheses () groups a series of pattern elements to a single element. They are also used match and then remember the match. They can be used along with special characters, labelled $1, $2, and so on to $9, that store sub strings discovered through the use of the capturing parentheses.
Pipe | separates alternate possibilities.For example a|b means a or b.
z{5} means zzzzz.
{M,N} denotes the minimum M and the maximum N match count. {M} matches exactly M times; {M,} matches at least M times; {0,N} matches at most N times.

#The dot (.) matches exactly one character.
print(re.findall(".", "Hello")) 
# ['H', 'e', 'l', 'l', 'o']


#The asterisk (*) matches the character preceding it zero or more times.
print(re.findall(".*", "Hello")) 
# ['Hello', '']


#The plus / addition sign (+)  matches the character preceding it one or more times.
print(re.findall(".+", "Hello")) 
# ['Hello']


#The question mark (?) matches the character preceding it zero or once.
print(re.findall(".?", "Hello"))
# ['H', 'e', 'l', 'l', 'o', '']


#[…] matches any character inside brackets.
print(re.findall("[Hl]", "Hello and How"))
#['H', 'l', 'l', 'H']


#[^…] matches every character except the ones inside brackets.
print(re.findall("[^Hl]", "Hello and How"))
#['e', 'o', ' ', 'a', 'n', 'd', ' ', 'o', 'w']


#\D represents Non digit character.
print(re.findall("\D", "H3ll0 and H0w"))
#['H', 'l', 'l', ' ', 'a', 'n', 'd', ' ', 'H', 'w']


#\d represents Digit character.
print(re.findall("\d", "H3ll0 and H0w"))
#['3', '0', '0']


#\w is same as regex [A-Za-z0-9_].
print(re.findall("\w", "H3ll0 a?d H0*"))
#['H', '3', 'l', 'l', '0', 'a', 'd', 'H', '0']


#\W is same as regex [^A-Za-z0-9_].
print(re.findall("\W", "H3ll0 a?d H0*"))
[' ', '?', ' ', '*']


#\b is same as regex (^\w|\w$|\W\w|\w\W).
print(re.findall(r"\bH", "Hello and How"))
#['H', 'H']

#\s matches white space.
print(re.findall("\s", "H3ll0 a?d H0*"))
#[' ', ' ']


#\S matches anything but white space.
print(re.findall("\S", "H3ll0 a?d H0*"))
#['H', '3', 'l', 'l', '0', 'a', '?', 'd', 'H', '0', '*']


#Here in Python re.MULTILINE is a multiline flag. 
#In regex we use Flags like \Pattern\Flags. Example is \^.{12}\m. Here flag being m.
#Another common flag in python is re.IGNORECASE for ignoring case of text
#Flag i is used usually for above \^.{12}\i
#They can be combined as \^.{12}\im or in Python as re.MULTILINE | re.IGNORECASE and passed as parameter.

#^ matches beginning of line or string.
print(re.findall("^H", "Hello\nand\nHow", re.MULTILINE))
#['H', 'H']


#\A matches beginning of string and not beginning of internal line.
print(re.findall("\AH", "Hello\nand\nHow", re.MULTILINE))
#['H']


#$ matches end of line or string.
print(re.findall("d$", "Hello\nand\nHow", re.MULTILINE))
#['d']


#\z matches end of string and not end of internal line. 
#In Python, the token \Z does what \z does in other engines: it only matches at the very end of the string. 
print(re.findall("d\Z", "Hello\nand\nHow", re.MULTILINE))
#[]


#Pipe | separates alternate possibilities.For example a|b means a or b.
print(re.findall("He|Ho?", "Hello and How"))
# ['He', 'Ho']


#z{5} means zzzzz.
print(re.findall("l{2}", "Hello and How"))
#['ll']


#{M,N} denotes the minimum M and the maximum N match count. {M} matches exactly M times; {M,} matches at least M times; {0,N} matches at most N times.
print(re.findall("l{3,4}", "Hellllo and Hlllw"))
#['llll', 'lll']


#Parentheses groups a series of pattern elements to a single element.
#They can be used along with special characters, labelled $1, $2, and so on to $9, that store sub strings discovered through the use of the capturing parentheses.
#The captured groups $1 and so on can then be used as substitutes in replacement text.

Greedy vs. Reluctant vs. Possessive Quantifiers

In Python and some other implementations (e.g. Java), the three common quantifiers (*, + and ?) are greedy by default because they match as many characters as possible. It can backtrack.

In Java, quantifiers may be made possessive by appending a plus sign, which disables backing off (in a backtracking engine), even if doing so would allow the overall match to succeed.
Another possessive matching example is for atomic grouping, which disables backtracking for a parenthesized group. The typical syntax is (?>group). For example, while ^(wi|w)i$ matches both wi and wii, ^(?>wi|w)i$ only matches wii because the engine is forbidden from backtracking and try with setting the group as “w”. Possessive quantifiers are easier to implement. It is similar to greedy but does not backtrack.

A reluctant or “non-greedy” quantifier first matches as little as possible. It can backtrack.

Enter your regex: .*test // Greedy quantifier
Enter input string to search: xtestxxxxxxtest
I found the text "xtestxxxxxxtest" starting at index 0 and ending at index 15.

Enter your regex: .*?test // Reluctant quantifier
Enter input string to search: xtestxxxxxxtest
I found the text "xtest" starting at index 0 and ending at index 5.
I found the text "xxxxxxtest" starting at index 5 and ending at index 15.

Enter your regex: .*+test // Possessive quantifier
Enter input string to search: xtestxxxxxxtest
No match found.

A Complete Example In Python

#Finds all matches starting with "This" and ends with "P followed by 3 to 5 word charaters and dot in a single line":

import re

text = "This matches given regular expression in PHP.\n"
text += "This matches given regular expression in Python.\n"
text += "This matches given regular expression in C.\n"
text += "This matches given regular expression in Pearl."

result = re.findall("This.* P\w{3,5}.", text)

if result:
    print(result)
else:
    print("No match")

['This matches given regular expression in Python.', 'This matches given regular expression in Pearl.']