Regular Expressions in python

5 minute read

Regular expression is a special sequence of characters that helps us in matching or finding other strings or set of strings , using a specialized syntax held in a pattern.

Python has a module called re which handles the regular expressions.

Character Description
. Any Character Except New Line
\d Digit (0-9)
\D Not a Digit (0-9)
\w Word Character (a-z, A-Z, 0-9, _)
\W Not a Word Character
\s Whitespace (space, tab, newline)
\S Not Whitespace (space, tab, newline)
\b Word Boundary
\B Not a Word Boundary
^ Beginning of a String
$ End of a String
[] Matches Characters in brackets
[^ ] Matches Characters NOT in brackets
| Either Or
( ) Group

Quantifiers:

    • 0 or More
    • 1 or More ? - 0 or One {3} - Exact Number {3,4} - Range of Numbers (Minimum, Maximum)
import re

text_to_search = "Ha HaHa Hall"

pattern = re.compile(r'\bHa')
matches = pattern.finditer(text_to_search)
for match in matches:
	print(match.span(), match.group())

Output of the above code is:

((0, 2), 'Ha')
((3, 5), 'Ha')
((8, 10), 'Ha')
[Finished in 0.1s]

Here there are four Ha but only three were found. This is beacuse we have used the \b which means word boundary.If we have used \B (not a word boundary) we will get the second Ha in HaHa.

((5, 7), 'Ha')
[Finished in 0.2s]

We use the ^ symbol to indicate the begining of a word. We can use it if we want to search something on the start of a sentence.

import re

text_to_search = "Starting of a sentence and it ends here"

pattern = re.compile(r'^Start')
matches = pattern.finditer(text_to_search)
for match in matches:
	print(match.span(), match.group())

Here the output is

((0, 5), 'Start')
[Finished in 0.1s]

Here if we use somthing else lets say ^of, we wont get the output. Because its not in the begining.

Similar to begining we have $ which indicates the end of a sentence.

import re

text_to_search = "Starting of a sentence and it ends here"

pattern = re.compile(r'here$')
matches = pattern.finditer(text_to_search)
for match in matches:
	print(match.span(), match.group())

Here the output is

((35, 39), 'here')
[Finished in 0.1s]

In regular expressions, the . can match anything. We are going to search for a phone number. Phone number has 10 digits with a seperator after 3rd and 6th digit. Eg 147.236.5547

import re

text_to_search = '''
123-456-7890
123.345.4152
985_456_1247
'''

pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
	print(match.span(), match.group())

The output will be

((1, 12), '123-456-789')
((14, 25), '123.345.415')
((27, 38), '985_456_124')
[Finished in 0.1s]

Here the -, . and _ were taken as dot matches with anything. We are using \d to match a digit.

Suppose we want only phone numbers that are seperated by . and -.How can we do that ?

import re

text_to_search = '''
123-456-7890
123.345.4152
985_456_1247
'''

pattern = re.compile(r'\d\d\d[.-]\d\d\d[.-]\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
	print(match.span(), match.group())

Here inside the square bracket, we are specifying that the characters after third and sixth digits shoud be a . or -.

The output will be

((1, 12), '123-456-789')
((14, 25), '123.345.415')
[Finished in 0.1s]

We are putting new constraint. We need to see the numbers that are starting with 1 or 2. This can be done like below.

import re

text_to_search = '''
123-456-7890
723.345.4152
285-456-1247
'''

pattern = re.compile(r'[12]\d\d[.-]\d\d\d[.-]\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
	print(match.span(), match.group())

Here the first digit has to be 1 or 2.

Output will be

((1, 12), '123-456-789')
((27, 38), '285-456-124')
[Finished in 0.1s]

See we are not considering the _. import re

text_to_search = ‘’’ abcdefghijklmnopqurtuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890 Ha HaHa MetaCharacters (Need to be escaped): . ^ $ * + ? { } [ ] \ | ( ) coreyms.com 321-555-4321 123.555.1234 1235551234 800-555-1234 900-555-1234 Mr. Schafer Mr Smith Ms Davis Mrs. Robinson Mr. T

cat mat pat bat ‘’’

sentence = ‘Start a sentence and then bring it to an end’

#print(‘\tTab’) #print(r’\tTab’) #This is raw string. Python dont do anything to raw string. Its printed like that#

Tab

#\tTab

pattern = re.compile(r’\bHa’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

#The Ha and the Ha in the first are matched because we searched with \b

pattern = re.compile(r’\BHa’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

Here it matches the second Ha in HaHa as there is no word boundary before it.

pattern = re.compile(r’^Start’)

matches = pattern.finditer(sentence)

for match in matches:

print(match.span(), match.group())

#Here the Start at the begining is found #If we use ^a insted, we wont get any output as there is not a at the start of the sentence

pattern = re.compile(r’end$’)

matches = pattern.finditer(sentence)

for match in matches:

print(match.span(), match.group())

#Here end is matched. If we give and$, this wont work eventhough and is present. It is not at the end.

pattern = re.compile(r’\d\d\d.\d\d\d.\d\d\d\d’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

#If we use . in between numbers, it will match anything. If we are specific to what all should be matched, we can give that in the square bracket.

pattern = re.compile(r’\d\d\d[-.]\d\d\d[-.]\d\d\d\d’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

#Suppose we want to match 800 and 900 we can do like this

pattern = re.compile(r’[89]00[-.]\d\d\d[-.]\d\d\d\d’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

#[1-5] matches if the first number is between 1 and 5, [a-f] means the letter between a and f. [a-fA-F] this matches both upper and lower

pattern = re.compile(r’[1-5]’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

re.compile(‘r[^a-fA-F’) -> means everything other than whats is inside

#re.compile(r’[^b]at’) -> matches everything except bat

Instead of re.compile(r’\d\d\d.\d\d\d.\d\d\d\d’) , I can use re.compile(r’\d{3}.\d{3}.d{4}’)

pattern = re.compile(r’M(r|s|rs).?\s[A-Z]\w*’)

matches = pattern.finditer(text_to_search)

for match in matches:

print(match.span(), match.group())

#Mr

((216, 218), ‘Mr’)

((228, 230), ‘Mr’)

((246, 248), ‘Mr’)

((260, 262), ‘Mr’)

Mr.

((216, 219), ‘Mr.’)

((228, 231), ‘Mr ‘)

((246, 249), ‘Mrs’)

((260, 263), ‘Mr.’)

Mr.?

((216, 219), ‘Mr.’)

((228, 230), ‘Mr’)

((246, 248), ‘Mr’)

((260, 263), ‘Mr.’)

Mr.?\s[A-Z]\w+

((216, 227), ‘Mr. Schafer’)

((228, 236), ‘Mr Smith’)

Mr.?\s[A-Z]\w*

((216, 227), ‘Mr. Schafer’)

((228, 236), ‘Mr Smith’)

((260, 265), ‘Mr. T’)

M(r|s|rs).?\s[A-Z]\w*

((216, 227), ‘Mr. Schafer’)

((228, 236), ‘Mr Smith’)

((237, 245), ‘Ms Davis’)

((246, 259), ‘Mrs. Robinson’)

((260, 265), ‘Mr. T’)

emails = ‘’’

CoreyMSchafer@gmail.com

corey.schafer@university.edu

corey-321-schafer@my-work.net

’’’

pattern = re.compile(r’[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+’)

matches = pattern.finditer(emails)

for match in matches:

print(match.span(), match.group())

urls = ‘’’ https://www.google.com http://coreyms.com https://youtube.com https://www.nasa.gov ‘’’

#pattern = re.compile(r’https?://(www.)?(\w+)(.\w+)’)

matches = pattern.finditer(urls)

for match in matches:

print(match.span(), match.group(), match.group(1), match.group(2), match.group(3))

((1, 23), ‘https://www.google.com’, ‘www.’, ‘google’, ‘.com’)

((24, 42), ‘http://coreyms.com’, None, ‘coreyms’, ‘.com’)

((43, 62), ‘https://youtube.com’, None, ‘youtube’, ‘.com’)

((63, 83), ‘https://www.nasa.gov’, ‘www.’, ‘nasa’, ‘.gov’)

subbed_urls = pattern.sub(r’\2\3’, urls)

print(subbed_urls)

#Here the second and third group is taken and substituted it with group 2 and 3. Groups are things inside ()

google.com

coreyms.com

youtube.com

nasa.gov

pattern = re.compile(r’start’, re.IGNORECASE) #Here we are want to do ignore case. matches = pattern.finditer(sentence) for match in matches: print(match.span(), match.group())

Updated: