Regex in Python

Regex module re has following functions that allows us to search a string for a match. The contet is derived from w3schools.com .

Function	Desc
findall	returns a list containing all matches
search	returns match object if any, else None . Return only the first match.
split	returns a list where the string has been split at each match
sub	replace one or more matches with a string

Example I: Search the provided string, and see if the string starts with The and ends with dead .

Following shows an example:

import re strings = ['The shaun of the dead','the man is dead','the cat is not dead.'] for each_str in strings: x = re.search("^The.*dead$", each_str) print(x)

The output of which is as follows:

<re.Match object; span=(0, 21), match='The shaun of the dead'> None None

The output above shows the pattern matches for the first string, whereas None for the two next strings indicate that the pattern did not match.

In regex, the symbols have certain meanings. With respect to the above example,

^ : Indicates starts with
. : Indicates any character (except newline)
* : Indicates any number of times (zero or more occurrences)
$ : Indicates the end.

Similarly, following are other symbols

+ : Indicates one or more occurrences (as opposed to *, which is 0 or more occurrences)
? : Indicates zero or one occurrence
[] : Indicates a set of characters

Please refer to the w3shools.com page for other symbols and their meanings.

Example II: Here we will do two tasks, findall() and search

import re strings = ['hello','hello ladies hello','oye hello'] for each_str in strings: x = re.search("he.+o", each_str) print(x) y = re.findall("he.+o", each_str) print(y) print("------")

Following is the output:

<re.Match object; span=(0, 5), match='hello'> ['hello'] ------ <re.Match object; span=(0, 18), match='hello ladies hello'> ['hello ladies hello'] ------ <re.Match object; span=(4, 9), match='hello'> ['hello'] ------

Example III: Let us find a set of characters. group() function can be used to print the first matching sequence of the re.Match object.
Similarly, .start() function can be used to print the starting position of the first matching sequence.
Similarly, .span() function can be used to access the tuple with start and end index of the matching sequence.

import re strings = ['123andd f567','adad 564 .;sfs'] for each_str in strings: x = re.search('[0-9]',each_str) print(x) print(x.group()) y = re.findall('[0-9]',each_str) print(y) print('------') xx = re.search('[0-9]*',each_str) print(xx) print(xx.group()) z = re.findall('[0-9]*',each_str) print(z) print('------') xx = re.search('[0-9]+',each_str) print(xx) print(xx.group()) z = re.findall('[0-9]+',each_str) print(z) print('------') print('------')

Following is the output.

<re.Match object; span=(0, 1), match='1'> 1 ['1', '2', '3', '5', '6', '7'] ------ <re.Match object; span=(0, 3), match='123'> 123 ['123', '', '', '', '', '', '', '567', ''] ------ <re.Match object; span=(0, 3), match='123'> 123 ['123', '567'] ------ ------ <re.Match object; span=(5, 6), match='5'> 5 ['5', '6', '4'] ------ <re.Match object; span=(0, 0), match=''> ['', '', '', '', '', '564', '', '', '', '', '', '', ''] ------ <re.Match object; span=(5, 8), match='564'> 564 ['564'] ------ ------

If we look closely, we can see how the use of * differs from the use of + for repetition of the numbers [0-9].

Example IV:

Let us look at hte special sequences

import re strings = ['this is sentence','his book'] for each_str in strings: x = re.search(r'\bhis',each_str) print(x) if x: print(x.group()) print('--') x = re.search(r'his\b',each_str) print(x) if x: print(x.group()) print('--') x = re.search(r'\Ath',each_str) print(x) if x: print(x.group()) print('--') print('--')

\b looks for either at the start or end of the word, \A looks at the beginning of the string.

None -- <re.Match object; span=(1, 4), match='his'> his -- <re.Match object; span=(0, 2), match='th'> th -- -- <re.Match object; span=(0, 3), match='his'> his -- <re.Match object; span=(0, 3), match='his'> his -- None -- --

Example V:

Let us see how to exclude characters. Characters can be excluded by using caret sign as follows. The following example looks for sequence that is not a digit (as indicated by [^0-9] ). The curly braces indicates the no of repitiion to search for, i.e. in this case, it will group 4 characters that are not 0-9.

import re strings = ['this is sentence','his book'] for each_str in strings: x = re.search('[^0-9]{4}',each_str) print(x) if x: print(x.group()) x = re.findall('[^0-9]{4}',each_str) print(x) print('--')

Following shows the output

<re.Match object; span=(0, 4), match='this'> this ['this', ' is ', 'sent', 'ence'] -- <re.Match object; span=(0, 4), match='his '> his ['his ', 'book'] --

Example VI: Sub

Let us modify the above code slightly and replace the first 4 non-digit characters of the strings with kera

import re strings = ['this is sentence','his book'] for each_str in strings: x = re.sub('^[^0-9]{4}','kera',each_str) print(x) print('--')

Following is the output

kera is sentence -- kerabook --

The same thing as above can be achieved by using the optional argument, maxsplit .

import re strings = ['this is sentence','his book'] for each_str in strings: x = re.sub('[^0-9]{4}','kera',each_str,1) print(x) print('--')