Regex in Python
Regex module re has following functions that allows us to search a string for a match. The contet is derived from w3schools.com .
Function | Desc |
---|---|
findall | returns a list containing all matches |
search | returns match object if any, else None . Return only the first match. |
split | returns a list where the string has been split at each match |
sub | replace one or more matches with a string |
Example I: Search the provided string, and see if the string starts with The and ends with dead .
Following shows an example:
import re strings = ['The shaun of the dead','the man is dead','the cat is not dead.'] for each_str in strings: x = re.search("^The.*dead$", each_str) print(x)
The output of which is as follows:
<re.Match object; span=(0, 21), match='The shaun of the dead'> None None
The output above shows the pattern matches for the first string, whereas None for the two next strings indicate that the pattern did not match.
In regex, the symbols have certain meanings. With respect to the above example,
- ^ : Indicates starts with
- . : Indicates any character (except newline)
- * : Indicates any number of times (zero or more occurrences)
- $ : Indicates the end.
- + : Indicates one or more occurrences (as opposed to *, which is 0 or more occurrences)
- ? : Indicates zero or one occurrence
- [] : Indicates a set of characters
Please refer to the w3shools.com page for other symbols and their meanings.
Example II: Here we will do two tasks, findall() and search
import re strings = ['hello','hello ladies hello','oye hello'] for each_str in strings: x = re.search("he.+o", each_str) print(x) y = re.findall("he.+o", each_str) print(y) print("------")
Following is the output:
<re.Match object; span=(0, 5), match='hello'> ['hello'] ------ <re.Match object; span=(0, 18), match='hello ladies hello'> ['hello ladies hello'] ------ <re.Match object; span=(4, 9), match='hello'> ['hello'] ------
Example III: Let us find a set of characters.
group()
function can be used to print the first matching sequence of the
re.Match
object.
Similarly,
.start()
function can be used to print the starting position of the first matching sequence.
Similarly,
.span()
function can be used to access the tuple with start and end index of the matching sequence.
import re strings = ['123andd f567','adad 564 .;sfs'] for each_str in strings: x = re.search('[0-9]',each_str) print(x) print(x.group()) y = re.findall('[0-9]',each_str) print(y) print('------') xx = re.search('[0-9]*',each_str) print(xx) print(xx.group()) z = re.findall('[0-9]*',each_str) print(z) print('------') xx = re.search('[0-9]+',each_str) print(xx) print(xx.group()) z = re.findall('[0-9]+',each_str) print(z) print('------') print('------')
Following is the output.
<re.Match object; span=(0, 1), match='1'> 1 ['1', '2', '3', '5', '6', '7'] ------ <re.Match object; span=(0, 3), match='123'> 123 ['123', '', '', '', '', '', '', '567', ''] ------ <re.Match object; span=(0, 3), match='123'> 123 ['123', '567'] ------ ------ <re.Match object; span=(5, 6), match='5'> 5 ['5', '6', '4'] ------ <re.Match object; span=(0, 0), match=''> ['', '', '', '', '', '564', '', '', '', '', '', '', ''] ------ <re.Match object; span=(5, 8), match='564'> 564 ['564'] ------ ------
If we look closely, we can see how the use of * differs from the use of + for repetition of the numbers [0-9].
Example IV:
import re strings = ['this is sentence','his book'] for each_str in strings: x = re.search(r'\bhis',each_str) print(x) if x: print(x.group()) print('--') x = re.search(r'his\b',each_str) print(x) if x: print(x.group()) print('--') x = re.search(r'\Ath',each_str) print(x) if x: print(x.group()) print('--') print('--')
\b looks for either at the start or end of the word, \A looks at the beginning of the string.
None -- <re.Match object; span=(1, 4), match='his'> his -- <re.Match object; span=(0, 2), match='th'> th -- -- <re.Match object; span=(0, 3), match='his'> his -- <re.Match object; span=(0, 3), match='his'> his -- None -- --
Example V:
import re strings = ['this is sentence','his book'] for each_str in strings: x = re.search('[^0-9]{4}',each_str) print(x) if x: print(x.group()) x = re.findall('[^0-9]{4}',each_str) print(x) print('--')
Following shows the output<re.Match object; span=(0, 4), match='this'> this ['this', ' is ', 'sent', 'ence'] -- <re.Match object; span=(0, 4), match='his '> his ['his ', 'book'] --
Example VI: Sub
import re strings = ['this is sentence','his book'] for each_str in strings: x = re.sub('^[^0-9]{4}','kera',each_str) print(x) print('--')
Following is the outputkera is sentence -- kerabook --
The same thing as above can be achieved by using the optional argument, maxsplit .
import re strings = ['this is sentence','his book'] for each_str in strings: x = re.sub('[^0-9]{4}','kera',each_str,1) print(x) print('--')