Regular Expression

RegEx, satnds for RegularExpression, is the sequence of character that defines some search patterns which then used to find,replace and other operations in data/strings


Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match

Metacharacter::

. ^ $ * + ? { } [ ] \ | ( )

Metacharacters don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched  or they affect other portions of the RE by repeating them or changing their meaning. 

[]: [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match.

 Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters ab, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters (except \) are not active inside classes. For example, [akm$] will match any of the characters 'a''k''m', or '$''$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class. For example, [^5] will match any character except '5'. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^] will match either a '5' or a '^.

used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.


\d Matches any decimal digit; this is equivalent to the class [0-9].
\D Matches any non-digit character; this is equivalent to the class [^0-9].
\s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].


Repeating Things:

The first metacharacter for repeating things that we’ll look at is ** doesn’t match the literal character '*'; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.

For example, ca*t will match 'ct' (0 'a' characters), 'cat' (1 'a'), 'caaat' (3 'a' characters), and so forth.

Repetitions such as * are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.

Step

Matched

Explanation

1

a

The a in the RE matches. ca+t will match 'cat' (1 'a'), 'caaat' (3 'a's), but won’t match 'ct'.


2

abcbd

The engine matches [bcd]*, going as far as it can, which is to the end of the string.

3

Failure

The engine tries to match b, but the current position is at the end of the string, so it fails.

4

abcb

Back up, so that [bcd]* matches one less character.

5

Failure

Try b again, but the current position is at the last character, which is a 'd'.

6

abc

Back up again, so that [bcd]* is only matching bc.

6

abcb

Try b again. This time the character at the current position is 'b', so it succeeds.


.
it has matched 'abcb'.

Another repeating metacharacter is +, which matches one or more times.

he question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either 'homebrew' or 'home-brew'.

The most complicated quantifier is {m,n}, where m and n are decimal integers. This quantifier means there must be at least m repetitions, and at most n. For example, a/{1,3}b will match 'a/b''a//b', and 'a///b'. It won’t match 'ab', which has no slashes, or 'a////b', which has four.

 {0,} is the same as *{1,} is equivalent to +, and {0,1} is the same as ?. It’s better to use *+, or ?

https://docs.python.org/3/howto/regex.html#simple-patterns.

Find and replace:

find(): find function is use to know to starting index of a word or a series of words together. It will return -1 if that word not find in string else it will return starting index of that word. We have to provide a index range in which we want to search that word.

Replace(): replace function is use to replace something in a string. It will return same string if it don't find the word that we want to replace.

index(): index function is like find() function but unlike find function it will raise error if it don't find the word that we are trying to search in string.

Performing Matches

    match(): Determine if the RE matches at the beginning of the string.

    search(): Scan through a string, looking for any location where this RE matches.

    findall(): Find all substrings where the RE matches, and returns them as a list.

    finditer(): Find all substrings where the RE matches, and returns them as an iterator.

match() versus search()

The match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match. It’s important to keep this distinction in mind. Remember, match() will only report a successful match which will start at 0; if the match wouldn’t start at zero, match() will not report it.

>>>
>>> print(re.match('super', 'superstition').span())
(0, 5)
>>> print(re.match('super', 'insuperable'))
None

On the other hand, search() will scan forward through the string, reporting the first match it finds.

>>>
>>> print(re.search('super', 'superstition').span())
(0, 5)
>>> print(re.search('super', 'insuperable').span())
(2, 7)


Modifying Strings

Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:

Method/Attribute

Purpose

split()

Split the string into a list, splitting it wherever the RE matches

sub()

Find all substrings where the RE matches, and replace them with a different string

subn()

Does the same thing as sub(), but returns the new string and the number of replacements

p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

Compiling Regular Expressions

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

>>>
>>> import re
>>> p = re.compile('ab*')
>>> p
re.compile('ab*')

re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:

>>>
>>> p = re.compile('ab*', re.IGNORECASE)

Method/Attribute

Purpose

group()

Return the string matched by the RE

start()

Return the starting position of the match

end()

Return the ending position of the match

span()

Return a tuple containing the (start, end) positions of the match


Flag

Meaning

ASCIIA

Makes several escapes like \w\b\s and \d match only on ASCII characters with the respective property.

DOTALLS

Make . match any character, including newlines.

IGNORECASEI

Do case-insensitive matches.

LOCALEL

Do a locale-aware match.

MULTILINEM

Multi-line matching, affecting ^ and $.

VERBOSEX (for ‘extended’)

Enable verbose REs, which can be organized more cleanly and understandably.


More Symbols

$ use to match pattern at the end of the string

^ Use to match pattern at the start of the string

. Match any character (except new line)

[] Use to match a single character(anything) out of many

| It works like OR