RegEx, satnds for RegularExpression, is the sequence of character that defines some search patterns which then used to find,replace and other operations in data/strings
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re
module. Using this little language, you specify the rules for the set of possible strings that you want to match
Metacharacter::
. ^ $ * + ? { } [ ] \ | ( )
Metacharacters don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched or they affect other portions of the RE by repeating them or changing their meaning.
[]: [
and ]
. They’re used for specifying a character class, which is a set of characters that you wish to match.
Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'
. For example, [abc]
will match any of the characters a
, b
, or c
; this is the same as [a-c]
, which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z]
.
Metacharacters (except \
) are not active inside classes. For example, [akm$]
will match any of the characters 'a'
, 'k'
, 'm'
, or '$'
; '$'
is usually a metacharacter, but inside a character class it’s stripped of its special nature.
You can match the characters not listed within the class by complementing the set. This is indicated by including a '^'
as the first character of the class. For example, [^5]
will match any character except '5'
. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^]
will match either a '5'
or a '^.
\
used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [
or \
, you can precede them with a backslash to remove their special meaning: \[
or \\.
\d
Matches any decimal digit; this is equivalent to the class [0-9]
.\D
Matches any non-digit character; this is equivalent to the class [^0-9]
.\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
.\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
.\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
.\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
.*
. *
doesn’t match the literal character '*'
; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.ca*t
will match 'ct'
(0 'a'
characters), 'cat'
(1 'a'
), 'caaat'
(3 'a'
characters), and so forth.*
are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.Step | Matched | Explanation |
---|---|---|
1 |
| The |
2 |
| The engine matches |
3 | Failure | The engine tries to match |
4 |
| Back up, so that |
5 | Failure | Try |
6 |
| Back up again, so that |
6 |
| Try |
'abcb'.
Another repeating metacharacter is +
, which matches one or more times.
he question mark character, ?
, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew
matches either 'homebrew'
or 'home-brew'
.
The most complicated quantifier is {m,n}
, where m and n are decimal integers. This quantifier means there must be at least m repetitions, and at most n. For example, a/{1,3}b
will match 'a/b'
, 'a//b'
, and 'a///b'
. It won’t match 'ab'
, which has no slashes, or 'a////b'
, which has four.
{0,}
is the same as *
, {1,}
is equivalent to +
, and {0,1}
is the same as ?
. It’s better to use *
, +
, or ?
https://docs.python.org/3/howto/regex.html#simple-patterns.
Find and replace:
find(): find function is use to know to starting index of a word or a series of words together. It will return -1 if that word not find in string else it will return starting index of that word. We have to provide a index range in which we want to search that word.
Replace(): replace function is use to replace something in a string. It will return same string if it don't find the word that we want to replace.
index(): index function is like find() function but unlike find function it will raise error if it don't find the word that we are trying to search in string.
Performing Matches
match():
Determine if the RE matches at the beginning of the string.search():
Scan through a string, looking for any location where this RE matches.findall():
Find all substrings where the RE matches, and returns them as a list.finditer():
Find all substrings where the RE matches, and returns them as an iterator.
match() versus search()
The match()
function only checks if the RE matches at the beginning of the string while search()
will scan forward through the string for a match. It’s important to keep this distinction in mind. Remember, match()
will only report a successful match which will start at 0; if the match wouldn’t start at zero, match()
will not report it.
On the other hand, search()
will scan forward through the string, reporting the first match it finds.
Modifying Strings
Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:
Method/Attribute | Purpose |
---|---|
| Split the string into a list, splitting it wherever the RE matches |
| Find all substrings where the RE matches, and replace them with a different string |
| Does the same thing as |
Compiling Regular Expressions
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.
re.compile()
also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:
Method/Attribute | Purpose |
---|---|
| Return the string matched by the RE |
| Return the starting position of the match |
| Return the ending position of the match |
| Return a tuple containing the (start, end) positions of the match |
Flag | Meaning |
---|---|
| Makes several escapes like |
| Make |
| Do case-insensitive matches. |
| Do a locale-aware match. |
| Multi-line matching, affecting |
| Enable verbose REs, which can be organized more cleanly and understandably. |
More Symbols
$ use to match pattern at the end of the string
^ Use to match pattern at the start of the string
. Match any character (except new line)
[] Use to match a single character(anything) out of many
| It works like OR