https://www.sitepoint.com/demystifying-regex-with-practical-examples/
Scenario:
- 6 to 12 characters in length
- Must have at least one uppercase letter
- Must have at least one lower case letter
- Must have at least one digit
- Should contain other characters
Pattern:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{6,12}$
This expression is based on multiple positive lookahead (?=(regex))
. The lookahead matches something followed by the declared (regex)
. The order of the conditions doesn’t affect the result. Lookaround expressions are very useful when there are several conditions.
We could also use the negative lookahead (?!(regex))
to exclude some character ranges. For example, I could exclude the %
with (?!.*#)
.
^
asserts position at start of the string(?=.*[a-z])
positive lookahead, asserts that the regex.*[a-z]
can be matched:.*
matches any character (except newline) between zero and unlimited times[a-z]
matches a single character in the range between a and z (case sensitive)
(?=.*[A-Z])
positive lookahead, asserts that the regex.*[A-Z]
can be matched:.*
matches any character (except newline) between zero and unlimited times[A-Z]
matches a single character between A and Z (case sensitive)
- (?=.*\d) positive lookahead, asserts that the regex
*\d
can be matched:.*
matches any character (except newline) between zero and unlimited times\d
matches a digit [0-9]
.{6,12}
matches any character (except newline) between 6 and 12 times$
asserts position at end of the string
Matching URL
Scenario:
- Must start with
http
orhttps
orftp
followed by://
- Must match a valid domain name
- Could contain a port specification (
http://www.sitepoint.com:80
) - Could contain digit, letter, dots, hyphens, forward slashes, multiple times
Pattern:
^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*)
The first scenario is pretty easy to solve with ^(http|https|ftp):[\/]{2}
.
To match the domain name we need to bear in mind that to be valid it can only contain letters, digits, hyphen and dots. In my example, I limited the number of characters after the punctuation from 2 to 4, but could be extended for new domains like .rocks
or .codes
. The domain name is matched by ([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})
.
The optional port specification is matched by the simple (:[0-9]+)?
.
A URL can contain multiple slashes and multiple characters repeated many times (see RFC3986), this is matched by using a range of characters in a group ([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*)
.
It’s really useful to match every important element with a group capture ()
, because it will return only the matches we need. Remember that certain characters need to be escaped with \
.
Below, every single subpattern explained:
^
asserts position at start of the string- capturing group
(http|https|ftp)
, captureshttp
orhttps
orftp
:
escaped character, matches the character:
literally[\/]{2}
matches exactly 2 times the escaped character/
- capturing group
([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})
:[a-zA-Z0-9\-\.]+
matches one and unlimited times character in the range between a and z, A and Z, 0 and 9, the character-
literally and the character.
literally\.
matches the character.
literally[a-zA-Z]{2,4}
matches a single character between 2 and 4 times between a and z or A and Z (case sensitive)
- capturing group
(:[0-9]+)?
:- quantifier
?
matches the group between zero or more times :
matches the character:
literally[0-9]+
matches a single character between 0 and 9 one or more times
- quantifier
\/?
matches the character/
literally zero or one time- capturing group
([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*)
:[a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*
matches between zero and unlimited times a single character in the range a-z, A-Z, 0-9, the characters:-._?,'/\+&%$#=~
.
Matching HTML TAG
Scenario:
- The start tag must begin with
<
followed by one or more characters and end with>
- The end tag must start with
</
followed by one or more characters and end with>
- We must match the content inside a TAG element
Pattern:
<([\w]+).*>(.*?)<\/\1>
Matching the start tag and the content inside it’s pretty easy with <([\w]+).*>
and (.*?)
, but in the pattern above I have added a useful thing: the reference to a capturing group.
Every capturing group defined by parentheses ()
could be referred to using its position number, (first)(second)(third)
, which will allow for further operations.
The expression above could be explained as:
- Start with
<
- Capture the tag name
- Followed by one or more chars
- Capture the content inside the tag
- The closing tag must be
</tag name captured before>
Including only two capture groups in the expression, the tag name and the content, will return a very clear match, a list of tag names with related content.
Let’s dig a little deeper and explain the subpatterns:
<
matches the character<
literally- capturing group
([\w]+)
matches any word charactera-zA-Z0-9_
one or more times .*
matches any character (except newline) between zero or more times>
matches the character>
literally- capturing group
(.*?)
, matches any character (except newline), zero and more times <
matches the characters<
literally\/
matches the character/
literally\1
matches the same text matched by the first capturing group:([\w]+)
>
matches the characters>
literally
Matching duplicated words
Scenario:
- The words are space separated
- We must match every duplication – non-consecutive ones as well
Pattern:
\b(\w+)\b(?=.*\1)
This regular expression seems challenging but uses some of the concept previously shown.
The pattern introduces the concept of word boundaries.
A word boundary \b
mainly checks positions. It matches when a word character (i.e.: abcDE
) is followed by a non-word character (Ie: -~,!
).
Below you can find some example uses of word boundary to make it clearer:
– Given the phrase Regular expressions are awesome
– The pattern \bare\b
matches are
– The pattern \w{3}\b
could match the last three letters of the words: lar, ion, are, ome
The expression above could be explained as:
- Match every word character followed by a non-word character (in our case space)
- Check if the matched word is already present or not
Below you will find the explanation for each sub pattern:
\b
word boundary- capturing group
([\w]+)
matches any word charactera-zA-Z0-9_
\b
word boundary(?=.*\1)
positive lookahead assert that the following can be matched:.*
matches any character (except newline)\1
matches same text as first capturing group
The expression will make more sense if we return all the matches instead of returning only the first one. See the PHP function preg_match_all
for more information.
No comments:
Post a Comment