None of the online regex tutorials I have visited follow the progression in understanding that, I personally think, is the most appropriate. So I wrote my own tutorial with the definite objective that, in roughly one hour, you'll become quite good at regular expressions. I even hope that most of you will start to love them for their absolutely magical power, and the enjoyable mind-teasing opportunities that they offer.
A last advice: this is a tutorial, not reference documentation. As suggested above, there is a progression where each lesson builds over the previous one. Just read through and take time to understand, or read faster, but it will be wise not to skip a section.
Table of Contents
- What are regex'es used for?
- Special characters
- Regular expressions without special characters!
- Matching the start and end of the input
- Matching characters in lists
- Optional and repeating characters
- Consuming input characters in greedy or reluctant mode
PART II: - Backslash escapes
- OR-expressions and groupings
- Capturing and non-capturing groups
- Negative match, lookahead and lookbehind...
- Conclusion
Lesson 1: What are regex'es used for?
A regular expression (regex) is basically a pattern that you will match over an input string.
The very first error is to consider matching as the purpose of using a regex. It is not! Matching is a technique and the purpose for which the technique will be used is—to my knowledge—one of the following four:
Identify
The existence of a specific pattern of characters inside the input string (or a match with the entire string) is tested; the TRUE or FALSE outcome is then used for branching to different processing activities.
Extract
A sub-string is drawn out of the input string, getting rid of filler and syntax characters so that the extracted value can be used in onward processing activities. This objective can take advantage from a regular expression feature named capturing groups.
Validate
A data value (rendered as the input string) is verified to comply with a pre- or post-condition expressed as a pattern (for instance an Internet mail address, a bank account number, a date or number format, an expected length and character subset). Non-compliance usually triggers error handling.
Cut-out
The input string results from the composition and syntaxical encoding of several data pieces, typically for the sake of data transmission between computer applications. It is now time to reverse the procedure and then decode and decompose the input string back to its constituents. This is a variant of the extraction case where repeating patterns or capturing groups are now used for cutting out of the input string the parts that require individual processing.
The next time you read or develop a regex, know the goal: identify? extract? validate? or cut? this may very much influence the design of the right regular expression pattern, or accelerate the reading.
Lesson 2: Special characters
The screwiness of a regular expression pattern will quickly fade away with your capability to spot the special characters in it in order to discover the boundaries of sub-expressions, and how these sub-expressions relate to each other (in sequence? like A-or-B? as optional bit? as repeated stuff?) in the making of the final pattern.
Pattern matching mechanisms are not complex: it's all about describing a sequence of string pieces that are mandatory, optional, or repeating, or exclude each other, and defining which precise character—or set of characters—is expected at every position. The inventors of regular expressions have opted for a very compact notation which is both magical and a challenge for the eye. Such compactness and the use of commonplace characters are—to my understanding—behind most of the apparent complexity.
So you shall train your eye to recognize . * + ? | ( [ { \ ^ $ as characters with a special meaning. And every time you see one of ( [ { look for the corresponding ) ] } as they always go in pairs, possibly nested as in (A[B]C) but never overlapping as in {A[B}C].
In a moment, we will learn the effects of each of them, composedly, and in turn. There is not much more to learn!
Take the time to consider the list again: . dot * star + plus ? question mark | pipe ( brackets [ square brackets { curly brackets \ backslash ^ caret (formally a circumflex accent) and the $ dollar sign. And do compare them to their peers ! / : ; , - ' # " ~ & @ = % _ and space which are not special characters.
Lesson 3: Regular expressions without special characters!
Before getting saturated at special characters, what happens if we don't use any of them? Well, we get the basic and essential pattern matching behavior: simply matching character against character, case sensitive by default.
Let's consider the input string:
ABCaBCABBACCCAAABCCC and the regular expression ABC
Side question: is there a standard way of quoting a regular expression? For instance, shall we note the above as "ABC"? Reply: No! It depends entirely from the language or tool dealing with regular expressions. For instance, in Java and C# you will note it "ABC", in PHP it becomes 'ABC', but when used in awk or sed command in UNIX it will be noted by default as /ABC/, whereas when passed as argument to the grep command you'll specify "ABC" or 'ABC' or even ABC (without any quotes because there is no space character inside this regex, spaces being interpreted as delimiters of command line arguments unless included within quotes or double quotes).
In the reverseXSL software, you can use any character as delimiter; indeed, the first non-space character of a regex specification automatically becomes the delimiter. But then you shall not use that character elsewhere in the regex—there is no way to escape. For instance, a regex in DEF files can appear as "ABC", or 'ABC', or #ABC#, or xABCx (this last one is not very clever!). You may also use |ABC|, or (ABC(, or )ABC) and so forth, but it is not recommended because | and () and the others are special characters often needed inside a regex.
Within this tutorial, strings and regular expressions are quoted with background colors
So, if we match the regex ABC onto ABCaBCABBACCCAAABCCC we get the following two matches
ABCaBCABBACCCAAABCCC
and not over aBC because matching is case sensitive by default.
Reminding that ! / : ; , - ' # " ~ & @ = % _ and space are not special characters, we can solve the following examples:
Regex | Input String | Solution |
-!&/ | #_!&/'-/-!&/=:@;% | #_!&/'-/-!&/=:@;% |
Hello | Hello | Hello |
"a" | "aaaaaaaaaaa" | no part matches the 3 char sequence " a " |
1,50 | 1,5= 1,5 =1,50=01,500=1.50=3/2 | 1,5= 1,5 =1,50=01,500=1.50=3/2 |
" ~ | [" ~].*?|\^${+(' # " ~)&%_ | [" ~].*?|\^${+(' # " ~)&%_ |
In the third case, remind that " and ' are characters like others. Forget about string delimiters in programming languages, we deal here with regular expressions.
In the last case, remember that the special characters . * + ? | ( [ { \ ^ $ seen in the input string are only special for the regex side. Characters within the input string shall always be considered as plain, simple, characters that we may like to match.
In practice
Are regular expressions without any special characters of any use? Yes, for identification: such regex'es indicate that a given tag or value exists in your input, and that may just be enough to drive a processing decision.
In the reverseXSL parser there is also a segment cut mode CUT-ON-"regex". For instance CUT-ON-"--" applied to ABC--DEF--G-H--IJK yields segmented pieces ABC , DEF , G-H , and IJK
Lesson 4: Matching the start and end of the input
In most above examples, we matched string portions elsewhere in the middle of the input. We can impose a regular expression to match only the very beginning or the very end of the input string.
Special character | Effect in pattern matching |
^ | matches the start of input |
$ | matches the end of input |
Expert note: in MULTILINE mode (cfr advanced tutorial) we can use ^ and $ to match intermediate line boundaries
So you shall understand the effect of the following regular expressions:
Regex | Input String | Solution |
ABC | ABCaBCABBACAAABCCCABC | ABCaBCABBACAAABCCCABC |
^ABC | ABCaBCABBACAAABCCCABC | ABCaBCABBACAAABCCCABC |
ABC$ | ABCaBCABBACAAABCCCABC | ABCaBCABBACAAABCCCABC |
^ABC$ | ABCaBCABBACAAABCCCABC | no match |
^ABC$ | ABC | ABC |
Regular expressions like ABC^ , $ABC or A^B$C would not cause any error; they are simply stupid, and never able to match anything.
The last case (^MyRegex$) is quite interesting: you force the pattern to match the entire input string or nothing.
Regular expressions software libraries actually provide means of testing whether the same pattern matches the entire input string or not, or whether this pattern is only matching a subset of the input string, without having to explicitly supply the ^ and $ in the former case. For instance, in java, we have a matcher.matches() method that returns TRUE when the pattern "ABC" is applied to the input "ABC" and FALSE when the same pattern "ABC" is applied to "XABCZ" for instance. However, the alternative matcher.find() method returns TRUE for such later case (finding "ABC" in "XABCZ"). In other words, a matcher.find() using "^myRegex$" is equivalent to matcher.matches() with "myRegex" alone.
The conclusion is that find-operations are more appropriate for identification, whereas full-match-operations fit well to validation. So, if you are using software that drives some processing based on the regex library find() method, you may be advised to frame with explicit ^ and $ the regular expressions used for validation purposes, as in ^validValueRegex$.
Vice-versa, a software built over the regex matches() method may require to extend with .* (that matches any string, see further) the regular expressions used for identification purposes.
In practice
For identification purposes, in addition to the default matching-elsewhere-in-input we have now the capability to match something that begins-with or ends-with,or matches-exactly.
EDIFACT, X12, IATA, TRADACOMS and numerous other EDI standards are organized into collections of records/segments with leading tags. It is quite easy to identify them with regular expressions like:
^NAD for an EDIFACT Name and Address segment (e.g. NAD+OY++HYDROGAS)
^N1 for the X12 Name and Address segment (e.g. N1*PR*HYDROGAS)
^SHP for Shipper details in CARGO-IMP (e.g. SHP/456)
Within reverseXSL software, regular expresssions immediately following a SEG (segment) or GRP (group) definition keyword are only used for identification purposes and built over find-operations.
For validation purposes, the possibility of leaving portions of the input string outside of the scope of the validation pattern is obviously not a good strategy. Therefore, only regular expressions like ^MyValidationPattern$ do make sense, else using those API functions that enforce an entire match (making ^ and $ implicit).
Within reverseXSL software, regular expressions immediately following a D (data element) definition keyword are used for validation and extraction purposes (see further), and built over full-match-operations. The use of ^ and $ as in ^myExp$ is implicit and may be omitted.
Lesson 5: Matching characters in lists
By default, a regular expression matches characters in the pattern against the same exact character in the input string, case sensitive. If we want to open up the possibility of accepting several characters at any position, we can use the following notations:
Special characters | Effect in pattern matching |
[character_list] | matches any single character in the list, e.g. [ABC] accepts A or B or C Character lists may contain ranges noted A-Z , 0-9 , or d-h (for lowercase d to h inclusive). You may combine ranges as in [X-Z0-3] which is equivalent to [XYZ0123] |
[^character_list] | matches any single character not in the list, e.g. [^a-z] accepts all characters, spaces and punctuations, but not lowercase letters a to z |
. | matches any single character but new line characters (LF, CRLF, CR, and Unicode line and paragraph separators) |
Expert note: the DOTALL mode entitles . to match also new line characters (cfr advanced tutorial).
Regular expressions also feature backslash-notations for numerous built-in character ranges like \d for the set of digits (0 to 9), \w for any word-character (A to Z, a to z, 0 to 9 and _), \s for whitespace (tab, space, CR, LF, FF), \p{ASCII} for all 7bit ASCII (with binary representations 0 to 127), \p{Punct} for any puctuation (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~). We shall not bother now. The exhaustive inventory is copied in the documentation.
Have you noted that the ^ which previously meant start-of-input is now overloaded with the neaning of negating the following character list, but only when used in [^ ]. Similary, - which is a regular character gets a role in range notations, but only within the scope of [ ] and [^ ] .
Let us illustrate the use of character lists with a few examples:
Regex | Input String | Solution |
[A-Z]C[AC] | BCABCxaBCA-BCAA.CCC.BC | BCABCxaBCA-BCAA.CCC.BC |
.BC | BCABCxaBCA-BCAA.CCC.BC | BCABCxaBCA-BCAA.CCC.BC |
[^A-C]BC | BCABCxaBCA-BCAA.CCC.BC | BCABCxaBCA-BCAA.CCC.BC |
[.]BC | BCABCxaBCABBCAA.CCC.BC | BCABCxaBCABBCAA.CCC.BC |
[a-]BC | BCABCxaBCA-BCAA.CCC.BC | BCABCxaBCA-BCAA.CCC.BC |
The last two examples are worth some comments:
- In [.] the dot special character looses it special any-character status and falls back to just mean itself, i.e. a dot punctuation character. A similar change of semantic applies to the start-of-input caret ^ which means itself within [ ] except when used at the very first position, in which case it negates the character list. Thus, [^abc] means not a neither b nor c, whereas [abc^] means a or b or c or ^ caret, [^.] means not_a_dot, [.^] stands for dot_or_caret, and [^^] means not_a_caret! Got the fun?
- In [a-] the dash is just meaning itself because it is not part of a proper range notation. The same consideration applies to [-ABC] for instance, or yet [^-a] (i.e. not - nor a, which will match ABC and .BC in the example input string). However, [z-a] or [A-9] are interpreted as ranges but are invalid (z is before and not after a, and 9 is not next to A in character order); they will cause an error. Note that a range like [ -z] (from space up to lowercase z) is actually valid, but is really poor practice for the challenge made to our memories of the order of punctuation, letters, and numbers in the Unicode character table.
In practice
With cut-out purposes in mind, a character-list may allow specifying all possible delimiters; for instance when
[: /-] is appied to
AF245-Z/63:10:12//15 KG
in order to separate all fields as in:
AF245-Z/63:10:12/<empty_field>/15 KG
The expression [: /-] matches single characters, hence delimits an empty field in between the two // .
Within reverseXSL, the segment cut mode CUT-ON-"regex" is just about using a regex to specify field delimiters.
On the other hand, the use of character lists for validation purposes is obvious.
More subtle is for instance the validation of a fixed two-character field in the range 0 to 29 with the regular expression
[ 12][0-9].
Lesson 6: Optional and repeating characters
So far, we are only capable of building fixed-length regular expression patterns where any position may match a fixed character, or a [ ] list of characters, else . any character. We can also hook the pattern to the ^ start of the input string, or the $ end, or both ^ $ (forcing an entire match).
Obviously, we need means to deal with the length of the pattern by letting characters repeat a variable number of times.
Special characters | Effect in pattern matching |
C{min,max} | means that the character C (or range specification) can occur from min to max times. For instance A{2,5} matches AA and AAA and AAAA and AAAAA, [0-9]{1,3} stands any numeric value from 1 to 3 digits, and .{5,15} matches any string from 5 to 15 characters long. |
C{fixed-n} | means that the character C (or range specification) is expected exactly fixed-n times. Obviously notations like A{3} , A{3,3} and AAA are equivalent. |
C{min,} | means that the character C (or range specification) must occur at least min times, up to unlimited size. For instance, [A-Z][A-Za-z0-9_]{4,} specifies a string starting with an uppercase letter, and followed by letters, digits and underscore characters whose total lentgh is at least 5 characters (1+4) but without a maximum length (in practice up to the input string length!). |
C? | means that C is optional; formally the character C (or range specification) must occur once or not at all. X? is clearly a short hand notation for X{0,1}. |
C+ | means that C can repeat; formally the character C (or range specification) must occur once or more times. X+ is clearly a short hand notation for X{1,}. |
C* | means that C is optional or repeating; formally the character C (or range specification) must occur zero or more times. X* is clearly a short hand notation for X{0,}. |
Let us consider a few samples:
Regex | Input String | Solution |
AB*C+ | ABCCCBCACCCAABBCAB | ABCCCBCACCCAABBCAB |
[A-Z0-9]+ | AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG matching 1 to indefinite-n alphanumericals |
[^/-]{3,} | AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG sequences of at least 3 not slash nor dash |
^.*Z | AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG from start of input to Z inclusive |
/.*$ | AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG from first slash, followed by any char repeated till end of input |
.* or ^.*$ idem | AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG any char repeated without limit = everything |
The last example is worth a comment: .* can also match an empty input string whereas .+ requires at least one character as input.
Note that like the . , the + , ? and * also loose their special status when used within [ ] or [^ ] range specifiactions: they can only match a single + ? or * character.
In practice
We have now powerful means of validating data.
For instance, signed decimal data values like 123,5 , -12,06 or +3.05
can be validated with: [+-]?[0-9]+[.,][0-9]+
The regex reads: a character belonging to the [] set of [ + (plus) or - (minus) ] which is ? optional, followed by a character in the [] set of [ - range from 0 to 9 ] which is + repeated at least once, followed by a character in the [] set of [ . (dot) or , (comma) ] followed by a character in the [] set of [ - range from 0 to 9 ] which is + repeated at least once.
The syntax of a Bank Identifier Code is 8 to 11 uppercase characters and can be validated with
[A-Z0-9]{4}[A-Z]{2}[A-Z0-9]{2,5}.
A flight number is made of an airline code from 2 to 3 alphanum characters (the third one is always alphabetical when used), followed by 3 to 5 digits. A validation regex is then
[A-Z0-9]{2}[A-Z]?[0-9]{3,5}.
The identification of interchange and segment/record structures is often based on tags at certain positions.
For instance, an EDIFACT interchange usually starts with one of UNA+... or UNB+... which we can match with ^UN[AB][+].
The + has been placed in a standalone character list [ ] such as to enforce its meaning as a plus-sign character, instead of standing for the special repeat-at-least-once marker.
In a classical file made of fixed-records we have for instance a 5 digit record number followed by a one character record type A, B or C. We can identify the respective records with
^[0-9]{5}A and ^[0-9]{5}B and ^[0-9]{5}C .
A variant that cares only about the one-char record type field at pos 6 could be
^.....A and ^.....B and ^.....C
which is guaranteed to work even if record numbers are not used and actually left blank (space filled!).
Numerous segments/lines in IATA airline messages bear no tag and can only be identified from their layout. For instance, an AWB_Consignment segment starts with the Air Waybill Number, itself made of a 3 character airline code followed by a dash, followed by the waybill number and so forth. Such segment is simply identified with:
^.{3}-
Lesson 7: Consuming input characters in greedy or reluctant mode
How many matches do we have of ... in ABCDEFG ?
We can indeed think about the five ABC, BCD, CDE, DEF, EFG, but the right answer is only ABC and DEF , G being left out because 3 characters are no longer available for a third match.
Indeed, as matching progresses throughout the input string, characters matched by the pattern are like 'consumed'. In the above example, the first match takes ABC, so the next attempt starts from D and takes DEF, and the next attempt starts from G and fails with insufficient characters left to match once more the pattern of 3 . any-char.
Assume now that we try to match ....? meaning 3 chars plus an optional fourth. Will we get ABC plus DEFG or ABCD plus EFG ? The answer is ABCD and then EFG , because matching is by default greedy; in other words, it takes as much as it can match as soon as possible.
Note that another formally correct solution to matching ....? over ABCDEFG is ABC plus DEF because the fourth character is optional. This last result is actually obtained by enforcing the reluctant mode; in other words, it takes as few as it can while trying to match optional and repeating elements.
The default mode is greedy, and reluctant mode is invoked by adding an extra ? next to the optional and repeating indicators described in the previous lesson. The above pattern becomes: ....?? which reads ... 3 any-chars plus . any-char ? optional and ? reluctant.
If we match ABCD plus EFG with ....?, and ABC plus DEF (no G!) with ....?? , how can we match ABC plus DEFG ? The reply (based on what we have seen so far) is ....??G? , try it!
The notation for reluctant matches becomes:
Special characters | Effect in pattern matching |
C{min,max}? | means that the character C (or range specification) can occur from min to max times, trying to match as few as possible to satisfy the overall pattern in which this is used. |
C{fixed-n}? | The notation is accepted but this is obviously stupid: greedy and reluctant modes cannot affect fixed counts! |
C{min,}? | means that the character C (or range specification) must occur at least min times up to unlimited count, trying here to match as few as possible to satisfy the overall pattern in which this is used. |
C?? | C is optional; formally C can be matched at most once, and only if that is needed to satisfy the overall pattern in which this is used. |
C+? | C is compulsory and may repeat, but only in case this is required to satisfy the overall pattern in which this is used. |
C*? | means that C is optional and may repeat; the matching strategy is here to consume as few instances as needed to satisfy the overall pattern in which this is used. |
The reluctant modifier (an extra ?) is one of the most powerful features of regular expressions.
A few examples comparing greedy and reluctant outcomes will immediately clarify the point.
Regex | Input String | Solution |
.*/ greedy |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG there is only one match up to the very last / in the string |
.*?/ reluctant |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG only first match is shown above, and goes up to the first / ; additional matches yield 63:10:12/ and /. |
/.*: greedy |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG |
/.*?: reluctant |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG |
/.*$ greedy |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG |
/.*?$ reluctant |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG why don't we match just /15 KG? because parsing works from left to right and gets a first matching bit on the first / and then takes the minimum to satisfy $ end-of-input. |
.* greedy |
AF245-Z/63:10:12//15 KG | AF245-Z/63:10:12//15 KG |
.*? reluctant |
AF245-Z/63:10:12//15 KG | the 1st match is an empty string located just before the very first character of the input. Subsequent matches produce 23 additional empty strings, thus yielding all possibile zero-length strings inside a 23 char long string! (in front, in between any two chars, at the end) |
In practice
Reluctant mode is useless in validating data, because validation is intrinsically about one big full match, checking that the data value entirely complies with a given pattern. There is no reason to match reluctantly less than the entire input data value.
By the same token, identification purposes do not need the reluctant modifier, because in principle if you have several identification possibilities, regular optional markers will cater for the variants: you do not need to favor one identification match versus another when all are valid by definition!
However, there are activities where the reluctant mode makes a great difference and is even indispensable: data value extraction, and cut-out (segmentation). These activities require help from capturing groups, which we investigate in the second part of this tutorial.