Intermediate wrap-up

Did you realize that at this point in our study of regular expressions, we are left with only 3 special characters out of 11: | pipe ( brackets, and \ backslash.
The meanings of the former . * + ? [ { ^ $ shall now be clear:

Special char Pattern matching effect
  . matches any single char (looses special status when used in [ ] )
  [char_list] matches a single character within the specified list
  * previous match repeated 0 or more times (* looses special status when used in [ ] )
  + previous match repeated 1 or more times (+ looses special status when used in [ ] )
  {n,m} previous match repeated n to m times
Variants: {n} fixed n-times, {n,} at least n-times ({ and } looses special status when used in [ ] )
  ? 1st rank : previous match is optional
2nd rank next to ? * + and {} : reluctant mode (? looses special status when used in [ ] )
  ^ matches the start-of-input, except when used in [^char_list] where it negates the char_list (^ looses special status when used in [] elsewhere but immediately next to [ )
  $ matches the end of input ($ looses special status when used in [ ] )
Do not forget the base rule: by default, a character in regex matches itself in input, case sensitive

Table of Contents 

PART I (about special characters, greedy and reluctant modes, ...)

PART II:

  1. Backslash escapes
  2. OR-expressions and groupings
  3. Capturing and non-capturing groups
  4. Negative match, lookahead and lookbehind...
  5. Conclusion

Lesson 8: backslash escapes

If we have any character like . * + ? | ( [ { \ ^ $ in the input string itself, and we would like to match this input character in the pattern, how can we do?

  • A first technique has already been illustrated: we include the special character within a character-list [ ] making it loose its special status.
    That works obviously for [.] [+] [*] [$], that works too for [?] [(] [{] [)] [}] [|] but ^ , [ and ] themselves clearly pose a problem: [^] negates a missing char-list and is invalid; the [[] or []] are obviously ambiguous.
  • The second technique is to use an escape sequence. Just like in C# or java strings, the backslash is the escape character and we have:
Escape sequence Interpretation in pattern matching
  \. matches a single dot char (hence [.] , \. and [\.] have the same effect!)
Similar to the above we can use \^ , \$ , \* , \+ , \? , \| , \( , \) , \{ , \} , \[ and \]
  \\ matches the backslash itself as a character
(indeed [\] is invalid, and [\]] is identical to \] )

However, you must be very careful not to think that \a_char is a general mechanism to enforce a_char as standing for itself. In fact \c does not mean letter-c, just like \. means dot. We have:

Escape sequence Interpretation in pattern matching
  \t matches the tab char (HEX 09 or ctrl-I)
  \d matches a digit, equivalent to [0-9]
  \D a non-digit, equivalent to [^0-9] and [^\d]
  \s matches all whitespace chars, comprising tabs, carriage-return, linefeed, backspace, formfeed and of course space (HEX 20), equivalent to [\t\r\n\x0B\f ]
  \S matches all non-whitespace chars, equivalent to [^\s] !)
  \xHH matches the ASCII character with hex value HH. For instance \x01 matches the SOH (start of header) control character, which is ctrl-A.
  \uHHHH matches the UNICODE character with hex value HHHH. For instance \u00E9 matches é and \u03B1 matches the greek letter alpha α.

The above are just a few (useful) examples. You may want to check some regex reference documentation for all additional possibilities.

In practice

Personally, I am much reluctant to use more than the above. Even using the \d and \D with caution as I tend to forget which of d or D stands for the digit or non-digit. The equivalent [0-9] and [^0-9] are often more explicit.

By the same token, \s is useful for its ability to match both spaces and tabs, but dangerous when using regular expressions in MULTILINE mode (cfr advanced tutorial) as we often forget that it matches also carriage return and linefeed!

The \xHH is much useful in matching control characters. Many legacy formats do use non-printable characters as record and field delimiters. This notation allows matching them just like other characters. You may combine them in lists [\x01\x02\x03].

Lesson 9: OR-expressions and groupings

Imagine now that we want to match "AB or BA" in an input string. If we use the two-char pattern [AB][BA] we would also match AA and BB. By a similar token, if we want to match a sequence of "three or five #", but not four, a pattern like ####?#? is inadequate as it does accept #### (four #); same problem with #{3,5}.

The solution makes use of the last two special characters that we have not yet explained:

Special char Interpretation in pattern matching
  exp1|exp2 matches the regular expression exp1 or exp2.
One can chain OR-expressions as in exp1|exp2|exp3.
For instance, compared with [AB][AB] the regex AB|BA can match AB and BA but not AA nor BB. Whereas AA|AB|BA|BB is equivalent to [AB][AB], that we may also write [AB]{2}
  (sub_exp) groups the elements in regex sub_exp as a sub-pattern which you can use as a block in following constructs:
(exp)* exp repeated 0 or more, (exp)+ exp repeated once or more, (exp)? exp is optional, (exp){n,m} exp n to m times, plus (exp)*? , (exp)+? , (exp)?? and (exp){n,m}? for the corresponding reluctant mode expressions.
For instance, the regex ###(##)? accepts only ### and ##### but not ####. We have also the equivalent (##)?### , #(##)?## and more like #{3}(#{2})? .

Beware! the regex AA(BB)|(CC)DD means "AABB or CCDD" and not "AABBCC or AACCDD". You would not have been confused with AB|CD in thinking that it could mean ABD or ACD! If brackets are not followed by any of +  *  ? or {n,m} such groupings are grammatically useless: AB(CD) is identical to ABCD, but AB(CD)? does match AB and ABCD, and can also be written as AB|ABCD. Vice-versa, AA(BB|CC)DD does mean "AABBCC or AACCDD" and removing the brackets would change it to "AABB or CCDD"

There are interesting effects when we put special characters like ^ or $ within an OR-expression scope. For instance:

Regex Input String Solution
(^..|..$) AF245-Z/63:10:12//15 KG AF245-Z/63:10:12//15 KG
first 2 or last 2 characters
(^|/)[A-Z0-9 ]+ AF245-Z/63:10:12//15 KG AF245-Z/63:10:12//15 KG
fields from the start or with a leading / and made of at least one alphanumeric or space

In practice

OR-expressions are invaluable for validation purposes.

For instance, the details of payment code in a SWIFT M101 Request for Transfer message is one of INV, IPI, RFB, ROC or TSU, which we can validate with the regex INV|IPI|RFB|ROC|TSU .

Load Category codes in the IATA-AHM Container Placement Message are one letter codes, or two letter codes starting with B or C. There is a limited set of combinations that we can match with the regex
   [BCDEFHMNQSTUWXZ]|B[FTHSGD]|C[AGILP]

A year field formatted as YYYY or YY can be validated for the past century and next thousand years with (19|2[0-9])?[0-9]{2}

Valid numercial fields in fixed size legacy record formats are often either all blanks or all digits, which we can validate with the regex [0-9]{10}| {10} . Note that factoring the field size as in ([0-9]| ){10} is not valid because a value like 0 00 12 4 is now accepted.

The discharged quantity in a proprietary freight message is either "NIL" or a valid number of pallets up to 3 digits, which is validated by NIL|[0-9]{1,3} ; an improved version that prevents leading zeros in the numerical value is NIL|[1-9][0-9]{0,2} .

The use of OR-expressions for identification purposes is often supporting the requirement to group—under a common XML element tag—a collection of different optional pieces from the non-XML input message. Assuming for instance that id_exp1, id_exp2 and id_exp3 are valid identification patterns for the 3 possible pieces/members of such a semantic grouping, we will be able to detect the break in semantics at the current parsing point by testing for the presence of any member with a regex like
    id_exp1|id_exp2|id_exp3

The are also occasions where a given message segment gives no means of testing for a tag but shall be identified from its 'shape'. The relevant identification patterns become layout models where groupings and or-expressions are essential tools.

For instance, in IATA messages, one may need to identify a given message segment by its count of fields. Assuming hash-separated fields as in AGT3#JFK#12345#P8, we can identify such four-fields segments with the regex:  ((^|#)[^#]*){4} , that cares also for empty fields!

The above regex reads as : a ( ) group that contains ( (^|#) start-of-input or hash followed by [^#] non-hash chars * repeated zero or more ) altogether {4} repeated 4 times exactly.

The following example is about airline messages exchanged through the SITA network. Feel free to skip it and go straight to the next lesson.

The first line in a IATA Type-B header starts with an optional priority code like QD or QN (and then a space), and then a list up to eigth IATA recipient addresses separated by spaces; these addresses are 3-letter city codes followed by 4 to 5 alphanumericals. There is no fixed line-tag to possibly test. So, if you want to check the presence of such header while matching the first line extracted from an input message, you may need a regex like (for a strict version):
    ^(Q[UPKDNXS] )?([A-Z]{3}[A-Z0-9]{4,5} ){0,7}[A-Z]{3}[A-Z0-9]{4,5}$

This regex reads as : ^ start of input followed by a ( )? optional group comprising letter Q plus a letter in [ ] set UPKDNXS plus a   space, followed by a () group comprising ( {} fixed 3 chars in [ ] set A to Z plus a {,} repeated 4 to 5 chars in [ ] set A to Z and 0 to 9, followed by a   space ) altogether {,} repeated 0 to 7 times and always followed by a {} fixed 3 chars in [ ] set A to Z plus a {,} repeated 4 to 5 chars in [ ] set A to Z and 0 to 9, always followed by $ end of input.

The length of the above may lead to prefer a more compact but looser identification with:
    ^(Q.\s+)?[A-Z0-9]{7,8}(\s|$)
which is enough in practice.

One shall note that this regex accepts /s+ repeated generic whitespace (including tab) in place of the strict single   space char after the optional priority code. Moreover, test is reduced to only the first 7-or-8 char recipient address (simplified to all alphanumerics) followed by a () group comprising \s generic whitespace (separating this recipient from subsequent ones), |=or, the immediate $ end of input (in case there is only one recipient).

Lesson 10: Capturing and non-capturing groups

We have so far much explored the use of regular expressions for identification and validation purposes, but seldom cited applications of extraction and cut-out. There are three techniques for extracting bits or cutting the input string into pieces with regular expressions. We have seen that regular expression software libraries support full-match-operations and find-operations. In the case of find-operations, the API supply methods that return the offsets within the input string at which a match of the pattern takes place. Therefore we can think of taking advantage from repeated find-operations to get all the offsets at which we shall slice the input. In doing so, we can think of using the regular expression to match either the delimiters, or the data values. This is the basis of the first two techniques illustrated below:

Regex Input String Solution
[/:] AF245-Z/63:10:12//15 KG AF245-Z / 63 : 10 : 12 / Ø / 15 KG
Technique 1: matching delimiters
[^/:]+ AF245-Z/63:10:12//15 KG AF245-Z / 63 : 10 : 12 // 15 KG
Technique 2: matching values

We may observe that the second technique poses a problem: the empty data value between the two // is missed, which may lead to miss-interpret 15 KG as the fifth field instead of the sixth. Of course you think of changing the regex to [^/:]* in order to accept empty values, but the consequence is that numerous empty fields will be revealed, yielding 11 matches as follows AF245-Z , Ø , 63 , Ø , 10 , Ø , 12 , Ø , Ø , 15 KG , Ø !

The above two techniques enforce the design of patterns that strictly match only delimiters, else only values. We cannot benefit from the interplay between delimiter-chars and value-chars in the same regex, which is very limitative. Hopefully, regular expressions feature an additional—extremely powerful—mechanism: capturing groups; in other words, the ability to indicate explicitly, within the pattern, the boundaries of the pieces of the input string that we want to capture, i.e. for extraction or cut-out purposes. Actually, capturing groups are just the ( ) groupings studied in the previous lesson! Let us immediately consider some examples, marking what each group—in a regex with several groups—captures:

Regex Input String Solution (matched , captured as G1 , captured as G2)
/([^/]*) AF245-Z/63:10:12//15 KG AF245-Z /63:10:12 /Ø /15 KG
slash followed by a capturing group of repeated non-slash
(^|/)([^/]*) AF245-Z/63:10:12//15 KG ØAF245-Z /63:10:12 /Ø /15 KG
capturing ( zero-length-start-of-input or slash ) followed by capturing ( repeated non-slash )
^([^/]*)|/([^/]*) AF245-Z/63:10:12//15 KG ØAF245-Z /63:10:12 /Ø /15 KG
start-of-input followed by capturing ( repeated non-slash ) OR slash followed by capturing ( repeated non-slash )

The first example above failed to capture the first data field. To fix it, we use in the second example an OR-expression grouping that generalizes the delimiter as ( start-of-input or slash ). By the same token, we introduce a second capturing group with the effect of capturing both delimiters and data values. The third example fixes the issue with capturing delimiters and captures data with two different groups.

The above consideration suggest that there will be numerous occasions where we will need groups for the sake of composing OR-expressions (with ( | ) ), else marking optional and repeated sub-patterns ( with ( )* , ( )+ , ( )* and ( ){n,m} ), without willing at the same time that all these grammatical-groups become automatically capturing groups. We would like to be able to tell explicitly which group shall be capturing, and which other one shall not. That's exactly what regular expressions do with the help of the following notations.

Notation Interpretation in pattern matching
(sub_exp) is a group that also captures whatever matches sub_exp.
Practically, the API to regex software provide means to query the boundaries of matched sub_exp's within the input string for the sake of extracting or cutting-out data.
(?:sub_exp) is a non-capturing group. Like any group, it can be used to compose OR-expressions and mark optional and repeated cardinality, but it will hide the offsets of sub_exp within the input string.

Let us illustrate now the third string cutting/extraction technique:

Regex Input String Solution (matched , captured )
Technique 3: capturing groups
(?:^|/)([^/]*) AF245-Z/63:10:12//15 KG ØAF245-Z /63:10:12 /Ø /15 KG
(.*)/ AF245-Z/63:10:12//15 KG AF245-Z/63:10:12// 15 KG
capture of ( any-char repeated greedy-mode-by-default ) up to a slash
(.*?)/ AF245-Z/63:10:12//15 KG AF245-Z/ 63:10:12/ Ø/ 15 KG
capture of ( any-char reluctantly repeated ) up to a slash
(.*?)(?:/|$) AF245-Z/63:10:12//15 KG AF245-Z/ 63:10:12/ Ø/ 15 KGØ ØØ
capture of ( any-char reluctantly repeated ) up to ( slash or zero-length-end-of-input )

One may wonder why an extra pair of zero-length ØØ is matched in the last example. The fact is that a match against ^ start-of-input or $ end-of-input does not consume any character. Matching ^^^. against ABC does work and returns A . Similar .$$$$$$$$ applied to ABC yields C . So when parsing resumes at offset 18 next to the last / , it does capture 15 KG and matches $ end-of-input. Offset becomes 23 (the actual length of the string) and parsing resumes again from the virtual end-of-string. It can capture an empty value (as allowed by the * repetition indicator) and match again the $ end-of-input. This time the offset has not moved and so parsing terminates.

For cases like the above, the reverseXSL software supplies simpler cutting techniques with the help of built-in functions like CUT-ON-(/) or else, CUT-ON-"[-:/]" in order to cut smaller data pieces from the above example.

In practice

We recommend combining extraction purposes with validation as follows: a regular expression is developed with non-capturing groups where needed that fully matches the input string. Then one (or several) pair of brackets ( ) is added inside the regex such as to delimit the capture zone and exclude unwanted field parts as well as syntax characters.

On the other hand, cut-out operations will take advantage from the three techniques presented in this lesson: (1) slicing in between matched delimiters, (2) slicing out matched values, or (3) using capturing groups.

Lesson 11: Negative match, lookahead and lookbehind...

From the previous lesson you should have noticed that the special character ? which had already three meanings (as optional marker by default, as reluctant modifier next to + * ? and {}, and as itself in [?] or \? ) got a fourth meaning as marker for non-capturing groups—as in (?: ) . The notation (? is actually an escape (like the \ alone) to a bunch of advanced pattern-matching facilities that are specified by what follows just next to the (? sequence. We have notably:

Notation Interpretation in pattern matching
(?:sub_exp) We already know this notation for non-capturing groups.
(?idmsux)
   turn flags ON
(?-idmsux)
   turn flags OFF
The flags can be turned on/off individually as follows:
 i ignore case, e.g. A matches both A and a
 d UNIX lines mode: only the \n linefeed is a line terminator
 m MULTILINE mode : ^ and $ also match start/end of lines (cfr advanced tutorial)
 s DOTALL mode : . also matches line terminators (cfr advanced tutorial)
 u UNICODE case folding, notably in (?iu)
 x allow comments starting with #, e.g. in (?x)(.*)(?:/|$) #a comment
(?=sub_exp) Zero-width positive lookahead... like a regular non-capturing group match of sub_exp, but the parsing cursor within the input string does not move forward
(?!sub_exp) Zero-width negative lookahead... like the previous (?= ) but where the non-match of the regex sub_exp is now expected.
(?<=sub_exp) Zero-width positive lookbehind... we try to match sub_exp against characters preceding the current parsing offset within the input string, and the cursor will not move forward nor backward
(?<!sub_exp) Zero-width negative lookbehind... like the previous (?<= ) but where a non-match of the regex sub_exp is expected.

A few examples are better than lengthy explanations:

Regex Input String Solution (colored/underlined like the regex)
(?=\d*:\d*:\d*)(.*?):
zero-width positive lookahead
AF245-Z/63:10:12//15 KG AF245-Z/63:10:12//15 KG
capture the first colon-separated field in three colon-separated numericals
:(\d*)(?<=\d*:\d*:\d*)
zero-width positive lookbehind
AF245-Z/63:10:12//15 KG AF245-Z/63:10:12//15 KG
capture the last colon-separated field in three colon-separated numericals
(?:^|/|:|-)(.*?)(?![A-Z0-9 ])
zero-width negative lookahead
AF245-Z/63:10:12//15 KG ØAF245-Z/63:10:12/Ø/15 KG 
( matching start-of-input or slash or colon or dash ) followed by the capture of ( any-char reluctantly repeated ) up to but not including a following is-not alphanumerical char

In the last example, we may question the use of a zero-width negative lookahead group (?![A-Z0-9 ]) in place of the simple expression [^A-Z0-9 ]. Why is the later not yielding the right cuts? The difference is indeed because [^A-Z0-9 ] consumes the matched non-alphanumerical (and thus advances the parsing cursor) whereas (?![A-Z0-9 ]) performs a similar match but does not advance the cursor. Therefore, with [^A-Z0-9 ] the non-alphanumerical delimiter under the cursor is, once matched, non-longer available for the next find-operation. The cursor is already positioned on the first data char of the next field. The next find-operation expects to match first a field delimiter defined by (?:^|/|:|-) . So the find-operation must skip the current data field till a next delimiter, from where a new instance of the whole regex can be matched. In essence, it skips a data field value every two, and fails also to match /15 KG because the end-of-input complies with (?![A-Z0-9 ]) but not with [^A-Z0-9 ] ! A second difference.

In practice

If you think that the last three examples are really screwy, I agree with you! I must acknowledge that I was willing to challenge you with the last regular expressions of the whole tutorial: they contain tricks from every lesson.

You will seldom need playing with zero-width positive-or-negative lookahead-or-behind patterns, especially for the cut-out purposes illustrated above which are pure academic cases. Reminding that such mechanism do exist in the collection of regex tricks is enough. When you will need one, it will be the adequate moment to review the details and challenge your prototype regular expressions against target strings in a regex test tool.

The practical application I have encountered is for identification purposes. For instance, you identify (and possibly skip) a repetition of some arbitrary string because it does not start by one of the TAGs of what is expected to follow. In such contexts the identification of such arbitrary string segments can be based on a regex like:   ^(?!TAG1|TAG2|TAG3)  which reads as ^ start-of-input followed by (?! is-not ( TAG1 |or TAG2 |or TAG3 ) ).

IATA AHM Container Placement Messages end with any number of SI (Supplemental Information) lines. The format is arbitrary because the repetition runs till an END keyword is encountered alone in a line. So, one can only identify repeated SI lines with a negative match, using the regex ^(?!END *$) reading as : start-of-input followed by is-not ( END followed by zero or more spaces followed by end-of-input ).

Another case occurs in SWIFT messages in which elements like ":50F:"—Ordering Customer in format option "F"—is optionally followed by 1 up to 4 free text address lines. Other elements like :56D: :57D: :59: share the same logic. In all these, the free text lines are identified by not being like anything that can follow: namely lines starting with an element code (featuring a tag like :[0-9]{2}[A-Z]?:) or the end of block "-}". The identification of address lines is thus ensured with a regex like :
   ^(?!:[0-9]{2}[A-Z]?:|-\}$)

Conclusion

At this point, of course you do not know everything about regular expressions, but you do know everything that you may need for parsing arbitrary character-based data, and possibly transform it to XML with the help of reverseXSL software.

If you look to a complete regex reference documentation, you will note numerous additional notations, but nothing really new that you cannot express with what we have explained. For instance [a-Z&&[^opq]] defines the intersection of character lists, which is equivalent to [a-nr-z] , \cA denotes ctrl-A which can also be noted \x01 , \p{Alpha} denotes [A-Za-z] and so forth.

There are actually only very few things that we have not covered:

  • Subtleties in applying regular expressions to input strings made of multiple lines. MULTILINE and DOTALL modes notably change the meaning of ^ , $ and . .
  • The \n notation, where n is a digit (e.g. \3 ), stands for whatever the n'th capturing group matched...
  • There is a possessive mode, in addition to the greedy and reluctant behaviors, which is like super-greedy...

These topics are covered elsewhere, e.g. in regex API reference documentation

This is the end of the tutorial! How long did you take?

Over 3 hours

Honestly, I believe that you were distracted by a call or a visit, or kept on doing another job in parallel...

Over 2 hours

Either you digested every bit to the tiniest detail, or you played much with a regex testing tool in parallel to the tutorial. This is very good: you have anticipated the next step.

Just over 1 hour

This is it! The objective has been fulfilled. The points which may still be obscure will soon enlight with practice. You know enough to be self-supporting.

I hope you enjoyed this tutorial. Do not hesitate to provide me with feedback. I would love to learn from your experiences and improve further.