Overview

A message structure is described hierachicaly in terms of nested:

  • Segments, that tell how to identify message pieces and how to cut them down into smaller pieces, throwing away any delimiters by the same token. Segments may contain nested (sub)segments, data elements, and groups.
  • Data elements, that identify message pieces at the data element level, and extract relevant data values, throwing away fillers, tags, and delimiters by the same token
  • Groups, that frame repeated sequences of segments and/or data elements within the message, and also allow controlling the hierarchy of elements in the output XML document.

In addition, one can also introduce Marks in the message definition with the effect of generating XML elements that result from the evaluation of Conditions on data element inter-dependencies. In other words, you can use the structure of the message itself to generate element values in the output XML document.

The whole message itself is simply a top level Segment.

If you are accustomed with the definitions for group, segment and data in EDIFACT, X12 or other similar legacy EDI format

... you shall forget a while about the meaning of these terms and reset your mind.

Within the context of the ReverseXSL parser,

  • Segments are used to 'segment'! Namely, to cut pieces into smaller pieces. A segment therfore stands for a UN-finished-parsing-chunk of the message at some level, that you do want to cut (say 'segment') further down.
  • Data is data, what else? These are the smallest atoms of information that you do not need (or want) to cut further down but that you may like to validate.
  • And Groups are just 'groups'! Simply, they group several pieces of the message under a name, and allow this sequence to repeat; hence, Groups are sequences of Segments (a piece that want to cut further), Data elements (a piece that you have finished cutting and you want to extract the value), and also (sub)Groups (giving a name to a (sub)sequence of pieces of the message).

Legacy EDI formats restrict data elements to coexist only with other data elements (some of which can be composite elements). In legacy EDI formats, data pieces are then assembled into segments, and repeatable sequences of segments do form groups. The message itself is a top level group.

All such restrictions do not hold here. The message itself is a top level thing you are willing to cut down, hence a Segment, and as its name suggests, it will be segmented further. All the pieces that result from this segmentation (and any lower level segmentation onward) can well comprise some pieces that you will not want to cut further down, hence Data, other pieces that need further segmentation, hence (sub)Segments, and sequences of pieces that actually form logical associations or repetitions, hence Groups. Therefore, Segments may contain Data elements, (sub)Segments and (sub)Groups in any order, Groups may contain Data elements, (sub)Segments and (sub)Groups in any order, and Data elements are the final atoms, each containing only one data value to be extracted.

Everything can be nested and repeated

Except for Data elements that, by essence, form the atoms of information and thus would not contain anything else but a data value, you can nest Data within Segments or Groups, and Groups and Segments within each other, in any order, down to any depth.

Moreover, Groups, Segments, and Data elements can each be repeated. Even the top level Segment matching the whole message can be repeated. You fix the minimum and maximum number of repetitions in each case. The parser is actually more subtle, because it dissociates the official from the acceptable repetition limits. For instance, the formal specification indicates a maximum of 5 quantity schedules per item ordered, but your application can well accept 100 without problem. A message arriving with 10 can raise a warning about the formal violation, but will be processed anyhow.

Interdependencies are verified

In addition to Groups, Segments, Data, and Marks, a message DEFinition file can also contain specifications of named Conditions. Conditions formalize the interdependencies between elements of the message at diverse levels:

  • The presence of Data, Segments or Groups that constrain the presence of other Data, Segments, or Groups 
  • Data values that constrain the presence of other Data, Segments, or Groups
  • The presence of Data, Segments or Groups that constrain values of data elements
  • Data values that bear restrictions when linked, or not, to other values

Marks, a concept evoked above, actually take advantage from Conditions. They provide the opportunity to evaluate a named condition at any point in the parsing of an input data message, and insert an extra element with the outcome into the generated XML document.

Marks can also be used to map coded values to more readable text equivalents.

Groups, Segments, Data, Marks, and Conditions, is all that you have to know.