89.40.112.148 - CSV to XML: real-life irregular CSV file handling

Table of Contents

A Product Unlike Others
A Simple CSV Example
The Parser DEF file
Improvements:

Grouping elements
Commas and double quotes in CSV
Advanced CSV segmentation

A Product Unlike Others: compact, simple to learn, very simple to deploy, and extremely flexible

ReverseXSL is a software tool digesting meta-data definitions of arbitrary formats, and turning the corresponding message instances to XML. The meta language used in reverseXSL DEF files can describe all kinds of formats, in ASCII, ISO, JIS, or Unicode brands, with plenty of non-printable characters, highly structured like EDI brands, or much unstructured like printed data, and all intermediate character-based formats inclusive of CSV, fixed, and variable format syntaxes, possibly mixed. If you know how to use it for CSV, you can also apply it to IATA Type-B messages, EDIFACT and X12 interchanges, fixed records, tagged values, property files, and all kinds of mixtures with explicit or implicit delimiters including printed forms redirected to files

{mosimage}

ReverseXSL software is NOT workstation package with graphical editors for numerous embedded tools and formats where CSV is one buried in hundred others. ReverseXSL software is neither a library of parsing functions that you'll need to call in a proper sequence in order to advance in the source file and capture data (e.g. like the SAX API). Instead, reverseXSL software is contained in a single java archive, which you invoke from the command line, an Excel macro, or MS-Windows script, or a shell script, or else the java API. The reverseXSL software takes your data as input, uses a message DEFinition file describing the transformation, and produces an XML output.

The reverseXSL software plus a DEF file describing the input syntax is all what is needed!

You can actually put the DEF file in the software archive itself, so you would just need to pass your input data file as argument to a java program contained in a single jar (inclusive of meta-data). Moreover, The reverseXSL Transformer automates the combined use of XSLT and the reverseXSL Parser, with the help of a mapping decision table. The Transformer identifies arbitrary input messages, looks up the decision table for relevant Transformation steps, and executes these. Therefore, the Transformer maps XML to XML, XML to flat / text or EDI, EDI to XML, text to XML, text to flat, EDI to flat, and so forth. It can also act as a pass-through for selected messages, thus generalizing the means to import and export data for target applications.

The magic is the ease of learning the DEF meta language; the DEF meta language contains only five constructs: segments, groups, data elements, marks and conditions. These five constructs organise the use of regular expressions and simple built-in functions to perform four activities in turn and recursively: identify, cut, extract and validate.

A Simple CSV Example

The software distribution jar (reverseXSL.jar) contains an example VAT declaration form in MS-Excel®, the associated CSV export, and sample meta-data showing how to transform it to an imposed XML schema.

We describe below a simpler example which you can download from the tutorials section. Of course CSV files and the like are not exclusive to MS-Excel®, but we illustrate this case here. The Excel worksheet is shown in the enclosed figure.

Fig. 1 {mosimage}

The example is drawn from the project management method STEP. Project members fill up the grid with planned tasks and estimated work time at the beginning of each week, and then report the actual time spent on these tasks, plus unplanned tasks that slept in, by the end of the week. They also fill up their subjective feeling about the overall completion degree of each task in %; 100% marks completed tasks. This data is then fed into an application that consolidates data from all project participants and measures the projects' progress.

A click on the EXPORT button invokes a Macro that executes worksheet.SaveAs Filename:="WeeklyReport.csv", FileFormat:=xlCSV and produces the file illustrated here.

Fig. 2 {mosimage}

There are several features to notice:

The sheet contains heading info and two data matrices: one about planned tasks and the second about UNplanned tasks. Although the structure is similar, distinct processing is required.
Empty and Zero values shall not be mixed. Their meaning is different.
Nothing prevents the user to add or remove rows in matrices. Thus the row number can't be used to identify data.
You can't rely on the systematic presence of double quotes around values; their presence is dynamic and based on the occurrence of a comma in the value (numerical or string).

The EXPORT Macro also calls the reverseXSL Transformer via a batch command file (a Windows script would be better but may require additional configuration). The XML document is directly produced from the CSV input by the Parser step from the reverseXSL software.

Ideally, the macro shall also automate the submission of the document to the target application.

<?xml version="1.0" encoding="UTF-8"?>
<STEPReport xmlns="http://www.reverseXSL.com/FreeParser" messageID="1">
   <References>
      <EmployeeNbr>99.87.65</EmployeeNbr>
      <Name>John Seehacevee</Name>
      <Year>2009</Year>
      <WeekNo>21</WeekNo>
   </References>
   <Planned>
      <Task>
         <Project>CCS</Project>
         <WBS>03.07.01</WBS>
         <Description>IATA envelope filtering</Description>
         <ManDays>
            <Planned>2.0</Planned>
            <Effective>1.5</Effective>
         </ManDays>
         <OverallCompletionPercent>80</OverallCompletionPercent>
      </Task>
      <Task>
         <Project>CCS</Project>
         <WBS>05.02.06</WBS>
         <Description>cargo pool traffic migration</Description>
         <ManDays>
            <Planned>2.0</Planned>
            <Effective>2.5</Effective>
         </ManDays>
         <OverallCompletionPercent>100</OverallCompletionPercent>
      </Task>
      <Task>
         <Project>Praxeme</Project>
         <WBS>01.01.04</WBS>
         <Description>proof-reading of training mat.</Description>
         <ManDays>
            <Planned>0.5</Planned>
            <Effective>0.0</Effective>
         </ManDays>
         <OverallCompletionPercent>30</OverallCompletionPercent>
      </Task>
   </Planned>
   <Unplanned>
      <Task>
         <Project>consulting</Project>
         <WBS>99.268.01</WBS>
         <Description>E.A.I. expertise, platform sel.</Description>
         <ManDays>
            <Effective>0.5</Effective>
         </ManDays>
         <OverallCompletionPercent>50</OverallCompletionPercent>
      </Task>
      <Task>
         <Project>sales</Project>
         <Description>edit project refs for quote</Description>
         <ManDays>
            <Effective>0.5</Effective>
         </ManDays>
         <OverallCompletionPercent>100</OverallCompletionPercent>
      </Task>
   </Unplanned>
</STEPReport>

The Parser DEF file

The DEF file entirely defines the parsing of the CSV input and its transformation to XML. The file simply contains nested segments (SEG), groups (GRP) and data (D) definitions.

You may like to open a separate window on the entire DEF file. Next to a full page of comments, a DEF file is always comparable in size to an XML schema for the target document.

The first significant piece of meta data in the DEF file is the MSG line, which is actually a SEGment matching the entire file:

MSG "^Weekly P.* P.* Rep" STEPReport M 1 1 ACC 1 T F "WeeklyReport root element" CUT-ON-NL

MSG is the keyword. MSG is the topmost SEGment.

"^Weekly P.* P.* Rep" is a regex (regular expression) used to identify this segment, here the entire file, as being a weekly report format; ^ matches the start of the segment/entire-file, Weekly matches 'Weekly ', the first P.* matches 'Project ' ('.*' means anychar'.' repeated '*'), the second matches 'Progress ', then Rep matches the first three letters of 'Report' and then doesn't look any further.

STEPReport defines the XML root element name.

M 1 1 ACC 1 is fixed in case of a MSG. It tells that this segment is Mandatory, formally occurs min=1 and max=1, and the Parser can actually ACCept 1.

T F instructs to immediately Throw a Fatal exception if anything goes wrong.

then follows a "segment description..." to be reported in error messages.

then follows the segment cutting function, here CUT-ON-NL, a built-in function telling to simply CUT the segment/entire-file on New Line boundaries.

The outcome of all this is that the Parser identifies the entire file/segment as being a weekly report and cuts it in individual lines. A Fatal exception is thrown if the match or the cut fails.

Let's jump further to the point the Parser will tentatively match one of the cut lines like:

CCS,03.07.01,IATA envelope filtering,2.0,1.5,80%,

Obviously, such line must be cut further down into smaller fields, hence a SEGment definition must be used to ... segment it! A working DEFinition can be:

||SEG "^(?!,+$)" Task M 1 5 ACC 10 R W "Table 1 rows - Planned Task" CUT-ON-(,)

MSG was a special SEG so this meta-data structure is indeed very similar to the above.
A similar SEG DEF line can be found nearby lines 82 and 100 in the sample WeeklyReport.def file.

|| marks the element depth and controls the nesting hierarchy; here a second-level child of the root STEPReport.

SEG is the keyword for a segment definition.

"^(?!,+$)" is the segment identification regex. Indeed, pieces of a message can be repeated, others are optional. A match with such identification regex tells to the Parser that at this point in the file, we have an instance of this segment. Question is: what means "^(?!,+$)"? Well, ^ matches the start of the segment/line; (?!exp) is a negative lookahead of expression exp; in other words, it is true when it does NOT match it; and exp is actually ,+$ which means , comma, + repeated once or more, till $ end of segment/line. In other words, this segment definition matches any line not made only of repeated commas.

Task is the associated XML element name/tag.

M 1 5 ACC 10 means that this segment is Mandatory, as confirmed by the formal minimum 1 occurence; there is also a formal maximum of 5, but the parser is allowed to ACCept up to 10! Such subtelty allows to continue parsing a file with too many occurences and later report the deviation.

R W : Indeed in case of such deviation or cut-error with this segment, the Parser will Record a Warning.

then follows the "segment description..."

When this segment is identified (i.e. matching the identification regex), segmentation takes place based on CUT-ON-(,) meaning to cut on comma as separators.

The outcome of all this is that the Parser identifies the line as being a Task segment instance, and cuts it down in the following list of comma-separated fields:

CCS
03.07.01
IATA envelope filtering
2.0
1.5
80%
<an empty field next to the last comma separator>

Pieces issued from the above segmentation do not need further cutting, so we can just match them against a sequence of Data elements, as follows:

|||D "(.*)" Project M 1 1 ACC 1 T F "..Col 1..." REPEATED-"[A-Za-z _]" [1..]

|||D "(.*)" WBS M 1 1 ACC 1 T F "...Col 2 ..." NUMERIC [1..]

|||D "(.*)" Description O 0 1 ACC 1 T F "...Col 3..." ASMATCHED [1..]

|||D "(.*)" Planned M 1 1 ACC 1 T F "...Col 4..." NUMERIC [1..]

|||D "(.*)" Effective M 1 1 ACC 1 T F "...Col 5..." NUMERIC [1..]

|||D "(.*)%" OverallCompletionPercent O 0 1 ACC 1 T F "...Col 6..." NUMERIC [1..]

|||D "(.*)" SKIP O 0 1 ACC 5 R W "trailing column - ignored" ASMATCHED

A Data definition is similar to a SEGment definition, where the CUT function is replaced by data value validation functions!

||| all these Data elements are defined at depth 3, hence they are children of the previous Task segment at depth 2.

D is the keyword for a data element definition.

"(.*)" or "(.*)%" are regex'es for data identification and value extraction. Alike segments, data element pieces can be repeated or optional, and this regex helps the Parser in deciding whether we encounter or not an new instance of the respective data element. In addition, (exp) defines a capturing group for whatever is matched by exp, where exp is actually .* which means . any character * repeated zero or more times. In other words, the regex "(.*)" matches any string and extracts the entire string as data value, whereas "(.*)%" requires a % char in the matched string and leaves this % char away from the extracted value.

Project, WBS, Description, etc. are the future XML element tags. SKIP is a reserved tag telling to omit the corresponding piece from the generated XML document.

O 0 1 ACC 5 specfies for instance an Optional element, formally occurs min=0 max=1 but the Parser can ACCept 5.

In case of an element with an invalid value, or not within the range of occurences, one can Throw-or-Record, a Fatal-or-Warning exception.

element "descriptions..." above were truncated for convenience.

NUMERIC validates the extracted data value against the built-in numercial character set, whereas REPEATED-"[A-Za-z _]" requires the value to be a repetition of characters in the explicit set A-Z (from A to Z), plus a-z (from A to Z in lowercase), or (a space) or yet _ (the underscore).

[1..] requires the value to contain at least one character, whereas [3..15] would require 3 to 15 characters for instance.

Execution

The above sequence of SEGment and Data definitions applied to:

CCS,03.07.01,IATA envelope filtering,2.0,1.5,80%,

yields:

      <Task>
          <Project>CCS</Project>
          <WBS>03.07.01</WBS>
          <Description>IATA envelope filtering</Description>
          <Planned>2.0</Planned>
          <Effective>1.5</Effective>
          <OverallCompletionPercent>80</OverallCompletionPercent>
       </Task>

Unlike EDI standards it's important to realize that SEGments do not need containing only Data elements; they often contain subSEGments, thus cutting further down some pieces of the parent segment while other pieces can already be used as data values. SEGments may also contain Groups as will be seen further.

The material for this tutorial can be downloaded. The file set provides a complete working message DEFinition for such a CSV example.

Improvements

Grouping elements

Beside SEGments and Data elements, the Parser handles Groups (GRP). Groups do not consume any piece from the source message or file. They are pure logical beasts, often used to repeat the parsing of a sequence of segments, data elements and other (sub)groups, else used to mark element hierarchies in the generated XML.

For instance, we may like to group the worktime elements Planned and Effective under a same parent tag like, say, ManDays. This is very simply achieved by making the Data definitions for Planned and Effective children of a new Group definition as follows:

||SEG "^(?!,+$)" Task M 1 5 ACC 10 R W "Table 1 rows - Planned Task" CUT-ON-(,)

|||D "(.*)" Project M 1 1 ACC 1 T F "..Col 1..." REPEATED-"[A-Za-z _]" [1..]

|||D "(.*)" WBS M 1 1 ACC 1 T F "...Col 2 ..." NUMERIC [1..]

|||D "(.*)" Description O 0 1 ACC 1 T F "...Col 3..." ASMATCHED [1..]

|||GRP "" ManDays M 1 1 ACC 1 T F "...grouping Col. 4&5..."

||||D "(.*)" Planned M 1 1 ACC 1 T F "...Col 4..." NUMERIC [1..]

||||D "(.*)" Effective M 1 1 ACC 1 T F "...Col 5..." NUMERIC [1..]

|||D "(.*)%" OverallCompletionPercent O 0 1 ACC 1 T F "...Col 6..." NUMERIC [1..]

|||D "(.*)" SKIP O 0 1 ACC 5 R W "trailing column - ignored" ASMATCHED

Like SEGment definitions, Groups have an identification regex; groups can be optional; groups can repeat; but they never cut nor validate anything. The next two Data element DEFinitions were moved to depth 4 as indicated by |||| preceding the keyword D. Therefore they become children of whatever precedes at depth 3: here the group ManDays that is itself a child of the SEGment Task.

Whereas "^$" matches only an empty string or line, "" does match everything (the empty string can be found elsewhere in between two characters of a string). Then, the Group is always considered to occur whatever is left to match in the input file, and consequently produced by the Parser.

When DEFinitions are modified as above, the reverseXSL Parser now transforms a Task row into the following XML fragment:

      <Task>
          <Project>CCS</Project>
          <WBS>03.07.01</WBS>
          <Description>IATA envelope filtering</Description>
          <ManDays>
             <Planned>2.0</Planned>
             <Effective>1.5</Effective>
          </ManDays>
          <OverallCompletionPercent>80</OverallCompletionPercent>
       </Task>

Commas and double quotes in CSV

If a cell value in the Excel sheet contains a comma, the CSV export will show the value between double quotes, alike in:

consulting,99.268.01,"E.A.I. expertise, platform sel.",,0.5,50%,

Note that numeric cells may be displayed like 12,5 (according to regional settings) but still exported as 12.5 (by default in MS-Excel)

If we proceed with segment cutting based on CUT-ON-(,), we will get:

consulting
99.268.01
"E.A.I. expertise
platform sel."
<an empty field>
0.5
50%
<an empty field next to the last comma separator>

There is obviously a wrong cut in the middle of the field containing a comma as part of the value and not as separator. To handle this case, we must use a more sophisticated cut function. A first possibility is offered by the segment cut mode based on capturing groups in regex (alike data value extraction). The following quick guess (!) offers an occasion to deepen your understanding of regular expressions:

SEG ... ... ... CUT '^([^,]*),([^,]*),"([^"]*)",([^,]*),([^,]*),([^,]*),$'

where we shall read:

the very first ^ matches the start of line

(f1),(f2),"(f3)",(f4),(f5),(f6), captures all comma separated fields

$ matches the end of line

and f1 to f6 (but not f3) are like [^,]* which reads: [] characters in the range [ ^ not being , comma ] * repeated zero or more times.

Indeed, the meaning of ^ changes from start-of-matched-string to become the negation of the explicit character set that follows (here a single comma!) when used in between [ ] square brackets. Other special regex characters like * . - + | ? also change meaning when used within or outside [ ]. It seems screwy, but it is actually very easy to get used to it. Moreover, there's only one other case like this with the question mark ? within ( ).

f3 is like [^"]* which reads: [] characters in the range [ ^ not being " double quote ] * repeated zero or more times.

One may not much like the above because:

This segment cutting regex does require the double quotes around field f3, but double quotes actually appear only when the value contains a comma
The notation will not be convenient at all with longer CSV records

The tutorial sample file WeeklyReport.def nearby line 82 illustrates a solution based on the use of a | OR-conjunction in the regular expression

A simple solution consists in preventing the use of comma within text cells, for instance with MS-Excel data validation functions, else changing the separator to something else. But let us now propose something quite sophisticated. Few CSV-format parsers are actually capable to adjust themselves automatically to whatever separator occurs in an instance file.

Advanced CSV segmentation

We present here a generic solution for the cutting of CSV lines containing fields with and without double quotes. It demonstrates how the reverseXSL Parser can adapt itself automatically to syntaxical variants where all other parsers can only deal with fixed delimiters.

This is no longer 'simple'. We present below a sophisticated case with the purpose to let you feel how powerful the reverseXSL Parser can be. If you do not understand, take the time to browse the generic regex tutorial and come back here. Within an hour, the apparent complexity will forever vanish.

The solution is a variant use of the segment CUT mode based on capturing groups that takes advantage of repeated regex matches. The idea is to define how one generic CSV field is captured with a regex, and let the Parser automatically repeat the cut for every matched field. The segment CUT function specification becomes:

||SEG ... ... ... CUT '(?:^|,)(?:"([^"]*)"|([^,]*))'

One shall take note that the regex does not begin with ^, nor ends with $ which would otherwise indicate to match respectively the very start and very end of the matched-string or line, and thus prevent the repeated matching and cuts.

We shall read:

(?:ncg1)(?:ncg2) a sequence of two non-capturing groups/expressions ncg1 and ncg2

ncg1 matches (but does not capture) the start delimiter of the CSV field, which is ^|, reading as the ^ start of the entire line (case of the first CSV field in the line), | or a , comma.

ncg2 matches (but does not capture yet) the value of the CSV field, which is
"([^"]*)"|([^,]*)
reading as:

"([^"]*)" case of a value delimited by " double quotes inside which we ( ) capture the [ ]characters in range [ ^ not being " double quote ] * repeated zero or more times.

| or (in case the above did not work)

([^,]*) case of a value that we directly ( ) capture as the [ ]characters in range [ ^ not being , comma ] * repeated zero or more times.

You may want to try this with the free regular expression tester.