Wednesday, April 30, 2014

Converting a .rpt (or other fixed-width file) to a tsv (or: a crash course in inline perl)


Bottom-Line Up-Front
  1. Execute: head -2 [inputFile].rpt | tail -1 | perl -pe '$_ =~ s/ /\n/g' | perl -pe '$_ = "A".length($_)' | perl -pe '$_ .= "\n"'
  1. Get a string like A11A7A26A101A51A12A12A26A26A26A16
  1. Insert that string into the following command and execute: perl -pe '$_ = ($. == 2) ? "" : do {$_ =~ s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}' [inputFile].rpt > [outputFile].tsv
Overview


You can do this with two perl one-liners.

A .rpt file is the standard export format from Microsoft SQL Server Management Studio. A fixed-width file is one where each column is left padded such that it occupies a fixed number of columns, but there can be spaces within columns or blank columns.
RPT files look like:
columnNameOne columnName2 columnName3
------------- ----------- -----------
row1col1      row1col2    row1col3   
row2col1      row2col2    row2col3   

Occasionally you will see a funky byte-order-mark at the beginning of the rpt file (http://stackoverflow.com/questions/3255993/how-do-i-remove-i-from-the-beginning-of-a-file). These two lines account for that byte order mark

First Command

The great thing is that we can use the second line in an rpt file to figure out column widths. To get column widths, do this:
Do:
head -2 [inputFile].rpt | tail -1 | perl -pe '$_ =~ s/ /\n/g' | perl -pe '$_ = "A".length($_)' | perl -pe '$_ .= "\n"'
This prints out a string indicating the lengths of each of the columns, eg: "A11A7A26A101A51A12A12A26A26A26A16"

Anatomy of the command
  • head -2 [inputfile]
    • Takes the first two lines of the inputfile 
  • tail -1
    • Takes the last one line of what's fed to it
  • The "|" operator chains commands - it means "take the output of the previous command and supply it as the input of the next command". The result of "head -2 [inputFile] | tail -1" is to get the second line of [inputFile]
  • perl -pe '$_ =~ s/ /\n/g'
    • The -pe flag says 'for every line in the input, read the line into the $_ variable, execute the command in quotes, and print out the final contents of $_'
    • "=~" means "execute the regex to the right on the variable to the left
    • "s/ /\n/g" means "replace every space with a newline character. s/.../.../g replaces everything in the first pair of slashes with everything in the second pair of slashes. \n stands for a newline character. so s/ /\n/g replaces spaces with newlines.
    • So the above command transforms the single line "--- --- ---" into three lines, each containing "---"
  • perl -pe '$_ = "A".length($_)'
    • The -pe flag says 'for every line in the file, read the line into the $_ variable, execute the command in quotes, and print out the final contents of $_'
    • length($_) gives the length of the $_ string (inclusive of newline characters)
    • The period operator in perl represents string concatenation. So "A".length($_) means "append the length of $_ to the string A"
    • Note that the final value of $_ does not contain a newline character. So what this command accomplishes is to read in the lengths of every line that is fed into it, append that length to the letter A, and print out the result all as one contiguous string. 
    • So if your input is three lines, each containing "---", your output after this command would be "A3A3A3".
    • Combined with the previous command, if your input is "--- ---- --", then after "perl -pe '$_ =~ s/ /\n/g' | perl -pe '$_ = "A".length($_)'" you will get "A3A4A2".
  • perl -pe '$_ .= "\n"'
    • This just says "take every line in the input and append a newline character". I just do this to make the output of the whole command look nice, because the previous command did not append any terminal newline character.

Second Command


Now that you have your string indicating the widths of all the columns from the first command, execute the following command:
perl -pe '$_ = ($. == 2) ? "" : do {$_ =~ s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}' [inputFile].rpt > [outputFile].tsv

Anatomy of the command
  • perl -pe '$_ = ($. == 2) ? "" : do {s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}' [inputfile].rpt > [outputFile].tsv
    • The -pe flag says 'for every line in the input, read the line into the $_ variable, execute the command in quotes, and print out the final contents of $_'
    • In this case, the 'input' is just the contents of [inputfile]
  • '$_ = ($. == 2) ? "" : do {s/\xef\xbb\xbf//g; $" = "\t"; @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_); "@arr\n"}'
    • $_ = ($. == 2) ? "" : ...
      • The "[condition] ? [1] : [2]" is a ternary operator. It means "if condition is true, return [1] - otherwise, return [2]"
      • The $. variable stores the line of the input file you are on
      • So $_ = ($. == 2) ? "" : [2] means "if you are on the second line of the file, store a blank string to $_. Otherwise, store whatever [2] is to $_.
      • The command above is a way to strip off the second line of the file, which if you recall in an rpt file is just a bunch of dashes.
    • do {...}
      • This just means "execute whatever is in the curly braces. Then, return the value of the last thing you evaluated".
    • $_ =~ s/\xef\xbb\xbf//g
      • \xef\xbb\xbf refers to the icky byte-order-mark at the beginning of the file that we need to drop. http://stackoverflow.com/questions/3255993/how-do-i-remove-i-from-the-beginning-of-a-file
      • "=~" means "execute the regex to the right on the variable to the left"
      • s/.../.../g replaces everything in the first pair of slashes with everything in the second pair of slashes. s/\xef\xbb\xbf//g says "replace the byte-order mark with a blank".
      • So the above command strips away byte-order marks from $_.
    • The semicolon operator is used to separate the lines of a perl command.
    • $" = "\t"
      • $" stores the value of the 'interpolation' string. Basically, if I want to turn an array into a string, the element of the array will get separated with the contents of whatever is in $". I am setting this to "\t" (the tab character) because I want a tsv.
    • @arr = unpack("A11A7A26A101A51A12A12A26A26A26A16",$_)
      • The unpack command is at the heart of my fixed-width to tsv conversion. It basically says "take the contents of $_, and separate the records into an array according to the template "A11A7A26A101A51A12A12A26A26A26A16", then store the result to the @arr variable. More documentation on the unpack command is here: http://perldoc.perl.org/functions/unpack.html
    • "@arr\n"
      • This just says "take the elements of the @arr array and convert them to a string by separating them by whatever delimiter is stored in the $" variable (which in a preceding command we set to a tab character), and also append a newline character at the end"
    • The full consequence of this is to go over each line in the input fixed-width file, skip the line if it's the second line, otherwise convert it to a tab-separated file according to the template in "A11A7A26A101A51A12A12A26A26A26A16".
  • > [outputfile].tsv
    • The ">" operator means "take the output of the previous command and write it to the file specified on the right". In this case, the file is outputFile.tsv
Note that an rpt file probably also has lines at the end that say "blah blah blah rows affected" - you probably want to ditch these rows:
  • To get the number of lines in a file, use "wc -l [name of file]" (that's a dash and a lowercase L)
  • To take only the first n rows of a file and write them to an output file, use: "head -n [inputfile] > [outputfile]"
  • So, if you want to ditch the last three lines of a file, first execute "wc -l [name of input file]", note the number returned (eg: 100), and then execute "head -97 [name of input file] > [name of output file]
References
  • http://stackoverflow.com/questions/6302025/perl-flags-pe-pi-p-w-d-i-t
  • http://www.perl.com/pub/2004/06/18/variables.html
  • http://perldoc.perl.org/functions/unpack.html
  • http://perldoc.perl.org/functions/do.html
More cool perl
This command drops the 2nd-4th and 6th columns from the input tsv:

perl -pe '$_ = do {%exclude = map {$_,1} (2..4,6); @arr = split(/\t/,$_); @arr = map {defined($exclude{$_}) ? () : $arr[$_]} (0..$#arr); $" = "\t"; "@arr"}' input.tsv > output.tsv

The most confusing aspects of the command above that haven't been covered elsewhere in this page are probably:

  1. The creation of the %exclude hash. Hashes are made using the syntax %nameOfHash = (even,sized,array,containing,alternating,key,value,pairs).
  2. The 2..4 syntax: this produces an array with the elements (2,3,4).
  3. The fact that parentheses are used to declare an array; (2,3,4) actually creates an array with those elements.
  4. The map {..} (anarray) syntax. This takes every element of (anarray), stores it to $_ (note: variable scoping is key! This doesn't overwrite the value of $_ in outer blocks!), applies the operation in {...}, and appends the result to a resulting array. In other words, map {$_,1} (2,3,4,6) produces the following array: (2,1,3,1,4,1,6,1). When this is used to create the %exclude hash, it makes a hash where the keys are 2,3,4 and 6 and the values are all 1.
  5. That $arr[$_] returns what is at index $_ of the @arr array (I know, the syntax is awful).
  6. That $#arr returns "size of @arr minus 1"