Cleaning CSV data with AWK

AWK is a brilliant little program for processing delimited (any kind of delimiter you like really) data very quickly and easily. Its great for cleaning data before passing trying to push it into a database.

I have a large data file from a client in CSV format that needs cleaning up before trying to import into Ubercart. Here is a sample of what I was working with:

Stock Code               ,Desc                                    ,Group               ,Item                ,Ext1                                              ,Ext2                                              ,Qty on Hand,RRP 1      ,RRP 2      ,Bus Id.   ,Rec Qty   ,Colour    ,Size                ,Min                 ,Max                 ,Season              ,Label               ,Story               ,COLOUR DESC ,Weight        ,Report1       ,Report2       ,Report3       ,Sort1         ,Sort2         ,Sort3         ,
10A90                    ,SNAP IN HOLSTER                         ,ACCESSORIES         ,                    ,9333532508586                                     ,10A90BLK*                                         , 000000000, 0000005.00, 0000013.50,          , 000000000,BLK                 ,*                   , 0000000, 0000000,CURRENT             ,KAKADU              ,ACCESSORIES         ,BLACK                         , 000000000.0000,            ,            ,            ,CC                       ,                         ,                         ,
10A91                    ,BRIEFCASE SNAP IN HOLSTER               ,ACCESSORIES         ,                    ,9333532508593                                     ,10A91BLK*                                         , 000000005, 0000006.50, 0000015.50,          , 000000000,BLK                 ,*                   , 0000000, 0000000,CURRENT             ,KAKADU              ,ACCESSORIES         ,BLACK                         , 000000000.0000,BAGS        ,            ,            ,CC                       ,                         ,                         ,
9L05                     ,RHINO CONVERTIBLE BAG                   ,ACCESSORIES         ,                    ,9333532479763                                     ,9L05BLK*                                          , 000000020, 0000046.50, 0000149.99,          , 000000000,BLK                 ,*                   , 0000000, 0000000,CURRENT             ,KAKADU              ,LUGGAGE             ,BLACK                         , 000000001.2000,BAGS        ,            ,            ,                         ,                         ,                         ,
9L05                     ,RHINO CONVERTIBLE BAG                   ,ACCESSORIES         ,                    ,9333532479770                                     ,9L05BONE*                                         , 000000017, 0000046.50, 0000149.99,          , 000000000,BONE                ,*                   , 0000000, 0000000,CURRENT             ,KAKADU              ,LUGGAGE             ,BONE                          , 000000001.2000,BAGS        ,            ,            ,                         ,                         ,                         ,

Which was spat out by the clients nasty stock tracking system. The objectives are:

  1. Only output certain columns, as most of them I don't need.
  2. Get rid of the asterisks * on the size and replace it with something else more consistent.
  3. Remove the white space.
  4. Rewrite first line which is the column headers.

The last objective is not a big priority but should help to reduce the file size.

The basic Awk script anatomy

BEGIN   { ... }
pattern { ... }
END     { ... }

Awk basic syntax is

pattern { ... }

where pattern is a regular expression or some boolean expression and this is followed by a block to execute. BEGIN and END are special key words that execute only once and before and after all the lines from the script are executed. All other patterns are executed for each line of the input file.

I'm only going to need BEGIN and a pattern block

BEGIN { # this is executed only once before any rows are processed. }
NR > 1 { #NR is the Row number. I'm going to skip the 1st row. }

Because its a simple CSV file I'm going to set the delimiter to a comma ',' in the BEGIN block.

BEGIN { 
        FS = ","    # input field seperator 
        OFS = ","   # output field seperator
}

These can also be regular expressions and you can even change then on a line by line basis. Pretty cool!

Output only certain columns

This is the easy part with Awk. Each column gets a number and you can reference each with the dollar sign, so $1 = column 1, $2 = column 2 etc. Coz I don't feel certain about the future of the column order I'm going to take it a step further and give them my own variable names so I don't have to change much code of the column order changes:

NR > 1 {

    style = $1

    # tolower() is a Awk function that converts text to lowercase
    name = tolower($2)
    tag1 = tolower($3)
    tag2 = tolower($4)
    tag3 = tolower($18)
    sku = $6

    # convert to numbers and remove leading zeros by adding a zero
    qty = $7 + 0    
    rrp1 = $8 + 0
    rrp2 = $9 + 0
    weight = $20 + 0

    season = $16
    label = $17
    colour = tolower($19)
    size = $13

    # This prints the columns in the new order. The commas tell Awk to use the character set in OFS
    print name, style, sku, qty, rrp1, rrp2, weight, size, colour, tag1, tag2, tag3, label 
}

Removing unwanted characters

The data as some asterisks (*) in there that are going to confuse the import script. Also, I want to replace the * with a nicer 'One Size' value which will make more sense in the shop when people are looking are products. Awk does Regular Expression (RegEx) natively. So a RegEx to find an asterisks would be

/\*/

Awk also has a fund and replace function called gsub which takes RegEx's so I can just stick this little 'if' statement in before the final print statement and we're done:

# Getting rid  of the * for size and sku. ~ means regular expression
if(sku ~ /\*/)
{
    // Replace the * with nothing ""
    gsub(/\*/, "", sku)
    // Change the size from * to "One Size"
    size = "One Size"
}

Removing leading and trailing whitespace

I fond this 'for' loop in the Awk one liners. Its a basic pattern on how to do something to all the columns of each row, one at a time:

# delete BOTH leading and trailing whitespace from each line
for (i=1; i<=NF; i++)
    gsub(/^[ \t]+|[ \t]+$/, "", $i) 

Again we're using the RegEx here (which looks ugly even for simple RegEx!). The main trick here is that we're using the for loop variable 'i' to reference each column '$i'. The NF variable is an Awk variable that hold the number of columns (fields) for the current row.

Custom column headings

My import script will match the column headings to fields for products so I want to ignore the first line of the file and print my own. To ignore the first line I'm using the block level pattern (as discussed above)

NR > 1 { ... }

Which skips the first line. Then in the BEGIN block I'm going to print my first line

# Print nice column headings, remember to update if final print order changes!
print "name, style, sku, qty, rrp1, rrp2, weight, size, colour, tag1, tag2, tag3, label";

which is just me copy&paste'ing the last print line and putting quotes around it (so should be fairly painless in the future if the column order changes).

Using it

Finally to execute it I'm taking Barnett's advice and . Also, I'm actually using gawk, which is the GNU Awk but works pretty much the same with a few extras that I didn't use, so I just put this line at the beginning of my file.

#!/usr/local/bin/gawk -f

and then change the permissions on the file to executable like so

chmod +x myScipt.awk

the at the command prompt (assuming the script is in my path)

myScipt.awk EXCEL_DATA_FILE.csv > clean_output_file.csv

and clean_output_file.csv should have the clean data in it. YAY!

References

Awk - A Tutorial and Introduction - by Bruce Barnett
Good one for getting you up to speed and down with the basics.
Awk one liners
Some neat tricks in here.
10 Awk Tips, Tricks and Pitfalls
Once you know a little AWK this help you get good style. Also there is a section called "Parse CSV" which is worth a read.