AWK is a brilliant little program for processing delimited (any kind of delimiter you like really) data very quickly and easily. Its great for cleaning data before passing trying to push it into a database.
I have a large data file from a client in CSV format that needs cleaning up before trying to import into Ubercart. Here is a sample of what I was working with:
Stock Code ,Desc ,Group ,Item ,Ext1 ,Ext2 ,Qty on Hand,RRP 1 ,RRP 2 ,Bus Id. ,Rec Qty ,Colour ,Size ,Min ,Max ,Season ,Label ,Story ,COLOUR DESC ,Weight ,Report1 ,Report2 ,Report3 ,Sort1 ,Sort2 ,Sort3 , 10A90 ,SNAP IN HOLSTER ,ACCESSORIES , ,9333532508586 ,10A90BLK* , 000000000, 0000005.00, 0000013.50, , 000000000,BLK ,* , 0000000, 0000000,CURRENT ,KAKADU ,ACCESSORIES ,BLACK , 000000000.0000, , , ,CC , , , 10A91 ,BRIEFCASE SNAP IN HOLSTER ,ACCESSORIES , ,9333532508593 ,10A91BLK* , 000000005, 0000006.50, 0000015.50, , 000000000,BLK ,* , 0000000, 0000000,CURRENT ,KAKADU ,ACCESSORIES ,BLACK , 000000000.0000,BAGS , , ,CC , , , 9L05 ,RHINO CONVERTIBLE BAG ,ACCESSORIES , ,9333532479763 ,9L05BLK* , 000000020, 0000046.50, 0000149.99, , 000000000,BLK ,* , 0000000, 0000000,CURRENT ,KAKADU ,LUGGAGE ,BLACK , 000000001.2000,BAGS , , , , , , 9L05 ,RHINO CONVERTIBLE BAG ,ACCESSORIES , ,9333532479770 ,9L05BONE* , 000000017, 0000046.50, 0000149.99, , 000000000,BONE ,* , 0000000, 0000000,CURRENT ,KAKADU ,LUGGAGE ,BONE , 000000001.2000,BAGS , , , , , ,
Which was spat out by the clients nasty stock tracking system. The objectives are:
The last objective is not a big priority but should help to reduce the file size.
BEGIN { ... }
pattern { ... }
END { ... }
Awk basic syntax is
pattern { ... }
where pattern is a regular expression or some boolean expression and this is followed by a block to execute. BEGIN and END are special key words that execute only once and before and after all the lines from the script are executed. All other patterns are executed for each line of the input file.
I'm only going to need BEGIN and a pattern block
BEGIN { # this is executed only once before any rows are processed. }
NR > 1 { #NR is the Row number. I'm going to skip the 1st row. }
Because its a simple CSV file I'm going to set the delimiter to a comma ',' in the BEGIN block.
BEGIN {
FS = "," # input field seperator
OFS = "," # output field seperator
}
These can also be regular expressions and you can even change then on a line by line basis. Pretty cool!
This is the easy part with Awk. Each column gets a number and you can reference each with the dollar sign, so $1 = column 1, $2 = column 2 etc. Coz I don't feel certain about the future of the column order I'm going to take it a step further and give them my own variable names so I don't have to change much code of the column order changes:
NR > 1 {
style = $1
# tolower() is a Awk function that converts text to lowercase
name = tolower($2)
tag1 = tolower($3)
tag2 = tolower($4)
tag3 = tolower($18)
sku = $6
# convert to numbers and remove leading zeros by adding a zero
qty = $7 + 0
rrp1 = $8 + 0
rrp2 = $9 + 0
weight = $20 + 0
season = $16
label = $17
colour = tolower($19)
size = $13
# This prints the columns in the new order. The commas tell Awk to use the character set in OFS
print name, style, sku, qty, rrp1, rrp2, weight, size, colour, tag1, tag2, tag3, label
}
The data as some asterisks (*) in there that are going to confuse the import script. Also, I want to replace the * with a nicer 'One Size' value which will make more sense in the shop when people are looking are products. Awk does Regular Expression (RegEx) natively. So a RegEx to find an asterisks would be
/\*/
Awk also has a fund and replace function called gsub which takes RegEx's so I can just stick this little 'if' statement in before the final print statement and we're done:
# Getting rid of the * for size and sku. ~ means regular expression
if(sku ~ /\*/)
{
// Replace the * with nothing ""
gsub(/\*/, "", sku)
// Change the size from * to "One Size"
size = "One Size"
}
I fond this 'for' loop in the Awk one liners. Its a basic pattern on how to do something to all the columns of each row, one at a time:
# delete BOTH leading and trailing whitespace from each line
for (i=1; i<=NF; i++)
gsub(/^[ \t]+|[ \t]+$/, "", $i)
Again we're using the RegEx here (which looks ugly even for simple RegEx!). The main trick here is that we're using the for loop variable 'i' to reference each column '$i'. The NF variable is an Awk variable that hold the number of columns (fields) for the current row.
My import script will match the column headings to fields for products so I want to ignore the first line of the file and print my own. To ignore the first line I'm using the block level pattern (as discussed above)
NR > 1 { ... }
Which skips the first line. Then in the BEGIN block I'm going to print my first line
# Print nice column headings, remember to update if final print order changes!
print "name, style, sku, qty, rrp1, rrp2, weight, size, colour, tag1, tag2, tag3, label";
which is just me copy&paste'ing the last print line and putting quotes around it (so should be fairly painless in the future if the column order changes).
Finally to execute it I'm taking Barnett's advice and . Also, I'm actually using gawk, which is the GNU Awk but works pretty much the same with a few extras that I didn't use, so I just put this line at the beginning of my file.
#!/usr/local/bin/gawk -f
and then change the permissions on the file to executable like so
chmod +x myScipt.awk
the at the command prompt (assuming the script is in my path)
myScipt.awk EXCEL_DATA_FILE.csv > clean_output_file.csv
and clean_output_file.csv should have the clean data in it. YAY!