Skip to Content

AWK to the rescue!

I mentioned a few posts back that I had built myself a GPS logging tool. Since then, I tweaked it some by putting it into an enclosure (a plastic box), hooking up an external power connector (aka a barrel jack), wired in an OpenLog device and changing the code to log the GPS values to a CSV type file on the SD Card. It is in my truck, powered by a cigarette lighter power chord. Those changes were fun and educational but I hit another wall when I moved on to the "next step" with the project.

The problem is that I am not storing my date/time values in a way that GPSBabel can make use of. I've changed the code on my logger to handle this, but I needed to "fix" the existing log files.

Here's a snippet of one of my log files (lat/long changed to protect the innocent):

year,month,day,hour,minute,seconds,latitude,longitude,altitude(cm),speed(kmph),course
2010,4,1,4,27,39.27,51.83587,-114.02490,103770,2.54,332.87
2010,4,1,4,28,0.0,51.83589,-114.02489,104930,1.83,276.81
2010,4,1,4,28,4.0,51.83591,-114.02491,104980,1.91,324.95
2010,4,1,4,28,5.0,51.83592,-114.02491,104920,1.07,339.55
2010,4,1,4,28,10.0,51.83591,-114.02489,105260,1.35,341.63
2010,4,1,4,28,13.0,51.83590,-114.02489,105350,1.15,339.99
2010,4,1,4,28,17.0,51.83588,-114.02488,105250,1.72,284.23
2010,4,1,4,28,18.0,51.83588,-114.02489,105240,2.15,281.07
2010,4,1,4,28,20.0,51.83588,-114.02492,105230,3.33,264.10

As you can see, I have created a separate field for year, month, day, hour, minute, and seconds.

When working with GPSBabel and CSV files, you can set up a filter to treat CSV fields as a specific type of value that GPSBabel understands (latitude, longitude, altitude, course, speed, etc.). I *could* do this with PHP, Python, or any programming language. But this was going to be a quick and dirty solution to massage some files. So I opted to do this as a Bash script.

The first draft of my script looked like this:

#!/bin/bash

IFS=$','
FILE="20100521.log"

rm -rf ./csv
mkdir csv

for currentfile in *.log
do
    csvfile=csv/`basename $currentfile .log`.csv
    echo "$currentfile -> $csvfile"

    #prepare the file
    echo "date_time,latitude,longitude,altitude,speed,course" > $csvfile

    while read currentline
    do
        YEAR=`echo $currentline | gawk '{print $1}'`
        if [ -z "$YEAR" ]
        then
            continue
        fi

        if [ "$YEAR" == "year" ]
        then
            continue
        fi

        MONTH=`echo $currentline | gawk '{print $2}'`
        DAY=`echo $currentline | gawk '{print $3}'`
        HOUR=`echo $currentline | gawk '{print $4}'`
        MINUTE=`echo $currentline | gawk '{print $5}'`
        SECOND=`echo $currentline | gawk '{print $6}'`
        LAT=`echo $currentline | gawk '{print $7}'`
        LONG=`echo $currentline | gawk '{print $8}'`
        ALT=`echo $currentline | gawk '{print $9}'`
        KPH=`echo $currentline | gawk '{print $10}'`
        COURSE=`echo $currentline | gawk '{print $11}'`
        TEMP=`echo $currentline | gawk '{print $12}'`

        #make sure the elements are zero padded, if needed
        MONTH=`printf "%02d" $MONTH`
        DAY=`printf "%02d" $DAY`
        HOUR=`printf "%02d" $HOUR`
        MINUTE=`printf "%02d" $MINUTE`
        SECOND=`printf "%02.3f" $SECOND`

        #IS_TIME_MS (as defined by/for gpsbabel -
        #   http://www.gpsbabel.org/htmldoc-1.3.6/style_define.html)
        #yyyy-mm-ddThh:mm:ss.SSSzzzzz
        ISOTIME="${YEAR}-${MONTH}-${DAY}T${HOUR}:${MINUTE}:${SECOND}Z"

        #create the revised output line
        echo -n $ISOTIME >> $csvfile
        echo -n "," >> $csvfile
        echo -n $LAT >> $csvfile
        echo -n "," >> $csvfile
        echo -n $LONG >> $csvfile
        echo -n "," >> $csvfile
        echo -n $ALT >> $csvfile
        echo -n "," >> $csvfile
        echo -n $KPH >> $csvfile
        echo -n "," >> $csvfile
        echo $COURSE >> $csvfile
    done < $currentfile
done
exit 

(watch for the &gt; text - this is supposed to be a greater than sign ">" for file redirection)

This worked, but was SOOOOOOOOOOOO slow. It was taking 10 minutes to process a single log file - approx 800 KB in size. I know the many echo statements were likely the root cause, as well as the repeated calls to AWK to extract the various fields.

My first attempt to speed this up was to replace the echo statements with a printf statement. This worked, but only cut the 10 minute process for one file down to 8 minutes.

Next up was to take a look to see if I could make AWK get all my data in one shot. After a little research (GAWK is the the GNU foundations version of AWK), I found that not only could I extract the fields in one shot, I could also eliminate the loop over the records and output my changes in one shot.

Here's the revised script:

#!/bin/bash

rm -rf ./csv
mkdir csv

for currentfile in *.log
do
    echo $currentfile
    csvfile=csv/`basename $currentfile .log`.csv
    echo "date_time,latitude,longitude,altitude,speed,course" > $csvfile
    
    awk 'BEGIN {FS = ","};
        {   #skip the column header row(s)
            if ($1 == "year")
            next
        }
        {   #skip blank rows
            if ($1 == "")
                next
        }
        {   #skip any rows that do not have a valid latitude (assuming an invalid entry if this is the case)
            if ($7 == "")
                next
        }
        {printf "%04d-%02d-%02dT%02d:%02d:%02.3f,%f,%f,%d,%0.2f,%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11 }' $currentfile >> $csvfile

done

This script now process all of my 59 log files (about 74 MB) in under 30 seconds. MUCH MUCH better!

Here's a run down of the script:

  • first we delete our working directory (csv), then recreate it.
  • we then create a loop over all the log files
  • we indicate which file we are processing, and create the output file (using the same name as the file we are working on - with the .log extension stripped off. i.e. mylogs.log would become "csv/mylogs.csv")
  • we push out the field headers to our new file with an echo statement
  • then we run an awk "program" against each line of text in our file
    • First we indicate that we are using commas as our field separator
    • Then we check the current record to see if the year value is "year" or blank. If so, we skip to the next record.
    • We run a similar check on the latitude value - if it is blank, we skip to the next record
    • We then make use of the printf capabilities within awk to format our output.
    • finally, we indicate which file to apply this awk program against, and dump the output into our new file.

And here is a snippet of the new output file:

date_time,latitude,longitude,altitude,speed,course
2010-04-01T04:27:39.270,51.835870,-114.024900,103770,2.54,332.87
2010-04-01T04:28:0.000,51.835890,-114.024890,104930,1.83,276.81
2010-04-01T04:28:4.000,51.835910,-114.024910,104980,1.91,324.95
2010-04-01T04:28:5.000,51.835920,-114.024910,104920,1.07,339.55
2010-04-01T04:28:10.000,51.835910,-114.024890,105260,1.35,341.63
2010-04-01T04:28:13.000,51.835900,-114.024890,105350,1.15,339.99
2010-04-01T04:28:17.000,51.835880,-114.024880,105250,1.72,284.23
2010-04-01T04:28:18.000,51.835880,-114.024890,105240,2.15,281.07
2010-04-01T04:28:20.000,51.835880,-114.024920,105230,3.33,264.10

So, by learning a little more about awk, I was able to shorten and simplify the script, as well as dramatically improve the performance. The only "hard" part was learning the syntax that awk needed.

Let this be a lesson to you (and ME!). If you need to process large volumes of text, Linux (and/or Unix) has tools explicitly for this and your task has probably been done many times before. Be patient, search the web, and dig deeper. You (and ME!) will save time and effort doing so. (my original script was STILL running while I worked through setting up the new awk based routine - a few hours later. I was able to build, troubleshoot, and run the awk routine to get my final output before the first routine was even 3/4 of the way done...)

Next on my TODO: list is to get the GPSBabel routine working right. I'll post that process once I've worked it out.