I mentioned a few posts back that I had built myself a GPS logging tool. Since then, I tweaked it some by putting it into an enclosure (a plastic box), hooking up an external power connector (aka a barrel jack), wired in an OpenLog device and changing the code to log the GPS values to a CSV type file on the SD Card. It is in my truck, powered by a cigarette lighter power chord. Those changes were fun and educational but I hit another wall when I moved on to the "next step" with the project.
The problem is that I am not storing my date/time values in a way that GPSBabel can make use of. I've changed the code on my logger to handle this, but I needed to "fix" the existing log files.
Here's a snippet of one of my log files (lat/long changed to protect the innocent):
year,month,day,hour,minute,seconds,latitude,longitude,altitude(cm),speed(kmph),course 2010,4,1,4,27,39.27,51.83587,-114.02490,103770,2.54,332.87 2010,4,1,4,28,0.0,51.83589,-114.02489,104930,1.83,276.81 2010,4,1,4,28,4.0,51.83591,-114.02491,104980,1.91,324.95 2010,4,1,4,28,5.0,51.83592,-114.02491,104920,1.07,339.55 2010,4,1,4,28,10.0,51.83591,-114.02489,105260,1.35,341.63 2010,4,1,4,28,13.0,51.83590,-114.02489,105350,1.15,339.99 2010,4,1,4,28,17.0,51.83588,-114.02488,105250,1.72,284.23 2010,4,1,4,28,18.0,51.83588,-114.02489,105240,2.15,281.07 2010,4,1,4,28,20.0,51.83588,-114.02492,105230,3.33,264.10
As you can see, I have created a separate field for year, month, day, hour, minute, and seconds.
When working with GPSBabel and CSV files, you can set up a filter to treat CSV fields as a specific type of value that GPSBabel understands (latitude, longitude, altitude, course, speed, etc.). I *could* do this with PHP, Python, or any programming language. But this was going to be a quick and dirty solution to massage some files. So I opted to do this as a Bash script.
The first draft of my script looked like this:
#!/bin/bash
IFS=$','
FILE="20100521.log"
rm -rf ./csv
mkdir csv
for currentfile in *.log
do
csvfile=csv/`basename $currentfile .log`.csv
echo "$currentfile -> $csvfile"
#prepare the file
echo "date_time,latitude,longitude,altitude,speed,course" > $csvfile
while read currentline
do
YEAR=`echo $currentline | gawk '{print $1}'`
if [ -z "$YEAR" ]
then
continue
fi
if [ "$YEAR" == "year" ]
then
continue
fi
MONTH=`echo $currentline | gawk '{print $2}'`
DAY=`echo $currentline | gawk '{print $3}'`
HOUR=`echo $currentline | gawk '{print $4}'`
MINUTE=`echo $currentline | gawk '{print $5}'`
SECOND=`echo $currentline | gawk '{print $6}'`
LAT=`echo $currentline | gawk '{print $7}'`
LONG=`echo $currentline | gawk '{print $8}'`
ALT=`echo $currentline | gawk '{print $9}'`
KPH=`echo $currentline | gawk '{print $10}'`
COURSE=`echo $currentline | gawk '{print $11}'`
TEMP=`echo $currentline | gawk '{print $12}'`
#make sure the elements are zero padded, if needed
MONTH=`printf "%02d" $MONTH`
DAY=`printf "%02d" $DAY`
HOUR=`printf "%02d" $HOUR`
MINUTE=`printf "%02d" $MINUTE`
SECOND=`printf "%02.3f" $SECOND`
#IS_TIME_MS (as defined by/for gpsbabel -
# http://www.gpsbabel.org/htmldoc-1.3.6/style_define.html)
#yyyy-mm-ddThh:mm:ss.SSSzzzzz
ISOTIME="${YEAR}-${MONTH}-${DAY}T${HOUR}:${MINUTE}:${SECOND}Z"
#create the revised output line
echo -n $ISOTIME >> $csvfile
echo -n "," >> $csvfile
echo -n $LAT >> $csvfile
echo -n "," >> $csvfile
echo -n $LONG >> $csvfile
echo -n "," >> $csvfile
echo -n $ALT >> $csvfile
echo -n "," >> $csvfile
echo -n $KPH >> $csvfile
echo -n "," >> $csvfile
echo $COURSE >> $csvfile
done < $currentfile
done
exit
(watch for the > text - this is supposed to be a greater than sign ">" for file redirection)
This worked, but was SOOOOOOOOOOOO slow. It was taking 10 minutes to process a single log file - approx 800 KB in size. I know the many echo statements were likely the root cause, as well as the repeated calls to AWK to extract the various fields.
My first attempt to speed this up was to replace the echo statements with a printf statement. This worked, but only cut the 10 minute process for one file down to 8 minutes.
Next up was to take a look to see if I could make AWK get all my data in one shot. After a little research (GAWK is the the GNU foundations version of AWK), I found that not only could I extract the fields in one shot, I could also eliminate the loop over the records and output my changes in one shot.
Here's the revised script:
#!/bin/bash
rm -rf ./csv
mkdir csv
for currentfile in *.log
do
echo $currentfile
csvfile=csv/`basename $currentfile .log`.csv
echo "date_time,latitude,longitude,altitude,speed,course" > $csvfile
awk 'BEGIN {FS = ","};
{ #skip the column header row(s)
if ($1 == "year")
next
}
{ #skip blank rows
if ($1 == "")
next
}
{ #skip any rows that do not have a valid latitude (assuming an invalid entry if this is the case)
if ($7 == "")
next
}
{printf "%04d-%02d-%02dT%02d:%02d:%02.3f,%f,%f,%d,%0.2f,%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11 }' $currentfile >> $csvfile
done
This script now process all of my 59 log files (about 74 MB) in under 30 seconds. MUCH MUCH better!
Here's a run down of the script:
- first we delete our working directory (csv), then recreate it.
- we then create a loop over all the log files
- we indicate which file we are processing, and create the output file (using the same name as the file we are working on - with the .log extension stripped off. i.e. mylogs.log would become "csv/mylogs.csv")
- we push out the field headers to our new file with an echo statement
- then we run an awk "program" against each line of text in our file
- First we indicate that we are using commas as our field separator
- Then we check the current record to see if the year value is "year" or blank. If so, we skip to the next record.
- We run a similar check on the latitude value - if it is blank, we skip to the next record
- We then make use of the printf capabilities within awk to format our output.
- finally, we indicate which file to apply this awk program against, and dump the output into our new file.
And here is a snippet of the new output file:
date_time,latitude,longitude,altitude,speed,course 2010-04-01T04:27:39.270,51.835870,-114.024900,103770,2.54,332.87 2010-04-01T04:28:0.000,51.835890,-114.024890,104930,1.83,276.81 2010-04-01T04:28:4.000,51.835910,-114.024910,104980,1.91,324.95 2010-04-01T04:28:5.000,51.835920,-114.024910,104920,1.07,339.55 2010-04-01T04:28:10.000,51.835910,-114.024890,105260,1.35,341.63 2010-04-01T04:28:13.000,51.835900,-114.024890,105350,1.15,339.99 2010-04-01T04:28:17.000,51.835880,-114.024880,105250,1.72,284.23 2010-04-01T04:28:18.000,51.835880,-114.024890,105240,2.15,281.07 2010-04-01T04:28:20.000,51.835880,-114.024920,105230,3.33,264.10
So, by learning a little more about awk, I was able to shorten and simplify the script, as well as dramatically improve the performance. The only "hard" part was learning the syntax that awk needed.
Let this be a lesson to you (and ME!). If you need to process large volumes of text, Linux (and/or Unix) has tools explicitly for this and your task has probably been done many times before. Be patient, search the web, and dig deeper. You (and ME!) will save time and effort doing so. (my original script was STILL running while I worked through setting up the new awk based routine - a few hours later. I was able to build, troubleshoot, and run the awk routine to get my final output before the first routine was even 3/4 of the way done...)
Next on my TODO: list is to get the GPSBabel routine working right. I'll post that process once I've worked it out.