Skip to Content

Batch Scanning of Documents

It's tax season once again. Usually, I bundle up my paperwork, classify and sort it, then ship it off the my accountant to work on. In the past I've run into issues where I would need to look up something that had been passed to the accountant. So this year I decided I want to do things a little different. Instead of just blindly handing things over, I want to scan every document and file the resulting image in a directory structure that mirrors the file cabinet.

So, this means I have a crap load of paperwork to scan. I wouldn't even consider this if I had to manually change every single page. Luckily my multi-function printer (Brother DCP-7065DN) has a sheet-feeder. So this means I can can bulk scan all the standard letter sized documents - which is probably about 80% of my pile. On the downside though, there is no bulk scanning tool that comes with the Linux drivers.

BASH to the rescue. By the time you get your scanner set up under Ubuntu (or most Linux's if I'm understanding things right), you'll have a command line tool called scanimage. This command allows you to, well, scan a document/image from the command line. Which means it can be scripted. So I whipped together a quick bash script for this:

#!/bin/bash

#build the default label
YEAR=`date +%Y`
LABEL="${YEAR}_unknown"

#prompt the user for a label
echo
read -p "Filename label [$LABEL] : " NEWLABEL
echo

#if the user specified a label, make sure we are using that one
[ -n "$NEWLABEL" ] && LABEL=$NEWLABEL

#remove old scans and converted images
echo "deleteing *.tiff and *.jpg files..."
rm *.tiff
rm *.jpg

#the SCANIMAGE command assumes ONE scanner is connected and has been properly set up
scanimage --format=tiff --batch="./${LABEL}_%d.tiff"

#for each scanned document, use ImageMagick to convert it to a JPG to condense the storage needs.
for f in *.tiff
do
    FNAME=`basename ${f} .tiff`
    
    echo $FNAME
    convert $f $FNAME.jpg
done

#remove the scanned images
rm *.tiff

A quick rundown:

  • First we define a default label. This label becomes the name of the scanned file(s), with a sequence number tacked onto the end.
  • Next we prompt the user to enter a custom label if the default is not desired.
  • Then, IF the user entered a new label, we replace the default label with the user entered value.
  • We then remove all .tiff and .jpg files from the current director. (Careful! don't delete anything important!). Our script works with .tiff and .jpg files, so we want to make sure we have a clean slate to start with.
  • Finally, we start scanning. The scanimage command takes a few more (view man scanimage to see all the options). We indicate we want .tiff files, and are working in batch mode, with the output going to the current directory. We didn't specify a scanner - if you only have one scanner, this is not an issue. If you have more than one scanner, you'll probably need to specify the device you intend to use.
  • Once the scanning part is done, we loop over the resulting .tiff files and convert them to .jpg. The .tiff scans are approx 13MB each on my system, converting them to .jpg results in files that are under 500KB each - MUCH less storage needs, with no loss of readability. We are using ImageMagick to do the conversions.
  • And finally, we delete the .tiff files to save storage space.

Now with this script running, I plob a batch of papers into the scanner's sheet feeder, then run the script. I give a custom file label to identify the batch - maybe 2011_bankstatements. Once the scan process is done, I move all the jpg files into the appropriate virtual file cabinet directory (maybe .../2011/bank/statements). Then while I get the next batch going, I'll go through the files and rename them to have a more specific and meaningful name - perhaps 20110101_bankstatement_1. I make use of the _1 on the end to indicate page numbers as some of the documents have 9 or more pages.

Oh, I use a split panel view in the Dolphin file manager - one panel shows the scanning directory, the second panel shows the folder I'm currently scanning documents for. I also set up the image preview panel so that when I mouse over an image I can see it. This way I can look at the document and determine the date, page number, and any other pertinent info without having to open the image in a viewer or editor. This process works for me, though I'm sure there are other ways that may work better for you.

While this process is time consuming, it is not as bad as having to manually scan each document one at a time. Most of the effort needed is in organizing the documents to begin with, and removing staples for scanning purposes. It's only been a day that I've been working on this, but I'm about 50% done the job.

Hope this script is helpful to someone else in the tax prep phase.