Software

Management and Administration

My sysadmin toolbox

By: Karl Vogel

I've been a system administrator since 1988, working mainly with Solaris and one or two versions of BSD. Here are some of the things I use all the time; they're not flashy, but they save me a ton of keystrokes.

Scripts can read their own source code

On every version of Unix I've ever used, shell and perl scripts know their own name; $0 holds the pathname of the script being run. This lets any script read its own source code, which is useful when you want to keep online help in sync with the program documentation. Here's a shell script which displays its own usage information if it gets confused:

   1  #!/bin/sh
   2  #
   3  # $Id: doit,v 1.5 2001/08/04 21:44:39 vogelke Exp $
   4  # $Source: /src/scripts/RCS/doit,v $
   5  #
   6  # NAME:
   7  #    doit
   8  #
   9  # SYNOPSIS:
  10  #    doit [-hv] [pattern]
  11  #
  12  # DESCRIPTION:
  13  #    Some blather here about what this script does.
  14  #
  15  # OPTIONS:
  16  #    -h          print this message
  17  #    -v          print the version and exit
  18  #
  19  # EXAMPLE:
  20  #    doit arg    presumably does something with "arg".
  21  #
  22  # AUTHOR:
  23  #    Based on Free Software Foundation configure scripts.
  24  #    Your name <your@email.addr>
  25  #    Your company, Inc.
  26
  27  PATH=/bin:/usr/sbin:/usr/bin:/usr/local/bin
  28  export PATH
  29  umask 022
  30  tag=`basename $0`
  31
  32  # ======================== FUNCTIONS =============================
  33  # die: prints an optional argument to stderr and exits.
  34
  35  die () {
  36      echo "$tag: error: $*" 1>&2
  37      exit 1
  38  }
  39
  40  # usage: prints an optional string plus part of the comment
  41  #   header (if any) to stderr, and exits with code 1.
  42
  43  usage () {
  44      lines=`egrep -n '^# (NAME|AUTHOR)' $0 | sed -e 's/:.*//'`
  45
  46      (
  47          case "$#" in
  48              0)  ;;
  49              *)  echo "usage error: $*"; echo ;;
  50          esac
  51
  52          case "$lines" in
  53              "") ;;
  54              *)  set `echo $lines | sed -e 's/ /,/'`
  55                  sed -n ${1}p $0 | sed -e 's/^#//g' |
  56                      egrep -v AUTHOR:
  57                  ;;
  58          esac
  59      ) 1>&2
  60
  61      exit 1
  62  }
  63
  64  # version: prints the current version to stdout.
  65
  66  version () {
  67      lsedscr='s/RCSfile: //
  68      s/.Date: //
  69      s/,v . .Revision: /  v/
  70      s/\$//g'
  71
  72      lrevno='$RCSfile: doit,v $ $Revision: 1.5 $'
  73      lrevdate='$Date: 2001/08/04 21:44:39 $'
  74      echo "$lrevno $lrevdate" | sed -e "$lsedscr"
  75  }
  76
  77  # ======================== MAIN PROGRAM ==========================
  78
  79  ac_invalid="invalid option; use -h to show usage"
  80  argv=
  81
  82  for ac_option; do
  83      case "$ac_option" in
  84          -h) usage ;;
  85          -v) version; exit 0 ;;
  86          -*) die "$ac_option: $ac_invalid" ;;
  87
  88          *)  case "$argv" in
  89                  "") argv="$ac_option" ;;
  90                  *)  argv="$argv $ac_option" ;;
  91              esac ;;
  92      esac
  93  done
  94
  95  # Real work starts here.
  96
  97  echo "Arguments: $argv"
  98  test -f "$argv" || die "$argv: not a file"
  99  exit 0

To get the version:

  % doit -v
  doit  v1.5  2001/08/04 21:44:39

Here's the usage information:

  % doit -h
   NAME:
      doit

   SYNOPSIS:
      doit [-hv] [pattern]

   DESCRIPTION:
      Some blather here about what this script does.

   OPTIONS:
      -h          print this message
      -v          print the version and exit

   EXAMPLE:
      doit arg    presumably does something with "arg".

The die function (lines 35-38) makes it easy to write tests like the one at line 98, instead of messing around with an if-then block.

The usage function (lines 43-62) reads the comment header, skips to the line holding NAME, and prints everything until the line holding AUTHOR to stderr.

The version function (lines 66-75) prints the program name, version, and last checkin time. Since the version information is kept by RCS, I don't have to do anything but make sure I check my changes in.

Locate and xargs

I was looking for a certain CSS stylesheet, and since I update my locate database every night, all I needed was a one-liner. This looks through my notebook files for any HTML documents, and checks each one for a stylesheet entry:

  % locate $HOME/notebook | grep '\.htm$' | xargs grep rel=.stylesheet

I was also looking for an example of how to create a solid border;

  % locate $HOME/notebook | grep '\.css$' | xargs grep solid
  /home/vogelke/notebook/2005/0704/2col.css: border-left: 1px solid gray;
  /home/vogelke/notebook/2005/0704/3col.css: border: 1px solid gray;
  ...

The locate databases on our fileservers are also updated every night, so it's easy to tell if someone's deleted something since yesterday. This comes in handy when there's an irate customer on the phone; if I can find their files using locate, it means the files were on the system as of late yesterday, so the customer can find out if someone in their workgroup did something clever this morning.

Either someone fesses up to deleting the files, or they've been moved; usually, the mover thought they were putting the files in one folder, and they ended up either hitting the parent folder by mistake or creating an entirely new folder somewhere else. A quick find will often fix this without my having to restore anything.

If I can't find the files using locate, they were zapped at least a day or two ago, which generally means a trip to the backup server.

Shell aliases for process control

I spend most of my time in an xterm flipping around between programs, and it's nice to be able to suspend and restart jobs quickly. On my workstation, I always have emacs plus a shell running as root as the first two jobs. Under the Z-shell, I use "j" as an alias for "jobs -dl":

  % j
  [1]    92178 suspended  sudo ksh
  (pwd : ~)
  [2]  - 92188 suspended  emacs
  (pwd : ~)
  [3]  + 96064 suspended  vi 003-shell-alias.mkd
  (pwd : ~/notebook/2006/0618/newsforge-article)

This way, I get the process IDs (in case something gets wedged) plus the working directories for each process.

The Z-shell lets you bring a job to the foreground by typing a percent-sign followed by the job number. I hate typing two characters when one's enough, so these aliases are convenient:

  alias 1='%1'
  alias 2='%2'
  alias 3='%3'
  alias 4='%4'
  alias 5='%5'
  alias 6='%6'
  alias 7='%7'
  alias 8='%8'
  alias 9='%9'
  alias z='suspend'

I can type '1' to become root, quickly check something, and then just type 'z' to become me again.

Tools to keep a sitelog

I learned the hard way (several times) that messing with a server and neglecting to write down what you did can easily screw up an entire weekend.

My first few attempts at writing a site-logging program weren't terribly successful. I've been working with the Air Force for nearly 25 years, and when someone from the federal government tells you that you tend to over-design things, your process clearly needs a touchup. A basic text file with time-stamped entries solves 90% of the problem with about 10% of the effort.

The sitelog file format is pretty simple - think of a weblog with all the entries jammed together in ascending time order. Timestamp lines are left-justified; everything else has at least 4 leading spaces or a leading tab. Code listings and program output are delimited by dashed lines ending with a single capital 'S' or 'E', for start and end. The whole idea was to be able to write a Perl parser for this in under an hour.

Here's an example, created when I installed Berkeley DB. I've always used LOG for the filename, mainly because README was already taken.

  BEGINNING OF LOG FOR db-4.4.20 ======================================

  Fri, 23 Jun 2006 19:34:15 -0400   Karl Vogel   (vogelke at myhost)

      To build:
      https://localhost/mis/berkeley-db/ref/build_unix/intro.html

      --------------------------------------------------------------S
      me% cd build_unix

      me% CC=gcc CFLAGS="-O" ../dist/configure --prefix=/usr/local
      installing in /usr/local
      checking build system type... sparc-sun-solaris2.8
      checking host system type... sparc-sun-solaris2.8
      [...]
      config.status: creating db.h
      config.status: creating db_config.h

      me% make
      /bin/sh ./libtool --mode=compile gcc -c -I. -I../dist/..
          -D_REENTRANT -O2 ../dist/../mutex/mut_pthread.c
      [...]
      creating db_verify
      /bin/sh ./libtool --mode=execute true db_verify
      --------------------------------------------------------------E

  Fri, 23 Jun 2006 20:32:34 -0400   Karl Vogel   (vogelke at myhost)

      Install:

      --------------------------------------------------------------S
      root# make install_setup install_include install_lib install_utilities
      Installing DB include files: /usr/local/include ...
      Installing DB library: /usr/local/lib ...

      [...]
      cp -p .libs/db_verify /usr/local/bin/db_verify
      --------------------------------------------------------------E

These scripts do most of the heavy lifting:

  • timestamp: writes a line holding the current time in ARPA-standard format, your full name, your userid, and the name of the host you're on. A short version is included below; I have a longer one that can parse most date formats and return a line with that time instead of the current time.

  • remark: starts VIM on the LOG file and puts me at the last line so I can append entries. The 'v' key (often unused) is mapped to call timestamp and append its output after the current line.

  • mfmt: originally stood for "make format"; intended to take the output of make (or any program), break up and indent long lines to make them more readable, and wrap the whole thing in dashed lines ending with 'S' and 'E'.

  • site2html: read a LOG file and generate a decent-looking webpage, like this one.

  • log2troff: read a LOG file and generate something that looks good on paper.

Here's a short version of timestamp:

  #!/bin/sh
  PATH=/usr/local/bin:/bin:/usr/bin; export PATH

  name=`grep "^$USER:" /etc/passwd | cut -f5 -d:`
  host=`hostname | cut -f1 -d.`
  exec date "+%n%n%a, %d %b %Y %T %z   $name   ($USER at $host)%n"
  exit 0

If I know I've logged something, it's also nice to be able to do something like

  me% locate LOG | xargs grep something

The W3M text web-browser

w3m is a text-based web-browser which does a wonderful job of rendering HTML tables correctly. If I want a halfway-decent text-only copy of a webpage that includes tables, I run a script that calls wget to fetch the HTML page and then w3m to render it:

   1  #!/bin/ksh
   2  # Fetch files via wget, w3m.  Usage: www URL
   3  
   4  PATH=/usr/local/bin:$PATH
   5  export PATH
   6  
   7  die () {
   8      echo "$*" >& 2
   9      exit 1
  10  }
  11  
  12  #
  13  # Don't go through a proxy server for local hosts.
  14  #
  15  
  16  case "$1" in
  17      "")      die "usage: $0 url" ;;
  18      *local*) opt="--proxy=off $1" ;;
  19      http*)   opt="$1" ;;
  20      ftp*)    opt="$1" ;;
  21  esac
  22  
  23  #
  24  # Fetch the URL back to a temporary file using wget, then render
  25  # it using w3m: better support for tables.
  26  #
  27  
  28  tfile="wget.$RANDOM.$$"
  29  wget -F -O $tfile $opt
  30  test -f $tfile || die "wget failed"
  31  
  32  #
  33  # Set the output width from the enviroment.
  34  #
  35  
  36  case "$WCOLS" in
  37      "") cols=70 ;;
  38      *)  cols="$WCOLS" ;;
  39  esac
  40  
  41  w3m="/usr/local/bin/w3m -no-graph -dump -T text/html -cols $cols"
  42  result="w3m.$RANDOM.$$"
  43  $w3m $tfile > $result
  44  
  45  test -f "$result" && $EDITOR $result
  46  rm -f $tfile
  47  exit 0

Line 18 lets me specify URLs on the local subnet which should not go through our proxy server; traffic through that server is assumed to be coming from the outside world, which requires a username and password.

Lines 28 and 42 create safe temporary files by taking advantage of the Korn shell's ability to generate random numbers.

I call wget on line 29, using the -F option to force any input to be treated as HTML. The -O option lets me pick the output filename. You might be able to use w3m to do everything, but here it seems to have some problems with the outgoing proxies (which I don't control), and wget doesn't.

Lines 36-39 let me specify the output width as an environment variable:

  % WCOLS=132 www http://some.host/url

would give me wider output for landscape printing. When w3m returns, you're placed in an editor in case you want to make any final touchups. After you exit the editor, you should have a new file in the current directory named something like w3m.19263.26012.

Dealing with different archive formats

I got fed up with remembering how to deal with archives that might be tar files, zip files, compressed, frozen, gzipped, bzipped, or whatever bizarre format comes along next. Three short scripts take care of that for me:

  • tc: shows the contents of an archive file

  • tcv: shows the verbose contents of an archive file

  • tx: extracts the contents of an archive file in the current directory

tc and tcv are hard-linked together:

   1  #!/bin/sh
   2  # tc: check a gzipped archive file
   3  # if invoked as "tcv", print verbose listing.
   4  
   5  case "$#" in
   6      0)  exit 1 ;;
   7      *)  file="$1" ;;
   8  esac
   9  
  10  name=`basename $0`
  11  case "$name" in
  12      tcv) opt='tvf' ;;
  13      *)   opt='tf' ;;
  14  esac
  15  
  16  case "$file" in
  17      *.zip)    exec unzip -lv "$file" ;;
  18      *.tgz)    exec gunzip -c "$file" | tar $opt - ;;
  19      *.bz2)    exec bunzip2 -c "$file" | tar $opt - ;;
  20      *.tar.gz) exec gunzip -c "$file" | tar $opt - ;;
  21      *.tar.Z)  exec uncompress -c "$file" | tar $opt - ;;
  22      *)        exec tar $opt $file ;;
  23  esac

tx is very similar:

   1  #!/bin/sh
   2  # tx: extract a gzipped archive file
   3  
   4  case "$#" in
   5      0)  exit 1 ;;
   6      *)  file="$1"; pat="$2" ;;
   7  esac
   8  
   9  case "$file" in
  10      *.zip)    exec unzip -a "$file" ;;
  11      *.tgz)    exec gunzip -c "$file" | tar xvf - $pat ;;
  12      *.bz2)    exec bunzip2 -c "$file" | tar xvf - $pat ;;
  13      *.tar.gz) exec gunzip -c "$file" | tar xvf - $pat ;;
  14      *)        exec tar xvf $file $pat ;;
  15  esac

Z-shell and Bash aliases

I've tried bash and tcsh, but the Z-shell is definitely my favorite. Here are some of my aliases:

To view command-line history:

  h          fc -l 1 | less
  history    fc -l 1

To check the tail end of the syslog file:

  syslog     less +G /var/log/syslog

To beep my terminal when a job's done (i.e., /run/long/job && yell):

  yell       echo done | write $LOGNAME

To quickly find all the directories or executables in the current directory:

  d          /bin/ls -ld *(-/)
  x          ls -laF | fgrep "*"

For listing dot-files:

  dot        ls -ldF .[a-zA-Z0-9]*

Largest files shown first or last:

  lsl        ls -ablprtFT | sort -n +4
  lslm       ls -ablprtFT | sort -n +4 -r | less

Smallest files shown first or last:

  lss        ls -ablprtFT | sort -n +4 -r
  lssm       ls -ablprtFT | sort -n +4 | less

Files sorted by name:

  lsn        ls -ablptFT | sort +9
  lsnm       ls -ablptFT | sort +9 | less

Newly-modified files shown first or last:

  lst        ls -ablprtFT
  lstm       ls -ablptFT | less

Converting decimal to hex and back:

  d2h        perl -e ''printf qq|%X\n|, int( shift )''
  h2d        perl -e ''printf qq|%d\n|, hex( shift )''

Most of these aliases (except for the fc stuff) work just fine in bash, with just a few minor tweaks in the formatting. Some examples:

  alias   1='%1'
  alias   2='%2'
  alias   3='%3'
  alias   4='%4'
  alias   5='%5'
  alias   6='%6'
  alias   7='%7'
  alias   8='%8'
  alias   9='%9'

  alias   d2h='perl -e "printf qq|%X\n|, int(shift)"'
  alias   d='(ls -laF | fgrep "/")'
  alias   dot='ls -ldF .[a-zA-Z0-9]*'
  alias   h2d='perl -e "printf qq|%d\n|, hex(shift)"'
  alias   h='history | less'
  alias   j='jobs -l'
  alias   p='less'
  alias   x='ls -laF | fgrep "*"'
  alias   z='suspend'

If you want to pass arguments to an alias, it might be easier to use a function. For example, I use mk to make a new directory with mode 755, regardless of my umask setting. The $* will be replaced by whatever arguments you pass:

  mk () {
      mkdir $*
      chmod 755 $*
  }

You can use seq to generate sequences, like 10 to 20:

  seq () {
      local lower upper output;
      lower=$1 upper=$2;
      while [ $lower -le $upper ];
      do
          output="$output $lower";
          lower=$[ $lower + 1 ];
      done;
      echo $output
  }

Sample use:

  % seq 10 20
  10 11 12 13 14 15 16 17 18 19 20

Functions can call other functions. For example, if you want to repeat a given command some number of times:

  repeat () {
      local count="$1" i;
      shift;
      for i in $(seq 1 "$count");
      do
          eval "$@";
      done
  }

Sample use:

  % repeat 10 'date; sleep 1'
  Wed Jul  5 21:29:18 EDT 2006
  Wed Jul  5 21:29:19 EDT 2006
  Wed Jul  5 21:29:20 EDT 2006
  Wed Jul  5 21:29:21 EDT 2006
  Wed Jul  5 21:29:22 EDT 2006
  Wed Jul  5 21:29:23 EDT 2006
  Wed Jul  5 21:29:24 EDT 2006
  Wed Jul  5 21:29:25 EDT 2006
  Wed Jul  5 21:29:26 EDT 2006
  Wed Jul  5 21:29:27 EDT 2006

Using PGP to create a password safe

How many different passwords do you have to remember, and how often do you have to change them? Lots of organizations seem to believe that high change frequency makes a password safe, even if the one you ultimately pick is only three characters long or your name spelled backwards.

PGP or the GNU Privacy Guard can help you safely keep track of dozens of nice, long passwords, even if you have to change them weekly. There are several commercial packages which serve as password safes, but PGP is free, and all you need is a directory with one script to encrypt your password list and one to decrypt it.

The most important thing to remember: DO NOT use the password for your safe for anything else!

I use GNU Privacy Guard for encryption, but any strong crypto will do. You can set up your own private/public key in just a few minutes by following the directions in the GNU Privacy Handbook. Let's say you put your passwords in the file "pw". Follow these steps to create a GPG public/private keypair and encrypt the password file:

Generate a keypair

  % gpg --gen-key
  gpg (GnuPG) 1.4.1; Copyright (C) 2005 Free Software Foundation, Inc.
  This program comes with ABSOLUTELY NO WARRANTY.
  This is free software, and you are welcome to redistribute it
  under certain conditions. See the file COPYING for details.

  Please select what kind of key you want:
     (1) DSA and Elgamal (default)
     (2) DSA (sign only)
     (5) RSA (sign only)
  Your selection?                                     [hit return]

  DSA keypair will have 1024 bits.
  ELG-E keys may be between 1024 and 4096 bits long.
  What keysize do you want? (2048)                    [hit return]

  Requested keysize is 2048 bits
  Please specify how long the key should be valid.
           0 = key does not expire
        <n>  = key expires in n days
        <n>w = key expires in n weeks
        <n>m = key expires in n months
        <n>y = key expires in n years
  Key is valid for? (0)                               [hit return]

  Key does not expire at all
  Is this correct? (y/N) y

  You need a user ID to identify your key; the software constructs the
  user ID from the Real Name, Comment and Email Address in this form:
      "Heinrich Heine (Der Dichter) <heinrichh@duesseldorf.de>"

  Real name: Your Name
  Email address: yourid@your.host.com
  Comment: 
  You selected this USER-ID:
      "Your Name <yourid@your.host.com>"

  Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? o
  You need a Passphrase to protect your secret key.
  [enter your passphrase]

Generate a revocation certificate in case you forget your passphrase or your key's been compromised

  % gpg --output revoke.asc --gen-revoke "Your Name"
  sec  1024D/B3D36900 2006-06-27 Your Name <yourid@your.host.com>

  Create a revocation certificate for this key? (y/N) y
  Please select the reason for the revocation:
    0 = No reason specified
    1 = Key has been compromised
    2 = Key is superseded
    3 = Key is no longer used
    Q = Cancel
  (Probably you want to select 1 here)
  Your decision? 1

  Enter an optional description; end it with an empty line:
  > Revoking my key just in case it gets lost
  > 

  Reason for revocation: Key has been compromised
  Revoking my key just in case it gets lost
  Is this okay? (y/N) y

  You need a passphrase to unlock the secret key for
  user: "Your Name <yourid@your.host.com>"
  1024-bit DSA key, ID B3D36900, created 2006-06-27

  ASCII armored output forced.
  Revocation certificate created.

Your revocation key is now in the file revoke.asc. Store it on a medium which you can hide; otherwise someone can use it to render your key unusable.

Export your public key

  % gpg --armor --output public.gpg --export yourid@your.host.com

will store your public key in public.gpg, if you want to put it on your website or mail it.

Encrypt the pw file

  % gpg --armor --output pw.gpg --encrypt --recipient yourid@your.host.com pw

will encrypt the pw file as pw.gpg. To decrypt it, you must include your own key in the --recipient list.

Test decrypting the pw file

  % gpg --output testpw --decrypt pw.gpg

  You need a passphrase to unlock the secret key for
  user: "Your Name <yourid@your.host.com>"
  2048-bit ELG-E key, ID 19DF3967, created 2006-06-27 (main key ID B3D36900)

  Enter passphrase:

  gpg: encrypted with 2048-bit ELG-E key, ID 19DF3967, created 2006-06-27
        "Your Name <yourid@your.host.com>"

The file testpw should be identical to pw, or something's wrong.

I use one script with two hardlinks for reading and updating passwords. When invoked as readp, the script decrypts my password safe; after I finish checking or editing the decrypted file, updatep encrypts it.

   1  #!/bin/ksh
   2  # read or encrypt a file.
   3  # use with GPG v1.4.1 or better.
   4  
   5  PATH=/bin:/usr/bin:/usr/sbin:/usr/local/bin
   6  export PATH
   7  name=`basename $0`
   8  
   9  case "$1" in
  10      "") file="pw" ;;
  11      *)  file=$1 ;;
  12  esac
  13  
  14  # clear = plaintext file.
  15  # enc = ascii-armor encrypted file.
  16  
  17  case "$file" in
  18   *.gpg) enc=$file
  19          clear=`echo $file | sed -e 's/.gpg$//g'`
  20          ;;
  21  
  22      *)  clear="$file"
  23          enc="$file.gpg"
  24          ;;
  25  esac
  26  
  27  case "$name" in
  28      "readp")
  29          if test -f "$enc"
  30          then
  31              gpg --output $clear --decrypt $enc
  32          else
  33              echo "encrypted file $enc not found"
  34          fi
  35          ;;
  36  
  37      "updatep")
  38          if test -f $clear
  39          then
  40              mv $enc $enc.old
  41              gpg --armor --output $enc --encrypt \
  42                    --recipient yourid@your.host.com $clear && rm $clear
  43          else
  44              echo "cleartext file $clear not found"
  45          fi
  46          ;;
  47  esac
  48  
  49  exit 0

The mutt mail-reader

Mutt is very useful for taking a quick look at a mailbox or correctly sending messages with attachments from the command line; there's more to it than just concatenating a few files together and piping the results to mail.

If you poke around Google for awhile, you can find many setups that make mutt quite suitable for general mail-handling. Dave Pearson's site has some great configuration files.

Setting up a full-text index for code and documents

I started trying to index my files for fast lookup back when WAIS was all the rage; I also tried Glimpse and Swish-e, neither of which really did the trick for me.

The QDBM, Estraier, and Hyper-estraier programs are without a doubt the best full-text index and search programs I've ever used. They're faster and less memory-intensive than any version of Swish, and the Hyperestraier package includes an excellent CGI program which lets you do things like search for similar files.

Keeping a copy of my browser history

Having command-line access to my browser links from any given day has occasionally been helpful. I know Mozilla and Firefox store your history for you, but it's either for a limited time, or you end up with the history logfile from hell. If I have logfiles that are updated on the fly, I'd rather keep them relatively small.

The biggest advantage is being able to search my browser history using the same interface as I use for my regular files (Estraier), as well as standard command-line tools. I keep my working files in dated folders, and I was recently looking for something I did in June on the same day that I looked up some outlining sites:

  % locate browser-history | xargs grep -i outlin
  .../2006/0610/browser-history: 19:07:27 http://webservices.xml.com/pub/a/ws/2002/04/01/outlining.html
  .../2006/0610/browser-history: 19:07:27 http://www.oreillynet.com/pub/a/webservices/2002/04/01/outlining.html
  .../2006/0610/browser-history: 19:08:57 http://radio.weblogs.com/0001015/instantOutliner/daveWiner.opml
  .../2006/0610/browser-history: 19:10:27 http://www.deadlybloodyserious.com/instantOutliner/garthKidd.opml
  .../2006/0610/browser-history: 19:10:43 http://www.decafbad.com/deus_x/radio/instantOutliner/l.m.orchard.opml
  .../2006/0610/browser-history: 19:10:47 http://radio.weblogs.com/0001000/instantOutliner/jakeSavin.opml

Here's a perl script by Jamie Zawinski which parses the Mozilla history file. According to Jamie, the history format is "just about the stupidest file format I've ever seen", and after trying to write my own parser for it, I agree.

The cron script below is run every night at 23:59 to store my browser history (minus some junk) in my notebook.

   1  #!/bin/sh
   2  # mozhist: save mozilla history for today
   3  
   4  PATH=/bin:/usr/bin:/usr/local/bin:$HOME/bin
   5  export PATH
   6  umask 022
   7  
   8  # your history file.
   9  hfile="$HOME/.mozilla/$USER/nwh6n09i.slt/history.dat"
  10  
  11  # sed script
  12  sedscr='
  13    s/\/$//
  14    /view.atdmt.com/d
  15    /ad.doubleclick.net/d
  16    /tv.yahoo.com/d
  17    /adq.nextag.com\/buyer/d
  18  '
  19  
  20  # remove crap like trailing slashes, doubleclick ads, etc.
  21  set X `date "+%Y %m %d"`
  22  case "$#" in
  23      4) yr=$2; mo=$3; da=$4 ;;
  24      *) exit 1 ;;
  25  esac
  26  
  27  dest="$HOME/notebook/$yr/${mo}${da}"
  28  test -d "$dest" || exit 2
  29  
  30  exec mozilla-history $hfile |           # get history...
  31     sed -e "$sedscr" |                   # ... strip crap ...
  32     sort -u |                            # ... remove duplicates ...
  33     tailocal |                           # ... change date to ISO ...
  34     grep "$yr-$mo-$da" |                 # ... look for today ...
  35     cut -c12- |                          # ... zap the date ...
  36     cut -f1,3 |                          # ... keep time and URL ...
  37     expand -1 > $dest/browser-history    # ... and store
  38  
  39  exit 0

mozilla-history (line 30) is Jamie's perl script.

tailocal (line 33) is a program written by Dan Bernstein which reads lines timestamped with the raw Unix date, and writes them with an ISO-formatted date like so:

  % echo 1151637537 howdy | tailocal
  2006-06-29 23:18:57 howdy

If you don't have tailocal, here's a short Perl equivalent:

  #!/usr/bin/perl
  use POSIX qw(strftime);

  while (<>) {
      if (m/(\d+)\s(.*)/) {
          print strftime("%Y-%m-%d %T ", localtime($1)), "$2\n";
      }
  }
  exit (0);

The resulting file has entries for one day which look like this:

  15:55:27 http://mediacast.sun.com/share/bobn/SMF-migrate.pdf
  16:02:36 http://www.sun.com/bigadmin/content/selfheal

Reading whitespace-delimited fields

At least once a day, I need the third or fourth column of words from either an existing file or the output of a program. It's usually something simple like checking the output from ls -lt, weeding a few things out by eye, and then getting just the filenames for use elsewhere.

I use one script with nine hard-links:

  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f1*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f2*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f3*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f4*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f5*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f6*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f7*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f8*
  -rwxr-xr-x  9 vogelke  vogelke  546 Oct  1  2003 f9*

The script is just a wrapper for awk:

   1  #!/bin/sh
   2  # print space-delimited fields.
   3  
   4  PATH=/bin:/usr/bin; export PATH
   5  tag=`basename $0`
   6  
   7  case "$tag" in
   8      f1) exec awk '{print $1}' ;;
   9      f2) exec awk '{print $2}' ;;
  10      f3) exec awk '{print $3}' ;;
  11      f4) exec awk '{print $4}' ;;
  12      f5) exec awk '{print $5}' ;;
  13      f6) exec awk '{print $6}' ;;
  14      f7) exec awk '{print $7}' ;;
  15      f8) exec awk '{print $8}' ;;
  16      f9) exec awk '{print $9}' ;;
  17      *)  ;;
  18  esac

f3 gets the third field, etc.

Using ifile for SPAM control

If you're still plagued by spam, or you need a generic method of categorizing text files, have a look at ifile. It's one of many "Bayesian mail filters", but unlike bogofilter and spamassassin, it can do n-way filtering rather than simply spam vs. non-spam.

Author bio

Karl is a Solaris/BSD system administrator at Wright-Patterson Air Force Base, Ohio.

He graduated from Cornell University with a BS in Mechanical and Aerospace Engineering, and joined the Air Force in 1981. After spending a few years on DEC and IBM mainframes, he became a contractor and started using Berkeley Unix on a Pyramid system.

He likes FreeBSD, trashy supermarket tabloids, Perl, cats, teen-angst TV shows, and movies.