2010-02-18

No more Voodoo SysAdmin allowed for Tech Support organizations

In my past life as a systems administrator and tech support person, one of the terms I remember hearing about was "Voodoo sysadmin" -- which I basically think of as doing something to a system when you're having a problem in a vain and superstitious hope that it'll fix the problem, without really understanding what's going on.  (An old acquaintance from that life, Mark Verber, has an article about Voodoo Sysadmin (among other things) if you want to learn more.)

This frequently takes the form of "it's acting up" -- "reboot it, and try again".  Sometimes, this solves the problem, at least temporarily.  A lot of other times, it doesn't solve the problem, but it doesn't do any real harm.  Put those two together, and you get a classic recipe for superstition: accidental reinforcement (more on wikipedia, or in Karen Pryor's book, Don't Shoot the Dog).  That is to say, because it works some of the time, you're bound to try it almost all the time, just in case it works.  Often, this is mostly harmless, if perhaps wasteful of time.  Sometimes, it creates real harm.  For example, rebooting a UNIX system that's having problems because some critical file got corrupted may leave you with no way to log back in and fix it, without getting out installation media.  Or worse, it may remove the temporary copy of the still-working version of the file that was lying around until the reboot process cleaned it up.

I fear, though, that I'm getting overly geeky for a point that's more universal, so let's instead go to a real-life example from my very recent past.

I was having a problem with my iPhone syncing its photos to my Mac.  Image Capture was saying "No camera or scanner connected."  iPhoto simply didn't have a "Devices" section under which it would show up (until and unless I connected another digital camera, which would show up fine).  Aperture 3 wasn't showing it either, though it had had partial success earlier (and Image capture used to work).  After a long conversation with Apple (and paying the roughly $75 for AppleCare protection on my Iphone, since I was beyond the original 90 days of phone support -- I'd started the call as a Mac call, which was still under warranty), a Senior iPhone Advisor named Kurt told me that there was a known issue wherein sometimes certain images that were downloaded from online, or from e-mail, or maybe even saved from apps on the phone other than the Camera app, would somehow get corrupted (or at least some meta-data about them would), and this corruption would cause a situation wherein the various photo-related apps on MacOS X would simply fail to see the phone as a device with images on it.  This was totally what I was experiencing.  They're working on a long-term fix for this, and in the mean time the workaround given to me was: Email myself any images not created by the Camera application (screenshots, downloaded images, etc.), and then delete those images from the phone.  Note that this does not involve doing a hard reset on the phone, or rebooting my mac, or, and here's the kicker, resetting all settings on the phone, which is exactly what a previous associate had asked me to do (and all the other things were asked, too).

Resetting all settings did not help.  It did, however, cause me some general annoyance at having to restore things like my ring tone of choice, and enabling caps lock, and that sort of thing.  And to re-enter wifi passwords (I'm glad I knew off-hand the main one of those that I care about; I'll have to go and find the others again, as they become relevant).  But that's not what really got me.  It also deleted all of my alarms from the Clock application.  Uhm, I rely on those to remind me to take some medications each day.  And others to make it to regular appointments.  I noticed this a couple hours after I was supposed to have taken a dose of medication that I take daily.  Now, I take it daily, and being a couple hours late isn't a big deal for the medication in question.  But what if it had been something I had to take on a very regular schedule?  Or what if I hadn't noticed that the alarm hadn't gone off, and I just missed it today completely?  Or missed the appointment that I have later in the day today?  Who's to say what would have happened.  I'll say, though, that it could have been bad news, if not for me than for someone else who didn't realize they'd lost their alarms.

So, step one was to complain to Apple about this, and ask them to make sure that their associates are all trained to make sure they let a user know that their alarms will go away.  Had I known that, I could have made a list of them first, before resetting all settings.  I've called Apple, and spoken to another Senior Advisor, and he seemed to take it all fairly seriously, so I have hopes that good things will happen there. Better yet would be to have the UI for resetting settings actually tell you this -- or maybe even be able to turn on and off which settings get reset.  This, too, has been suggested to Apple.

What would really make me happy, though, is for them not to have asked me to do something that was totally unnecessary! Resetting my settings didn't help, and I'm sure that while the guy I was speaking to suspected that it might have, the reality of the situation was that he didn't understand why my phone wasn't being recognized, and thus didn't know if it would fix it or not.  Maybe he knew he didn't know, maybe he thought he knew and was wrong.  Either way, the fact is the same: he didn't know, and so he went with a superstition-based or "voodoo sysadmin" approach to fixing the problem. He even claims he resets all settings on his own iPhone every week, as a matter of course.  If that's not superstition, I don't know what is.  If I did that, it would drive me nuts...  I have a lot of settings changed.  And a lot of alarms -- the important ones of which I think are all back, though I know I had others in there that I'll have to re-create from scratch (ones that were off because they weren't set to repeat, but which I would occasionally turn on for certain things -- now, I'll have to actually re-create them when those situations arise again, instead of just turning them on.  I can live with that.

But here's the thing:

If there was a concerted effort within the Tech Support industry to try to eliminate all superstitious practices from their support calls, this kind of thing simply wouldn't happen.  And I believe it shouldn't have had to happen.  Because, as this blog is all about, I believe there is a better way.

"Eliminate all superstitious practices?  Shyeah, right."  I know, I know, it'll never happen.  See Verber's article (linked above): superstition is part of Human Nature.  True.  I have no argument there.  Still, it's the effort to eliminate them that would bring about the change that I want.  As I see it, there would be several main components to such an effort:

  1. Educate tech support personnel on what superstition is, how to recognize it, and how to avoid being trapped into it.
  2. Teach these same folks alternative ways of doing things so that they can actually find actions to suggest that will be known to be helpful.  Now, this will be impossible in some cases, because they just won't have a way of figuring out what's wrong, which brings me to item #3:
  3. Have software (and hardware) developers provide better instrumentation in their products, and analysis tools which can either be used by support staff, or given to end users to run on behalf of the support staff, with results being given back to them in the latter case.  Also involved in this is more and better error reporting, and/or more use of any extant error reporting by support staff.  Many of these tools could be built in to applications.  Others would be separate tools.  Either way, more troubleshooting would be helpful.
If Image Capture had given me some sort of message saying "iPhone detected, but the image database looks to be corrupt", or if there'd been a menu option for "Detailed device detection", or a help item for "Why doesn't my device show up?" with instructions on how to run some debug information, or something, I would have kept my alarms, had a lower degree of frustration, and possibly even saved both Apple and myself some money, by not having had to buy AppleCare (yes, I get it, that *makes* them money), and not having them have to take the call (this is where they get it back -- I was on the phone with them for a while, bouncing between different people, and calling back the next day to let them know of the problem I had with the service I'd gotten -- that may or may not add up to the cost of the plan, but I bet it came close, at least).

And so...

In summary:


Software developers:
Instrument your code, and provide tools for analyzing problems as they're going on.  (Example of a fairly decent (if under-technical for the hard cases) version of this: The network connection assistant in Apple's network preferences on MacOS X.  It goes through each stage of trying to get online, and tries different things, giving you things to try along any stage that's showing difficulty.
Hardware developers:
Uhh, I dunno.  I'm not a hardware guy.  But something like the above.  And make sure you work with the software folks to build the tools to make use of the instrumentation you're providing.
Tech support managers:
Train your people on the perils of superstitious support behaviors, and reinforce them when they go through the admittedly often-more-difficult process of actually trying to figure out what's wrong.  (Oh yeah, and this probably means training them how to do that, too, and rewarding them for doing it, and costing you lots of money in more advanced employees.  I bet you it's worth it in the long run, though -- actually fixing the problem without negative side effects makes for happier customers, less likely to call back again about the same problem, and more likely to exhibit loyalty to your brand.  But that's just speculation on my part.)


Thank you for reading.  I hope this is somehow helpful to someone -- even if only for having listed the workaround to the iPhone connectivity problem, so the next person hitting it can fix it themselves, and save themselves a bit of heartache.

Wishing for a better world,

  David

P.S.  Oh yeah, and the medical industry could probably do a lot of this, as well.  But that's another rant, for another day.  See Karen Pryor's book for a brief discussion of this point, when she's introducing the idea of superstitious behaviors.

No comments:

Post a Comment