What to do when you kill your database with your own hand

Couple of months ago, we had an interesting day. It was a bright
summer morning. We were all in the office doing our regular work.
Developers writing codes, QA testing the site. We were
all happy and smiling as there was no major bug. Some were
drinking coffee and having nice chit chat. Our honorable Sys Admin
came to office as usual late after noon. We all welcomed him
remembering his day and night inhuman effort to keep our systems up
and running fighting against germs and bacteria. As usual he
started his day logging into servers one by one checking their
health. He connected to our maintenance server via Remote Desktop.
Did some routine check up and found that there was no space on hard
drive. So, he decided to delete the database on the maintenance
server as it was quite old. He started SQL Server Management
Studio, entered sa and password, selected the database, pressed
delete. As his habit, on the confirmation dialog he pressed enter
without looking into it. Things went cool, database got deleted.
But strangely the maintenance server did not free up any hard drive
space.

In the meantime, there were sharp screams from other rooms. We
all ran to see whether somebody fell down or had an electric shock
or not. Seemed like every one was intact. But with horror, we
looked into all the screens on our desktops and saw this:

This is our error handler page where users are redirected when
there’s any unknown catastrophic failure. This page appears when
something goes wrong, really wrong. We do this by having the
block in web.config:


< customErrors mode =”RemoteOnly” defaultRedirect =”GenericErrorPage.html” > < error statusCode =”403″ redirect =”403.html” /> < error statusCode =”404″ redirect =”404.html” /> customErrors >

If somebody fell down or had electric shock, we would not mind
at all but this is generally beginning of a nightmare for
us. Then we saw our Sys Admin was shaking and his face turned
white out of blood. He somehow stood up trembling with fear and
came to me and said, “My blood pressure is going sky high. I think
I have deleted production database”.

I told him, “no worries, I have done this before”, of course on
the outside. Let’s not say what I felt inside. So, we first looked
into Recycle Bin. Nope, no trace. It was a 30 GB file. There’s
no way Windows is going to store it on Recycle Bin. Then we
searched for several undelete utilities both commercial ones and
free ones. All failed on regular attempt. Some showed an advanced
recovery can restore the file but it will take 2 hours of scanning.
This will be disastrous if we go down for 2 hours. But we had no
choice. Last backup we took was the previous day. If we restore it,
we are going to lose thousands of users and their page setup. We
will also lose hundreds of signups. So, we went for the 2 hours
scan. After 2 hours, the scan reported it found the MDF file. But
when we said recover it, it recovered a 0 byte file. We tried
another product, same result. In the meantime 4 hours past by. So,
we had to take a decision. We has no other choice but to restore
the previous day’s backup. So, we did, and the site went live.

Now it was time to find out all those poor souls who lost their
data. We maintain a log file where we record all important
activities like adding new page, registering, first visit etc. It
was a tab delimited file like this:

01.01.2007 08:39:35 e5ca904c-0348-42cf-9d1b-6fb932ec930d Create
Anonymous user if necessary 0,25
01.01.2007 08:39:36 e5ca904c-0348-42cf-9d1b-6fb932ec930d Get Page:
__RSSFEED__ 0,109375
01.01.2007 08:39:36 e5ca904c-0348-42cf-9d1b-6fb932ec930d Load
Modules in page #1413882 0,1875
01.01.2007 08:39:36 e5ca904c-0348-42cf-9d1b-6fb932ec930d
GetPageflake 0,546875

It logs date and time of each action and also the duration of
the process. This log helps us identify slow running operations and
we can see almost in real time what’s going on with the site.

After importing this log into database, we ran a query to find
out user name of all users who signed up since database delete.
Here, we discovered something interesting. We were expecting the
date and time will be in server’s own time. But Enterprise Library
logs are always in GMT. So, we had to figure out what is the time
difference from regional settings of the servers. Then we
subsctracted it to match GMT. Then we calculated when the database
was deleted and started finding the users after that. We got a list
of email addresses whose signup was lost, almost around 400. We
were lucky it happened on a weekend, so there was not much signup
that day. Then we sent that email list to our marketing team and
they sent apology letters to those users.

So, what we did wrong which you should make sure you never
do:

  • Sys Admin became too comfortable with the servers. There was
    lack of seriousness while working on remote desktop. It became
    routine monotonous absent minded work to him. This is a real
    problem with sys admins. On first month, you will see him very
    serious about his role. Everytime he logs into remote desktop on
    production or maintenance servers, there’s a considerable amount of
    curves on his forehead. But day by day, it reduces and he starts
    working on production server as if he is working on his own
    laptop. At some point, someone needs to make him realize what is
    the gravity of his actions. He should wash his hands before sitting
    in front of remote desktop (or perform wudhu if he is muslim) and
    then say his prayer: “O Lord! I am going to work on remote desktop.
    Grant me tranquility and absolute concentration and protect me from
    the devil who whispers in my soul foul words and lures me to cause
    great harm to production servers”
  • All databases had the same “sa” password. If we had different
    password, at least while typing the password, sys admin could
    realize where he is connecting to. Although he did connect to
    remote desktop on maintenance server, but from SQL Server
    Management Studio, he connected to primary database server as he
    did last time. SQL Server Management Studio remembered the last
    machine name and user name. So, all he did was enter password and
    hit enter and delete the database. Now we have put the server name
    inside the password. So, while typing the password, we know
    conciously which server we are going to connect.
  • Don’t ignore confirmation dialogs on remote desktops as you do
    on your local machine. Nowadays, we consider ourselves super expert
    on everything and never read the confirmation dialog. I myself
    don’t remember when was the last time I read any confirmation
    dialog seriously. Definitely this attitude must change while
    working on servers. When Sys Admin tried to delete the database,
    there was a confirmation that there are active connections on the
    database. SQL Server tried its best to inform him that this is a
    database being used and don’t delete it, please. But as he does
    hundred times per day on his laptop, clicked OK without reading the
    confirmation dialog.
  • Don’t put same administrator password on all servers. This
    makes life easier while copying files from one server to another,
    but don’t do it. You will accidentally delete file on another
    server just like we do.
  • DO NOT use Administrator user account to do your day to day
    work. We started using a Power User account for our day to day
    operation which has limited access on couple of folders only. Using
    Administrator account on remote desktop means you are opening doors
    to all possible accidents to happen. If you use a restricted
    account, there’s no possibility of such accidents.
  • Always have someone beside you when you work on production
    server and do something important like cleaning up free space or
    running scripts, restoring database etc. Make sure the other
    person is not sleeping on his chair beside you.

7 thoughts on “What to do when you kill your database with your own hand”

  1. Omar, I would disagree with you on the last point on your recommendations.

    For starters, your sysadmin didn’t do a professional job and the business cannot spend time, money & resource to provide a babysitter to ensure the sysadmin doesn’t screw things up.

    One of the fundamental issue here is Disk Space management. You shouldn’t have to delete files to free up disk space. If a business is growing and there is need for proper infrastructure to be in place (ie. Server HW/SW Spec) then it should be addressed first. Also the database should be configured for better disk space usage (ie. auto shrink, log file truncation etc)

    Also instead of daily mid-night backups, hourly incremental backup should be considered.

    This would have saved time & money for your organisation.

  2. Great article, I know that feeling very well. Fortunately I don’t have to administer my DBs myself anymore :o)

    Cheers

    Andreas

    PS: Happy new Year!

  3. Couldn’t agree more with the last commenter’s final point. Why don’t you have hourly back ups of something so critical to your business?

  4. Dear Omar,

    hmmm! So you guys are human. you also make mistakes. I do agree with you that there are areas in our daily work where we forget to take proper precautions before touching.

    Areas where one should perform wudhu and take off shoes before attempting to handle.

    Thank you for sharing these thought with us human.

    Regards

    Ghazi Wadi

    PMP, ASQ SSGB

Leave a Reply