100% CPU, 100% IO, a near death experience for SQL Server 2005 and us

For last two weeks we were having pretty hard time at
Pageflakes. The
database server was having 100% CPU, and 100% IO usage most
of time. It was running hot, almost about to go to a coma.
It’s a 64 bit Dual Core Dual Xeon Dell server with 2 GB RAM running
around 30 GB database on 4 SCSI drives. So, it’s more or less the
best hardware money can buy (except for RAM ofcourse). But
still the performance counter looked like this:


Pretty horrible situation isn’t it? Users are having connection
timeout on their browser. Event Log is full of “SQLConnection:
Timeout”. User’s can’t see their page. Those who can see their
pages, have serous poor performance and slow response from the
server. You can imagine the rest. Email flood, phone calls,
management screaming on your ears etc etc.

After a lot of diagnostics, we came to conclusion that SQL
Server 2005 is the culprit. So, we ran SQL Profiler on a different
server (don’t ever run SQL Profiler on the same server where you
database is running). We saw there’s an SP which is taking
thousands of read and cpu cycles.


So, we were pretty much sure that SP was the culprit. So, we
took one of the long executing ones and ran it through Execution
Plan to see what goes wrong.

So, we first did this in order to see IO usage during the SP
execution:


set statistics IO on

GO

Here’s the output after we run the SP:

Table ‘RSSItem’. Scan count 1, logical reads 40, physical reads
0, read-ahead reads 0, lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.

Nothing special. Runs pretty quick, very low read count. Looks
like no read count at all, completely cached and returned from
memory. So, SQL Server 2005 is doing pretty good caching on
it. After this, we verified whether the index was being
used properly or not:


‘Clustered Index Seek’ the best possible thing. If it was
Clustered Index Scan, then we would have thought there’s something
wrong. So, we were making best use of Index also. We were pulling
our hairs out.

So, I decided to look at the SP in order to see if I can find
out something. I stared at this for half an hour:


ALTER PROCEDURE 
[dbo].[prcRSSItemGetByChannelIDPageSize]

    @ChannelID 
int
,

    @StartIndex 
int
,

    @PageSize 
int

AS


WITH 
SelectedRSSItems(
[ID],[ChannelID],[Hash],...)

AS

(

    
SELECT


        [ID],

        [ChannelID],

        [Hash],

        [Title],

        [Guid],

        [Description],

        [EncodedContent],

        [Link],

        [PublishDate],

        [XML],

        [SavedCount],

        ROW_NUMBER() 
OVER

(
ORDER
BY 
[PublishDate]

DESC
)

AS

[RowNumber]

    
FROM

        dbo.[RSSItem]

    
WHERE

        [ChannelID] = @ChannelID 

)


SELECT 

        *

FROM

        SelectedRSSItems

WHERE

        RowNumber 
BETWEEN

(@StartIndex
+ 1) 
AND

(@StartIndex
+ @PageSize)

ORDER BY 
PublishDate

desc

Those who are wondering what’s with the “WITH” block. This is
called Common Table Expression (CTE). This is a new feature in SQL
Server 2005 which is the best so far for paging. Previously we had
to page rows by doing any of the following:

  1. Copy primary key in a temporary table after sorting and
    use IDENTITY column to generate row number. Then select rows by
    joining primary key from temporary table to the actual table. The
    paging is done based on row number on the temporary table. Very
    expensive. You end up creating temporary table on every call to the
    SP. Too many IO on tempdb.
  2. Use subquery. Very complicated SQL. Not so good result.
  3. Use cursor to skip rows and select only those which fall within
    the page. It was the best solution so far.

Now you have the mighty ROW_NUMBER() function which we all have
been waiting for 10 years!

Back to our disaster, you see the SQL is nice and clean, nothing
suspicious. But SQL Server 2005 is drowning taking us and a
thousands of users all over the world with it.

Finally, it struck my mind! SQL Server is doing something on the
PublishDate field because it needs to sort the rows based on
it.


ROW_NUMBER()


OVER

(

ORDER BY

[PublishDate]

DESC

)

AS

[RowNumber]

What if I make a new Non-Clustered index on ChannelID and
PublishDate? The query filters using ChannelID and sorts using
Publishdate. So, it should get everything it needs from the
index.

I created a brand new index on ChannelID and PublishDate. It
took 12 mins to create the index. Then the CPU usage looked like
this:


We are saved! We are back in business!

So, what we learn from this experience? We learn nothing.

This blog post is dedicated to my friend Shahed who wrote the
above SP two weeks ago and gave us a lot of fun all these
days.

6 thoughts on “100% CPU, 100% IO, a near death experience for SQL Server 2005 and us”

  1. 1) You learned from this experience (and I from your post) how properly create indexes for “SELECT WITH” queries.

    2) How long did it take to write this post after you found the problem?

  2. Great post! Saved my bacon too. I had exactly the same kind of issue and was going bald trying to find it. You turned on the light for me. Thanks.

  3. Sir,
    I am having SQL server 2005 database server,its taskmanager shows 100% CPU utilization,even when no acess is there.What may be the cause for it.only sqlserver.exe taking 99% CPU usage,so user feel so much problem while generating reports.Please send the solution on my mail.I applied Indexes also.

  4. my stored proecedure run 4 min off peak time and 30 mins in peak time, i am not able to find the cause for 30 mins duration.Please tell me the various ways to find the bottleneck cause solution.
    we has transaction replication setup on the same server.

Leave a Reply