Omar AL Zabir Blog – Page 18 – Engineering Manager, Meta

Atlas 4: Only 2 calls at a time and don’t expect any order

Browsers make 2 concurrent AJAX calls at a time to a domain. If
you make 5 AJAX calls, browser is going to make 2 calls first, then
wait for any one of them to complete and then make another call
until all remaining 4 calls are complete. Moreover, you cannot
expect calls to execute in the same order as you make the calls.
Here’s why:

Here you see, call 3’s response download is quite big and thus
takes longer than Call 5. So, Call 5 actually gets executed before
Call 3.

So, the world of HTTP is unpredictable.

Atlas 5: Bad calls make good calls timeout

If 2 http calls somehow get stuck for too long, those two bad
calls are going to make some good calls expire too which in the
meantime got queued. Here’s a nice example:

function timeoutTest()

{

PageMethods.Timeout( { timeoutInterval : 3000, onMethodTimeout:
function() { debug.dump(“Call 1 timed out”); } } );

PageMethods.Timeout( { timeoutInterval : 3000, onMethodTimeout:
function() { debug.dump(“Call 2 timed out”); } } );

PageMethods.DoSomething( ‘Call 1’, { timeoutInterval : 3000,
onMethodTimeout: function() { debug.dump(“DoSomething 1 timed
out”); } } );

PageMethods.DoSomething( ‘Call 2’, { timeoutInterval : 3000,
onMethodTimeout: function() { debug.dump(“DoSomething 2 timed
out”); } } );

PageMethods.DoSomething( ‘Call 3’, { timeoutInterval : 3000,
onMethodTimeout: function() { debug.dump(“DoSomething 3 timed
out”); } } );

}

I am calling a method named “Timeout” on the server which does
nothing but to wait for a long time so that the call gets timed
out. After that I am calling a method which does not timeout. But
guess what the output is:

Only one call succeeded “Do Something 1”. Try again and you
might see this:

Now two calls succeeded. So, if at any moment, browser’s two
connections get jammed, then you can expect other waiting calls are
going to timeout also.

In Pageflakes, we used to get nearly 400 to 600 timeout error
reports from users’ browsers. We could never find out how this can
happen. First we suspected slow internet connection. But that
cannot happen for so many users. Then we suspected something is
wrong with the hosting providers network. We did a lot of network
analysis to find out whether there’s any problem on the network.
But we could not detect any. We used SQL Profiler to see whether
there’s any long running query which times out ASP.NET request
execution time. But no luck. We finally discovered that, it mostly
happened due to some bad calls which got stuck and made the good
calls expire too. So, we modified the Atlas Runtime and introduce
automatic retry on it and the problem disappeared completely.
However, this auto retry requires a sophisticated open heart bypass
surgery on Atlas Runtime javascript code which you have to perform
again and again whenever Microsoft releases newer version of Atlas
Runtime. You also can no longer use the
tag which produces Atlas runtime references instead you have to
manually put links to Atlas runtime and compatibility javascript
files. So, you better do auto retry yourself in your own code from
Day 1. On the onMethodTimeout method, just make one retry all the
time to be on the safe side.

Atlas 3: Atlas batch calls are not always faster

Atlas provides you batch
call feature which combines multiple web service calls into one
call. It works transparently, you won’t notice anything nor do you
need not write any special code. Once you turn on the Batch
feature, all web service calls made within a duration gets batched
into one call. Thus saves roundtrip time and total response
time.

The actual response time might be reduced but the perceived
delay is higher. If 3 web service calls are batched, the 1st call
does not finish first. All 3 calls finish at the same time. If you
are doing some UI updates upon completion of each WS calls, it does
not happen one by one. All of the calls complete in one shot and
then the UI gets updated in one shot. As a result, you do not see
incremental updates on the UI, instead a long delay before the UI
updates. If any of the call, say the 3rd call downloads a lot of
data, user sees nothing happening until all 3 calls complete. So,
the duration of the 1st call becomes nearly the duration of the sum
of all 3 calls. Although actual total duration is reduced, but the
perceived duration is higher. Batch calls are handy when each call
is transmitting small amount of data. Thus 3 small calls gets
executed in one roundtrip.

Let’s work on a scenario where 3 calls are made one by one.
Here’s how the calls actually get executed.

The second call takes a bit time to reach the server because
first call is eating up bandwidth. The same reason it takes longer
to download. Browsers open 2 simultaneous connections to the
server. So at a time, only 2 calls are made. Once the second/first
call completes, the third call is made.

When these 3 calls are batched into one:

Here the total download time is reduced (if IIS compression
enabled) and there’s only one network latency overhead. All 3 calls
get executed on the server in one shot and the combined response is
downloaded in one call. But to the user, the perceived speed is
slower because all the UI update happens after the entire batch
call completes. The total duration the batch call will take to
complete will always be higher than 2 calls. Moreover, if you do a
lot of UI update one after another, Internet Explorer freezes for a
while giving user a bad impression. Sometimes expensive update on
the UI makes the browser screen go blank and white. But Firefox and
Opera does not have this problem.

Batch call has some advantages too. Total download time is less
than downloading individual call responses because if you use gzip
compression in IIS, the total result is compressed instead of
individually compressing each result. So, generally batch call is
better for small calls. But if a call is going to send a large
amount of data or is going to return say 20KB of response, then
it’s better not to use batch. Another problem with batch call is,
say 2 calls are very small but the 3rd call is quite big. If these
3 call gets batched, the smaller calls are going to suffer from
long delay due to the 3rd larger call.

Beginning Atlas series: Why Atlas?

This is the first question everyone asks me when they see
Pageflakes. Why not
Protopage or Dojo library? Microsoft Atlas is a very promising AJAX
library. They are putting a lot of effort on Atlas, making lots of
reusable components that can really save you a lot of time and give
your web application a complete face lift at reasonably low effort
on changes. It integrated with ASP.NET v very well and it is
compatible with ASP.NET Membership and Profile provider.

When we first started developing Pageflakes, Atlas was in
infant stage. We were only able to use the Page Method and
Webservice Method call feature of Atlas. We had to make our own
drag & drop, component architecture, popups, collapse/expand
features etc. But now you can have all these from Atlas and thus
save a lot of development time. The web service proxy feature of
Atlas is a marvel. You can point a < script> tag to a .asmx
file and you get a javascript class generated right out of the web
service definition. The Javascript class contains the exact methods
that you have on the web service class. This makes it really easy
to add/remove new webservices, add/remove methods in webservices
which does not require any changes on the client side. It also
offers a lot of control over the AJAX calls and provides rich
exception trapping feature on the javascript. Server side
exceptions are nicely thrown to client side javascript code and you
can trap it and show nicely formatted error messages to the user.
Atlas works really well with ASP.NET 2.0 eliminating the
integration problem completely. You need not worry about
authentication and authorization on page methods and web service
methods. So, you save a lot of code on the client side (of course
Atlas Runtime is huge for this reason) and you can concentrate more
on your own code then building up all these framework related
codes.

Recent version of Atlas works nicely with ASP.NET Membership and
Profile services giving you login/logout features from Javascript
without requiring page postbacks and you can read/write Profile
object directly from Javascript. This comes very handy when you
heavily use ASP.NET membership and profile providers in your web
application which we do at Pageflakes.

On earlier versions of Atlas, there was no way to make HTTP GET
calls. All calls were HTTP POST and thus quite expensive calls. Now
you can say which calls should be HTTP GET. Once you have HTTP GET,
you can utilize Http response caching features which I will explain
soon.

I will be writing about lots of Atlas tips and tricks. I am
assuming you are familiar with Atlas and you have already tried
some quick start tutorials and you know the concepts of Page
Method, Web service Proxy, Script Manager etc.

Do you have problems with users who cannot use Forgot Password option?

Here’s a scenario. We use Email address as user name in ASP.NET
2.0 Membership provider. There were several places where we used to
create user accounts using this:

Membership.CreateUser( email,
password );

We did not notice what it was doing. After some days, users
started complaining. This is what users said whose account was
automatically created by the above code:

“Hi,

I got the email invitation. I went to your site. I tried login,
it said user name or password is wrong. So, I tried Signup. Signup
said user name already taken. Then I went to forgot password to
retrieve the password. It shows something is wrong and password
email cannot be sent.

I am stuck. Please help!”

Here’s the problem. When we use the above code, it creates a row
in aspnet_users table using the email address as user name. Fine no
problem. But in aspnet_membership table, the row it creates
contains Email is NULL. So, user cannot use “Forgot Password”
option to request the password because the email address is null.
Out database contained 908 of such unfortunate users, so we had to
run the following SQL to fix it:

update aspnet_membership set email =
( select username from aspnet_users where applicationID = ‘
… ‘ and
userID = aspnet_membership.userID) ,loweredemail
= ( select loweredusername from aspnet_users where applicationid = ‘
… ‘ and
userid = aspnet_membership.userID) where loweredemail is null
and applicationID = ‘
… ‘

The applicationID is something which you need to specify for
your own application. You can find the ID from aspnet_application
table.

Then we changed the code to create user accounts to this:

Membership.CreateUser( email,
password, email );

The 3rd parameter is the email address. We did not notice
this.

Large log file can bring SQL Server down when transaction log shipping runs

We were having very poor performance when we turned on
transaction log shipping on our SQL Server. We are using SQL Server
2005. The transaction log file was around 30 GB because the
database was in Full Recovery mode. The server became very slow,
every 15 mins when we were doing the log shipping, it used to
become very slow and sometimes nonresponsive. The event log was
getting full of SqlTimeout exceptions generated by the web site.
The web site started to show asp.net error page very frequently. We
could not use SQL Server Management Studio to login to SQL Server
so that we could do something about it.

Here’s how the connection time was reported from an external
monitoring site:

The peaks are 30 seconds which mean they timed out.

So, here’s what we did:

Turned off Log shipping
Restarted SQL Server.
Switched Database to Simple recovery model. Shrunk the log
file. This made the log file come down to couple of megabytes.
Ran for some days. All looked ok.
Then switched DB to Full Recovery model and configured log
shipping again.

So far running fine. But we go down for an hour every Saturday
when we run INDEX DEFRAG on the indexes. The log ships show around
5 or 6 log backups which are each 1 or 2 GB in size when the index
defrag happens.

How to setup SQL Server 2005 Transaction Log Ship on large database that really works

I tried a lot of combinations in my life in order to find out an
effective method for implementing Transaction Log Shipping between
servers which are in a workgroup, not under domain. I realized the
things you learn from article and books are for small and medium
sized databases. When you database become 10 GB or bigger, thing’s
become a lot harder than it looks. Additionally many things changed
in SQL Server 2005. So, it’s even more difficult to configure log
shipping properly nowadays.

Here’s the steps that I finally found that works. Let’s assume
there are 2 servers with SQL Server 2005. Make sure both servers
have latest SP. There’s Service Pack 1 released already.

1. Create a new user Account named “SyncAccount” on both
computers. Use the exact same user name and password.

2. Make sure File Sharing is enabled on the local area
connection between the server. Also enable file sharing in
Firewall.

3. Make sure the local network connection is not regular LAN. It
must be a gigabit card with near zero data corruption. Both cable
and switch needs to be perfect. If possible, connect both servers
using Fibre optic cable directly on the NIC in order to avoid a
separate Switch.

4. Now create a folder named “TranLogs” on both servers. Let’s
assume the folder is on E:Tranlogs.

5. On Primary Database server, share the folder “Tranlogs” and
allow SyncAccount “Full Access” to it. Then allow SyncAccount
FullAccess on TranLogs folder. So you are setting the same
permission from both “Sharing” tab and from “Security” tab.

6. On Secondary database server, allow SyncAccount “Full Access”
right on TranLogs folder. No need to share it.

7. Test whether SyncAccount can really connect between the
servers. On Secondary Server, go to Command Prompt and do this:

9. Now you have a command prompt which is running with
SyncAccount privilege. Let’s confirm the account can read and write
on “TranLog” shares on both servers.

10.

11. This is exactly what SQL Agent will be doing during log
ship. It will copy log files from primary server’s network share to
it’s own log file folder. So, the SyncAccount needs to be able to
both read files from primary server’s network share and write onto
its own tranlogs folder. The above test verifies the result.

12. This is something new in SQL Server 2005: Add SyncAccount in
SQLServer Agent group “SqlServer2005SqlAgentUser….”. You will
find this Windows User Group after installing SQL Server 2005.

13. Now go to Control Panel->Administrative
Tools->Services and find the SQL Server Agent service. Go to its
properties and set SyncAccount as the account on the Logon tab.
Restart the service. Do this on both servers.

14.

15. I use sa account to configure the log shipping. So, do this
on both servers:

a. Enable “sa” account. By default, sa is disabled in SQL Server
2005.

b. On “sa” account turn off Password Expiration Policy. This
prevents sa password from expiring automatically.

16. On Secondary server, you need to allow remote connections.
By default, SQL Server 2005 disables TCP/IP connection. As a
result, you cannot login to the server from another server. Launch
the Surface Area Configuration tool from Start->Programs->MS
SQL Server 2005 and go to “Remote Connection” section. Choose the
3rd option which allows both TCP/IP based remote connection and
local named pipe based connections.

17. On Secondary Server firewall, open port 1433 so that primary
server can connect to it.

18. Restart SQL Server. Yes, you need to restart SQL Server.

18. On Primary server, go to Database properties->Options and
set Recovery Model to “Full”. If it was already set to full before,
it will be wise to first set it to Simple, then shrink the
transaction log file and then make it “Full” again. This will
truncate the transaction log file for sure.

19. Now take a Full Backup of the database. During backup, make
sure you put the backup file on a physically separate hard drive
than the drive where MDF is located. Remember, not different
logical drives, different physical drives. So, you should have at
least 2 hard drives on the server. During backup, SQL Server reads
from MDF and writes on the backup file. So, if both MDF and the
backup is done on the same hard drive, it’s going to take more than
double the time to backup the database. It will also keep the
Disk fully occupied and server will become very slow.

20. After backup done, RAR the database. This ensures when you
copy the database to the other server there’s no data corruption
while the file was being transferred. If you fail to unRAR the file
on the secondary server, you get assurance that there’s some
problem on the network and you must replace network infrastructure.
The RAR also should be done on a separate hard drive than the one
where the RAR is located. Same reason, read is on one drive and
write is on another drive. Better if you can directly RAR to the
destination server using network share. It has two benefits:

a. Your server’s IO is saved. There’s no write, only read.

b. Both RAR and network copy is done in one step.

21.

22. By the time you are done with the backup, RAR, copy over
network, restore on the other server, the Transaction Log file
(LDF) on the primary database server might become very big. For us,
it becomes around 2 to 3 GB. So, we have to manually take a
transaction log backup and ship to the secondary server before we
configure Transaction Log Shipping.

23.

24. When you are done with copying the transaction log backup to
the second server, first restore the Full Backup on the secondary
server:

25.

26. But before restoring, go to Options tab and choose RESTORE
WITH STANDBY:

27.

28. When the full backup is restored, restore the transaction
log backup.

29. REMEMBER: go to options tab and set the Recovery State to
“RESTORE WITH STANDBY” before you hit the OK button.

30. This generally takes a long time. Too long in fact. Every
time I do the manual full backup, rar, copy, unrar, restore, the
Transaction Log (LDF) file becomes 2 to 3 GB. As a result, it takes
a long time to do a transaction log backup, copy and restore and it
takes more than an hour to restore it. So, within this time, the
log file on the primary server again becomes large. As a result,
when log shipping starts, the first log ship is huge. So, you need
to plan this carefully and do it only when you have least amount of
traffic.

31. I usually have to do this manual Transaction Log backup
twice. First one is around 3 GB. Second one is around 500 MB.

32. Now you have a database on the secondary server ready to be
configured for Log shipping.

33. Go to Primary Server, select the Database, right click
“Tasks” -> “Shrik”. Shrink the Log File.

34. Go to Primary server, bring on Database options, go to
Transaction Log option and enable log shipping.

35.

36. Now configure the backup settings line this:

37.

38. Remember, the first path is the network path that we tested
from command prompt on the secondary server. The second path is the
local hard drive folder on the primary server which is shared and
accessible from the network path.

39. Add a secondary server. This is the server where you have
restored the database backup

40.

41. Choose “No, the secondary database is initialized” because
we have already restored the database.

42. Go to second tab “Copy Files” and enter the path on the
secondary server where log files will be copied to. Note: The
secondary server will fetch the log files from the primary server
network share to it’s local folder. So, the path you specify is on
the secondary server. Do not get confused from the picture below
that’s it’s the same path as primary server. I just have same
folder configuration on all servers. It can be D:tranlogs if you
have the tranlogs folder on D: drive on secondary server.

43.

44. On third tab, “Restore Transaction Log” configure it as
following:

45.

46. It is very important to choose “Disconnect users in
database…”. If you don’t do this and by any chance
Management Studio is open on the database on secondary server, log
shipping will keep on failing. So, force disconnect of all users
when database backup is being restored.

47. Setup a Monitor Server which will automatically take care of
making secondary server the primary server when your primary server
will crash.

48.

49. In the end, the transaction log shipping configuration
window should look like this:

50.

51. When you press OK, you will see this:

52. Do not be happy at all if you see everything shows
“Success”. Even if you did all the paths, and settings wrong, you
will still see it as successful. Login to the secondary server, go
to SQL Agents->Jobs and find the Log Ship restore job. If the
job is not there, your configuration was wrong. If it’s there,
right click and select “View History”. Wait for 15 mins to have one
log ship done. Then refresh and see the list. If you see all OK,
then it is really ok. If not, then there are two possibilities:

a. See if the Log Ship Copy job failed or not. If it fails, then
you entered incorrect path. There can be one of the following
problem:

The network location on primary server is wrong
The local folder was specified wrong
You did not set SyncAccount as the account which runs SQL Agent
or you did but forgot to restart the service.

b. If restore fails, then the problems can be one of the
following:

i. SyncAccount is not a valid login in SQL Server. From SQL
Server Management Studio, add SyncAccount as a user.

ii. You forgot to restore the database on secondary server as
Standby.

iii. You probably took some manual transaction log backup on the
primary server in the meantime. As a result, the backup that log
shipping took was not the right sequence.

53. If everything’s ok, you will see this:

Careful when querying on aspnet_users, aspnet_membership and aspnet_profile tables used by ASP.NET 2.0 Membership and Profile provider

Such queries will happily run on your development
environment:

Select * from
aspnet_users where UserName = ‘
blabla ‘

Or you can get some user’s profile without any problem
using:

Select * from
aspnet_profile where userID = ‘
…… ‘

Even you can nicely update a user’s email in aspnet_membership
table like this:

Update aspnet_membership SET Email =
‘ newemailaddress@somewhere.com ‘ Where
Email = ‘
… ‘

But when you have a giant database on your production server,
running any of these will bring your server down. The reason is,
although these queries look like very obvious ones that you will be
using frequently, none of these are part of any index. So, all of
the above results in “Table Scan” (worst case for any query) on
millions of rows on respective tables.

Here’s what happened to us. We used such fields like UserName,
Email, UserID, IsAnonymous etc on lots of marketing reports at
Pageflakes. These are some
reports which only marketing team use, no one else. Now, the site
runs fine but several times a day marketing team and users used
call us and scream “Site is slow”, “Users are reporting extreme
slow performance”, “Some pages are getting timed out” etc. Usually
when they call us, we tell them “Hold on, checking right now” and
we check the site thoroughly. We use SQL profiler to see what’s
going wrong. But we cannot find any problem anywhere. Profile shows
queries running file. CPU load is within parameters. Site runs nice
and smooth. We tell them on the phone, “We can’t see any problem,
what’s wrong?”

So, why can’t we see any slowness when we try to investigate the
problem but the site becomes really slow several times throughout
the day when we are not investigating?

Marketing team sometimes run those reports several times per
day. Whenever they run any of those queries, as the fields are not
part of any index, it makes server IO go super high and CPU also
goes super high – something like this:

We have SCSI drives which have 15000 RPM, very expensive, very
fast. CPU is Dual core Dual Xeon 64bit. Both are very powerful
hardware of their kind. Still these queries bring us down due to
huge database size.

But this never happens when marketing team calls us and we keep
them on the phone and try to find out what’s wrong. Because when
they are calling us and talking to us, they are not running
any of the reports which brings the servers down. They are
working somewhere else on the site, mostly trying to do the same
things complaining users are doing.

Let’s look at the indexes:

Table: aspnet_users
Clustered Index = ApplicationID, LoweredUserName
NonClustered Index = ApplicationID, LastActivityDate
Primary Key = UserID

Table: aspnet_membership
Clustered
Index = ApplicationID, LoweredEmail
NonClustered = UserID

Table: aspnet_Profile
Clustered
Index = UserID

Most of the indexes have ApplicationID in it. Unless you put
Application=’…’ in the WHERE clause, it’s not going to use
any of the indexes. As a result, all the queries will suffer from
Table Scan. Just put ApplicationID in the where clause (Find your
applicationID from aspnet_Application table) and all the queries
will become blazingly fast.

DO NOT use Email or UserName fields in WHERE clause.
They are not part of the index instead LoweredUserName and
LoweredEmail fields are in conjunction with ApplicationID field.
All queries must have ApplicationID in the WHERE
clause.

Our Admin site which contains several of such reports and each
contains lots of such queries on aspnet_users, aspnet_membership
and aspnet_Profile tables. As a result, whenever marketing team
tried to generated reports, they took all the power of the CPU and
HDD and the rest of the site became very slow and sometimes
non-responsive.

Make sure you always cross check all your queries WHERE and JOIN
clauses with index configurations. Otherwise you are doomed for
sure when you go live.

Calculate code block execution time using “using”

Here’s an interesting way to calculate the execution time of a
code block:

private void SomeFunction() { using (
new TimedLog(Profile.UserName, “ Some
Function “ )) { … … } }

You get an output like this:

6/14/2006
10:58:26 AM
4b1f6098-8c9d-44a5-93d8-e37394b6ef18
SomeFunction
9.578125

You can measure execution time of not only a function, but also
smaller blocks of code. Whatever is inside the “using” block, gets
logged.

Here’s how the TimedLog class do the work:

public class TimedLog
: IDisposable { private
string _Message; private long _StartTicks; public TimedLog( string userName, string message) { this ._Message = userName + ‘
t ‘ +
message; this ._StartTicks = DateTime.Now.Ticks; } #region IDisposable Members void IDisposable.Dispose() {
EntLibHelper.PerformanceLog( this ._Message + ‘
t ‘ +
TimeSpan.FromTicks(DateTime.Now.Ticks
– this ._StartTicks).TotalSeconds.ToString()); }
#endregion }

We are using Enterprise Library to do the logging. You can use
anything you like on the Dispose method.

The benefit of such log is, we get a tab delimited file which we
can use to do many types of analysis using MS Excel. For example,
we can generate graphs to see how the performance goes up and down
during peak hours and non peak hours. We can also see whether there
are high response times or not and what is the pattern. All these
gives us valuable indications where the bottle-neck is. You can
also find out which calls take most of the time by doing sort on
the duration column.