Multiply By Pi… Rotating Header Image

The migration saga continues….

I’m trying to figure out now if this is  Xen/Ubuntu issue or just an  Ubuntu issue…

In my professional work environment we use VMWare ESXi in production. We run several virtuals in that environment, including Ubuntu (Hardy) and Centos 5.x.  We run a heavily loaded site that deals with a weekly newsletter which draws hundreds of concurrent connections per second for several hours during the peak period.  The only problem we have ever noticed with any kind of stability is that if there is too little memory, the system starts to swap heavily, and then apache stops responding.  Even in that low memory situation we could still connect to the console or ssh in, and restart apache.  (adding memory and tweaking the apache configuration for MaxClients, etc. fixed that issue).  It’s never locked up the virtual so bad that the entire server had to be cold booted. Even in it’s worst case, we could still restart the virtual from the ESXi console.  And Ubuntu has been rock-solid and easy to work with.

The VPS.NET environment is a 64-bit Xen environment.  After I went to bed this morning (at 3:30am), support left a message in my ticket thread saying that the problems with my lockups may be due to the fact that the Turnkey appliance is 32-bit and they are running it on a 64-bit Xen implementation and did I want to move to a 64-bit kernel? (In other words, did I want to start over from scratch? Yeah. Right.).  Then the support rep came back and said that this may be incorrect: seems there are problems in some instances with the 32-bit Xen kernel and native 32-bit Ubuntu implementations reported in the Xen lists as well.  The net result is that they have updated my kernel (to 2.6.24-24-xen)  and the VPS is now up and running, and has been for 6 hours however,  that’s not exactly a record yet. 

I did some googling on this issue, and found that it may be an Ubuntu thing. Seems that other people have had the same problem with Ubuntu under VMWare.  Stuart suggested a kernel parameter fix which worked for 3 out of 4 of his VMs. The ubuntu forum he links to in his article about the issue makes reference to upgrading the kernel. Let’s just hope that works for my situation.

I’m nervous to “flip the switch” (change my dns and move this implementation live) lest there be issues with the site locking up again.  On the other hand, the discussion forum has been migrated and it was an arduous 3 hour exercise that I don’t want to have to do again. 

I think I’ll keep the old site running in parallel for the next month, in case I have to fail back to it.  I’ve sub-domained some key functions like the discussion forum, so I leave the main part of the site on the old system, and the forum on the new system and see how it fares.

Now I just need to wave a dead rubber chicken around my head, say a few incantations, and pray…

Leave a Reply