ISP losing email

Whilst I was consultant/I.T. Director at Totalise, a rapidly growing ISP, a major email problem developed. Reports came in that subscribers email folders were losing their stored email. As Totalise focused on webmail backed by IMAP4, this was a major issue. To begin with the reports were so random we couldn't see any clues as to the cause. What we did discover was, the emails would often reappear!

The mail system was Infinite Interchange running on Windows NT4 on a Dell PowerEdge cluster. Myself and the lead server administrator spent hours watching the servers over remote connections to the data-center. Just as we thought we weren't going to discover anything a set of user account directories all starting with the letter 'M' disappeared in front of our eyes.

We were totally at a loss to understand why. Just as we started to investigate the PowerVault storage the directories reappeared - we were so surprised we just laughed in relief; joking about the ghost in the machine. Throughout the day more email directories would go missing only to return later.

We became convinced it was an issue with the Dell PowerVault so we called Dell. Despite the expensive support contract we had with them they wouldn't send out an engineer to help us. This caused a storm when the CEO heard because by then Totalise had thousands of customers being affected and no explanation or solution in sight.

With Dell refusing to help us the decision was made to replace the Dell PowerVault with an IBM Netfinity array. I phoned around as if I was doing my weekly shop and was amazed to discover there was one sitting in a distributor's warehouse about 4 miles away - if I had £15,000 to spare. We quickly completed the purchase and arranged for a van to carry the new disk array to the data center on the Friday night.

We had advised the subscribers we'd be shutting down the email servers for repairs overnight, so 3 senior engineers and myself headed to the data-center after close of business Friday. An IBM engineer arrived to provide assistance if needed, too.

I was amazed and angry to discover that the 3-server PowerEdge cluster with fibre-channel connected RAID PowerVault had actually been incorrectly configured by the company engineer when it was first installed and what we had was a single unclustered NT4 server in use, one standing idle, and one actually switched off. To make matters worse, the PowerVault disk array wasn't configured for RAID - we had no redundancy for 160,000 email accounts!

At 1am we shut down the server and began swapping the disk arrays. We had to leave reconfiguring the servers as a cluster for another time. Once the Netfinity array was in place we began the process of formatting it for RAID-5. A couple of hours later it was ready and we began copying the mail folders from the PowerVault to the NetFinity.

We had to use some special software tools to update the Interchange email-server indexes and for that we had the U.S. programmers on an open line with us writing utilities on the fly. It took a couple of aborted attempts before a good transfer was under way. It was obvious the transfer of email from the PowerVault was going to take a day rather than a few hours, so we advised the subscribers when we opened the mail-server to clients again on-time at 12 noon.

We all headed off for a good sleep and what we hoped would be a relaxing weekend. On Monday morning email directories were still being transfered to the Netfinity but by late afternoon the transfer was complete.

Around 2am Tueday morning I was woken with reports that the email-server was losing email again. Knowing we didn't have a cluster configured and worrying now that the problem was caused by the server, I headed for the office and began investigating.

When the engineers arrived at 9am I was working through every bit of information and every decision we had gone through the previous week. I was also reading reams of Microsoft Knowledge Base articles for any clues. Around lunch-time I discovered a KB article (Q229607 "File Corruption on an NTFS Volume with More Than 4 Million Files" archived by Jeff Par) that reported that NTFS volumes would appear to lose directory entries if the volume had more than 4 million files, because of a pointer wrap-around bug. The bug was fixed in service pack 6. I checked the mail servers and discovered they were running SP5!

I was ready to explode at this point but I reviewed our decisions again and saw that we'd already asked about this bug and the Interchange engineers had told us that it was fixed in NT SP5. We asked them again, and they repeated that it was fixed in SP5. We pointed them at the MS KB article and listened to the deafening silence as they realised the error.

We quickly developed a plan to update the server to SP6 that afternoon before the early-evening demand kicked in. We were able to restart the server with SP6 by 5pm and were relieved to see all directories present. We hung about in the office until 9pm just in case, but this time the problem was solved.

It was a very stressful week for everyone, especially as we couldn't diagnose the issue. It was also very costly in time and money, but it taught everyone a lesson about ensuring systems are correctly configured and not trusting to hearsay. A couple of weeks later when the fuss had died down myself and the systems admin spent another overnight session in the data-center configuring the cluster and adding two more CPUs into the servers so each had four. From that time on server-loads dropped from 80% to 10% and the email system didn't fail again.