:::: MENU ::::
Browsing posts in: Church IT

How Bad is it? Answer: Really Bad

Monday…
After another exciting day on the phone with tech support I am left wondering, when do you say enough is enough.

The morning started at 7am and I found out support didn’t open till 9am, I should have known it was going to be a long day.

The first 5 hours were spent calling every 30 minutes for updates from Intel.  While the support agents speak clear English, they weren’t exactly helpful.  They were only able to let me know that engineering was looking at the case and had requested remote access to our network.  Now they did note, engineering NEVER asks to remotely connect… wow wasn’t I excited to hear that.

A little after 1pm a support tech named Ruonan called my cell phone and said she would be taking the case.  I have to say she is probably one of the most pleasant tech support agents I have spoken with.  So Ruonan contacted Galen at LeftHand Networks and we started digging into the problem.  Galen connected to the SAN via SSH and noticed that the files that should be on the drives to create the configuration were missing… Hence we can’t find the data on the SAN.

Ruonan instructs us to then boot to slax pocket Linux and browse for the files on the DOM.  So we load slax Linux on a CD and boot from disk to create a bootable jump drive…  One item learned, SLAX boot iso doesn’t have an installer as the instructions state so we can’t make the USB jump drive bootable.  So Jeremie downloads DSL (D*** Small Linux).  We load that on to the jump drive.  All this monkeying around takes about 3 hours….  We boot up the SAN to find that the files are missing on the old DOM too…

So where does this leave us, according to Galen 80% chance of needing to rebuild the SAN.  So when do you wave good-bye to your data you know is on the SAN and start the backup recovery process?  Ugh, the though of restoring 13 virtual servers including a SQL server doesn’t give me warm fuzzies.

So at 5 pm I make the dreaded call… we are going to restore the servers from backup.  We contact Mark Moreno our SAN channel partner and the rebuilding of the SAN begins.

For the rest of the night we build and boot up virtual server templates and start the restore process.  At 1 am its clear we aren’t going to be up and running by morning so I tell Jeremie its time for us to head home.

Tuesday…
I arrive at the office at 7am and we continue to work on rebuilding the virtual servers. We did a system state restore to the print server and all the printers are back.. but the server isn’t stable… So we take the info about the print shares etc and start up a new template.   Around 9 am we have our Print Server back-online (minus 2 printers)…

I do have to say the restore process from our SonicWall CDP is really nice… We install the CDP client on the new Virtual servers and then right click on the directories and tell it to restore deleted directories and the process is under way. Need i say, no searching for tapes.  As soon as the data is back online, the directory is a watched folder again.

Around 2 pm the SAN is configured and ready for us to start restoring data. One bonus for the whole process, we configure the SAN with Load-balancing this time to utilize both NICs on the VM_Hosts and configure the mirroring and array in RAID 5 and gain almost a full TB of storage space.

We attempted to restore our SQL databases to the newly built SQL server but no luck.  A 2 hour call to Microsoft and a few tweaks on the permissions and using the cilconfig tool and we are able to remotely access to the SQL server.  But now the Desktop Authority application fails when we install… And of course DA support is closed.

While I am talking with MS about SQL, Jeremie starts a case with MS Support about restoring the system state of the file server.  We have the virtual server with the data restored but we need to use the system state to bring back the permissions and shares.  We called MS since we hadn’t had much luck with the system state restore on other servers earlier in the day, and our lives would be much easier if we could restore the file server properties.  So we snapshot the virtual server and MS helps restore the registry.. No luck on reboot one, but suddenly after reboots 2 and 3 file shares appear.  A little tweaking of the registry and we are back up… but minus permissions.  The tech tells  us that we will have to rebuild the permissions… So at least we got the shares back…So now that its 2 am I tell Jeremie its time for us to go home.

Wednesday…
A 7:15 am I call Scriptlogic and the tech points us to the fact we just need to do a clean install for DA and all is well… And once again our CDP works well… right click on the database in the client application on the SQL server and magically the DB is restored.  Logging into the DA console all our settings are there… we can now finish our Print server since we know the printer share names we need. 

It takes about 2 hours to restore the permissions to the file server but that is now back online.  So after 4 days we have our File, SQL, Print, Web, Antivirus servers back online.  The help-desk, SharePoint, Ghost and a few other less important servers will have to wait  until after thanksgiving weekend.

Our next steps, finish restoring the remaining servers, and review and evaluate the crisis.  This will include asking the question what can we do to prevent this from happening.

We found some issues with our backup process, primarily documenting configurations.  You don’t realize how much stuff  you store on your file server or in specific application databases until your file-server and SQL server are MIA.

One thing we will seriously look into: a storage space and process to backup the whole Virtual server.  If we had a backup of the whole VMDK (and other files) for each virtual server we could just restore the data to the offline server and this process would have been much quicker… Maybe a rack mounted NAS might be a good solution.  It needs to be off the SAN, but still on drives that are a Raid Array.

One thing we have learned, how to and that we should backup the configuration from the SAN.. Even though the support at Intel said that recovery of the SAN after a DOM failure would probably work if we had a backup of the config, they said that restore sometime fails too… But for the future, we have that config saved and and archived just in case.

I must say a few say thanks for the prayers!  Several including some Northwoods’ staff, JP and Ed called or emailed just to check in on us or say they were praying for us… very cool… this is what CITRT is all about.  I even had a vendor call and ask how they could help… Dean Lisenby from ACS is top notch in my book.  He calls after his work day is done on his way home to remind me that I have his cell phone number and can call for any reason…even just to check in… Dean isn’t your normal vendor.

I have to give huge Thanks to my staff.  Each person on my team has responded extremely well during this crisis; no task was trivial no matter if it was laughing at me when my thoughts are less than clear when they come out of my mouth at 1 am, fixing my forgetting to run a Virtual server as a service rather than under the local login account, answering the question of “when will I be able to….” for the rest of the staff, or unlocking doors after I leave my keys in the server room or my office, or the new server room, or in the bathroom, or.. well you get the picture.  Thanks Jeremie, Jim and Linda you 3 are an awesome team!!!!

Now I must sleep, 10 hours of sleep since Monday morning leaves a sleepy IT Director.


How Bad is it? Saga Continues

We’ll I had expected that today’s update on our SAN saga would conclude the drama, but that isn’t the case. 

Little frustrates me more than when vendor’s tech support doesn’t do their homework and tells you incorrect information.  When we talked with tech support on Thursday, when ordering the DOM, we were clearly told that the configuration of our array in our SAN would not be affected when the new module was installed.  I wondered about the accuracy, so we asked several times, and the tech reassured us that the SAN would power up and all would be well when the new DOM was installed.  Well… as you can guess we received the DOM Friday morning and all isn’t well.  Jeremie installed the DOM and hmm… the SAN powers up, but doesn’t display any configuration.   While I appreciate the correct diagnosis by the technician of the failed module, I don’t like being told the wrong information.

So, that means we just work with support to rebuild the configuration and get the SAN back online… If only it was that easy.

Jeremie spent most of the day camping out in our server room leaving messages and waiting for our buddy Ray at Intel to call us back.  Even our Channel Partner got in the mix of leaving messages for Ray. Mid-day I asked Jeremie to start the process with a new technician since Ray was MIA.  This new tech sent Jeremie a form to complete and drops off the call.  Jeremie receives the form, but with explanation of where to send the form or with what information to input (the old configuration or our current situation).  So we wait some more trying to contact Ray to find out where this form fits in the mix.

At 5pm Jeremie conferences me in to another call to Intel and guess who we get a hold of… or buddy Ray, who starts to explain that he was calling all afternoon but nobody answered…  (He was calling our main number vs. the DID we told him to call)  I quickly told Ray I didn’t care about the why… but we needed a resolution.

So Ray then instructs us with what information to complete the form, and gives us the email address to where we should send it… and we wait some more.  Ray informs us our issue requires transferring our case to the engineering department, which might be closing at 7pm.  How can support close at 7pm?  It’s at that point I start asking for a supervisor. (Which Ray says will take 60 minutes for a call back)

So as time ticks away to the possibly not so magic 7pm, we wait.  At 6pm I receive a call from Oscar the support supervisor and he says the email hasn’t come thru.  It’s been 60 minutes since Jeremie sent over the form, but they still don’t have it.  I put a little pressure on Oscar and 2 minutes later he is able to search the email account and the magic happens and suddenly they have our email.  But the craziness doesn’t stop, Oscar says that it is 2-5 days to transfer our case to engineering.  I about blow a stack at that point and again I request a supervisor.  So at this point I tell Jeremie to get some dinner and we wait. 

So about 50 minutes later I receive a call on my cell and the Caller ID displays a 916 phone number it must be Intel calling to help us resolve our case. The caller identifies himself as the supervisor of server support for the central region… you know this call isn’t going to be good.  He informs me that all our paperwork is in place and our account is ready to be transferred over to engineering.  So I wait for that hated word ‘BUT’ and here it is, But the engineering department closes at 7pm.

So… Now we wait.  We have weighed re-working the raid array without support, but I am not willing to take that chance.  There is too much data on the SAN to risk and the thought of a total recovery from backup of 14+ servers.. that just doesn’t sound like an attractive option for not being patience.  So the decision is made, we wait until Monday.


How bad is it? SAN issues updated.

Well, after 15 minutes of explaining to nice lady at first level support that I was calling about an Intel SAN and not an Intel FAN and about the 10th time of saying it’s a SAN model number SSR212MA things have started to look up.  Or Intel Channel partner told us afterwards there is a better support number to call in the future.. well, we will note that for the future.

 DOM

The technician has identified our problem as the DOM (Disk on Module). Its a little module that plugs into the IDE port on the motherboard. 

Funny thing about the DOM we noticed after we unplugged it as instructed by the tech… a little notice that reads:  “Warranty Void if Removed”.
So now we wait, unfortunately with 16 of 18 servers offline.  The SAN is wonderful, except when its down we don’t have anywhere near enough hard drive space to run all the virtual servers we have…

Maybe we’ll throw a little celebration for the UPS man when he arrives before 10am tomorrow morning.


How bad is it?

So you arrive in the office to learn that none of the Virtual Servers are responsive… not good.
You visit the console of the VM_Host and it says the Drive from the iSCSI connection to the SAN is not responsive… not good.
You look at the console of the SAN and you see a screen full of errors… not good.
You send a screen shot of the errors to Intel… they have to call you back after they look into it… not good.

I thought i would share our lovely errors with you, maybe sharing the pain will help our frustration, your you would like to give your two cents worth.

SAN ErrorSan Error


UA-2932131-1