Customer Testimonies are not easy to recant or write about. It is either too ‘soapy’ or critical; a bit forced either way.
So let me try writing one without quite doing so.
A little over a year ago we were having big performance problems. Everything was at a standstill. Nothing was moving either way. Our databases were big, so big – About 5 to 6 TB.
And, so what to do, but query Microsoft SQL Server ‘wait states’. What is causing the problem? We settled on I/O bottlenecks.
So is it Write or Reads – Mostly reads, can’t recollect how many MB/sec, but really quite big.
Met with a couple of gentlemen – Steve (Storage) and Chris & another engineer.
So they took time and explained the topology of our Storage Filers to me — Via Fiber Channel we are directly connected to the Filers. No Storage Switches and so that couldn’t possible be the problem. Almost, bare metals like you might say; none of these virtual stuff.
So I nodded with knowing head. But, still did not know what to do to get better performance.
On our database side, we have an Active and a Passive database. They are on two different machines and each machine is connected to its own Storage Filer.
So the more we thought, we said let us switch the Filer each database is connected to. We made the change, nothing substantial in terms of performance throughout.
Storage – Flash Memory
So things were bad and stayed bad. And, Steve went away for a while and came back to let us know that he might be able to help.
He said we should run perfstats and collect performance metrics. And, so we did so. Upon reviewing the numbers he said getting Flash Memory will help. As we were ‘reads’ heavy, Flash Memory would serve to keep more of data in the Filer Memory and we would not have to go the actual Filer’s physical disk, as much.
So we took his advice and added the Flash Memory to the Filer. Since we knew we were memory bound on the hosts as well, we took the time to add memory to our database hosts.
We were a bit constrained as to how much memory we could add to the Database Hosts; the reason being that we did want to throw away any of the existing memory Simms. But, thankfully the manager was so committed and he ‘maxed out’ the boxes.
Storage – Additional LUNS
Things were good. Performance was quite better. But, nightly out of disk calls remained. We were running out of database log room.
The reasons were related to Database Mirroring problems, Device driver issues, etc.
So what to do: update device drivers, install database service packs, etc.
And, cup in hand, go begging for more storage.
Manager ponied up the money again. And, we went to each host and added secondary local disks.
As I thought of the best way to divvy up the new storage, Steve was roaming around and I brought up the fact that we were going to add new Storage, but wanted to know if it will be better to use new disk or have him carve out new LUNs on the Filers.
And, Steve said let him go back and look at how much Storage we had available on each hosts.
A week or two later he came back with a nicely prepared document. The document had charts and hard facts about our current setup in each environment – how much storage is available in whole, how much is already allotted, and how much more is available.
The document was so beautifully prepared I walked it in to my manager. It is always easy to go to management with something you know they will get behind quickly.
Management said yes right away and since no new purchase we could start right away.
All I needed to do was write a deployment plan & schedule.
Did so and got on the Internet and researched each Storage Filer command that we will be running. Easy; as Steve had each step written out already.
Our rollout of the new LUNS went very well. We were able to align the new drives to ensure that they were properly aligned – This is an OS Requirement.
There was zero downtime.
But, then came some alarms. Our support staff started getting these alarms that the Volume was almost full. And, this was due to me not following our internal guidelines. Steve was aware of that guideline and had taken it into consideration in his original proposal.
With the Log LUN provisioning having gone so well, we created new LUNS for TempDB, as well.
This meant that our 3 Database pillars – Data, Log, and TempDB each had their own LUN.
Slow Performance on a couple of hosts
With more tweaking here and there, things were generally well. But, we continued to have noticeable problems on one or two hosts.
It was all consuming trying to determine why these hosts were slow.
Spent months on this holding pattern, nothing we tried budged our trouble.
Opened Ticket with Vendor
We opened a ticket with Vendor for this one host. While ticket was opened, read and read some more on what we can possibly do to help Storage I/O.
Ideas came such as using compression, which is a bit of an Industry term for what marketing folks are now calling de-duplication.
We tried compressing some of our database files, and that helped. But, still not nearly enough on these troublesome hosts.
Storage Vendor called meeting and we went to the meeting. In a nutshell, they held on the fact that these Filers are configured just like others and there was nothing in their setup that might contribute to slowness. They cited the problem was probably related more so to something going on with our database hosts and that they noticed that in some cases we were bumping into the upper limit of our Queue Length.
We all know that the only way to get more Queue Length is to drop in yet another LUN and split our database \ database files between the current Data and the new one(s).
Hearing this yet again, confirmed that there was nothing that we had missed while configuring the Database, nor the Storage hosts.
Nowhere else to go
A great leader said words that had stayed with me for a while here — “Having nowhere else to go, but on my knees”.
And, that was me — No where else to go, but to the database hosts.
I stopped a couple of database jobs so that I can better explain our system. As we continued stopping more jobs and looking ever deeper, we noticed that our system was still chewing up a lot of I/O even though the jobs that were running were not the ones we once thought were the most expensive.
As we looked more closely at these newly found expensive queries, we noticed a best described anomalous behavior of otherwise simple queries.
We planned on calling our Database Vendor to possibly gain more clarity.
But, the helping hands of God, came sooner than we could have asked. We were able to fix the SQL.
I am sure Steve and the other engineers helped focus and constrain our thoughts.
In the last few months, there were many things we considered doing. Some avenues were not fully tried, and others half-heartedly. But, I think somehow God really wanted this problem to be one that he will solve with 5 loaves of bread and 2 fishes.
All along, I have held this little feeling that if we would have tried a few different things we would have gone further. But, again nothing helped until we gave 5 loaves of bread and 2 fishes.
The whole silliness is that I can no longer hold the sad feeling of how we could have solved it through engineering might.
All I have to say is that God had his way, and I have to release the grudge to him and have him wash it away. And, he is looking down and saying you silly blogger.