Since it’s been nearly ten years since I wrote my book, Scaling Oracle8i, I thought it was about time that I started writing again. I thought I would start with the new-fangled blogging thing, and see where it takes me. Here goes.
As some will know, I run a small consulting company called Scale Abilities, based out of the UK. We get involved in all sorts of fun projects and problems (or are they the same thing?), but one area that I seem to find myself focusing on a lot is storage. Specifically, the performance and architecture of storage in Oracle database environments. In fact I’m doing this so much that, whenever I am writing presentations for conferences these days, it always seems to be the dominant subject at the front of my mind.
One particular common thread has been the effect of latency. This isn’t just a storage issue, of course, as I endeavoured to point out in my Hotsos Symposium 2008 presentation “Latency and Skew”. Latency, as the subtitle of that particular talk said, is a silent killer. Silent, in that it often goes undetected, and the effects of it can kill performance (and still remain undetected). I’m not going to go into all the analogies about latency here, but let’s try and put a simple definition out for it:
Latency is the time taken between a request and a response.
If that’s such a simple definition, why is it so difficult to spot? Surely if a log period of time passes between a request and a response, the latency will be simple to nail? No.
The problem is that it is the small latencies that cause the problems. Specifically, it is the “small, but not so small that they are not important” ones that are so difficult to spot and yet cause so many problems. Perhaps an example is now in order:
A couple of years ago, a customer of mine was experiencing a performance problem on their newly virtualised database server (VMware 3.5). The problem statement went a little bit like this: Oracle on VMware is broken – it runs much slower on VMware than on physical servers. The customer was preparing a new physical server in order to remove VMware from the equation. Upon further investigation, I determined the following:
- The VMware host (physical server running VMware) was a completely different architecture to the previous dedicated server. The old server was one of the Intel Prescott core type (3.2GHz, Global Warming included at no extra cost), and the new one was one of the Core 2 type with VT instructions.
- Most measurable jobs were actually faster on the virtualised platform
- Only one job was slower
The single job that was slower was so much slower that it overshadowed all the other timings that had improved. Of the four critical batch jobs, the timings looked like this:
- Job 1: 26s
- Job 2: 201s
- Job 3: 457s
- Job 4: 934s
- Total: 1618s
- Job 1: 15s
- Job 2: 111s
- Job 3: 208s
- Job 4: 2820s
- Total: 3154s
It can be seen that, if one takes the total as the yardstick, the virtualised server is almost twice as slow: Therein lies the danger of using averages and leaping to conclusions. If Job 4 is excluded, the totals are 684s vs 334s, making the virtualised server more than twice as quick as the physical one.
Upon tracing Job 4 on the VMware platform with Oracle extended SQL tracing (10046 level 8), I discovered that it was making a lot of roundtrips to the database. Hang on, let me give that the right emphasis: A LOT of roundtrips. However, each roundtrip was really fast – about 0.2ms if memory serves. So where’s the problem with that? It’s not exactly high latency, is it? Well it is if you have to do several million of them.
As it turns out, there was something in the VMware stack (perhaps just additional codepath to get through the vSwitch) that was adding around 0.1ms latency to each roundtrip. When this tenth of a millisecond is multiplied by several million (something like 20 million in this case), it becomes a long time. About 2000s to be precise, which more than made up for the extra time. The real answer – do less roundtrips, the new server will be at least twice as fast as the old server.
So what does this have to do with I/O? Plenty. The simple fact is that roundtrips are a classic source of unwanted latency. Latency is the performance killer.
Let’s look at some I/O examples from a real customer system. Note: this is not a particularly well tuned customer system:
- Latency of 8KB sequential reads: 0.32ms
- Latency of 4MB sequential reads: 6ms
Obviously, the 4MB reads are taking a lot longer (18x), but that makes sense, right? Apart from one thing: The 4MB reads are serving 512x more data in each payload. The net result of these numbers is as follows:
- Time to sequentially read 2.1GB in 8KB pieces: 76s
- Time to sequentially read 2.1GB in 4MB pieces: 24s
So what happened to the 52s difference in these examples? Was this some kind of tuning problem? Was this a fault on the storage array? No. The 52s of time was lost in latency, nowhere else.
Here’s another definition of latency: Latency is wasted time.
So, look out for that latency. Think about it when selecting nested-loop table joins instead of full-table scans. Think about it when doing single-row processing in your Java app. Think about it before you blame your I/O system!