Start a Conversation

Unsolved

This post is more than 5 years old

3333

March 16th, 2012 00:00

How to Tune 'Log File Sync' Waits Event

I am looking at a AWR report with 'log file sync' as one of top wait events.  Could anyone help to explain what is the root causes for such 'Log file sync' waits?  any guidlines to resolve this problem?

Thanks in advance!

643 Posts

March 18th, 2012 19:00

Below is my findings from the internet of the root causes of ‘log file sync’ essentially boil down to few scenarios. The following is not an exhaustive list, by any means.

  1. LGWR is unable to complete writes fast enough for one of the following reasons:
    1. Disk I/O performance to log files is not good enough. Even though LGWR can use asynchronous I/O, redo log files are opened with the DSYNC flag and buffers must be flushed to the disk (or at least, written to the disk array cache in the case of SAN) before LGWR can mark a commit as complete.
    2. LGWR is starving for CPU resource. If the server is very busy, LGWR can starve for CPU too. This will lead to slower response from LGWR, increasing ‘log file sync’ waits. After all, these system calls and I/O calls must use CPU. In this case, ‘log file sync’ is a secondary symptom and resolving the root cause for high CPU usage will reduce ‘log file sync’ waits.
    3. Due to memory starvation issues, LGWR can be paged out. This also can lead to slower response from LGWR.
    4. LGWR is unable to complete writes fast enough due to file system or uUnix buffer cache limitations.
  2. LGWR is unable to post the processes fast enough, due to excessive commits. It is quite possible that there is no starvation for CPU or memory, and that I/O performance is decent enough. Still, if there are excessive commits, LGWR has to perform many writes/semctl calls, and this can increase ‘log file sync’ waits. This can also result in sharp increase in redo wastage statistics.
  3. IMU undo/redo threads. With private strands, a process can generate few Megabytes of redo before committing. LGWR must write the generated redo so far, and processes must wait for ‘log file sync’ waits, even if the redo generated from other processes is small enough.
  4. LGWR is suffering from other database contention such as enqueue waits or latch contention. For example, I have seen LGWR freeze due to CF enqueue contention. This is a possible scenario, albeit an unlikely one.
  5. Various bugs. Oh yes, there are bugs introducing unnecessary ‘log file sync’ waits.

46 Posts

March 19th, 2012 04:00

Something that also might happen is not even contention but simple I/O queueing.

For example, you have 2 HBA's and 20 LUNs and REDO logs are on the same filesystems or ASM disk groups as the rest of the data. Assume we have Powerpath or another load balancer so each of the 20 LUNs is visible on both paths.

The dbwr wants to flush a bunch of dirty SGA buffers and issues a series of writes. So there might be, say, 10 I/O's on each LUN queue, meaning 200 total I/O's distributed over 2 HBA queues - 100 each. It might be that the storage backend is servicing the I/O requests pretty quick, say, 1ms per write, and that the utilization as measured on the storage side is low (say less than 10% on all levels). No bottlenecks.

Now the redo writer needs to do a commit and wants to flush 512K worth of data in 64K writes (so 8 writes of 64K each). The writes will be put on top of the queues and have to wait for the db writes to be finished first (unless some optimization algorithm picks them up first but I know of no such thing).

So now the IO wait time for the redo writes is at least 10 x 1ms (the time it takes to handle the other 10 requests per LUN) plus the time it spends in the HBA queue, plus the service time of the redo write itself. I bet you end up with 12 ms or more for the redo write (where you expect 1-2 ms because of the storage write cache).

If you would isolate the redo logs on completely separate I/O channels (at least separate LUNs but maybe also separate HBA's) you would not have this effect. As the db writes are background, I would always give priority to redo writes. A few ms more for a db buffer flush is not a problem, but an increased redo write delay can be.

Not sure if my interpretation is completely correct but this is what I have noticed several times at customers. Unfortunately I haven't been able to verify by changing the configuration in those cases.

So as soon as the redo write has to wait in line anywhere in the process (Host LUN queue, HBA, storage frontend, you name it) you will see log file sync time going up (and this is completely unnecessary IMO).

Thoughts?

225 Posts

March 19th, 2012 19:00

Log file IO tends to be a seq IO, in most of cases, storage write cache is able to handle it well if storage is running with proper pressure and w/o issues.

But you need to speak with App dev to see how they do commit job in system, too often commit would cause lot of problem.

27 Posts

March 22nd, 2012 08:00

Here is another situation that caused log file sync wait event to pop up in the Top 5 Events AND the Avg Wait Time was above 6ms AND the % Total Call time was 10% or more....(kind of a tipping point for me)

For a large & active SAP DB running on a DMX-4, they had a convergence of a few activities dramatically impacting Symm cache (we found this out when we(EMC & customer) sat down and discussed the issue and the nature of redo log IO to the Symm).

For this SAP DB, they were doing pretty large bulk loads, at the same time they were backing up the DB & at the same time they were refreshing 5-6 TF Clones for other DBs, and still servicing normal business transactions for this DB.  So the Symm cache was being saturated & working over time to flush dirty blocks(ie. write pendings were being hit). This caused elongated log buffer flushes to occur & therefore log file sync events.

And to top it off, they could not put any more cache in the Symm. So there work around was to offload some workload to another array.

225 Posts

March 22nd, 2012 20:00

Matt, yours is very example of backend storage bottleneck causing high waits,

Have you checked CPU utilization of your DMX? If not much high, <=50-60%, you might consider to place EFD on it to absorb dirty pages quickly.

27 Posts

March 23rd, 2012 06:00

Eddie, this is an old issue, and I just posted my reply in response to the community thread on log file sync.

Sent from my blackberry... Please excuse typo's or brief reply since I am all thumbs.

No Events found!

Top