Tiglat

5 Posts

2284

September 24th, 2007 00:00

Solaris10 + ZFS + Sybase IQ + SAN EMC (Clariion / DMX)

Hi,

We are now in the process to define a database server based on Solaris10 ZFS platform and SybaseIQ, and due to database storage capacity requirements we see that many customers will be prone to integrating it into their existing SAN.

In the last months I have seen plenty comments on the net stating that an EMC (or any non-JBOD array) is not a good fit for ZFS. ZFS does really well when it controls the disks. This allows is to take into account self-correcting data from two mirrors, location of data blocks and so forth. Some of these features are available on SAN, but the generic solution we are defining will not necessarily include SAN so ZFS is the way to have these features in place for all customers.

Not being an expert on EMC products, would you please comment on the possibilities to have the Solaris10 + ZFS + Sybase IQ + SAN EMC (Clariion / DMX) on a production environment?

What are the configuration possibilities at EMC side to give ZFS what it requires, (that is, control over disks as in a JBOD configuration)?

Thanks,
Tiglat

Responses(9)

xe2sdc

2 Intern

•

2.8K Posts

0

September 24th, 2007 02:00

Forgive me but I'm not a Solaris/ZFS expert .. But what do you mean with "give ZFS what it requires (that is, control over disks as in a JBOD configuration)" ??

Which specific feature ZFS needs in a JBOD ?? Both DMX and Clariion try to offer the best possible emulation of a disk device to the hosts .. So it's a little unclear to me what ZFS needs .. Could you please explain better this idea ??

Thx in advance...

Tiglat

5 Posts

0

September 24th, 2007 21:00

Hi Stefano,

Some of the ZFS features I am concerned about are:

ZFS manages the disk caches (or at least tries to), and orders flushing of write cache when required. If the array cache is in between and disk cache is not really under control of the ZFS system, then data may not always be reliable. ZFS knows this and will wait for the array cache to flush to disk. This usually results in a terrible performance since the array cache does not always flush immediately when required. The option seems to be to have the array configured to answer flush requests positively when ZFS requests them even if they are not performed. See more on this issue here:

http://blogs.digitar.com/jjww/?itemid=44

Another issue may be self-healing and corruption detections functions for disks. ZFS implements these, and I believe EMC has such features as well. However, from my perspective it looks more interesting to have ZFS doing these functions since ZFS has the top view for the data stored. Is this a problem to EMC arrays? How can this be configured?

ZFS implements a pre-fetching algorithms, and I think EMC implements this as well. Are they compatible? I would initially think it may be a better approach to have ZFS take pre-fetching decisions rather than letting EMC do so for similar reasons to previous paragraph.

So basically my question was, what is the best set-up for a ZFS system on EMC, having in mind these requirements? What raid level may be used to provide best performance on a SybaseIQ system (sequential writes, very much in line with what ZFS does, so no big benefit from having big raid cache)?

Perhaps a bunch of Raid 0 LUNs matching exactly each physical disk behind the raid controller, and having raid controller answer immediatly to ZFS on flush requests without waiting for data to be on disk? Any better option to simulate a JBOD disk?

Thanks for your comments!
Tiglat

Message was edited by:
Tiglat

xe2sdc

2 Intern

•

2.8K Posts

0

September 25th, 2007 00:00

Unfortunatly I can't look at your link from this site ... But IMHO it's difficult for ZFS to know what's going on in the cache of the DMX since ZFS needs a deep knowledge of the DMX to talk with the microcode inside the DMX and understand if a track has been destaged on the disk or not. I don't know if it's even possible to know if a specific track is still in the cache of the box or has been destaged on the disks ..

I'll give a look tonight at the blog you posted .. and let you know

-s-

xe2sdc

2 Intern

•

2.8K Posts

0

October 1st, 2007 08:00

Still looking ...

xe2sdc

2 Intern

•

2.8K Posts

0

October 3rd, 2007 02:00

Finally I've managed to read the blog you cited .. And frankly don't understand why the blogger is so gratefull to the guys that helped him

.. Let me explain what I do think..

ZFS manages the disk caches (or at least tries to),
and orders flushing of write cache when required. If
the array cache is in between and disk cache is not
really under control of the ZFS system, then data may
not always be reliable. ZFS knows this and will wait
for the array cache to flush to disk. This usually
results in a terrible performance since the array
cache does not always flush immediately when
required. The option seems to be to have the array
configured to answer flush requests positively when
ZFS requests them even if they are not performed. See
more on this issue here:

http://blogs.digitar.com/jjww/?itemid=44

I've read many times the blog and I'm still unable to understand what they really talk about since at first the blogger cites the "fsync()" syscall and later talks about the cache of the storage. But if you go to the following page

http://docs.sun.com/app/docs/doc/806-0627/6j9vhfms0?l=it&a=view

you'll clearly read that the fsync() syscall will flush the HOST cache to the storage.
And since fsync is a standard syscall that almost every unix os offers, I think that the behaviour will be the same on hpux, aix, tru64 and linux. It's different from the "sync()" syscall since "fsync()" will return when ALL the data has been flushed from the HOST cache to the disks, while "sync()" will schedule the flush but will actually return control to the calling program before the flush has been completed.
AFAIK each and every OS will trust a SCSI write command, as soon as the target acknowledges the write to the initiator

.. But I can be wrong.

And if you look at the "fix" that the blogger posts, you'll notice that it seems to modify something in the behaviour of the non-volatile RAM of the controllers.. It's something strictly related to the architecture of the storages, not to some strange SCSI command, AFAIK.

IMHO the change they apply to the entry level storages that the blogger is using (they are all Engenio OEM'd arrays, as the blogger wrote) are only a trick to avoid cache congestion since the Engenio produces small controllers with small cache and -maybe- slow processors .. Since Engenio have only 2 or 4 Gb of cache, we'll skip the NVram and write directly on the disks since writing on the NVram, when we are short in cache, adds overhead to the IO and reduces IOps. And ZFS writes a lot on ZIL so delaying writes to ZIL will slow down the whole system.

If you look at the comments in the blog you'll find lot more usefull informations. Just search for "zfs_nocacheflush"

Every clariion or DMX have smarter algorithms, faster processors and sometimes A LOT MORE CACHE then the Engenio storage boxes that the blogger is talking about.. If you look at Sun's storage page you'll quickly find that Sun have better things to offer (http://www.sun.com/storagetek/disk.jsp)

IMHO this is a non-issue.

Another issue may be self-healing and corruption
detections functions for disks. ZFS implements these,
and I believe EMC has such features as well. However,
from my perspective it looks more interesting to have
ZFS doing these functions since ZFS has the top view
for the data stored. Is this a problem to EMC arrays?
How can this be configured?

The blog don't explain this specific feature .. what do you mean with "self-healing" and "corruption detection" ?? A DMX will always try to trick you and let you believe that you have a single disk while you are really using a bunch of disks, a lot of cache, up to 128 processors.. The DMX will show you a single device while your data are mirrored on two different drives or spread on 4 or 8 disks when using RAID5. The DMX will show you only a single device even if it's keeping synchronized different devices in different boxes (maybe in different sites) with your beloved data. The DMX will always cross check what's into the cache with the data on the disks to be sure that all the bits are at their place. If you want, a DMX may even talk directly with your database and cross-check the checksum that Oracle will send to the storage before accepting every single write from the database. A DMX will tell you that everything is allright even if you have a broken disk and your precious data is being mirrored again on a hotspare.
Those are the features that a DMX offers.
If you think that your host have spare CPU power and memory to accomplish the same tasks, then ask ZFS to do this dirty job. If your CPU power is precious and your memory isn't cheap, then let the DMX do this dirty job

ZFS implements a pre-fetching algorithms, and I think
EMC implements this as well. Are they compatible? I
would initially think it may be a better approach to
have ZFS take pre-fetching decisions rather than
letting EMC do so for similar reasons to previous
paragraph.

Since your host will usually move 4kb or smaller blocks from/to the storage, while the storage uses 32kb or 64kb tracks, the storage will always "automagically" prefetch data from the disks. With unix hosts the prefetching algorithm of a DMX will rarely be triggered since sequential IOs are a very rare in a "open" world. Both prefetching algorirhtm will work seamless.

So basically my question was, what is the best set-up
for a ZFS system on EMC, having in mind these
requirements? What raid level may be used to provide
best performance on a SybaseIQ system (sequential
writes, very much in line with what ZFS does, so no
big benefit from having big raid cache)?

Do you want the best performances possible?? On a DMX configure small hypers (18414 cyls) 2-Way Mirror devices and form striped meta with 8 members using devices from differend drives in the backend. Map the devices to AT LEAST 4 frontend ports, 2 FA for every HBA in your host. Avoid SRDF/S (SRDF/A is good). DMX is good at sequential reads, don't be scared about them

Perhaps a bunch of Raid 0 LUNs matching exactly each
physical disk behind the raid controller, and having
raid controller answer immediatly to ZFS on flush
requests without waiting for data to be on disk? Any
better option to simulate a JBOD disk?

No need to emulate a jbod .. read above

-s-

Tiglat

5 Posts

0

October 3rd, 2007 04:00

That was a bunch of good answers, and I thank you for that! :-D

Just want to comment a few things.

On the fsync issue, the fear was the possibility that the ack signal from the array would take too long. I am not sure how the DMX/Clariion cache would handle this. Would it immediately (as soon as data is in cache) issue an ack to OS, or would it wait for the data to be actually in the disks (and not just in cache) before sending an ack? I would not want to have ZFS waiting for the cache to be written to disks (specially if cache is not volatile), but on the other hand, getting an ack when data is not yet at the disks might put at risk data consistency.

Big cache is always an advantage, but in the current case I have doubts about the actual advantage. ZFS is using mainly sequential writes, and it always intends to write on free space (COW). Average IO size for our calculations is 100Kb/s since ZFS writes in 128Kb data chunks.

At the same time SybaseIQ, being sort of a warehouse database is using a sequential access pattern as well (128Kb), with huge peaks of read and write operations (60% / 40%) that get often (every 15 min.) up to 400Mb/s throughput.

What would be necessary in terms of array cache size to cope with this requirement and remove the limitation of the disks speed behind the cache? Probably A LOT.

ZFS would access data in 128Kb blocks, so I assume the pre-fetching done by ZFS would put additional requirements to the array cache, but I am sure this is no big difference.

You are right and I probably prefer the storage device to do self-healing (copying to hotspare), but I am affraid our system is forced to use host mirroring (I cannot get out of this set-up unless EMC guarantees SybaseIQ consistency on mirror splitting and snapshots). In that case, I was just wondering if you see risks between those algorithms the array implements and those from ZFS. I believe it should be no big deal, they should either be tranparent to the OS, or at least a chance should exist of disabling them at the array.

On your suggestion, I agree, particularly considering that we have a sequential access pattern. As many spindles as possible behind the LU to allow for higher throuput, on a stripe configuration (cache may not be relevant, unless we talk about A LOT of it). However, we need to keep a decent hypers size not to face limitations on file sizes ( SybaseIQ huge files could be 150GB easily), or include many hypers in the stripe.

My initial intention was to publish a bunch of LU based on your recommended set-up (perhaps 2 HBAs 4Gb/s, again for the peak throuput), let ZFS stripe them and mirror them at host level, and keep SRDF/A as an option for DR (depends on DB consistency guarantee not sorted out). Do you see this as an alternative to your recommendation?

Again thanks a lot, and I will welcome any more comments!

xe2sdc

2 Intern

•

2.8K Posts

0

October 3rd, 2007 05:00

I did a little research .. and I think that the blogger is looking at the wrong direction ...

There is a SCSI command (defined in SBC-2) that asks a magnetic disk to move data from the volatile cache to a non-volatile storage. This command is issued from the OS -for example- when you are shutting down the system and are going to power down the drives. The OS issues this SCSI command and request the drives to "save" their cache on the disks before unplugging the power.
Speaking in "plain SCSI" this CDB (command descriptor block) is an optional SCSI command since you can issue this command only if the target (the disk) advertise itself as having a cache. But DMX will NEVER tell the host that it have a cache

.. So the host shouldn't issue SYNCHRONIZE CACHE scsi command to a DMX device. And if the host will send a Synch Cache command, the DMX will immediatly acknowledge it, without any further action.
So IMHO the fsync/SYNC CACHE issue is a NON-ISSUE

(and now I can tell you why

)

If ZFS uses big IOs instead of smaller IO it will make a better use of both the HBA (bigger scsi frames brings bigger throughput) and better use of DMX cache, since each and every read/write from the host will use ALL the content of the fetched track.
You talked about writes .. I'm pretty sure that ZFS is good at writing sequentially .. but prefetching works on reads .. and reads will hardly be sequential

If you need a big throughput, forget about faster HBA and try to use a bigger number of slower HBAs .. What happens if you have a 2-lane super-highway where you can run at 200 mph and it gets congested ?? You'll go VEEERY slowly and you'll have only 2 longer queues. Isn't it better to have a 4-lane normal-highway where you can run only at 100 mph.. but when it gets congested you have 4 different (and shorter) queues ??

Don't worry about the cache .. it's only a part of the equation. If you build mirrored devices you are able to read from two different devices at a time .. And if you build striped metavolumes (inside the storage) you'll be able to read from 16 different drives at the same time. This gives at least 1600 mb/sec of throughput from a single metadevice .. a lot more then what you need .. and you can use more devices and have bigger throughput

Host level mirroring has nothing to share with hotspares.. You don't need host level mirroring to protect your data. The DMX itself will give you the mirroring AND the hotspare. And if you want, you can use even host level striping and mix it with the striping that our metavolumes offer. Simply check NOT to use the same HDA in the backend when you stripe at OS and at Metavolume level

If you (your company) don't trust TimeFinder and SRDF/A ability to give you "consistent" images of your data, it's a completly different story

.. We can talk about that but it's a huge chapter ...

BTW you talk about "a lot of cache" .. Do you think that 256 GB of cache is "a lot" ??

-s-

xe2sdc

2 Intern

•

2.8K Posts

0

October 3rd, 2007 06:00

On the fsync issue, the fear was the possibility that
the ack signal from the array would take too long. I
am not sure how the DMX/Clariion cache would handle
this. Would it immediately (as soon as data is in
cache) issue an ack to OS, or would it wait for the
data to be actually in the disks (and not just in
cache) before sending an ack? I would not want to
have ZFS waiting for the cache to be written to disks
(specially if cache is not volatile), but on the
other hand, getting an ack when data is not yet at
the disks might put at risk data consistency.

Both DMX and Clariion share the same behaviour ..
They both will acknowledge your writes as soon as the data is in the cache of the storage.

They both put a big effort in preserving their own cache. In DMX and Symmetrix you find batteries that will keep the storage up and running for 15 minutes after your datacenter lost power. And when your datacenter is out of power the storage will write ALL the content of the cache on the disks. It will power off leaving your data on the disks. With newer DMX3 and DMX4 storages you always have batteries but have a different approach and the cache will be saved at power-off in a dedicated area of the drives (VAULT devices) and will be loaded again in cache at power-on.
In both cases data is ALWAYS on disks at power-off.

Tiglat

5 Posts

0

October 3rd, 2007 06:00

Yes, 256GB = A LOT,

much more than we are willing to dedicate to such a system. That's why I ment cache size would not make a substantial difference on sequential access patterns (5GB to 15GB).

Nothing against mirroring and replication at SAN

, actually very much interested in making use of it whenever possible. Unfortunately other constrains on this system force us to use host mirroring, snapshots, and striping at ZFS level as a base for this analysis.

Now considering the storage capacity and throughput requirements, instead of using a DAS solution (where JBOD config provides best performance out of our experience), I was considering SAN, and I see there are plenty possibilities to find a satisfying solution, and that SAN will not be a problem, rather a solution for many issues, provided we use common sense when defining all variables.

Thanks for the support. what goes around comes around

View All

No Events found!

Host Systems

Solaris10 + ZFS + Sybase IQ + SAN EMC (Clariion / DMX)