In March, we launched our guarantee on storage utilization and efficiency – customers are able to utilize at least 80% of the raw storage that they buy – and we’re so confident, we’ve guaranteed it.
It is important to realize that we’re not guaranteeing that customers can fill bits onto platters, but rather, we’re guaranteeing that they can utilize the storage that they paid for – not only can they store data on the system, but they can also meet their performance and reliability requirements. This is a key distinction and differentiation when comparing Isilon storage to our competitors (e.g. NetApp).
Despite the implementation (which may differ significantly), every modern storage system can divide its overhead into similar classes:
- FileSystem Overhead: This is the space required for any file system to function correctly; OneFS needs to use storage to keep track of the location of files and directories, just as WAFL does. Unless you’ve done something really wrong in designing your system, this overhead should be minimal.
- Protection Overhead: This is the space required to provide some type of redundancy or forward-error-correcting codes. NetApp puts the majority of their engineering efforts behind RAID-DP, which provides hardware-based protection against two simultaneous drive failures – WAFL is ignorant of per-volume protection mechanisms. OneFS, on the other hand, has designed hardware-based protection mechanisms out of the storage system. Instead, OneFS relies on FlexProtect, which is a software implementation of Reed Solomon encodings, providing protection against up to four simultaneous drive failures.
- Snapshot Reserve: This is the space required to pre-provision for snapshots. The actual space consumed by snapshots should vary depending on the workflow (i.e. the quantity of ’snapshotted’ data modified), but whereas WAFL requires that the user to specify a snapshot reserve, OneFS does not require any snapshot reservation.
These are the major categories of overhead in a storage system, with the rest of the space being available to user data. Is this the entire story? Not even close.
Based on NetApp’s own user’s data, the amount of space available to the customer is 60% of the raw storage, while the average amount used is only 37%. Why would a NetApp customer only use 37% of their available storage?
The two key reasons that I’ve seen when speaking with our customer base (and if you know of others, please share):
- Performance: Since a NetApp volume is configured and deployed, backed by a particular aggregate (with a set number of disks), it is very easy to under or over provision the storage. In this case, you’ll end up with not enough disk bandwidth to serve your data set – and you’ll have to break your data set up into multiple volumes, leaving some of the storage unused. You may also exceed the amount of bandwidth your filer can serve – again, you’ll have no choice but to break up your data set and move it to another filer, which will result in unused storage. Yes, an administrator could come along and try and fill all the nooks and crannies left behind, but who has time for that?
- Flexibility: Just as performance can be a key motivator to split your data, so can your organization’s structure, your projects, your customers, etc. As an example, let’s say you put 10 customers on a single 16 TB volume using qtrees to partition the space. Before too long, you have some customers which have exceeded their 160 GB of space and need to be moved off to another volume – leaving behind under utilized storage. Another prime example is when you expand your data set – if you need to add a project that consumes 10 TB, but you only have 8 TB free, you’re going to deploy an entirely new volume – wasting 14 TB!
As far as I’m concerned, these are really the result of the classic bin-packing problem – your data consists of files with particular sizes and performance characteristics. NetApp (and other traditional SAN/NAS systems) give you a set of fixed buckets (volumes, LUNs, and filers) and force you to try and maximize this space – which is time consuming, costly, and ultimately (as NetApp data shows) ineffective. NetApp will point to thin provisioning, cloning and deduplication as solutions to this, which I will address in more detail in another post, but for now I will ask two questions:
- What happens if your data diverges/expands such that the clone/deduplication/provisioning method is no longer effective?
- Can you effectively utilize this data in terms of end-user performance?
My experience shows these techniques will help when initially deploying a system (or for very particular workloads), but over time (and in general) they will break down.
Isilon and OneFS give you a single, large bucket – every item placed into the bucket is striped across the system, balancing performance and maximizing utilization – a bucket which grows seamlessly, transparently, and effortlessly.
Who has time to pack bins?