Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

3092

March 4th, 2018 18:00

1 DellEMC ScaleIO SDS shows 250x storage bloat for random write, along with performance issues

I downloaded “trial” ScaleIO in a 3-node setup and started running small tests to see how it performs and scales. I noticed a bizarre behavior which led to this post.

First of all, I want to encourage others to try the same test themselves and hopefully prove me wrong, because this seems to be a huge architecture issue for any enterprise storage. I am sharing my test details, along with “vdbench” script I used to reproduce this issue.

Issue: Huge storage bloat happens when performing random writes on a large disk, and system never recovers from it. In short, I perform random writes worth 5 GB, and ScaleIO ended up consuming 2.4 TB worth of storage, instead of ~ 10 GB (including the replica) and never recovered from it. This is equivalent of nearly 250x bloat.

Setup:

1. Downloaded and installed “trial” version of ScaleIO with a default 3-node/3-ScaleIO Data Server (SDS) setup

2. Added 6 physical HDDs (4 TB each) to each SDS. Total SacleIO capacity: 65.5 TB, unused capacity: 43.2 TB, spare capacity: 22.3 TB

3. Each SDS had: 128 GB RAM and 32 cores (with hyper-threading)

4. Client/SDC (ScaleIO Data Client) used for this test was the same node as one of 3 SDS

5. Created block disk of 10 TB (thinly provisioned) and mapped only 1 SDC to it

Test:

Ran 4K 100% random write workload using only 1 SDC, on 10TB disk. I used vdbench for this and maxdata parameter to limit total writes to 5 GB. This is what vdbench manifest looked like:

debug=25

*SD:    Storage Definitions

sd=sd1,lun=/dev/scinia,threads=16,openflags=o_direct

*WD:    Workload Definitions

wd=WD_4K_Write_Random_1,sd=sd*,rdpct=0,seekpct=100,xfersize=4k

*RD:    Run Definitions

rd=run_4K_Write_Random_1,wd=WD_4K_Write_Random_1,iorate=max,maxdata=5G,elapsed=120000,interval=1,pause=5

And this is how I ran vdbench test:

#!/u/tools/bin/bash

/u/tools/bin/vdbench -f manifest -o /root/`hostname`/

vdbench output after the test started:

Mar 04,2018  interval        i/o   MB/sec   bytes   read     resp     read    write     resp 

                             rate  1024**2     i/o    pct     time     resp     resp      max  

16:25:53.052         1     541.00     2.11    4096   0.00   26.925    0.000   26.925  356.967  

vdbench output after the test ended (these are average values):

17:12:21.045avg_2-2789    470.46     1.84    4096   0.00   34.006    0.000   34.006  642.025

I saw 2 issues with this test:

1. Write Performance: I only got average of 1.84 MB/sec as part of this run. This itself is pretty bad, but I won’t go into too much detail with this because one can argue hardware improvements (ssd, network bandwidth, CacheCade etc) to make this better. If someone can only afford to have HDDs, they should definitely look into this.

2. Storage bloat: ScaleIO reported 2.4 TB capacity consumed even after writing only 5 GB using only 1 block disk and 1 client. Even after I waited more than 12 hours, this number never came down. It doesn’t make sense for a thinly provisioned 10 TB disk to consume 2.4 TB, after just writing 5 GB to it.

Also, I created such a large disk just to clearly show the bloat. This should be easily reproducible by creating large number of small disks, and performing random writes to those disks. This issue can lead to storage bloat and severe under-utilization of storage, in a production environment with large number of disks.

31 Posts

March 6th, 2018 04:00

It depends on the average block size and also how much space has been over-provisioned. In this case, it has been over-provisioned by 2000x which is extremely high. If you were writing to a 10GB volume (which would be rounded up to 16GB) - you would see much less "bloat" as those 4K blocks would have a much higher probability of being written within the same 1MB chunk. Likewise, if you were using a 1MB block size (I know, uncommon) - you would see 0% bloat.

The other way to look at is the actual over allocation of the underlying physical capacity. For example, in your system with 72TB raw capacity (3 nodes, giving you about 24TB usable) -- you can create Volumes that add up to about 5x this physical amount. This way you are getting the benefit of thin provisioning without necessarily having the physical space.

If you have time, I'd suggest repeating the test but with more Volumes that over-allocate your physical hardware. Eventually you should start getting warnings, and it will also stop letting you create new Volumes.

In regards to the performance, for your 3 nodes x 6 (4TB) 7200rpm drives, they can do about 80 Random IOPS each. So you should max out around 1440 IOPS for this environment. There is some initial performance penalty for thin provisioning when writing for the first time to a new chunk, so this also needs to be taken into consideration.

31 Posts

March 5th, 2018 01:00

Hi Andy,

I believe this is due to ScaleIO using 1MB chunks, each of those 4K random writes have been translated into 1MB chunks being written all over the place. This would explain the perceived bloat of 250x. (i.e, 1MB equals 1024K, therefore 4K into 1024K is 256).

At proper scale though, once the volume was fully utilized this would not be seen as an issue.

With the upcoming ScaleIO 3.0 there is a new space efficient format which will use 4K chunks instead (when using all-flash).

Matt

March 5th, 2018 08:00

That would make sense for the storage bloat. But, to say that someone has to fully utilize the storage volume in order to avoid this bloat is not practical. No one really fills the entire volume to the brim. More often than not, there are writes, overwrites and deletes, which can lead to huge storage cluster fragmentation, if 1MB chunks are always preallocated.  

No Events found!

Top