Minimize impact to client IO when SDS crash with a kernel panic

Hello!

I'm currently testing ScaleIO with version R2_0.14000.231 on CentOS 7.2.

I have 12 SDS in one storage pool, 1 protection domain and 4 fault sets with 3 SDS in each fault set.

Each SDS have two network interface: 1 Gb/s Ethernet for management traffic and 56 Gb/s InfiniBand for Data traffic.

I'm testing 2 cases:

Full unavailability for Data network interface
Kernel panic on SDS node

In both cases I have same issue from clients behind the SDC: when network interface goes down, or node crash with kernel panic client IO hangs for 0 IOPS for 7-9 seconds, then starts again normally with no I/O errors on file-system. After this 7-9 second ScaleIO starts a rebuild normally.

How I can minimize this impact?

Responses(1)

M

matt_hobbs

31 Posts

0

April 24th, 2018 19:00

This is expected behaviour in my experience - various checks are occurring before an SDS is marked offline and I/O will be paused until a decision has been made which typically takes this amount of time. There is no tuning that can be done to reduce this. (Also if it made the decision too quickly, it could cause other problems if the issue was only transient).

In the vast majority of situations most operating systems and applications can tolerate this amount of I/O holding (defaults for most OS's are around 30 seconds). For most storage solutions, I/O holding is a fact of life at some stage or another - typically during controller failovers for example.

If you have an application that is not tolerant to this, I would be curious to know more detail about it.

View All

No Events found!

PowerFlex

Minimize impact to client IO when SDS crash with a kernel panic