Replication Throughput Issue

Question

Good Afternoon All,

I would like to find out if there are any limitation on how replication data is pushed across the wire. We have a lot of data to replicate, we have a 50MB Metro connection, all the correct Cisco switches in place, The issue that i can see is that most of teh time we have data moving across, we are only utilizing about 60% of that 50MB pipe that we have. On occasions (very few) we can see it spike up to about 80% for a bit. What does it take to get all replication jobs to utilize all or most of the bandwidth?? Attached is a pic of what i am referring to. Any comments or suggestions are welcome

Some of these replication jobs are taking forever to finish and we need to do something about it. Some jobs take 20 hours. Thanks

bigpoamah · Answer

The WAN link is the only constant in the picture.  Based on the picture attached at one point the throughput was pretty good since it was running above 80%.  As for how many jobs that depends but some times there is one and sometimes as many as 4.  What I am trying to fix is the speed.  If we can get teh full use of our Metro connection we would never have more than two at once.  We have one volume that need to copy about 50GB a day and all the others are closer to 5GB.  As you can see, there is pleanty to copy and only utilizing 50%of my BW isn;t going to cut it.  As for teh Latency, it varies as well.  I am going to get in touch with the carrier and have them monitor the Latency

edolgner · Answer

Full disclosure, I am a product manager for Silver Peak.

Don is correct, latency is a limiting factor for replication throughput. You didn't mention what the latency on your network is, or where your two sites are so it is difficult to comment on how much latency is limiting your throughput. We have an online throughput calculator that will let you determine what performance you should be getting on your network www.silver-peak.com/.../throughput-calculator.

There is one other thing that really slows down replication throughput, especially when there is latency, packet loss. If you are losing any packets during replication your throughput will start to slow down until the packet loss stops, and then it will slowly ramp back up. If you look at a bandwidth chart you will see a sawtooth pattern. Your provider might tell you that packet loss is very low, 0.1% - 1%. Even 0.1% loss with 80 ms of latency will limit you to 4 Mbps, regardless of how much bandwidth you have. This is a TCP/Network issue, not an EqualLogic issue.

Silver Peak can fix network problems, and overcome the effects of latency on the network. We have hundreds of joint EQ customers with Dell, and even a dedicated replication accelerator to help you use all of your bandwidth while also getting 5-10X reduction on the data as it is replication. This means your 50 Mb connection could behave like a 250-500 Mb connection.

I recommend checking with your provider to get your SLA for latency and loss. Once they give you the info you need to use a monitoring tool to verify that they are meeting their targets (a lot of providers will hide behind measurable loss and latency happening in a different part of the network). If you want to get the stats you can download one of our appliances for free, run it with your EQ replication traffic, and it will report on the loss, latency, and out-of-order packets in real-time on your network. You might find that it is easily correctible by changing something on your network, or it will give you ammunition to use in your conversation with your provider.

bigpoamah · Answer

Well Guys,

I have a ticket open with Dell now and had one open before for about a month with no resolution. They are saying it's a networking issue and not teh hardware or setup. However i do have an issue with that thought since we are getting great throughput and some time and not the others. If the issue was related to a setup or network issue i wouldn't expect it to go any higher than it is. Yesterday for example, we had outstanding throughput for about 10 hours then it dropped back off. I just don;t get that! If anyone has any suggestions on how to help pinpoint the issue please let me know. I've looked at all teh switched, checked all the firmware, Cables, installed new fiber in the office, changed out fiber modules, replaced switches and a host of other things. Any comments or suggestions are welcome. Thanks to all that have replied! much appreciated the help!

Perry

bigpoamah · Answer

The latency when i ping the remote site is <2ms.  The remote site is about 2 miles away from our bldg.  Our provider is coming out to check the circuit on Tuesday.  I have alos reopened a ticket with Dell support.  The old ticket was opened for 2 months with no resolution.  Have another tech working on it now.  It's just hard for me to understand why at some point n the day i can get 80% thruput and 30 minutes later only get 40%.

bigpoamah · Answer

i did specify 1024.  The only thing that goes over this link is replication traffic and we have a VMWare host on the DR site as well.  No major traffic with that just heartbeat traffic.

bigpoamah · Answer

There are 3 things on the DR side.  A host a switch and another equalogic.  There is one server on the host and it's just a vcenter server.  That is all that is there. Replication runs 24/7.  i will shutdown the vcenter server just to eliminate that part.

bigpoamah · Answer

Checked out a couple more things and found nothing out of the ordinary,  started replication again and checked the latency with a ping (1408) and still only 4ms currently however we are only utilizing 40% of the transmit bandwidth.  I would try to upload a few pics but can't find how to attach file.

bigpoamah · Answer

if you are referring to the MTU size yes.  TCP window size NO.  What is it recommended at?

bigpoamah · Answer

Can I ask you how at some times during teh evening I do get good bandwidth?  Sunday for exampel, i got almost 90% thruput fora bout a 9 hour period and then it went to *** again.

bigpoamah · Answer

Currently only about 2. Tried it with just one and still the same issue. Here is a blurb from San HQ email. Brace your self, it's ugly!

• 11/10/2014 1:14:49 PM to 11/10/2014 2:16:41 PM

o [ID: 3.28] Member SAN2 TCP retransmit percentage of 2.0%. If this trend persists for an hour or more, this could be indicative of a problem on the member's SAN network, resulting in an e-mail notification.

 TCP retransmit rate greater than 1% should be investigated. Check the network connections and switch settings.

 Condition already generated an e-mail message. If the condition persists, additional messages will be sent approximately every 6 hours.

o TCP outbound packet counts for polling period: 8,025,511

o TCP retransmit packet counts for polling period: 163,823

o eth0 send rate for polling period: 3.34 MB/sec

o eth1 send rate for polling period: 1.93 MB/sec

o eth2 send rate for polling period: 946.47 KB/sec

o eth3 send rate for polling period: 1.92 KB/sec

• 11/10/2014 10:09:29 AM to 11/10/2014 11:11:16 AM

o [ID: 3.28] Member SAN2 TCP retransmit percentage of 1.6%. If this trend persists for an hour or more, this could be indicative of a problem on the member's SAN network, resulting in an e-mail notification.

 TCP retransmit rate greater than 1% should be investigated. Check the network connections and switch settings.

 Condition already generated an e-mail message. If the condition persists, additional messages will be sent approximately every 6 hours.

o TCP outbound packet counts for polling period: 5,637,941

o TCP retransmit packet counts for polling period: 90,600

o eth0 send rate for polling period: 1.02 MB/sec

o eth1 send rate for polling period: 941.27 KB/sec

o eth2 send rate for polling period: 1.87 MB/sec

o eth3 send rate for polling period: 2.24 KB/sec

• 11/10/2014 9:07:38 AM to 11/10/2014 10:09:29 AM

o [ID: 3.28] Member SAN2 TCP retransmit percentage of 3.3%. If this trend persists for an hour or more, this could be indicative of a problem on the member's SAN network, resulting in an e-mail notification.

 TCP retransmit rate greater than 1% should be investigated. Check the network connections and switch settings.

 Condition already generated an e-mail message. If the condition persists, additional messages will be sent approximately every 6 hours.

o TCP outbound packet counts for polling period: 7,326,481

o TCP retransmit packet counts for polling period: 249,518

o eth0 send rate for polling period: 1.07 MB/sec

o eth1 send rate for polling period: 1.81 MB/sec

o eth2 send rate for polling period: 2.94 MB/sec

o eth3 send rate for polling period: 2.65 KB/sec

• 11/10/2014 8:05:51 AM to 11/10/2014 9:07:38 AM

o [ID: 3.28] Member SAN2 TCP retransmit percentage of 3.7%. If this trend persists for an hour or more, this could be indicative of a problem on the member's SAN network, resulting in an e-mail notification.

 TCP retransmit rate greater than 1% should be investigated. Check the network connections and switch settings.

 Condition already generated an e-mail message. If the condition persists, additional messages will be sent approximately every 6 hours.

o TCP outbound packet counts for polling period: 6,321,521

o TCP retransmit packet counts for polling period: 245,541

o eth0 send rate for polling period: 2.16 MB/sec

o eth1 send rate for polling period: 2.19 MB/sec

o eth2 send rate for polling period: 1.05 MB/sec

o eth3 send rate for polling period: 2.66 KB/sec

bigpoamah · Answer

That is correct.  That is really the only thing this pipe is used for.

bigpoamah · Answer

we have the ISP coming tomorrow to test the connection but i'm guessing that it will be OK.  Like i mentioned before teh latency only shows 4ms when i have replication running.  If i could post a file i would show you wht out solarwinds looks like.

EqualLogic

Replication Throughput Issue

Was this post helpful?