Jump to content

High number dropped rx packets


Recommended Posts

I would appreciate assistance from someone with experience of debugging network problems.

 

I looked at ifconfig output and was horrified to see that the count of dropped rx packets on my eth0 interface, is more than 50% of the count of rx packets.

 

I'm not sure how long this has been happening, but probably for some considerable time - I have been aware that the time for the login prompt to be displayed when making a telnet connection to the server can be as long as 10 seconds.  It seems significant, to me, that once the telnet connection is established, there are no further appreciable delays.

 

If I reduce network traffic by stopping all significant addons (particularly transmission and minidlna), then the login prompt is displayed virtually instantaneously.  In this situation, the dropped packet count stops incrementing.  However, even then, a simple copy of,  for instance, bzimage, from the flash drive to an ubuntu desktop over smb, will bump the dropped packet count significantly.

 

On the other hand emhttp and unmenu interfaces respond almost instantaneously when making a connection.

 

Even though there are twice as many tx packets as there are rx packets, the dropped tx packets is zero.  I have tried changing network cable and switch port - that makes no significant difference.

 

I plan to connect unRAID and Ubuntu desktop back-to-back with no switch and run iperf to see what happens.

 

No other computers on are showing any dropped packets, so I don't think it's an inherent problem with my network infrastructure.

 

My hardware configuration is in my .sig - main points are Xeon e3-1230V2/X9scm-iiF mobo, running rc12a.

 

Can anyone suggest what steps I can take to identify the cause of this problem or, better still, fix it?

 

I guess that I should mention that the server, and most other devices, are connected directly to a TP-Link unmanaged 16 port Gb switch.  I have two other Gb switches both 8 port unmanaged, which I could test with.  However, being unmanaged, none of them can provide any diagnostics to assist with tracking down the problem.

Link to comment

I have a few on my server in sig, uses the same Intel ethernet:  Intel® 82574L Dual port GbE LAN.

 

I usually have mine link aggregated, but currently eth0 has 282 mill packets rec and 12k dropped, so no-where near as bad as yours!

 

Interesting - I have seen reports of dropped packets with the 82579, and frequent driver code changes to attempt to fix it.  However, the 82574 is very different and I've not found any suggestion that the 82579 is similarly afflicted.

 

What is your MTU setting on your switch?  unRAID server?  Hosts on your network?

 

My switches are not managed - I have no control over MTU on the switches.  My TP-Link switch is specified as supporting jumbo frames.  Server and desktop machines are configured to 1500. The wan side of my router is set to 1442 (max permissable is 1492).

 

Are you running any encapsulation protocols on the network?  Between the hosts and server? IPSec, GRE or SSL?

 

Not within the lan, but the server is using ssl for connections to my remote mail service.

 

Have you check l of your cables for signal loss?

 

No, but I have tried swapping the cable between server and switch.

Link to comment

Okay, after some testing, I have come to the following conclusion:

My unRAID server drops an enormous quantity of rx packets when it is transmitting data to a slower device - ie a 100Mb interface.

 

For instance, transferring a 4.7GB mkv from my unRAID server (1Gb) to an Ubuntu machine (100Mb) the following increases were reported:

rx packets: 1.6 million, dropped rx packets: 2.1 million, tx packets: 3.26 million

 

Performing the same transfer from Ubuntu (1Gb) to Ubuntu (100Mb) the rx packet and tx packet counts were almost exactly the same as before, but there were no dropped packets.

 

Transfers from unRAID to another 1GB equipped machine doesn't produce any dropped packets.

 

So my question now is: Why does my unRAID server drop so many rx packets when it is transmitting data and how can I rectify it?

 

There are no significant syslog entries at the time of the file transfers but, if the startup information may be of use, I can post a syslog.

Link to comment

Thanks for the suggestions.  I have found netperf pre-built for both Ubuntu and Slackware.  I will examine the options, and see what I can learn from the package.

 

However, with a Gb interface at either end, a quick test with default settings achieves in excess of 939*10^6bps when run from unRAID to Ubuntu.  If I run it the other way, I only get 922-923(*10^6)bps.  I don't know whether this is significant.

 

iperf, which I have run in the past, gives me around 924000 Kbits/sec in either direction.  Now, I'm not sure whether the 'Kbits' (sic) actually means 'kbits' or whether it means 1000bits.  If it's kbits, then it is producing slightly higher numbers than netperf in the faster direction.  On the otherhand, if it's 1000bits, then the results match those of netperf in the slower direction.

 

For interest, I repeated my file copy, this time to an Ubuntu machine with Gb interface.

This produced the following figures (compared with those when using a 100Mb interface):

rx packets: 215thousand, dropped rx packets: 0, tx packets: 3.3 million.

rx packets: 1.6 million, dropped rx packets: 2.1 million, tx packets: 3.26 million

 

I've also recorded the actually transferred byte count:

With the 100Mb interface: rx: 118 million, tx: 4,500 million.

With the 1Gb interface: rx: 38 million, tx : 4,800 million

 

The most glaring difference, apart from dropped packet count, is the tremendous reduction in rx packets (and rx bytes) when using the Gb interface at each end.

 

Anyway, I'm still at a loss to understand why there are no dropped return packets when copying Ubuntu Gb to Ubuntu 100Mb, but a very large number of dropped return packets when copying unRAID Gb to Ubuntu 100Mb.

 

I have not, knowingly, changed any network configurations from the unRAID default values, so is unRAID misconfigured by default, or do I have a hardware problem?

 

Does no one else (apart from, possibly, Concorde Rules) suffer from this problem?

Link to comment

After much googling, I came up with the following advices:

Increase the rx ring buffer using ethtool.  Default is 256 - I bumped it to the maximum 4096.  That appeared to make absolutely no difference to my problem.

 

The next thing I found was:

Beginning with kernel 2.6.37, the rx_dropped counter shows statistics for dropped frames because of:

 

Softnet backlog full

Bad / Unintended VLAN tags

Unknown / Unregistered protocols

 

If any frames meet those conditions, they are dropped before the protocol stack and the rx_dropped counter is incremented. 

 

and:

Care should be taken to confirm that frames are not being legitimately dropped.  A quick way to test this is to start a packet capture:

 

host:~# tcpdump

 

And then watching the rx_dropped counter.  If it stops incrementing while the tcpdump is running; then it is more than likely showing drops because of the reasons listed earlier.  If frames continue to be dropped while running tcpdump, investigation should take place to determine root cause.

 

I found, and installed, tcpdump on my unRAID server.  Sure enough, if I run it, the dropped packet count stops incrementing - even the drops resulting from copying a file from unRAID to Ubuntu desktop with 100Mb interface.

 

So this suggests that the packets are being dropped for legitimate reasons - but why should my file copy respond with invalid packets?

 

tcpdump produces quite a bit of output in default mode (and there are two verbose modes!), but I've not read enough to be able to interpret this.

 

I still have a lot of questions:

If transfers work okay at Gb speeds, but fail at 100Mb speeds, doesn't it seem unlikely that I have a cable fault?

If I copy a file to the server (either from a 100Mb or from a 1Gb interface), I see no dropped rx packets at either end, doesn't this also suggest that there's nothing wrong with the infrastructure?

Is the rx traffic at the server, during a file copy from the server, simply handshaking (ACK/NAK/XOFF/XON) type of traffic?  What could possibly go wrong with these, presumably simple, packets?

 

Apart from attempting to interpret the output from tcpdump, I think that I need to run some tests with my unRAID network interface configured to run at 100Mb speeds.

Link to comment

There should be less than 1% dropped packets. Most systems have zero.

Indeed ... which is why I want/need to get to the bottom of this.  I'm using 'fairly standard' Supermicro/Xeon hardware - if this is hardware related, the problem must lie in the network interface/microcode.  I'm using a standard rc12a build, admittedly with a number of addons - but this problem doesn't appear to be linked to any particular addon.

My network infrastructure should be okay - it can achieve a high throughput on 1Gb interfaces without any problem - the failure is only occurring when there's a 100Mb link involved.  I've swapped cables and switch ports.

 

There is a lot more I can try, given time - try another switch, try the other ethernet port on my x9scm-iiF, try configuring the ethernet port down to 100Mb, try connecting two computers back-to-back etc., but I'm still very puzzled why I encounter this problem, and no one else seems to be similarly afflicted.

 

Also, I suspect that the output from tcpdump ought to be able to tell me something useful - if only I knew how to interpret it.

 

unmenu includes a netperf package.

 

Strange - I'm on the latest (1.6) unmenu, but can't find netperf in the list of packages - only iperf, which I've been using for a couple of years.  Perhaps I'm blind and just can't see it!  :-[  Anyway, no problem - as I said, I did find a ready-built Slackware package, which is working fine.

Link to comment

I've now run tests connecting two machines via a different switch (a Netgear GS608).  Nothing else is connected to the switch - just my unRAID server and an Ubuntu machine with a 10/100 interface.  The dropped packets are still occurring during the file copy from server to client, and the count stops incrementing if I run tcpdump on the server.

 

I tried connecting the two machines back-to-back, with no switch (obviously, in this case, the server interface is restricted to 100Mb).  Now, there are no dropped packets - I believe that this is because the two interfaces run at the same speed, but the only way to prove that is to use a switch, but configure the server to only connect at 100Mb.

 

Edit to add:

Interestingly, even though no dropped packets are being reported by ifconfig, tcpdump still logs packets.  The terminating summary produced by tcpdump is:

 

In the case of 100Mb connection (no dropped packets)

26906 packets captured
363737 packets received by filter
336798 packets dropped by kernel

 

In the case of 1Gb connaction (with dropped packets)

44011 packets captured
672360 packets received by filter
628318 packets dropped by kernel
52 packets dropped by interface

 

Tests continue ...

Link to comment

I have now reverted to a virgin system (ie. no addons installed - my go file has a single executable line, starting emhttp, and I renamed the plugins directory on my flash drive).  I will attach my ps -eaf output for some kind soul to check.

 

I have also tried going back to rc10, with no addons and still I get dropped packets during file copy.

 

I can think of two further things to try - boot an alternative OS on my unRAID hardware, and booting unRAID on other hardware.  Oh, I could also try enabling the second ethernet interface on my x9scm-iiF mobo, and using that - I currently have it turned off in the bios.

 

I will also attach a syslog.

syslog.txt

virgin_unRAID_ps.txt

Link to comment

I've had my unRAID hardware booted into Ubuntu and performed a file copy from memory disk on that to my Ubuntu machine with 100Mb interface.  There were no dropped packets anywhere.

 

I have to conclude, therefore, that the dropped packets are not the result of any inherent hardware fault, but are a feature of unRAID.

 

I still wonder why ... and why no one else seems to have encountered the problem ...

 

Puzzled!

 

I think that expert analysis of my tcpdump is becoming important - the question is: Where do I find an expert?

Link to comment

Peterb

 

I am interested in the outcome of your work.

 

I have the same issue, on RC10, huge numbers of dropped rx packets. The MB I have has a RealTec card (different from yours). I NEVER saw this under unRAID 4.7, but I have recently switched to the RC10 to be able to use larger drives.

 

I had just assumed that the NIC was going bad. I have had bad experiences with realtec in the past.

 

Now I am not so sure.

 

bkasten

 

Link to comment

Thanks for your reply, bkasten.  I was worried that I was the only person to encounter this problem.  Do you have a lot of traffic going from your unRAID server to devices with a slower interface?  Are the dropped packets shown on the rx side at the unRAID server?  If you stop traffic to those slower devices, do the dropped packets stop?  If you constrain your unRAID machine ethernet interface to 100Mb, do the dropped packets stop?

 

I had wondered whether my problem was related to the e1000e driver, because unRAID is using v1.9.5 and Ubuntu is using 2.0.0.  However, if you're suffering the same problem with a Realtek chipset, then it must lie elsewhere in the stack, or something is delaying the servicing of interrupts.

Link to comment

Thanks for your reply, bkasten.  I was worried that I was the only person to encounter this problem.  Do you have a lot of traffic going from your unRAID server to devices with a slower interface?  Are the dropped packets shown on the rx side at the unRAID server?  If you stop traffic to those slower devices, do the dropped packets stop?  If you constrain your unRAID machine ethernet interface to 100Mb, do the dropped packets stop?

 

I had wondered whether my problem was related to the e1000e driver, because unRAID is using v1.9.5 and Ubuntu is using 2.0.0.  However, if you're suffering the same problem with a Realtek chipset, then it must lie elsewhere in the stack, or something is delaying the servicing of interrupts.

 

None of the 10/100 devices on my network talk to the unRAID. All the slower items are printers and the like. All the client computers are 10/1000. 1 old Mac, 1 Windows 7 and 3 Linux computers. The Windows box uses SMB and the Mac and Linux computers all use NFS. I have not tried constraining the unRAID to 10/100 yet.

 

All the dropped packets are on the RX side.

 

eth0      Link encap:Ethernet  HWaddr 00:30:18:ad:e3:b6 

          inet addr:10.0.1.10  Bcast:10.0.1.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:2168231165 errors:0 dropped:4268832 overruns:0 frame:0

          TX packets:2069464208 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:249601419 (238.0 MiB)  TX bytes:3616151215 (3.3 GiB)

          Interrupt:43

 

None of the client computers show dropped packets.

 

On my server, I issue the command:

 

watch ifconfig

 

And I can see that it's dropping packets at the rate of approximately 2 every two seconds. That rate is steady whether I am using if or not. The rate doesn't seem to vary up or down with use.

 

As I said before, I had assumed that the on-board NIC was going bad. Nothing new there for me. So I ordered an Intel Pro/1000 card to put in. Now I am not so sure.

 

Frankly, I am at a lost at the moment for what/how to test this.

 

I am open to any ideas.

 

I might just throw the discrete card in and see what happens.

 

Bruce

Link to comment

None of the 10/100 devices on my network talk to the unRAID. All the slower items are printers and the like. All the client computers are 10/1000. 1 old Mac, 1 Windows 7 and 3 Linux computers. The Windows box uses SMB and the Mac and Linux computers all use NFS.

... all connected to a Gb switch - are you sure that they have all negotiated a Gb connection?  What about wireless clients?

What about Internet traffic?

 

You could try disconnecting all clients, one at a time, including disconnecting your router, and note whether the dropped packets stop.

 

Apart from two older computers which have 10/100 interfaces, I also have several Squeezeboxes (both 10/100 hardwired, and wireless) and a digital photoframe on a wireless connection.  I honestly cannot remember whether my TVs, set-top boxes etc. have Gb or 10/100 interfaces.  To make life easier, I performed a lot of my testing with just two connections to a spare switch.

 

 

I have not tried constraining the unRAID to 10/100 yet.

 

Okay - worth doing as an experiment, and simply accomplished with an ethtool command:

ethtool -s eth0 autoneg off speed 100 duplex full  

Restore Gb operation with:

ethtool -s eth0 autoneg on speed 1000 duplex full

 

All the dropped packets are on the RX side.

 

Exactly what I'm seeing.

On my server, I issue the command:

 

watch ifconfig

 

And I can see that it's dropping packets at the rate of approximately 2 every two seconds. That rate is steady whether I am using if or not. The rate doesn't seem to vary up or down with use.

 

This makes me think that you have some background process passing packets to a slower interface - perhaps the Internet.

 

My dropped packet counts increment much more rapidly while I have transmission running, or during file transfer to a slow machine.

 

As I said before, I had assumed that the on-board NIC was going bad. Nothing new there for me. So I ordered an Intel Pro/1000 card to put in. Now I am not so sure.

 

Frankly, I am at a lost at the moment for what/how to test this.

 

I am open to any ideas.

 

Me too!

 

I might just throw the discrete card in and see what happens.

 

Sure, it's worth a try!

Link to comment

Other than slowing down file transfers, is this issue causing any functional problem?    The network protocol should, of course, still work fine and transfer files with no problem -- right?

 

I know how to test this in Windows, but have no idea how to do the measurements on the Linux (UnRAID) side.    And this thread now has me curious as to whether my system may have the same issue [i have no reason to think it does, but am just curious]

 

Is this simply a matter of opening a Telnet login and running a command to monitor the packets?  If so, what is the appropriate command?    I see your commands to change the network speeds; but it's not clear what will capture the actual transfers.   

Link to comment

FWIW I fired up Wireshark on my PC and set it to monitor retransmissions;  then copied a large file to/from each of my UnRAID servers.  No retrains -- so it's "working perfectly" from the PC side.

 

But it'd still be interesting to know whether the same is true on the UnRAID end ... so a quick tutorial would be very much appreciated.    From a bit of Googling/reading it seems it's just a matter of using iperf and netstat; but I have no idea exactly what commands they need to do this.

 

Link to comment

None of the 10/100 devices on my network talk to the unRAID. All the slower items are printers and the like. All the client computers are 10/1000. 1 old Mac, 1 Windows 7 and 3 Linux computers. The Windows box uses SMB and the Mac and Linux computers all use NFS.

... all connected to a Gb switch - are you sure that they have all negotiated a Gb connection?  What about wireless clients?

What about Internet traffic?

 

OK, that got me thinking. I have wireless extended all over the house with routers! Two subordinate routers to my main one. So, I unplugged each one in turn and checked the result.

 

No difference.  :(

 

You could try disconnecting all clients, one at a time, including disconnecting your router, and note whether the dropped packets stop.

 

That will have to wait for the weekend

 

Apart from two older computers which have 10/100 interfaces, I also have several Squeezeboxes (both 10/100 hardwired, and wireless) and a digital photoframe on a wireless connection.  I honestly cannot remember whether my TVs, set-top boxes etc. have Gb or 10/100 interfaces.  To make life easier, I performed a lot of my testing with just two connections to a spare switch.

 

 

I have not tried constraining the unRAID to 10/100 yet.

 

Okay - worth doing as an experiment, and simply accomplished with an ethtool command:

ethtool -s eth0 autoneg off speed 100 duplex full  

Restore Gb operation with:

ethtool -s eth0 autoneg on speed 1000 duplex full

 

Easy enough, so I tried it. No difference! Wild since for you, it made a difference. Mine was exactly the same. That one surprised me.

 

All the dropped packets are on the RX side.

 

Exactly what I'm seeing.

On my server, I issue the command:

 

watch ifconfig

 

And I can see that it's dropping packets at the rate of approximately 2 every two seconds. That rate is steady whether I am using if or not. The rate doesn't seem to vary up or down with use.

 

This makes me think that you have some background process passing packets to a slower interface - perhaps the Internet.

 

My first instinct is to say no, but I had even forgot about the subordinate routers. I may have to go through the house and turn off all devices one at a time this weekend. I cannot do this on a weekday. It might cause an insurrection. The unRAID has been so reliable, and is so integrated in the household, that it is an essential part of the house, like heat in the winter, running water, etc.

 

My dropped packet counts increment much more rapidly while I have transmission running, or during file transfer to a slow machine.

 

As I said before, I had assumed that the on-board NIC was going bad. Nothing new there for me. So I ordered an Intel Pro/1000 card to put in. Now I am not so sure.

 

Frankly, I am at a lost at the moment for what/how to test this.

 

I am open to any ideas.

 

Me too!

 

I might just throw the discrete card in and see what happens.

 

Sure, it's worth a try!

 

I will give it a try, installing the Intel NIC. I want it in anyway, having had a bad time with RealTec NICs.

 

I will keep you posted

 

Bruce

 

Link to comment

Gary, to provoke this problem you must, first of all, be transmitting data from the unRAID server (1Gb ethernet interface, I assume) to another device which has a slower interface (eg 10 or 100Mb wire, or a wireless connection) - this can be a simple file copy from a standard disk or user share.

 

You can check for dropped packets via a telnet connection to your server - simply type the command: ifconfig eth0 before and after the data transfer.  If you have this problem, it will show up as an increase in dropped RX packets (fourth line of the output - see bkasten's example in reply #18).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...