traffic shaping in high-bandwidth environments

Tue Apr 9 15:39:47 BST 2019

(Note to moderator -- kindly delete queued html-formatted submissions.
I ought to know better, but web client default message composition
settings caught me.  Apologies.  Hopefully, this comes through as
plain text.)

By way of introduction, I'm a long-time Linux user/admin (since 1992),
but have only recently stumbled into the area of traffic
shaping/control in a Linux environment.  The "fireqos" package looked
like a great way to ease into this, but I'm obviously missing
something fundamental about the way this is supposed to work, at least
in terms of what I expected to see :-).

Hardware environment is an IBM blade server with Broadcom NetXtreme II
BCM57111 NICs (10G PCIe).  There is an "eth5" interface associated
with the "external" network, and an "eth7" interface associated with
the "internal" network.  The two interfaces are bridged: idea is for
traffic to pass transparently between the two interfaces, subject to
whatever shaping/control I might want to apply.  I have access to a
BreakingPoint appliance I can use to generate different mixes of
application traffic at various rates, and that's what I've been using
for testing.  For now, I'm limiting things to IPv4 UDP traffic, and
looking at various scenarios involving unidirectional and
bidirectional traffic flows.

Per the recommendations found in the FireQOS tutorial and elsewhere, I
did a bit of tuning with "ethtool" on the "eth5" and "eth7"
interfaces.  Specifically, I turned off "gro" and "lro", and increased
the number of receive buffers ("rx") to the indicated maximum of 4078.
Early testing of a trivial "fireqos" configuration resulted in a
massive number of receiver overruns: adjusting the "rx" value was part
of the solution to that problem, as well as adjusting the "interrupt
coalescing parameters" -- rx-usecs 0 tx-usecs 0 rx-frames 1 tx-frames
1.  Another individual suggested setting "net.core.netdev_max_backlog
= 5000" (default value is 1000): the explanation offered was, this is
the max number of packets that can be queued on the input side when an
interface receives packets faster than the kernel can process them.

With "fireqos" inactive, the bridge has no problems processing traffic
at an aggregate (bidirectional) rate of 10 Gbit/s.  Neither the
BreakingPoint nor the SUT report any issues whatsoever.

Given the following trivial "fireqos.conf" file as suggested by the tutorial:

DEVICE=eth5
IN_SPD=9000mbit
OUT_SPD=10000mbit
LINKTYPE=ethernet

interface $DEVICE ext-in input rate $IN_SPD $LINKTYPE

interface $DEVICE ext-out output rate $OUT_SPD $LINKTYPE

After typing "fireqos start", I see the expected "eth5-ifb" device
created to help with the "input" side of things.  With the default
"sfq" qdisc, throughput with this configuration would best be
described as "abysmal".  1.0 Gbit/s aggregate (bidirectional) is
completely error-free.  1.5 Gbit/s aggregate is pretty good at 98%+.
At a 2.0 Gbit/s aggregate data rate, traffic control reports I'm
dropping upwards of 120,000 packets per second.  Appending "qdisc
pfifo" to each interface statement in the conf file (replaces default
"sfq" qdisc with "pfifo") helps somewhat, which I would expect because
simple FIFOs are computationally simple to implement.  Odd thing about
that is, the SUT doesn't seem to be in any "distress" as far as CPU
utilization, inability to service interrupts promptly, memory/buffer
issues, etc.  The SUT has no other job to do than process traffic, and
it has 24 hyperthreaded 2.4 GHz Xeon cores and 64 GB RAM available to
throw at the task.

Other things I've tried, a few of which seem to have helped:

(1) Since "eth5" and "eth7" are bridged, why not use "eth7" as the
"ext-in" interface and "eth5" as the "ext-out" interface (both in the
"output" direction only)?  Can then run both at "rate 10000mbit" since
traffic shaping/control is only being applied to outbound direction of
each interface.  This has the further advantage of eliminating the IFB
layer and whatever latency it introduces.

(2) For each "interface" statement, experimented with adding "class
default commit X%" for various values of "X".  So far, this actually
had the biggest effect on improving overall throughput when "fireqos"
was active: aggregate rates of up to 3.0 Gbit/s look pretty good for a
commit value of 90%.

So, is this behaving "as designed"?  I would expect higher throughput
in the absence of any explicit controls, but I'm obviously missing
something.  Thanks in advance for improving my understanding of what's
going on here :-).

Respectfully,
--Bob