[Firehol-support] better understanding link-balancer and PBR

Wed Dec 7 11:12:24 GMT 2016

I respond again to the list - I sent it only to spike before.

In you text:

On Wed, Dec 7, 2016 at 7:02 AM, Spike <spike at drba.org> wrote:

> Dear all,
>
> Q1) why does l-b copy over the main routing table to all the custom chains?
> Or at least, in a common/default scenario where main just contains a couple
> of local routes, what's the benefit of it?
>

It simplifies routing significantly. Without this inheritance, policy based
routing would be a lot more complicated. Imagine it. You have your static
routes and 2 upstream providers. How would you say that lan server1 is to
be routed via ISP2, without loosing your static routes?

> Q2) l-b generates a nexthop default route using the GWs I configured as
> default . When the packet encounters that do they go back to look at the
> rules and then match Table1 for GW1 or Table2 for GW2 depending on nexthop
> selected? If not, then what are those tables set up for? the main table
> would already know how to reach those destinations since they are local.
>

This is done with policy based routing. Check: ip rule show or the policy
section in link-balancer.conf

> Q3) my understanding is that routes are cached, so even after a link has
> gone down a client will still make the same choice in terms of routing a
> certain ip. Is that correct? ie it won't look at the rule or tables and
> just pick the cached route. So for example if when 2 GWs were up, and
> packets were routed through GW1, with Table1 having GW1 as its default
> route, and then GW1 went down, subsequent packets would still route through
> GW1 until the cached route expired. Is that correct? If that's true, then
> what's the point of changing the default route in Table1 to use GW2 when
> the rule that pointed to GW1 is removed anyway?
>

hm... I don't know how the routing cache works exactly. I know however,
that in all cases I have encountered so far, my problem was only the
iptables connection tracker, especially when NAT is involved or CONNMARK is
used.
I had to to run conntrack to delete all the rules of the failed gateway, to
prevent long timeouts.

> Q4) for some reason I'm not understanding, two subsequent runs of l-b give
> opposite results regarding a failed GW: the first run detects it as FAILED,
> but it then adjusts the routes for that GW's table which seems the reason
> why the second run succeeds, even tho the GW is still in failed mode.
>

This ping-pong case is common if the check depends on the presence or not
of routes.
If you check what it is that fails, you should have an indication on how to
make it work in that state. The most common case is that a policy based
rule might be required to make something work after link-balancer has
applied its rules.

Costa