This is a routing configuration for Linux that distributes outgoing connections across two separate interfaces to public internet, or two different ISPs.

Caveat

A router is an appliance: it should do its task effectively and save you time. The goal of this article is to describe a multi-homed system, but it will not help you if you lack the understanding to maintain it. This means debugging it, testing it, repairing it, and customizing it further in the future.

An advanced network setup like this can easily cost you more time in construction and maintenance than it saves you in file transfers. Keep this in mind as you are designing your system.

Solution Concepts

This system's basic premise is to use a simple closed-loop control of throughput. Instead of distributing packets and connections across the interfaces blindly, we will measure the load on each interface to avoid congestion.

A naive solution might be to just send out the each IP packet on the least busy interface. As we'll see, there are inherent problems with doing this.

Let's assume that somewhere between each interface and the public internet, there is a router doing NAT so that some of our "external" interfaces are not bound to public internet addresses. NAT will imply that, from the outside, each packet will appear to originate from whatever address is bound to each of the two outward interfaces. If the next IP packet is sent via one of these two available interfaces at random then the outside world will not see a single, stable source address associated with a connection.

This is a problem since a TCP connection is defined by a stable source and destination IP address that every packet in the connection shares. So in this naive packet distribution solution, consistent TCP connections are guaranteed to fail.

We need a way to ensure that subsequent packets in a connection are always sent out the same interface. In other words, our solution needs to have connection integrity.

Simple Load Balancing of Workstation Connections

Let's start with the basic IP configuration of the interfaces:

Interface CIDR Address Gateway
eth0 183.100.0.10/24 183.100.0.1
eth1 192.168.46.11/24 192.168.46.1

We'll say that eth0 is on a public IP thanks to a passthrough mode on an ISP home router, whereas eth1 is on another private ISP network. So we're guaranteed that the eth1 interface will need to go through layer of NAT to get out.

We want traffic on this machine destined for the public internet to be routed out the interface that is currently the least busy. Let's dive in.

iptables
Reference: [1] man iptables(8). Linux manual page. man7.org.
Reference: [2] man iptables-extensions(8). Linux manual page. man7.org.
Reference: [3] The iptables tutorial. Oskar Andreasson. frozentux.net
Reference: [4] Diagram of netfilter packet flow through iptables. en.wikipedia.org.

The iptables suite provides the rules for blocking, forwarding, and masquerading traffic, but it also offers a few information-gathering mechanisms for logging and tracking packets. See [1] and [3]. You should be familiar with how to control access to a port or to masquerade a virtual machine before continuing. You can also take a look at [4] to see the packet flow into and around the kernel's routing tables.

We'll use a standard extension called RATEEST that you can read about in [2]. It can generate a "traffic rate estimate" for an interface by adding rules such as the following to the iptable named mangle. These estimators will provide us the information required to help the kernel route the packet long the least busy interface.

mangle-prerouting-1.chain
*mangle
-A PREROUTING -i eth0 -j RATEEST -rateest-name rate_eth0 --rateest-interval 100ms --rateest-ewma 1.0s
-A PREROUTING -i eth1 -j RATEEST -rateest-name rate_eth1 --rateest-interval 100ms --rateest-ewma 1.0s
COMMIT

Here we're creating two ingress rate estimators for the interfaces eth0 and eth1, named respectively rate_eth0 and rate_eth1. Each estimator will have both byte and packet counters associated with it. The two rightmost parameters cover sampling and averaging details that you should look up in [2].

You can load these rules into your kernel with the command:

iptables-restore --noflush -T mangle < mangle-prerouting-1.chain

You can check that these rules have been added with the command iptables-save, which will print all the current iptables to stdout.

Next, iptables can generate a small integer tag called a mark that will be attached to the packet on its journey through the kernel's routing framework. The mark doesn't persist on the packet when it gets sent out on the interface. In our solution, we'll use this mark (or at least the lowest-order bit of it) to serve as an indicator of which outbound interface we prefer to route the packet on.

In this next excerpt, we use iptables rules to test the rate estimators from the previous section. If the test passes, the CONNMARK target assigns a mark value to the packet. You can also specify a bit mask with CONNMARK, allowing you to set/unset/ignore all the bits and use them for different purposes.

mangle-output-1.chain
*mangle
-A OUTPUT -m conntrack --ctstate NEW -m rateest --rateest rate_eth0 --rateest-bps1 0bit --rateest-lt --rateest2 rate_eth1 --rateest-bps2 0bit -j CONNMARK --set-mark 1
-A OUTPUT -m conntrack --ctstate NEW -m rateest --rateest rate_eth0 --rateest-bps1 0bit --rateest-gt --rateest2 rate_eth1 --rateest-bps2 0bit -j CONNMARK --set-mark 2
COMMIT

Note that we're using the OUTPUT chain here. These marks will be applied to new outbound connections originating at this host, but not to any connections that it is forwarding. This is relevant in certain virtual machine setups -- a host acts as a router for its guests and would need to forward outbound connections.

It's also relevant for a router with two public interfaces. This is a much more common way of using two public internet connections and we'll explore it later. For now, think of the current host as a workstation that is only concerned with getting its own innate packets to the right place. In this case, OUTPUT is the right chain to use.

Note when using RATEEST: if you try to put the contents of both files into a single file and import it into iptables that way, it won't work. The statements using the estimators need to be in a different COMMIT block than the statements defining the estimators. This is the reason for having them in two separate files.

To summarize, a mark of 1 on a connection implies that we should prefer eth0 for that connection. Conversely, a mark of 2 implies that we should prefer eth1.

You'll notice that these output rules only match on a conntrack clause --ctstate NEW. This is part of our commitment to connection integrity. Only the first packet in a connection has its CONNMARK set. This connection mark is separate from the packet mark, and is maintained throughout the life of the tracked connection.

In order to force subsequent packets in a connection to inherit the mark from their connection, we need to restore the packet mark in the prerouting phase:

mangle-prerouting-2.chain
*mangle
-A PREROUTING -j CONNMARK --restore-mark
COMMIT

Now each packet in a connection will inherit the mark that was decided for it when the connection was opened. This has obvious drawbacks in that packets may not be distributed evenly among the interfaces, but this is a small price to pay for the guarantee that a connection will now have a stable source ip to the outside world.

Finally, these estimators assume that the desired throughput out of either interface should be roughly equal. You can read [2] if you'd like to tweak the desired rates.

ip route
Reference: [5] man ip-route(8). Linux manual page. man7.org.

Since Linux kernel 2.4, we have the ability to define several routing tables and to decide which table to use using a set of rules. I'll show you how to create the alternate tables first using ip route[5], then show you ip rule in the next section.

If you haven't before, type ip route and look at what's there.

The important columns are:
  • a class of destinations given by a subnet mask
  • a host by which to get there
  • an interface by which to get there
The important rows are:
  • a "default" class of destinations (aka 0.0.0.0/0) via some gateway address with global scope
  • a gateway destination (or sometimes a subnet that includes the gateway) via a dev (interface) with link scope

You might see two "default" rows in your routing table if you have more than one public interface. If you've configured them both using DHCP then your ISP servers likely gave them both default routes. In that form, the one that comes first will simply get all the default packets.

What we'll be doing is copying this lines into separate tables so that each table has only one default route out the interface that we care about. Then choosing this table for a packet will become the way of deciding how it gets routed.

Choose two numbers for your new routing tables between 2-32764. Here I'll use 1000 for sending out of eth0 and 1001 for sending out of eth1.

With this information, we can write a custom routing table for each:

ip route add default table 1000 proto static via 183.100.0.1 dev eth0
ip route add 183.100.0.1/24 table 1000 proto static scope link dev eth0

...and...

ip route add default table 1001 proto static via 192.168.46.1 dev eth1
ip route add 192.168.46.1/22 table 1001 proto static scope link dev eth1 

For the uninitiated, proto static means that the route won't be modified or deleted by network-aware daemons like NetworkManager or systemd-networkd. The scope link in the second rule of each part implies that this rule will cause a packet to be pushed down to the link layer and out the specified interface directly rather following the trail of IP address rules further (which are scope global by default).

The astute reader will notice that I used the gateway address along with the host's CIDR prefix. This assumes that the gateway is in the same subnet as the host. It means that the rule also covers the gateway and the subnet that was assigned to that interface via its address. This is so that any other ISP-related resources on that subnet that the ISP intends to expose, such as DNS and NTP servers. If your gateway is outside this subnet, then provide a similar rule with a /32 prefix for the gateway, and a separate rule for the subnet.

Now you have routing tables with a single default gateway in each. You can view these tables with a command like e.g. ip route show table 1000.

ip rule
Reference: [6] man ip-rule(8). Linux manual page. man7.org.

Ok, so your packets have a mark attached to them, and you have tables that say much more explicitly where a packet should be routed by default. How do we connect these? With ip rule.

We'll use this command to set up a policy which states that packets with mark=1 will use routing table 1000, while packets with mark=2 will use routing table 1001.

# ip rule add fwmark 1 table 1000
# ip rule add fwmark 2 table 1001

Note that fwmark stands for "firewall mark" and is a remnant codeword that refers to the packet mark discussed here.

We can conduct a small test of the routing framework we've just set up with ip route get.

# ip route get 8.8.8.8 fwmark 1 from 183.100.0.10 iif eth0
# ip route get 8.8.8.8 fwmark 2 from 183.100.0.10 iif eth0

This asks the routing framework to find a route from 183.100.0.10 to 8.8.8.8 using fwmark 1, then fwmark 2. Note that the source address doesn't matter here as long as it is a routable address on this host.

If you use NetworkManager (like Fedora and Ubuntu do), then you'll likely want to put these commands into scripts so that the rules are defined when both network interfaces eth0 and eth1 are both up, and removed again when either one of the interfaces is brought down. These scripts go into NetworkManager's dispatcher.d directory and will receive arguments indicating the interface and the event (up/down).

A Router Configuration

Let's expand the scenario a bit by saying that the host now has two outgoing interfaces to separate internet providers, and a third internal connection which serves a private LAN. Our address/interface table might now looks like this:

Interface CIDR Address Gateway
eth0 183.100.0.10/24 183.100.0.1
eth1 192.168.46.11/24 192.168.46.1
lan0 10.0.10.1/26 (none)

Our iptables rules now look a bit different. Similar to mangle-output-1.chain, we'll add a few more similar rules to the prerouting phase

mangle-prerouting-3.chain
*mangle
-A PREROUTING -m conntrack --ctstate NEW -m rateest --rateest rate_eth0 --rateest-bps1 0bit --rateest-lt --rateest2 rate_eth1 --rateest-bps2 0bit -j CONNMARK --set-mark 1
-A PREROUTING -m conntrack --ctstate NEW -m rateest --rateest rate_eth0 --rateest-bps1 0bit --rateest-gt --rateest2 rate_eth1 --rateest-bps2 0bit -j CONNMARK --set-mark 2
COMMIT

These ensure that forwarded connections are also marked.

Another change is that, since we now have a 3rd IP address that represents our internal network, we can require that any packets on the router that are bound to one of the two "external" addresses must also leave out the interface that is associated with that address. While this seems redundant, we need to make it explicit because our previous OUTPUT rule was extremely unselective -- it marks traffic for a specific outbound interface with no consideration of what the source address is.

The easiest way to do this is simply to make the OUTPUT rule from mangle-output-1.chain more selective -- so that it only applies to connections that bind the router's internal address:

mangle-output-1.chain
*mangle
-A OUTPUT -s 10.0.10.1 -m conntrack --ctstate NEW -m rateest --rateest rate_eth0 --rateest-bps1 0bit --rateest-lt --rateest2 rate_eth1 --rateest-bps2 0bit -j CONNMARK --set-mark 1
-A OUTPUT -s 10.0.10.1 -m conntrack --ctstate NEW -m rateest --rateest rate_eth0 --rateest-bps1 0bit --rateest-gt --rateest2 rate_eth1 --rateest-bps2 0bit -j CONNMARK --set-mark 2
COMMIT

The resulting behaviour is that processes on the router get the behaviour they want when they bind either eth0 or eth1 directly -- traffic will be directed out of the proper interface with no balancing performed. It's only when the router binds the internal address 10.0.10.1 that the connection is marked for use on the least busy external interface.

Since we're no longer only handling packets for our own host, the router needs to provide return addresses to the private network in each of the special tables 1000 and 1001 that we defined earlier:

# ip route add 10.0.1.0/26 table prefer-eth0 fwmark 1 proto static scope link dev lan1
# ip route add 10.0.1.0/26 table prefer-eth1 fwmark 2 proto static scope link dev lan1

This is so that connections with either mark still have a way to route packets back to their source on the private network.

Source-based Routing Address Blocks

Another nice touch is to set aside a few small ip blocks on the private network for hosts to use as indicators that they wish to be "routed out" on specific interfaces. Let's take for example a small block 10.0.10.36-10.0.10.39, or in CIDR notation 10.0.10.36/30. We can make the addresses in this range behave in the following way:

Private Address router:interface
10.0.10.36 eth0
10.0.10.37 eth0
10.0.10.38 eth1
10.0.10.39 eth1

These addresses still belong to the router's subnet and can be provisioned by DHCP or given a static assignment on chosen hosts. What's different is that we can make the router exempt these addresses from the load-balancing scheme above and force their traffic to leave according to the interfaces mapped above. This could be handy if a certain hosts want to behave as if packets originate on just one of the two ISP networks. These hosts don't even need to relinquish their original (load-balanced) IP -- they can assign both IPs to the same network card and let applications and administrators choose whether to use the load-balanced IP or the provider-specific IP.

We'll simply append more rules to the mangle chain:

mangle-prerouting-4.chain
*mangle
-A PREROUTING -s 10.0.10.36/31 -j CONNMARK --set-mark 1
-A PREROUTING -s 10.0.10.38/31 -j CONNMARK --set-mark 2
COMMIT

If added to the mangle/PREROUTING chain after the rules given in mangle-prerouting-3.chain, these new rules will overwrite the mark based on the source address.

Another way that you can choose the routing table for a connection uses ip rule and is sometimes referred to as source-based policy routing. The ip rules look like this:

# ip rule add from 10.0.10.36/31 lookup 1000
# ip rule add from 10.0.10.38/31 lookup 1001

If these rules come before the rules given in the previous section, they will divert the routing search to tables 1000/1001 based on the source address before the rules we gave earlier that matched on packet fwmark. Running the ip rules in the sequence given by this document will result in the later rules getting a lower number and hence executed sooner.

Conclusion and a Nice Product

I really enjoyed constructing a multi-homed router for my home network. The experience was definitely worth the extra $90 CAD for an extra ISP month. I might even keep the two ISPs for an extra month just to see what other interesting things I can try.

I want to conclude by showing a link to the DIY router product that started me on this little quest. The device I am using for my home router is a ClearFog Pro by SolidRun. It is a very nifty little device that I paid $152 USD for.

Processor Dual core Cortex-A9 1.6Ghz (ARM7)
Network Interfaces One GbE switched into 6 downstream ports
One GbE dedicated upstream port
One 2.5GbE SFP+ port
Expansion 2 mPCI-E slots, one with SIM card slot for LTE modem
1 M.2 B-key slot for SATA SSD (I put in 500GB)
1 mikroBus header, 20-pin Dupont GPIO
Peripheral 1 USB 3.0 host slot
I/O 1 USB micro-B peripheral slot for serial console

It ran both ARM Arch and Fedora ARM right out of the box, fanless, and it performs beautifully. The optional aluminum case is quite nice and had ports for a three aftermarket wifi antennas.

While you do have to know your operating systems to get everything running, it was for the most part, a pleasure to work with this device.


I hope that this article is of benefit to you -- it took great effort to comb through all of the IP tutorials and manual pages on the web before I got exactly the configuration that I wanted. I'm sure that nearly everyone that tries a project like this will have their own unique needs, but hopefully this demonstrates a very typical scenario that many people can learn from.

© 2020 michaeljoya.com