sábado, 27 de marzo de 2010

Networking is a little more than IPs and netmasks


Case one

Very recently I was asking this questions (which is still open) at www.linuxquestions.org (the first place I hit when I have a question regarding linux or gnu, by the way) and took a brief look at the questions open on the networking forum and I hit this beauty.

It's a guy who has set up DNAT on netfilter to forward packets that are sent to one host to another server that does the real work. Think of it as a proxy. In his example, he wanted to forward packets that arrive at his host on port 3306 to port 3197 on another host (let's use IP a.a.a.a). So, he set up a simple rule on (nat) PREROUTING:

$ iptables -t nat -A PREROUTING -p tcp --dport 3306 -j DNAT --to a.a.a.a:3197

What this rule is doing is telling the kernel to change the destination IP address of any packet that arrives at his host through any network interface to IP address a.a.a.a (reachable from his server, maybe not from the host that originated said packages) and the destination port to 3197 (the port where the real service is working on the real server). When the routing decision is made on those packages a while later the destination IP address will be a.a.a.a and so the packets are sent to the real server. Source address/port of those packages remains the same (unless a little more natting is done, of course). Nice and dandy.

Then, when the packets arrive at server a.a.a.a port 3197 the response will be sent to the originating host/port and the "networking cycle is complete". A word of caution: this works if the packets that are sent back from the real server go through the same host that is doing the natting. If the real server is sending the packets to the originating host through another host, the trick is broken as packets arriving from a.a.a.a:3197 to the originating host don't match the IP:port he sent traffic to, so the connection is not established. This can be solved by SNATting this same traffic on the server that does that DNAT before the traffic is sent to the real server (making sure traffic will come back through it on the way back).

He tests it and it's working. Traffic is reaching the real server and going back to clients.

He then tried to replicate that same behavior but using localhost instead. So he added a rule that looks very much the same on OUTPUT, like this:

$ iptables - t nat -A OUTPUT -p tcp --dport 3306 -i lo -j DNAT --to a.a.a.a:3197

It should make it, shouldn't it? Try telnet to localhost port 3306 and nothing happens. No connection is established. Doesn't work. But why? Using a sniffer it's seen that when the -t nat OUTPUT rule is not set, traffic to localhost port 3306 is moving through interface lo, nothing wrong with that, but when the rule is set up again, traffic gets lost. It doesn't go through lo or any other network address.... so the IP stack is descarding it. Weird. Counter for the -t nat OUTPUT rule is increasing so it's doing its job as required... still, no traffic.

So what's going on? Let's think of what's going on with the traffic. When it reaches -t nat OUTPUT, this packets have source address port whatever, destination address port 3306. Then, after the rule is applied, source address is port whatever, destination host is a.a.a.a port 3197. As the packet is changed in nat, a second routing decision is made on it. As the destination host is a.a.a.a, traffic should be sent to the real server but IP source address is If there were a DNAT rule on POSTROUTING, it should take care of this problem (there was a MASQUERADE rule in place, so no problem). The problem (which is a little buried in the networking stack of linux) is that by the time the second routing decision is made, the source IP address is not consistent. Let me show you my routing table:

$ ip route show dev eth0 proto kernel scope link src metric 1 dev eth0 scope link metric 1000
default via dev eth0 proto static

Nothing about Why is that? It's because this is set up at another routing table (linux supports multiple routing tables, in case you didn't know). You can see the routing tables available by taking a look at file /etc/iproute2/rt_tables. I have default, main and local. Let's take a look at them:

$ ip route show table default
$ ip route show table main dev eth0 proto kernel scope link src metric 1 dev eth0 scope link metric 1000
default via dev eth0 proto static
$ ip route show table local
broadcast dev lo proto kernel scope link src
broadcast dev eth0 proto kernel scope link src
local dev eth0 proto kernel scope host src
broadcast dev eth0 proto kernel scope link src
broadcast dev lo proto kernel scope link src
local dev lo proto kernel scope host src
local dev lo proto kernel scope host src

And this is where things start to make sense. If you see carefully, with src address, all routes there have a local scope, which means that they are not used outside of the scope of the actual host. In our case the dest address is a.a.a.a and with src address, it's impossible to route this traffic... so it gets dumped.

But it fails because we attempted on address, but if you tried to telnet to the ip address of your intranet address instead, the test would be successful (our traffic will go through interface lo, the kernel can figure that out, and so the filter will apply). The src address will be that same address and the DNAT will change dest address to a.a.a.a and the trick will work.

Hope you find this trick useful.

Case two

Think of a situation where you have two internet connections through two different ISPs. You get two ethernet cables from them, they provide you with two static addresses/netmasks/default gateways/dns etc.

You connect each cable to a different box, set up networking and everything works like a charm.

Now, you want to get a little wacky and connect those two cables to a single switch (layer two) and connect those two boxes to the switch as well.

Connections should work fine, right? And they do... but then, what happens if you try to send traffic between those two boxes? Say, from box A you ping box B. In this case box A checks it's routing table and realizes there's no network defined for such host so it goes through its gateway. ARP request to get the mac of its gateway, gateway responds with its mac address, packets go out with mac address of the gateway, src address A, dest addres B and the traffic is heading to internet through one ISP. Then traffic comes through the other ISP to box B, box B gets it. It's going to respond to host A, there's no route for it, sends it through its gateway, goes through same ISP that sent the request to host B, comes back through first ISP to host A and we see a reply on host A. Great.

But wasn't that too long a trip to reach a host that is two ethernet connections away from host A? There should be a way to make the trip shorter, right? And sure there is. You can set up routes to be reached through gateways (layer 3 routing) but also through devices (layer two routing). How does it work?

Let's add a layer two route for host B on host A:

ip route add b.b.b.b dev ethx

ethx being the interface we use to connect to switch. And that's it.

Now what happens when host A tries to ping B? Now, there's a route to reach B through interface ethx so an ARP request for IP b.b.b.b is sent through said interface. Traffic is sent to the switch. The switch broadcasts this ARP request and reaches B, B responds to the ARP request with its mac address. A learns Bs mac address and sends traffic to it. Source IP address is A's, dest address is Bs, dest mac address is Bs. B is able to see this traffic (it's got its mac as destination) and sees As ping request. Now to respond to A it checks its routing table. Remember you didn't change anything on B? Well, there's no route to A so have to go through gateway. Traffic is sent to gateway, ISPs and then it reaches A.

To get the trick working to avoid using ISPs at all, you have to do the same thing on B:

ip route add a.a.a.a dev ethx

ethx being the interface B uses to connect to switches.

After not writing for so long, I had to have something interesting to write about, right?

Have fun!

2 comentarios:

  1. Este comentario ha sido eliminado por un administrador del blog.

  2. Agree about quirks regarding to iptables but for connecting to external database you have to use ssh tunnel. e.g. ssh -L 3301:localhost:3306 a.b.c.d
    See. http://www.ssh.com/support/documentation/online/ssh/adminguide/32/Port_Forwarding.html