# -*- mode: Outline; fill-column: 78; fill-prefix: " " -*- # # klips2-design.txt # Richard Guy Briggs # # RCSID $Id: klips2-design.txt,v 1.18 2001/07/06 07:32:43 rgb Exp $ # * Outline Commands cheat sheet (C-c C-s to see this) C-c C-t Hide EVERYTHING in buffer C-c C-a Show EVERYTHING in buffer C-c C-d Hide THIS item and subitems (subtree) C-c C-s Show THIS item and subitems (subtree) C-c C-c Hide ONE item C-c C-e Show ONE item * Introduction Linux FreeS/WAN IPSec -- KLIPS2 DESIGN ====================================== # This document outlines the proposed design for KLIPS2, the second # generation Linux FreeS/WAN IPSec kernel implementation. It is # accompanied by the following: # # klips2-design.dia dia(1) diagram # klips2-design-legend.txt diagram legend # klips2-design-api.txt API descriptions # klips2-design-api-trips.txt scenarios that force trips through the # various APIs to verify all needed resources # are available # This document is devided up into Introduction, Goals, Feature requests, # Packet path overview, NetFilter overview, IPSec path description, and # Other issues. # # This document was originally written 2.5 weeks after OLS2000, inspired # from a meeting with Rusty and Marc Boucher in Montreal in November 1999 # and two meetings at OLS2000. # # Please comment to the linux-ipsec, netfilter-devel or netdev lists. # # Current kernel version reference is 2.4.4 with iptables 1.2 * Goals: To get rid of the current IPSec virtual interfaces that associate with a specific physical network interfaces and replace them with IPSec virtual interfaces that specify a local gateway address as a source address, a remote gateway address as a destination address and specific tunnel policy or SAs. To use existing packet matching engines rather than re-invent. To support more of the required selectors, especially source and destination ports, and possibly userid. Security labels are not of obvious value, but the selectors will be easy to add in the future if they are implemented in the Linux kernel. To get access to *all* packets incoming and outgoing to enforce policy in both directions. To better support opportunistic encryption. To take advantage of the parallellism of SMP and H/W encryption. To make encryption and authentication modular. The idea is to redesign KLIPS (kernel parts of FreeS/WAN) to avoid all the 'stoopid routing tricks' (TM) to which we have had to resort since the project started by disassociating any IPSec devices from physical devices and to add a proper SPDB to do proper incoming IPSec policy checks. We are hoping to use existing pattern-matching tools rather than invent our own. NetFilter appears to have all the pattern matching capabilities with the exception of security labels which Linux doesn't appear to have anyways, but may be limited in other ways. There is also a significant interest in enabling FreeS/WAN to communicate with routing daemons and be able to do load sharing and failover: http://www.quintillion.com/fdis/moat/ipsec+routing/ * Feature requests: Code=Level/Status/Priority Level: S = strategic M = middle T = tactical U = unset Status: U = unstarted C = coded T = tested D = deleted Priority: H = high M = medium L = low U = unset D = deprecated Code Feature Implementation ==== ======= ============== S/U/U changeable gw wild-side addresses on-the-fly - road warriors with RSA keys and hooks from DHCP to move to a new set of SAs upon expiry of previous DHCP lease. Notify peers. Negociate new tunnels. Handle delayed or denied renewal see: conn up, down, wanted. M/U/U address inertia for remote gw's with changeable wild-side addresses so local gw reboots will initiate reconnect to remotes. - this requires a disk cache. - 3 possible levels: - save list of connections - save IKE phase 1 keys - save IPSEC SA keys (requires KLIPS mods) T/U/U mini-database of road warriors that persists across reboots. - this requires a disk cache. M/U/U connection up, down, wanted - KLIPS2? S/U/U routing below tunnel layer to support mobility and multi-homing - M/U/U tunnel identified by subnets served? - M/U/U why do equalizing schedulers not play well with tunnels? - M/U/U decouple SA retrieval from DADDR (don't care how it arrived) - protocol redesign sysctl and ifdef for dstaddr T/U/U SPIs unique, independant of protocol and DADDR - sysctl and ifdef for protocol S/U/U routing above tunnel layer - S/U/U granularity smaller than host - SPORT, DPORT, UID, SecLev M/U/U /dev/ipsecNNN devices that could be chown(1)ed and chmod(1)ed. M/U/U process to process tunnels T/U/U netfilter,pf_key,ioctl,/dev/ipsecNNN ways to manipulate tunnel perms. S/U/U KLIPS as a loadable module (isn't it already?) S/U/U stats: {number,time_of_last} packets {out,good_in,error_in} S/U/U integrate IPSec and firewall policy into Security Policy. (What APIs and user-level tools?) S/U/U full inbound policy checking S/U/U secure ciphers and hashes T/U/U kernel implementation (should be faster) S/U/U plays well with routing daemons S/U/U free of export restrictions T/U/U standard crypto api to add newer ciphers and hashes S/U/U opportunistic T/U/U SADB hash table will be locked for additions/deletions T/U/U use a refcount on each SA to increase locking granularity * Packet path overview: The basic path through the kernel as it concerns IPSec for the three types of packets is as follows: IN: NIC basic sanity checks NF_IP_PRE_ROUTING hook route-in ip-options processing defragmentation NF_IP_LOCAL_IN hook transport layer demux application FORWARD: NIC basic sanity checks NF_IP_PRE_ROUTING hook routing-in ip-options processing ttl decrement and check NF_IP_FORWARD hook fragmentation NF_IP_POST_ROUTING hook fragmentation output() NIC OUT: application transport layer mux NF_IP_LOCAL_OUT hook route-out NF_IP_POST_ROUTING hook fragmentation output() NIC * NetFilter overview: The basic architecture of NetFilter is: --->[1]--->(ROUTE)--->[3]--->[4]---> where: | ^ [1] NF_IP_PRE_ROUTING | | [2] NF_IP_LOCAL_IN | (ROUTE) [3] NF_IP_FORWARD v | [4] NF_IP_POST_ROUTING [2] [5] [5] NF_IP_LOCAL_OUT | ^ | | v | Destination NAT (port forwarding) gets applied in NF_IP_PRE_ROUTING and NF_IP_LOCAL_OUT at priority NF_IP_PRIORITY_DNAT = -100, and Source NAT (masquerading) gets applied in NF_IP_POST_ROUTING at priority NF_IP_PRIORITY_SNAT = 100, . Filtering is applied in NF_IP_LOCAL_IN, NF_IP_FORWARD and NF_IP_LOCAL_OUT at priority NF_IP_PRIORITY_FILTER = 0. Hook processing order would generally be:
PRE	IN	FWD	OUT	POST	PRIORITY MACRO         PRI
=======.=======.=======.=======.=======.=================== = ====
-500?  .       .       .       .       .NF_IP_PRI_IPSEC_IN  = -500 (?)
-200   .       .       .-200   .       .NF_IP_PRI_CONNTRACK = -200
-175? . . . . . . . . . . . . . . . . .	NF_IP_PRI_IPSEC_IN  = -175 (?)
-150   .       .       .-150   .       .NF_IP_PRI_MANGLE    = -150
-100   .       .       .-100   .       .NF_IP_PRI_NAT_DST   = -100
. . . . . .0. . . .0. .	. .0. . . . . . NF_IP_PRI_FILTER    =    0
       .       .       .       . 100   .NF_IP_PRI_NAT_SRC   =  100
       .       .       .       . 500   .NF_IP_PRI_IPSEC_OUT =  500
=======.=======.=======.=======.=======.=================== = ====
Not all modules are present at each hook. I am uncertain still if IPSEC_IN should be before or after CONNTRACK. Any comments? * IPSec path description: Treat incoming IPSec encapsulation as a transport layer protocol and decapsulate it at the transport layer demultiplexer since it appears as a transport layer protocol from the bottom of the Internet Protocol network stack. For outgoing, we treat IPSec as a network layer protocol since that is what IPSec appears to be from the top of the IP stack. An incoming packet starts off with a sanity check. It then goes through all the NF_IP_PRE_ROUTING hooks starting with the SPDB checking. It would have several possible targets: DROP; REJECT; ACCEPT; PEEK. DROP, REJECT, ACCEPT are standard NetFilter targets. It would DROP if it should have been encrypted. REJECT is a special case of DROP where an ICMP is returned. It would ACCEPT if it was an encrypted IPSec packet bound for this machine and no other policy was expected first, it had already been decrypted from expected SAs indicated by nfmarks or virtual IPSec device or there was policy to allow it through. PEEK would let the KMd have a look at the packet to see if it needed to start thinking about opportunistic and then pass it on. Since it is a fresh ESP or AH packet, it will not have any nfmarks or virtual IPSec device association and unless that outer IP header should have been processed by another SG in between, no policy will have been required, letting it through. The rest of the NF_IP_PRE_ROUTING hooks may cause it to be DNATed and defragmented. It then goes through routing which thinks it is a local packet, deals with any outer header IP options, then defragmentation and NF_IP_LOCAL_IN filter (allow ESP,AH) before getting to ipsec_rcv() where the outer bundle is authenticated and decrypted and nfmarked or associated with a virtual IPSec device to indicate what decapsulation happenned before being passed back to netif_rx(). The next IP header is now visible. The packet now gets re-injected at the beginning. It goes through the incoming sanity check again, getting checked at NF_IP_PRE_ROUTING for policy using previously set nfmark or virtual IPSec device from decryption. It may again be DNATed and defragmented. Routing looks at the now-visible next IP header and routes it locally or via the forward hook. If it is a local packet, IP options and defragmentation are processed. NF_IP_LOCAL_IN then gets to check filtering policy for other transport layer protocols. If it is the endpoint for nested bundles, it is sent back to netif_rx(), having exposed the next IP header. If it is not a local packet, routing has selected a route, potentially through an existing virtual IPSec device, one per connection, not per physical I/F. IP options and TTL are processed before being filtered at NF_IP_FORWARD, fragmented, then sent to NF_IP_POST_ROUTING. If it is a locally generated packet, it would go through normal filtering at NF_IP_LOCAL_OUT, then go through routing, then be sent to NF_IP_POST_ROUTING. At NF_IP_POST_ROUTING, the ipsec table would make a decision about the fate of the packet. It would have several possible targets: ACCEPT; IPSEC SAList; DROP; REJECT; TRAP; HOLD. ACCEPT would allow the packet through with no processing. IPSEC would return NF_STOLEN, stealing the packet and applying the policy specified by its parameter of an SA list. If the SA(s) do(es)n't exist(s) or if the TRAP target was specified, it would send up an ACQUIRE to all listening key management daemons via PF_KEYv2 and put in a HOLD that would keep only the last packet that matched for that HOLD, waiting for the appropriate SA(s). If or once the SA(s) is/are available, it then IPSec processes the packet, then re-injects the packet at NF_IP_LOCAL_OUT (since the packet now appears to originate from this host) and sets nfmark or associates it with a virtual IPSec device to indicate what processing happenned. The packet would then be routed and sent back to NF_IP_POST_ROUTING. If the IPSec remote security gateway is not different upon policy lookup, the ipsec table would ACCEPT it. DROP would drop the packet if previous attempts to do opportunistic encryption failed and the default policy was to block non-IPSec packets. REJECT would be almost the same as DROP, except that it returns ICMPs. ACCEPT, DROP, REJECT are standard NetFilter targets. A packet routed through an optional IPSec virtual I/F simply gets assigned a specific source address, and has the nfmark/SA list preloaded. * Other issues: The way that nfmark is used is rather vague. It is presently only 32 bits. Ideally, I would like to be able to indicate exactly which SAs were processed on the way in, which would most easily be represented by as many as 4 SAs (AH, ESP, IPCOMP, IPIP), each having an 8 bit protocol field (absolute minimum of 2-bits), 32-bit destination address field (for IPv4, IPv6 would be 128) and a 32-bit SPI. This is a potential maximum of 672 bits. A way of mapping 672 bits on to the 32 bits available would be required to use this. A lookup table could be used to map nfmarks to SAIDs, not the SAs themselves, since the SAs could disappear at any time the SADB is not locked. It should be able to represent a bundle of SAs where one SA could be used in more than one bundle. There could also be more than one right answer for the incoming SPDB. I have an idea how to accomplish this by changing/extending nfmark by converting it to a list of nfmark structures that contain a pointer to the next item on the list, a cookie for the specific netfilter function that owns the data and a pointer to a data structure. nfmark may not be the right tool for this. Another possible solution is to add a member to the struct sk_buff to point to this information. This has the benefit of not depending on anyone else, but the drawback of needing to patch a header file *and recompiling the entire kernel*. There is also the possibility of using the NetFilter Connection Tracking facility. The SADB would be managed via the PF_KEYv2 socket I/F. The SPDB would be managed via a combination of PF_KEYv2 socket I/F extensions and iptables. A separate NetFilter table called 'ipsec' (as opposed to 'filter', 'nat' or 'mangle') would have the first hook at NF_IP_PRE_ROUTING and the last hook at NF_IP_POST_ROUTING. iptables(8) currently uses get/set_sockopt(2) system calles, but there is discussion of having it converted to use the AF_NETLINK socket family. Having it use a PF_POLICY interface that was interoperable with multiple platforms would be a big win. For matches, we have source/destination address/mask/port, userid (owner) and security label. source/destination address/mask/port and userid are already supplied by iptables. We only need to supply security label, if we even think we need it. For targets, how do we do this? SPDB with policies, or name specific SAs, SA chains or SA lists or even virtual IPSec devices. Currently, we name specific SAs which are chained exclusively together. We could have a target iptables library and kernel module that has a target of IPSEC which: 1. names a specific SA/chain which are unsharable 2. lists SAs to apply which are sharable 3. names a specific virtual IPSec device which implies a list of SAs 4. spec's req'd policy, ie.: cipher, hash, shared?, remote gw (pfs should not be and option) I am favouring option number 2. Option number 3 would map to number 2 in the SPDB. SAs would still be stored in a SADB hash table. The prev and next fields of struct ipsec_sa would be removed if SAs were no longer chained, but were listed in lists, either from a direct list, virtual IPSec device or from an SPDB. We could use one table for ipsec matching or maybe use one table for each of ipsec_in and ipsec_out. It would have to use a table separate from filter, nat or mangle since we need an input NF_IP_PREROUTE priority of less than -200 (CONNTRACK) and an output NF_IP_POST_ROUTING priority of more than 200 (CONNTRACK?). I suggest NF_IP_PRIORITY_IPSEC_IN = -500 and NF_IP_PRIORITY_IPSEC_OUT = 500. There might be a security level iptables library and netfilter kernel module, if needed. There would be "IPSEC", "TRAP" and "HOLD" iptables target libraries and netfilter target kernel modules. Equivalent ipsec_rcv() functionality would be installed as an IP transport layer protocol handler and be included with the netfilter ipsec table module. Equivalent ipsec_tunnel_start_xmit() functionality would be in the IPSEC netfilter target kernel module.