Building a multi-institute VPN solution for Munich’s 130,000 user community
The Leibniz-Supercomputing Centre (Leibniz Rechenzentrum, LRZ) is the computing center of both Munich Universities of Excellence: Ludwig-Maximilians University and Technical University Munich and a National Supercomputing Centre. The LRZ operates the Munich Scientific Network (MWN) for all universities and other research institutions in the greater area of Munich. MWN connects more than 130,000 users.
The structure of LRZ is different compared to many universities as we do provide services for several universities in the Munich area instead of only one university. The big ones, counted in VPN users, are:
- Technical University of Munich tum.de TUM
- Ludwig Maximilian University of Munich lmu.de LMU
- HM Hochschule München University Of Applied Sciences Munich hm.edu
- Weihenstephan-Triesdorf University of Applied hswt.de
and more smaller ones.
Existing VPN Service
The current VPN solution works with several VPN servers pooled into a VPN cluster with a single virtual address. Users connected with their clients to the cluster address and were assigned to different IP pools. The decision, which IP pool was appropriate, was based on login, prefix and from where the user connected. Authentication, checking login and password, authorization, selecting the IP pool, and accounting were done externally on a central Radius server. In case of a successful login the radius server returned a tag for the VPN IP pool to the VPN Server. The Radius server authenticated against a central LDAP directory and several other radius servers.
During the pandemic peak we had over 6,500 concurrent users on the VPN cluster. Before March 2020 these numbers were around 1,500 to 2,000 concurrent connections.
As we do have a still operating VPN environment, the requirements for the new service were easily set. 6,000+ concurrent users should be served as in pandemic times. Client update is very important due to the big number of active registered users, (typically about ten times the maximum concurrent numbers). For configuration profile updates this applies as well. Multiple client profiles have to be user selectable, one for a split tunnel scenario, where only the Munich Scientific Network can be accessed via VPN, one for a full tunnel where all traffic from the client passes through the VPN tunnel. For LRZ employees using VPN two-factor authentication is mandatory.
Why migrate? Hardware and Licensing Scheme
The current VPN hardware appliances are end-of-life and running out of service in 2022. A new licensing scheme introduced several years ago, switching from concurrent use to accounting every possible user, does not fit our requirements due to the potentially large number of registered users.
So we had to face a crucial decision, continue with the established system, new licenses and new hardware or try something different?
Virtual firewall access control to institutes is another service we do provide. This is based on pfSense instances running as VM pairs. pfSense gives VPN access out of the box based on OpenVPN with a nice GUI. We do have several years support experience with OpenVPN here.
So we started with an OpenVPN testing environment for a small circle of voluntary users. Feedback showed that this is going to work, but was lacking several features our existing VPN solution had. Things to cope with were: automatic client update, automatic client configuration updates and an automatic selection of affiliation (IP pools). The automatic updates were very important to us having the big number of users in mind.
eduVPN – Setting up eduVPN for LRZ
Luckily there was with eduVPN a VPN solution which eased most of the openVPN inconveniences.
After some testing with different authentication schemes of eduVPN we decided to set up a single server for each of the three biggest universities based on VPN user access: TUM, LMU and HM, one “catch all” server for the other institutions and one dedicated server for us, the LRZ.
One controller, multiple nodes
These servers were realized with one controller and two nodes for redundancy. Controller redundancy relies on the virtual infrastructure, node redundancy is done with multiple nodes, we started with two nodes.
Installation and Configuration
After following and thoughtful reading the eduVPN documentation on GitHub the installation ran smoothly.
Debian as OS on VM
Debian was our choice for the operating system as this is one of the supported systems at the LRZ. All systems run on our virtual infrastructure. Configuration management is moving over from manual to Puppet. Monitoring is done with Check_MK. As startup setup controllers were deployed with two cores, 2GB RAM (see Shibboleth below). The nodes work with one core and 1 GB of RAM. The VPN use of the nodes is still moderate.
Shibboleth, LDAP or Radius?
Shibboleth was our choice for the big university and the LRZ installation. The reasons for this choice were: a) These institutions are already registered as identity providers (IdP). b) Additional security mechanisms, like two factor authentication, can be deployed on the IdP side and not on the local system c) User passwords are not processed on the local system. Caveat: you have to install a shibboleth daemon on the system and register it as service provider (SP). The shibd Shibboleth daemon came out as quite hungry for resources. LDAP as second candidate was dropped due to the advantages of Shibboleth.
Radius was our anchor for the transition from the old VPN service to eduVPN. With our radius server we have a working AAA system where you can feed in login/password. If the authentication is successful you get back the affiliation for the IP pool for the login. The tradeoff was processing login data on the system.
eduVPN supports ACL (Access Control Lists) to manage access to VPN profiles which can be mapped to a “permissionAttribute” retrieved from the authorizing server. Unfortunately, this was implemented only for Shibboleth and LDAP and not for Radius. After outlining our use case to the eduVPN support team this feature was timely implemented for Radius as well and we could head on.
The ACL for profile mapping turned out to be quite flexible. So, we can offer a) a generic VPN profile to get access to the Munich Scientific Network and b) institution bound VPN profiles with user designated IP pools.
Hereditary IPv4 subnet oddities
With the old VPN solution it was possible to assign multiple, non-contiguous, IPv4 subnets for one VPN IP pool. This was done quite often with non RFC1918 subnets, when more VPN users needed more subnets. A free /24 or, if possible, /23 subnet in our allocation was assigned for VPN. Over time the VPN subnets were scattered over our allocation.
These VPN-Subnets were registered in multiple internal and external servers, i.e., online libraries, which are not under our control. So renumbering VPN subnets was not really an option. With IPv6 this challenge never showed up.
In eduVPN a single VPN IP pool defined in a profile can be split up on to multiple server processes, but you cannot use two non-contiguous subnets. So, we will roll out new nodes on demand.
Using one controller and multiple nodes helped to circumvent our IPv4 subnet limitations. Although very well documented, this setup can be tricky. Mantra: Keep things transparent to the user, only one VPN profile, which may then connect to one of the nodes via load balancing. The mechanism here used is very simple to implement: round-robin DNS. This implies an identical setup on the different nodes, except for the VPN subnets of course. Things to watch out for are: Are the nodes listening on the same UDP/TCP ports for this profile? Do all nodes have a common TLS key? Did you allow enough OpenVPN processes to be started? Is the firewall configured for additional ports? Configuration errors here may result in strange “works – works not” scenarios. All this can easily be learnt the hard way.
Network setup for Performance
The nodes do handle the VPN traffic. So, we set up the nodes with two network interfaces, one external interface where the clients connect and an internal one to which the VPN subnets are routed. Traffic selection is done via source routing.
Two-Factor authentication for LRZ employees
For LRZ employees using VPN, two-factor authentication (2FA) is mandatory. eduVPN has built-in 2FA support which users can register on their own. The algorithm used is Time-based One-time Password (TOTP). The 2FA backend at LRZ is based on privacyIDEA, users get YubiKeys which are preregistered to privacyIDEA. privacyIDEA is a modular authentication server. By using Shibboleth as authentication protocol, we were able to activate 2FA at the identity provider (IdP). So YubiKeys worked for eduVPN with 2FA and no external application for generating TOTP keys was needed.
We had an issue with Windows 10 LTSC installations, where .NET 4.8 had to be manually installed. With the latest upgrade of the eduVPN.app to Version 2.1 this should be fixed. Linux users complained that the split tunnel profile was not working as expected. The NetworkManager always added a default route to the VPN profile. This problem cropped up with OpenVPN as well. Adding the option never-default=true manually to the Networkmanager VPN profile helped.
Some users connected to the “Secure Internet” VPN provided by the German NREN DFN and complained about not getting access to internal resources. We had to emphasize the use of “Institute Access” VPN in the documentation.
We offer VPN-only access in public places over a restricted network, where users can only connect to VPN Servers. For these networks we had to allow access to the Shibboleth identity providers.
Installation and configuration were free of show stopping obstacles. This was based on the good documentation and the very responsive support. We still have to face the start of the winter term with lots of new students.
Although a big chunk of the work is done, there are still things to do: enhance user documentation, automate janitor processes, like cleaning out users which are no longer eligible but have still active profiles.