Interconnect network is a critical component in a RAC environment. It’s very important to minimize latencies, increase its throughput and avoid lost of UPD packages.
There’s more than a heartbeat
Sometimes, when we work with sysadmin or network admins, you find they are expecting a “classic” interconnect nwetwork for deploying a RAC. In this situation, this network is just used to perform configuration and keepalive information between nodes. With Oracle RAC, we must see the interconnect just like an I/O system, just because it is the componente Cache Fusion technology uses to send data blocks from one cluster node to any other without spending time in updating it in disk for the other instance to get a current value. So, we can find high network traffic with 8KiBs blocks being sending (if we use default block size for our database).
Some years ago, Oracle said it was necessary using a physical dedicated network different from the public network as interconnect. Lot of customers did not use specific hardware for this implementation as it was integrated in the consolidated LAN infrastructure, using specific VLANs for these networks. In 2012, Oracle published a White Paper called Oracle Real Application Clusters (RAC) and Oracle Clusterware Interconnect Virtual Local Area Networks (VLANs) Deployment Considerations. This document begins talking about the concepts private and separate.
As a summary, we should take in mind:
- We can use taggeg or untagged VLANs and not necessary specific hardware.
- RAC servers should be connected as OSI layer 2 adjacency, within the same broadcast domain and just one hope communication.
- Disabling or restricting STP (Spanning Tree Protocol) is very important for avoiding traffic suspension that could result in a split brain.
- Enable prunning or private VLANs, so multicast and broadcast traffic will never be propagated beyond the access layer.
Additionaly, from 22.214.171.124 onward, we need to enable multicast traffic for network 126.96.36.199, or, from 188.8.131.52.1, for network 184.108.40.206. We can find further information in the MOS Doc . We can find in that Note a tool for checking multicast traffic in a network.
There are a lot of points to take in mind when configuring interconnects. We should take a look at MOS Note I remark these recommendations:
- Jumbo Frames.With a 1500 bytes MTU, an 8KiB block needs to be fragmented into 6 packets, each one processed with its corresponding headers, sent, and be reassembled in destination to get agian the original 8KiB block. This implies an overload in the transmission, as we are using more control data, and we need extra processing for fragmentation and reassembly. Using a 9000 bytes MTU, 8KiB blocks will need just one IP packet and no fragmentation reassembly, increasing overall performance.
- HAIP. This feature, available from 220.127.116.11, it’s an alternative to OS bonding, and it’s the preferred option for interconnect configuration. We can give the cluster up to 4 different networks, and Clusterware will be able to automatically manage high availability with all of them, moving the virtual IPs HAIP uses between these networks when a failure arises. We can use HAIP over OS bonding, but this implies complexity over functionality. If running in a virtual environment, where virtual networks are protected by hypervisor level physical bondings, using HAIP can be nosense as we are not adding fault tolerance. It is indeed a good solution for physical environments where HAIP can tolerate a switch or network card failure in a almost transparent way (BDs services do not suffer an outage, but interconnect traffic is stopped for some seconds while HAIP is getting reconfigured, so a slowleness can be seen in this situation).
- Speed. We can use 1GbE networks, but it is recommended the use of more powerful solutions, like 10GbE. If we think again about the interconnect as an I/O network, there’s nothing more to say.
- When using bonding in Linux systems, we must avoid mode 3. This is broadcast and transmist all packets using all bonding interfaces.
- Remove Reverse Path Filtering interference, by means of configuring rp_filter kernel parameter with value 0 (disabled) or 2 (loose). This is an IP level functionality, so in case we would be using HAIP over bonding, this parameter must be configured at bonding OS device and not the individual interface devices. With this configuration, we are preventing, in 2.6.31 and up kernels, a situation where blocks get locked or discarded. We can find more specific information in MOS rp_filter for multiple private interconnects and Linux Kernel 2.6.32+ (Doc ID 1286796.1).
- Network kernel buffers. General recommendation for 11g and up is net.core.rmem_default=262144, net.core.rmem_max=4194304, net.core.wmem_default=262144 e net.core.wmem_max=1048576. These values can be changes, in example, after checking network test results to improve speed or solve network errors.
- Tests. Interconnect validation is a critical task. It’s a common and frequent point of failure that is not detected as no tests are performed when deploying a RAC. Using netperf and ethtool (or other tools) let us identify any hardware or configuration even before installing any software. This will specially prevent low performance and UDP packet loss in the network. Cache Fusion uses UDP, a connectionless protocol. Losing UDP packets can be the origin of a high severity performace problem in a RAC, so we must test communication and network statistics to identify and solve any possible problem before going live. This information should be checked with the network admins.
What to check in interconnect tests?
Now we review a real example of a 3 node RAC with 2 interconnect networks configured with HAIP. Each node has 2 connections to any other node through interfaces named priv1 and priv2. We can find more information regarding the use of netperf for this tests in MOS Doc ID 810394.1.
We print in these tables maximum and minimum values because tests are repeated several times just to assure the results are OK and check the stability and reliability of the networks.
Tests are done using netperf, and they take results from two different modes, STREAM, which uses 8KiBs blocks to get network throughput in MiB/s, and RESPONSE, sending 1 byte packets to check network latency with minimum load. We should verify:
- When using UDP, it’s important to check results sending and receiving, because without any connection control, a sender can be as fast as he can, but the receiver can be flooded and discarding packages, so they are not really received at the application level. In this example, values are exactly the same, great news when you perform a test.
- Stability in the results. Check if results are stable in differente tests.
- We get maximum throughput. We should know the theoretical throughput our network should give us and that is what we should get when performing STREAM test. In our example, we are using a 20Gbps bonding, but with a limitation of 10Gbps thoruhput for a single connection, so 1.180Mbps is a very good result.
- Network latency. RESPONSE test will give as result the number of 1-byte packets received. We calculate the average response time, giving 0,03ms in all of our tests. These are very good latencies when comparing to tests performed in other customers. A deeper view of the expected latency can be discussed again with network administrators.
In these tests, results are very clear and there’s no suspect of any problems in the network. Even being so, it’s a recommendation to take a look at the OS statistics using netstat or ethtool, specially when performing UDP tests. Doing so, we can identify potential problematic results and where the problem can be being generated (sender, network car
We are comparing sender and receiver statistics. Without entering in greater detail, due to how the shell script used to perform the tests gets the information, it’s a normal value to find receiver packets being 50% of sender transmissions due to the use of 2 networks and the execution of 2 different tests. We can see that more than 227 million packets where received, and only 254 packets ended in error, very insignificant and a great result. We can see how using jumbo frames avois fragmentation, no packect has been fragmented neither any packet has been reassembled. Sending errors are insignificant too. Also we can see with ethtool that no error has been registered at interface level.
Once interconnect networks performance and latency have being tested, something that usually is done before installing the software, after we have deployed our infrastructure (Grid Infrastructure and RAC database), we must test high availability. Within our tets we must include the lost of every interconnect interface. We will validate:
- Initial state before testing.
- HAIP IP failover to a surviving network seconds after disconnecting one of the interconnects.
- Routing table must be updated at SO level.
- HAIP IP must failover again to its original interface once the network is recovered.¡
As an exemple, we perform a test in a node with two interconnects. Initial state is:
[oracle@rac1 ~]$ oifcfg iflist eno53 10.181.18.240 eno53 169.254.0.0 eno54 10.181.18.248 eno54 169.254.128.0 bond0 10.181.18.160 [oracle@rac1 ~]$ oifcfg getif bond0 10.181.18.160 global public eno53 10.181.18.240 global cluster_interconnect eno54 10.181.18.248 global cluster_interconnect [oracle@rac1 ~]$ route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default gateway 0.0.0.0 UG 0 0 0 bond0 10.181.18.160 0.0.0.0 255.255.255.224 U 0 0 0 bond0 10.181.18.240 0.0.0.0 255.255.255.248 U 0 0 0 eno53 10.181.18.248 0.0.0.0 255.255.255.248 U 0 0 0 eno54 link-local 0.0.0.0 255.255.128.0 U 0 0 0 eno53 169.254.128.0 0.0.0.0 255.255.128.0 U 0 0 0 eno54
With two private networks (eno53 and eno54 OS interfaces), Oracle is managin two virtual IPs, 169.254.0.0 (eno53), and 169.254.128.0 (eno54). Now let’s disconnect the wire from eno53. This is the result:
[oracle@rac1 ~]$ oifcfg iflist eno53 10.181.18.240 eno54 10.181.18.248 eno54 169.254.128.0 eno54 169.254.0.0 bond0 10.181.18.160 [oracle@rac1 ~]$ route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default gateway 0.0.0.0 UG 0 0 0 bond0 10.181.18.160 0.0.0.0 255.255.255.224 U 0 0 0 bond0 10.181.18.240 0.0.0.0 255.255.255.248 U 0 0 0 eno53 10.181.18.248 0.0.0.0 255.255.255.248 U 0 0 0 eno54 link-local 0.0.0.0 255.255.128.0 U 0 0 0 eno54 169.254.128.0 0.0.0.0 255.255.128.0 U 0 0 0 eno54
IP 169.254.0.0 is still up and available and can be used for interconnect communication, but now is listening in interface eno54. Additionaly, we check routing table has been updated to reflect the change. When reconnecting the wire, situation gets to initial state after several seconds:
[oracle@rac1 ~]$ oifcfg iflist eno53 10.181.18.240 eno53 169.254.0.0 eno54 10.181.18.248 eno54 169.254.128.0 bond0 10.181.18.160 [oracle@rac1 ~]$ route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default gateway 0.0.0.0 UG 0 0 0 bond0 10.181.18.160 0.0.0.0 255.255.255.224 U 0 0 0 bond0 10.181.18.240 0.0.0.0 255.255.255.248 U 0 0 0 eno53 10.181.18.248 0.0.0.0 255.255.255.248 U 0 0 0 eno54 link-local 0.0.0.0 255.255.128.0 U 0 0 0 eno53 169.254.128.0 0.0.0.0 255.255.128.0 U 0 0 0 eno54
Generally, stability and performance in our infrastructure cannot be done by means of improvisation, and it must be adequated to the expectations taking in mind our hardware capabilities. We can quickly deploy a RAC if we avoid testing and looking for potential problems, or we can stress our hardware and software to validate its functionallity as expected, solving all issues we find meanwhile executing the tests. It’s quite frequent to find interconnect issues, so this is not a minor review point.
Post image by Vera Pereiro. Arumel’s Meerkat.
Text Reviewer: Alicia Constela.