As part of the high availability tests in a Grid Infrastructure cluster (usually a RAC cluster), we must validate that all possible infrastructure failures behave as expected in acordance to the solution design. Hardware failures and specially connectivity failures (network and IO) require an special effort to be performed as most of the times cooperation between different working teams is necessary, usually DBAs, network administrators and sysadmins.
When we deploy an extended cluster, an important test is split brain condition due to inter-site communication issue. Frequently, one or more dedicated inter-site connections are used to carry both network and SAN information, maybe by means of using technologies like DWDM links. An outage in DWDM connectivity can lead to a partial and simultaneous outage in our cluster networks (public and interconnect) and IO access. The cluster must survive this issuem but, how can we minimize split test impact in a production environment?
Testing a real split between two different sites can collide with serious limitations for isolating the affected hardware in the test (shared switches, shared storage…). If our test dows not work as expected, it will be necessary to repeat it after some adjustments. If things go harder and we find some kind of problem like a possible bug, we can be opeing a SR where Oracle engineers would require us to perform additional tests. This can be unaffordable in a production environment or at least be very uncomfortable and expensive with nightly or weekend job neede to perform the tests.
To prepare and simplify execution of split tests between 2 different sites, we can use escacha tool, created ad-hoc by Arumel to be able to overcome the limitations of repeted real split tests in an environment where results where not the expected. For this reason we created a shell script for simulating the simultaneous network and / or disk split without any further work from net or sys admins and without disrupting any other production systems.
The tool was developed in an environment where normal redundancy ASM is used to mirror data between two storage arrays, one in each site. At SO level native RHEL7 multipath was used, and disk devices were presented using iSCSI. Escacha can do:
- Identify all block devices in the storage array of a specific site. All those devices will be put offline in the remote storage when executing the tool. Local storage will still be available (we only lose 1 copy of the mirrored ASM information).
- Block interconnect communication. Using iptables rules only inter-site traffic will be blocked. Nodes in the same site will still be able to receive other members heartbeat.
Escacha integrates with another tool with simple shell scripts, named arumon, which gets periodic information along the split brain test to be able to check some basic functionallity like checking write access in a database, new session creation, site status (12.2), asm disk status…Escacha and arumon are tested in a 12.2 GI, so different GI versions could need additional code changes.
Escacha and arumon access in github
Escacha and arumon are publicly available in github and can be downloaded using git clone https://github.com/arumel/escacha and git clone https://github.com/arumel/arumon. Escacha is an Arumel internal tool, that could be potentially useful for other people, so that’s why we want to share i, but please, take in mind, it has no warranty or support, so you should is it at your own risk. It’s ad-hoc creation makes this tool not very flexible currently, and it’s code is surely not the most efficient possible, but any DBA or sysadmin surely can be able to adapt it to its need with little effort. It’ very important to be aware that in real mode, escacha will cut disk and network access just like a real split would make, so be cautious whn executing it, specially if you are using ansible to launch it agains a group of servers.
Finally, escacha is a useful tool, but it does not substitute real split tests. You can reduce the time spent performing real tests and also reduce tests impact, but we recommend to use it as a test tool previous to a real test and not as an alternative for reality.
Cover Artwork: Olalla Núñez. “Atrapadas”.
Revision: Alicia Constela