Greetings netizens, pandax381 here. A new version of Keepalived (ver. 1.3.0) was released on November 20. I take it we’re all up-to-date on our version upgrades?


2016-11-20 | Release 1.3.0

New MAJOR release with stabilization fixes. Support to DBus. Conf extensions. Parser error log. Security extensions to run scripts more securely. Refer to ChangeLog for more info.


(In the dev mailing list, they included an announcement hinting that ver. 2.0.0 was close at hand.)


This is a quick email to announce a new major Keepalived release. We are planning with Quentin to push a new release soon as the 2.0.0 release. This one will fix and extend previous parts. It also comes with a Security fix for those making extensive use of scripts.


Released on November 20, the biggest changes in version 1.3.0 include added support for DBus as well as a new security structure for running scripts. Surprise! These two features came from patches KLab created specifically for this package. They were quietly included in the recent update, so in order to set the record straight, I’ve written about them in detail below.


Improved Healthcheck Patch

Keepalived comes with a healthcheck feature that lets you monitor real servers. However, the protocols it supports by default are limited to TCP, HTTP(S), and SMTP. Apart from HTTP and SMTP, you had to add your own healthcheck script if you wanted to do more than just check your TCP connections.


MISC_CHECK, the part of the program that runs our hand-made healthcheck script, has a few problems of its own. That’s why the engineers over here at KLab developed and released a patch that really does a wonder on Keepalived’s healthcheck. (I’ll talk more about this little issue later.)



In this patch, we added support protocols for FTP, DNS, and SSL to the healthcheck. Inside the article it says, “this patch hasn’t been applied to the DSAS live environment.” However, soon after this article was posted (way back in 2007) we officially started using the patch. In fact, we still use it today. ...Or at least until Keepalived 1.3.0 came out.


Independently-developed patches are often made obsolete by new and improved official versions of software. Thus is the fate of most independent patches. We wrote our healthcheck patch over a decade ago. A lot has changed since then. There’s no way we could keep using the patch as-is with subsequent versions of Keepalived, so we’ve made an effort to update the patch regularly in order to stay ahead of the curve. Every time a new version of Keepalived comes out, we’ve had to update the patch to keep it useable. Needless to say, it was becoming a bit of a chore. The only way to escape this vicious cycle of recursive updates was for our independently developed baby patch to grow up and leave the nest to be merged into the official version of Keepalived.



Just in case you missed the title of this blog, KLab’s in-house patch has been officially merged into Keepalived. It is now a standard feature as of the version 1.3.0 update.



Here’s the pull request from the healthcheck improvement patch.

The original patch only included support for FTP, DNS, and SSL. These days we only use DNS with DSAS, so we added a DNS healthcheck into the standard healthcheck functionality. Here’s the format for DNS_CHECK.


  # one entry for each realserver
   real_server <IPADDR> <PORT>
   {
          # DNS healthchecker
          DNS_CHECK
          {
              # ======== generic connection options
              # Optional IP address to connect to.
              # The default is the realserver IP
              connect_ip <IP ADDRESS>
              # Optional port to connect to
              # The default is the realserver port
              connect_port <PORT>
              # Optional interface to use to
              # originate the connection
              bindto <IP ADDRESS>
              # Optional source port to
              # originate the connection from
              bind_port <PORT>
              # Optional connection timeout in seconds.
              # The default is 5 seconds
              connect_timeout <INTEGER>
              # Optional fwmark to mark all outgoing
              # checker packets with
              fwmark <INTEGER>

              # Number of times to retry a failed check
              # The default is 3 times.
              retry <INTEGER>
              # DNS query type
              #   A | NS | CNAME | SOA | MX | TXT | AAAA
              # The default is SOA
              type <STRING>
              # Domain name to use for the DNS query
              # The default is . (dot)
              name <STRING>
          }
   }


There’s a lot of options in there. However, as long as you set “type” and “name” correctly, the code will do its job.


  realserver 192.0.2.100 53 {
          DNS_CHECK {
              type A
              name www.klab.com
          }
   }


You know the healthcheck was successful if you get 1 or more answers for ANSWER SECTION. You have to be careful here. Even if you get a response packet back, you can ignore these packets if ANSWER SECTION is empty. I’m not suggesting this is practical, but you could even control the result of the healthcheck via the registration situation of the DNS record if you wanted to.



Fixing Bugs, Getting Presents

As previously mentioned, MISC_CHECK, which ran our healthscript, had a few problems of its own. Until the latest version of Keepalived was released, the script called by MISC_CHECK to perform the healthcheck had the nasty habit of causing the number of processes running to multiply out of control. An unfortunate bug indeed.


Here’s an easy-to-follow example. You can easily recreate this problem by using the following settings when running MISC_CHECK.


MISC_CHECK {
   misc_path "/bin/sleep 3600"
   misc_timeout 10
}


MISC_CHECK has an interesting structure. If the script being run doesn’t end by the time misc_timeout passes, MISC_CHECK sends a signal that forces the script to end. However, there is a problem that exists inside the code that processes the sending of this signal. This problem keeps the process alive even though it should have been ended once misc_timeout times out. It also makes it so that new processes are generated, which leaves us with an entire nest of processes.


UID   PID  PPID  PGID   SID COMMAND
 0 41010     1 41010 41010 /sbin/keepalived
 0 41013 41010 41010 41010  \_ /sbin/keepalived
 0 41361 41013 41010 41010  |   \_ /sbin/keepalived
 0 41362 41361 41010 41010  |   |   \_ sh -c /bin/sleep 3600
 0 41363 41362 41010 41010  |   |       \_ /bin/sleep 3600
 0 41364 41013 41010 41010  |   \_ /sbin/keepalived
 0 41365 41364 41010 41010  |   |   \_ sh -c /bin/sleep 3600
 0 41366 41365 41010 41010  |   |       \_ /bin/sleep 3600
 0 41367 41013 41010 41010  |   \_ /sbin/keepalived
 0 41368 41367 41010 41010  |       \_ sh -c /bin/sleep 3600
 0 41369 41368 41010 41010  |           \_ /bin/sleep 3600
 0 41014 41010 41010 41010  \_ /sbin/keepalived
 0 41019     1 41010 41010 sh -c /bin/sleep 3600
 0 41020 41019 41010 41010  \_ /bin/sleep 3600
 0 41025     1 41010 41010 sh -c /bin/sleep 3600
 0 41026 41025 41010 41010  \_ /bin/sleep 3600
 0 41031     1 41010 41010 sh -c /bin/sleep 3600
 0 41032 41031 41010 41010  \_ /bin/sleep 3600

In order to run the healthcheck script, MISC_CHECK runs fork(2), then runs system(3). From the perspective of the process sending the signal, the process it wants to end is a great-grandchild. In the original code, the signal is only sent to the child process. This means that only the child process is ended, while the grandchild and great grandchild processes continue to run. Sneaky, sneaky.



The pull request above fixes this bug. When MISC_CHECK fork(2) is run, setpgid(2) is called and the process group separated. signal(2) is then able to designate individual process groups. It then is able to send signals to all of the processes belonging to that process group. This fix allows the patch to send signals to all the child, grandchild, and great grandchild processes. Problem solved!


This fix has also been merged into Keepalived version 1.3.0. Now we should be able to use MISC_CHECK without any problems.


*2: There are actually even more bugs than this. The child processes, which are absolutely essential to this program, simply ignore the signals (SIGTERM) unintentionally. The program was made to work by sending a force-quit signal (SIGKILL), which was included as a last-resort measure, to the processes which in effect did kill them when received, but I think we can all agree this was a less-than-elegant solution to the problem. This has all been resolved now.



All in a Day’s Work

In the article above I made it sound like they took our code as-is and just added it into the package. To be completely honest, they pretty much re-wrote our patch from scratch. Sadly, most of the code we wrote can’t be found inside the official version of Keepalived.


There’s a very good reason for this. The framework for the healthcheck that comes with Keepalived is designed for TCP. It wasn’t designed with UDP in mind. That’s why we wrote all the socket-related code from scratch in our original patch. This left us with some interesting results. Compared to standard healthchecks, our code is full of programming faux-pas and really misses out on a lot of best practices. For example, if you sent a pull request as-is, it might upset the entire program.


That being said, the framework wasn’t designed for UDP. We wracked our brains for days, but since there really wasn’t any other good way to go about it, we ended up revising the core of Keepalived, changing the way the very fabric of the framework. (The individual functions of healthcheck are positioned like modules. You can add new functions easily, but it’s a pretty big undertaking to revise the main body.)



In this commit, we took a framework that was designed with only TCP in mind and made it support UDP. In order to avoid changing the code being used by the framework, we used inline functions to create a wrapper, all while keeping their compatibility intact. This fix makes it much easier to create a healthchecker with a UDP base, so there’s a chance that the number of supported protocols will increase in the future.


At the end of the day, writing the message for the pull request was harder than writing the actual code itself. The hardest part for me personally was the language barrier. My English is pretty deplorable (thank you ghost-writer for translating this blog!), so if someone told me to do all this with MISC_CHECK, that would have been the end of the story. I poured my heart and soul into writing a message in English that conveyed exactly why I felt it was so important that they add this part to the default healthchecker. I was relying a lot on Google Translate (which was pretty bad back in the day for Japanese→English), and it was giving me a lot of grief, but in the end, I was able to get the results I had sought after for so long.

This may very well be the biggest and most-well known product my own code has ever been included in. Simply put, I am ecstatic. I’m not sure exactly how much demand there is for this kind of thing, but if you’re reading this blog, please try out the DNS healthcheck feature!