KLabGames Tech Blog - English

KLab develops and provides service for a variety of smartphone games. The world of mobile games is growing by leaps and bounds, and the development process calls for a different set of skills than traditional console games.

Greetings netizens, pandax381 here. A new version of Keepalived (ver. 1.3.0) was released on November 20. I take it we’re all up-to-date on our version upgrades?


2016-11-20 | Release 1.3.0

New MAJOR release with stabilization fixes. Support to DBus. Conf extensions. Parser error log. Security extensions to run scripts more securely. Refer to ChangeLog for more info.


(In the dev mailing list, they included an announcement hinting that ver. 2.0.0 was close at hand.)


This is a quick email to announce a new major Keepalived release. We are planning with Quentin to push a new release soon as the 2.0.0 release. This one will fix and extend previous parts. It also comes with a Security fix for those making extensive use of scripts.


Released on November 20, the biggest changes in version 1.3.0 include added support for DBus as well as a new security structure for running scripts. Surprise! These two features came from patches KLab created specifically for this package. They were quietly included in the recent update, so in order to set the record straight, I’ve written about them in detail below.


Improved Healthcheck Patch

Keepalived comes with a healthcheck feature that lets you monitor real servers. However, the protocols it supports by default are limited to TCP, HTTP(S), and SMTP. Apart from HTTP and SMTP, you had to add your own healthcheck script if you wanted to do more than just check your TCP connections.


MISC_CHECK, the part of the program that runs our hand-made healthcheck script, has a few problems of its own. That’s why the engineers over here at KLab developed and released a patch that really does a wonder on Keepalived’s healthcheck. (I’ll talk more about this little issue later.)



In this patch, we added support protocols for FTP, DNS, and SSL to the healthcheck. Inside the article it says, “this patch hasn’t been applied to the DSAS live environment.” However, soon after this article was posted (way back in 2007) we officially started using the patch. In fact, we still use it today. ...Or at least until Keepalived 1.3.0 came out.


Independently-developed patches are often made obsolete by new and improved official versions of software. Thus is the fate of most independent patches. We wrote our healthcheck patch over a decade ago. A lot has changed since then. There’s no way we could keep using the patch as-is with subsequent versions of Keepalived, so we’ve made an effort to update the patch regularly in order to stay ahead of the curve. Every time a new version of Keepalived comes out, we’ve had to update the patch to keep it useable. Needless to say, it was becoming a bit of a chore. The only way to escape this vicious cycle of recursive updates was for our independently developed baby patch to grow up and leave the nest to be merged into the official version of Keepalived.



Just in case you missed the title of this blog, KLab’s in-house patch has been officially merged into Keepalived. It is now a standard feature as of the version 1.3.0 update.



Here’s the pull request from the healthcheck improvement patch.

The original patch only included support for FTP, DNS, and SSL. These days we only use DNS with DSAS, so we added a DNS healthcheck into the standard healthcheck functionality. Here’s the format for DNS_CHECK.


  # one entry for each realserver
   real_server <IPADDR> <PORT>
   {
          # DNS healthchecker
          DNS_CHECK
          {
              # ======== generic connection options
              # Optional IP address to connect to.
              # The default is the realserver IP
              connect_ip <IP ADDRESS>
              # Optional port to connect to
              # The default is the realserver port
              connect_port <PORT>
              # Optional interface to use to
              # originate the connection
              bindto <IP ADDRESS>
              # Optional source port to
              # originate the connection from
              bind_port <PORT>
              # Optional connection timeout in seconds.
              # The default is 5 seconds
              connect_timeout <INTEGER>
              # Optional fwmark to mark all outgoing
              # checker packets with
              fwmark <INTEGER>

              # Number of times to retry a failed check
              # The default is 3 times.
              retry <INTEGER>
              # DNS query type
              #   A | NS | CNAME | SOA | MX | TXT | AAAA
              # The default is SOA
              type <STRING>
              # Domain name to use for the DNS query
              # The default is . (dot)
              name <STRING>
          }
   }


There’s a lot of options in there. However, as long as you set “type” and “name” correctly, the code will do its job.


  realserver 192.0.2.100 53 {
          DNS_CHECK {
              type A
              name www.klab.com
          }
   }


You know the healthcheck was successful if you get 1 or more answers for ANSWER SECTION. You have to be careful here. Even if you get a response packet back, you can ignore these packets if ANSWER SECTION is empty. I’m not suggesting this is practical, but you could even control the result of the healthcheck via the registration situation of the DNS record if you wanted to.



Fixing Bugs, Getting Presents

As previously mentioned, MISC_CHECK, which ran our healthscript, had a few problems of its own. Until the latest version of Keepalived was released, the script called by MISC_CHECK to perform the healthcheck had the nasty habit of causing the number of processes running to multiply out of control. An unfortunate bug indeed.


Here’s an easy-to-follow example. You can easily recreate this problem by using the following settings when running MISC_CHECK.


MISC_CHECK {
   misc_path "/bin/sleep 3600"
   misc_timeout 10
}


MISC_CHECK has an interesting structure. If the script being run doesn’t end by the time misc_timeout passes, MISC_CHECK sends a signal that forces the script to end. However, there is a problem that exists inside the code that processes the sending of this signal. This problem keeps the process alive even though it should have been ended once misc_timeout times out. It also makes it so that new processes are generated, which leaves us with an entire nest of processes.


UID   PID  PPID  PGID   SID COMMAND
 0 41010     1 41010 41010 /sbin/keepalived
 0 41013 41010 41010 41010  \_ /sbin/keepalived
 0 41361 41013 41010 41010  |   \_ /sbin/keepalived
 0 41362 41361 41010 41010  |   |   \_ sh -c /bin/sleep 3600
 0 41363 41362 41010 41010  |   |       \_ /bin/sleep 3600
 0 41364 41013 41010 41010  |   \_ /sbin/keepalived
 0 41365 41364 41010 41010  |   |   \_ sh -c /bin/sleep 3600
 0 41366 41365 41010 41010  |   |       \_ /bin/sleep 3600
 0 41367 41013 41010 41010  |   \_ /sbin/keepalived
 0 41368 41367 41010 41010  |       \_ sh -c /bin/sleep 3600
 0 41369 41368 41010 41010  |           \_ /bin/sleep 3600
 0 41014 41010 41010 41010  \_ /sbin/keepalived
 0 41019     1 41010 41010 sh -c /bin/sleep 3600
 0 41020 41019 41010 41010  \_ /bin/sleep 3600
 0 41025     1 41010 41010 sh -c /bin/sleep 3600
 0 41026 41025 41010 41010  \_ /bin/sleep 3600
 0 41031     1 41010 41010 sh -c /bin/sleep 3600
 0 41032 41031 41010 41010  \_ /bin/sleep 3600

In order to run the healthcheck script, MISC_CHECK runs fork(2), then runs system(3). From the perspective of the process sending the signal, the process it wants to end is a great-grandchild. In the original code, the signal is only sent to the child process. This means that only the child process is ended, while the grandchild and great grandchild processes continue to run. Sneaky, sneaky.



The pull request above fixes this bug. When MISC_CHECK fork(2) is run, setpgid(2) is called and the process group separated. signal(2) is then able to designate individual process groups. It then is able to send signals to all of the processes belonging to that process group. This fix allows the patch to send signals to all the child, grandchild, and great grandchild processes. Problem solved!


This fix has also been merged into Keepalived version 1.3.0. Now we should be able to use MISC_CHECK without any problems.


*2: There are actually even more bugs than this. The child processes, which are absolutely essential to this program, simply ignore the signals (SIGTERM) unintentionally. The program was made to work by sending a force-quit signal (SIGKILL), which was included as a last-resort measure, to the processes which in effect did kill them when received, but I think we can all agree this was a less-than-elegant solution to the problem. This has all been resolved now.



All in a Day’s Work

In the article above I made it sound like they took our code as-is and just added it into the package. To be completely honest, they pretty much re-wrote our patch from scratch. Sadly, most of the code we wrote can’t be found inside the official version of Keepalived.


There’s a very good reason for this. The framework for the healthcheck that comes with Keepalived is designed for TCP. It wasn’t designed with UDP in mind. That’s why we wrote all the socket-related code from scratch in our original patch. This left us with some interesting results. Compared to standard healthchecks, our code is full of programming faux-pas and really misses out on a lot of best practices. For example, if you sent a pull request as-is, it might upset the entire program.


That being said, the framework wasn’t designed for UDP. We wracked our brains for days, but since there really wasn’t any other good way to go about it, we ended up revising the core of Keepalived, changing the way the very fabric of the framework. (The individual functions of healthcheck are positioned like modules. You can add new functions easily, but it’s a pretty big undertaking to revise the main body.)



In this commit, we took a framework that was designed with only TCP in mind and made it support UDP. In order to avoid changing the code being used by the framework, we used inline functions to create a wrapper, all while keeping their compatibility intact. This fix makes it much easier to create a healthchecker with a UDP base, so there’s a chance that the number of supported protocols will increase in the future.


At the end of the day, writing the message for the pull request was harder than writing the actual code itself. The hardest part for me personally was the language barrier. My English is pretty deplorable (thank you ghost-writer for translating this blog!), so if someone told me to do all this with MISC_CHECK, that would have been the end of the story. I poured my heart and soul into writing a message in English that conveyed exactly why I felt it was so important that they add this part to the default healthchecker. I was relying a lot on Google Translate (which was pretty bad back in the day for Japanese→English), and it was giving me a lot of grief, but in the end, I was able to get the results I had sought after for so long.

This may very well be the biggest and most-well known product my own code has ever been included in. Simply put, I am ecstatic. I’m not sure exactly how much demand there is for this kind of thing, but if you’re reading this blog, please try out the DNS healthcheck feature!

Hello denizens of the internet, this is KLabGames infrastructure engineer kensei. Today I’m going to talk about how we notify players when their health (or “HP”) has fully recovered in our mobile games.


Getting Started

The idea of “HP” is strongly tied to mobile games these days. Players receive a limited number of play tickets (a.k.a. “health” or “HP”) used to play the game. When these are used up, the player must wait until their health recovers on its own, or use some sort of in-game item to recharge their health. Naturally, many players choose to wait until their HP recovers before playing again.


What if there was a way to let players know exactly when their HP was fully recovered? It would certainly improve the overall gaming experience, saving both time and effort for the player.


One of the ways we let players know when their HP has fully recovered is to send them a message via local push notifications.


Think Global, Act Local

According to Apple’s Documentation, local notifications are scheduled for sending by the application itself.


As long as the app isn’t running in the foreground, a notification, icon badge and sound is sent to the user when the clock strikes the preset time. If the app is running in the foreground, users will be alerted with a simple notification.


Android implements local push notifications via AlarmManager and NotificationBuilder.


Notifications 101

The most important part about sending HP recovery notifications is remembering to cancel unnecessary notifications. Smartphones are always multi-tasking. There’s no telling when a player will pause the game to switch to some other task before picking up where they left off.


What happens if you send a local push message via timer in this situation? If you don’t cancel the process, the notification will hit the player in the middle of their game after they've resumed gameplay.


What happens if a player uses some of their HP, or uses an item to recover all of their HP? If you don’t cancel the notification, you’ll end up sending a recovery notification when their HP isn’t fully recharged or long after it’s fully recovered.


Canceling Requests: Timing is Everything

So, when should you send cancellation requests?


For KLabGames’ titles, we always run cancellation processes at the three points mentioned below.

  • When the app is launched.

  • The number of seconds the HP needs to fully recover is retrieved when the game first makes contact with the server and is stored on the device. A cancellation is processed if this number is already at 0.

  • If the time it takes to reach full HP > 0 seconds, a cancellation request is sent just before the local push timer is set.


Here’s the logic behind each of the points listed above.

  • There’s no need for a notification when the app first starts up, so the notification is canceled.

  • The second time the app connects to the server, multiple APIs are used to calculate the amount of time it will take to fully recover the HP on the server side. The result is then returned to the app.

  • Notifications can be canceled when a player uses an item to recover their HP, or recovers their HP by leveling up.

  • For the third point, in order to keep the local notification timer constantly updated, a cancellation process is run right before setting the timer for the local push message.

  • By keeping only one timer set and up-to-date at all times, you can be sure not to send any false alarms, avoiding any unnecessary “whoopsies” and other slip-ups.


When sending full HP recovery notifications, the most important thing to remember is to make sure you’re only setting one timer, and that you’re constantly updating it. Wait, what? That’s two things...


Unity-Side Program

I made a few samples which run on iOS and Android.

https://github.com/kensei/klab_advent_calendar_2015


So here I’ve got a game that only lets players use and recover their HP. Perhaps in some twisted universe, you could call this program a game. I made it in a hurry so I apologize if it’s a bit buggy.


Quick Overview of the Program


Processing the Local Push Notification

Create the client plugin, then encapsulate the different processes to be run per platform. Initialization, local push notification settings and local push notification cancellations are all bridged via Unity’s native code.


On iOS, local push notification processes are set to Unity standards.


LocalNotification l = new LocalNotification();
l.applicationIconBadgeNumber = 1;
l.fireDate = System.DateTime.Now.AddSeconds(10);
l.alertBody = "test";
NotificationServices.ScheduleLocalNotification(l);


However, you can’t do really complicated things like sending local push notifications that repeat themselves.


That’s why I decided to use the native code from the start to implement my local push notifications. Additionally, local push notifications for iOS8 require permission from the user. In order to get the user’s permission, I’ve extended part of the UnityAppController we imported.


Here’s the code.


Android requires the settings for permissions used for local push notifications, as well as the settings for receivers used to receive timers for AndroidManifest.


Initialization

The native code for each platform is initialized on startup.


Local Push Notification Settings

  • C#

    • Call native code.

  • iOS

    • Create an instance of UILocalNotification, then pass it to UIApplication.

  • Android

    • Create an instance of intent we’ll use to pass to the receiver. Set intent so that it receives LocalNotificationReceiver.

    • Set the time you plan to end the event in your Calendar instance. The end time is set in seconds in the sample I made.

    • Set the intent and Calendar you created inside AlarmManager.

  • Android Receiver

    • Receive information from intent.

    • Create Notification and notify the player.


Cancelling Local Push Notifications

  • C#

    • Call native code.

  • iOS

    • Receive all of UILocalNotification from UIApplication.

    • Cancel anything that matches with notificationId.

  • Android

    • Retreive the Action that matches the PendingIntent.

    • Send a cancellation request to AlarmManager.


Closing Thoughts
As you can see from the article above, notifying users that they’re ready to play the game again is important. However, sending them notifications at the right time is equally as important. As with most things in life, timing is everything! 

Virtual machines. These days it seems like you can’t even turn a corner without running into one of these faux operating systems. As a matter of fact, you probably already have a virtual machine of choice. My coworkers and I tend to go with VirtualBox. The following is how we usually roll. 


  • Host OS: Mac/Windows

  • Guest OS: Linux

  • File sharing: Shared folders connecting host and guest OSes


Although it doesn’t exactly stand out from the crowd, the most important part of this setup may very well be shared folders.


The shared folder system provided by VirtualBox allows the user to mount the host OS’s file system from inside the guest OS. While this is extremely useful, the time it takes to access files across OSes feels excruciatingly slow. find never ends, and git status seems to drag on forever.


I tried to find a way around these problems, and here’s what I found.



However, neither of these “solutions” offered any real answers to the problem at hand. That’s when it occurred to me to look into how vboxsf (VBox’s shared folder system) works compared to other file systems and see if I could find a way to make vboxsf run faster. What I found may surprise you.


The first discovery I’d like to highlight is the fact that find is surprisingly slow.


Test Environment

The test environments I used for these trials are laid out in the table below. The directory I used contained around 30,000 files, weighing in at a 4GB.


Environment

Tool Name

Version

VirtualBox


4.3.28

VMWare Fusion


7.1.2

Host OS


OS X Yosemite 10.10.2

Guest OS


Debian 8.1 (Jessie)


Findutils

GNU findutils 4.4.2


Coreutils

GNU coreutils 8.23


Glibc

2.19


Linux Kernel

3.16.0-4-amd64


VBoxGuestAdditions

4.3.18


VMWareTools

9.9.3


Detective Work

First things first. I needed to determine if the find command itself was running slowly, so I decided to run a test with a few other file systems. For this comparison, I selected vmhgfs, the file system VMWare uses for sharing files between the host and guest OS.


I chose this particular file system because vmhgfs is very similar to vboxsf. The chart below shows the time/strace results that were produced when I ran the find command in both file systems. One look at the results reveals that vboxsf is significantly slower than vmhgfs. Additionally, when we use strace to look at the system calls being run, we see that the call taking up most of vboxsf’s time is newfstatat. It also becomes painfully obvious that the number of times it’s called far surpasses vmhgfs.


I also found that the processes being called inside the find command are different for vboxsf and vmhgfs, for reasons we’ll look into next.


Parameter

vmhgfs

vboxsf

time(real)

0m7.205s

0m19.774s

time(sys)

0m5.740s

0m8.088s


// vboxsf
$ strace -c find .
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
76.59    0.509083          15     34145           newfstatat   <-Here
10.86    0.072196          73       993           openat
10.24    0.068043          68      1005         6 open
 2.14    0.014249           0     34145           write
 0.14    0.000952           1       998           fstat
 0.03    0.000173           0      2013           getdents
(omitted)


// vmhgfs
$ strace -c find .
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
94.99    3.207002        1593      2013           getdents
 2.36    0.079565          79      1005         6 open
 1.96    0.066032          66       993           openat
 0.38    0.012751          13       993           newfstatat   <-Here
 0.19    0.006551           0     34145           write
 0.12    0.004198           2      1993           close
 0.00    0.000139           0       998           fstat
(omitted)


Follow That Code!

Next, I decided to take a closer look into why the number of times newfstatat is called differs for vboxsf and vmhgfs.


The Code for find

I found that when the code below is run, the result determines whether or not newfstatat is called.


// findutils-4.4.2/gnulib/lib/fts.c 1135-1140
           bool skip_stat = (ISSET(FTS_PHYSICAL)
                     && ISSET(FTS_NOSTAT)
                     && DT_IS_KNOWN(dp)
                     && ! DT_MUST_BE(dp, DT_DIR));
           p->fts_info = FTS_NSOK;
           fts_set_stat_required(p, !skip_stat);


dp is a dirent structure retrieved using glibc’s readdir() function. dirent structures are made up of cached file paths and inode numbers. They are mainly used for converting paths to inode numbers.


In order to make a decision, the program needs to know the value for d_type. If it’s DT_UNKNOWN or DT_DIR, it runs stat. When we examine the values at this point in the process for vboxsf and vmhgfs via gdb, we see that vboxsf is dp->d_type=0 (DT_UNKNOWN), while vmhgfs returns dp->d_type=4 (DT_DIR) and dp->d_type=8 (DT_REG). The following is also included in the explanation of readdir().


Linux aside, d_type fields generally only exist in BSD systems.

d_type This field contains a value indicating the file type, making
             it possible to avoid the expense of calling lstat(2) if
             further actions depend on the type of the file.
             When a suitable feature test macro is defined

             (_DEFAULT_SOURCE on glibc versions since 2.19, or

              _BSD_SOURCE on glibc versions 2.19 and earlier), glibc

             defines the following macro constants for the value

             returned in d_type.


This problem can be explained as follows. Since d_type is always returned as DT_UNKNOWN in vboxsf, the program is forced to call stat more times than it needs to, making vboxsf run that much slower than vmhgfs.


Code for Glibc and Linux Kernel

Having gleaned this valuable information from examining the source code, I decided to press deeper into the heart of find. I now needed to find out why vboxsf caused all d_types to be returned as DT_UKNOWN. To that end, I wanted to track down how readdir() was acquiring information from the file system. readdir() is defined inside glibc. We can see that it calls the syscall getdents() in the following code excerpt.


# glibc-2.19/sysdeps/posix/readdir.c
     bytes = __GETDENTS (dirp->fd, dirp->data, maxread);


getdents() is defined inside the Linux kernel, which is where file->f_op->iterate is called.


// linux/fs/readdir.c
   if (!IS_DEADDIR(inode)) {
       ctx->pos = file->f_pos;
       res = file->f_op->iterate(file, ctx);
       file->f_pos = ctx->pos;
       fsnotify_access(file);
       file_accessed(file);
   }


file->f_op, defined per file system, is a collection of methods used for handling files.


Code for vboxsf

Since file->f_op is defined within the file system, we now need to take a look at the implemention for vboxsf. file->f_op->iterate is defined in the following way for vboxsf. As with other file systems, a little fishing around with grep turns up the results we’re looking for fairly easily.


// VirtualBox-4.3.28/src/VBox/Additions/linux/sharedfolders/dirops.c
struct file_operations sf_dir_fops =
{
   .open    = sf_dir_open,
   .iterate = sf_dir_iterate,
   .release = sf_dir_release,
   .read    = generic_read_dir,
   .llseek  = generic_file_llseek
};


From there, we found the following lines of code when we followed up on vboxsf’s sf_dir_iterate.


// VirtualBox-4.3.28/src/VBox/Additions/linux/sharedfolders/dirops.c
       if (!dir_emit(ctx, d_name, strlen(d_name), fake_ino, DT_UNKNOWN))
       {
           LogFunc(("dir_emit failed\n"));
           return 0;
       }


dir_emit() is a function used to register acquired file names and inode numbers as directory entries. In the same way, an inode number and a file name are returned when getdents() is called in vboxsf. We also find the root of the speed issue―d_type is returned as DT_UNKNOWN.


The solution I came up with to fix the problem at hand was fairly simple. If I could retrieve the dentry type from the host side via vboxsf and use that to return d_type instead, find should become much faster, even for vboxsf.


Hot-Rodding vboxsf

In order to register the correct d_type, we need to retrieve the d_type on the host side. A quick glance at how dentry is retrieved reveals that vboxsf, running on the guest side, is asking for the result from the service running on the host side. The code on the host side for retrieving dentry for Mac and Windows machines is as follows.


// VirtualBox-4.3.28/src/VBox/Runtime/r3/posix/dir-posix.cpp
RTDECL(int) RTDirRead(PRTDIR pDir, PRTDIRENTRY pDirEntry, size_t *pcbDirEntry)
...
           pDirEntry->INodeId = pDir->Data.d_ino; /* may need #ifdefing later */
           pDirEntry->enmType = rtDirType(pDir->Data.d_type);
           pDirEntry->cbName  = (uint16_t)cchName;

// VirtualBox-4.3.28/src/VBox/Runtime/r3/win/direnum-win.cpp
RTDECL(int) RTDirRead(PRTDIR pDir, PRTDIRENTRY pDirEntry, size_t *pcbDirEntry)
...
   pDir->fDataUnread  = false;
   pDirEntry->INodeId = 0; /** @todo we can use the fileid here if we must (see GetFileInformationByHandle). */
   pDirEntry->enmType = pDir->Data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY
                      ? RTDIRENTRYTYPE_DIRECTORY : RTDIRENTRYTYPE_FILE;
   pDirEntry->cbName  = (uint16_t)cchName;


This code shows that when the info for dentry is retrieved, d_type(pDirEntry->enmType) is retrieved along with it.


Herein lies a problem. The value is being returned from the guest side, but it seems like it’s not being set. That’s all, open and shut. I went ahead and modified the code so that the d_type retrieved from the host side is returned like this.


Results

After making these modifications to the code, I tried running the find command again. The time/strace results are organized in the chart below. As you can see, the number of times newfstatat is called has dropped significantly, and the amount of time used to process these calls has become much shorter, transforming vboxsf into a real speed machine that’s almost as fast as vmhgfs.



Parameter

vmhgfs

vboxsf (Before)

vboxsf (After)

time(real)

0m7.205s

0m19.774s

0m3.385s

time(sys)

0m5.740s

0m8.088s

0m0.860s


% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
55.75    0.090346          90      1005         6 open
37.71    0.061110          62       993           openat
 4.58    0.007425           0     34145           write
 0.97    0.001571           2       993           newfstatat   <- Here
 0.66    0.001071           1       998           fstat
 0.20    0.000326           0      1989           fchdir
 0.12    0.000194           0      2013           getdents
 0.00    0.000000           0         4           read
(omitted)


Bringing It Home

By returning the correct d_type value to vboxsf, we can make find and other commands that rely on d_type much faster by cutting out some of the inefficient returns. It’s also interesting to note that my revision was included in the fixes for VirtualBox 5.0.2.


Thanks for reading!


by kokukuma2

Back to Top