KLabGames Tech Blog - English

KLab develops and provides service for a variety of smartphone games. The world of mobile games is growing by leaps and bounds, and the development process calls for a different set of skills than traditional console games.

I happened to come across this article: Configuring sql.DB for Better Performance. It’s a good article that clearly lays out the three settings for controlling connection pool sizes. That said, I just can’t bring myself to agree with the settings recommended in the article. I’d like to explain what settings I would recommend myself.

Limit ConnMaxLifetime instead of MaxIdleConns

Allowing just 1 idle connection to be retained and reused makes a massive difference to this particular benchmark — it cuts the average runtime by about 8 times and reduces memory usage by about 20 times. Going on to increase the size of the idle connection pool makes the performance even better, although the improvements are less pronounced.

The issue lies in the words "to this particular benchmark." In this benchmark, the database is constantly receiving queries in 8 parallel. When one query finishes, another pops up immediately, so with DB.SetMaxIdleConns(1) you get significant results.

The way this benchmark works is effective in certain cases, such as when batch processing insertion requests for large amounts of data in a database. However, it doesn’t work for web applications.

I’ve written up a simple simulation to demonstrate an application that handles 1,000 queries a second. The queries are evenly distributed and appear randomly, with each query taking 10 ms to form a new connection. (View the Github gist for this simulator here.)

Let’s compare how MaxIdleConns(4) and MaxIdleConns(10) perform when using MaxOpenConns(20). The orange line is the overall number of connections, the blue line is the number of connections in use, and the green line is the maximum wait time before a connection becomes available shown in milliseconds.


Despite handling a 1,000 queries, there were 285 connections for MaxIdleConns(4), while with MaxIdleConns(10), it went down to 69. On the other hand, even after the initial load goes away the number of stabilized connections still increase.

Next let’s look at a simulator for SetMaxIdleConns(100); SetConnMaxLifetime(time.MilliSecond * 300).


We’re creating 80 (20 x 4) connections, which is more than the 69 connections from MaxIdleConns(10), but the lifetime is shortened so that we can more easily see the operation. If we extend the simulation time by 100 seconds, the number of connections is about 690 for MaxIdleConns(10) and about 80 for SetConnMaxLifetime(time.Second * 30).

You might notice that in this graph, the reconnections are clustered at a specific time which has an extended latency. This is because the simulation is completely evenly distributed, with all the connections in the beginning created at the same time. In an application where the load changes depending on the time, the times the connections are made would be further apart, so spikes like this wouldn’t happen. Similar to the graph above, this is what’s happening with the load at 1000 ms in the next graph, after the load gradually increases over the course of 200 ms.


Other Reasons for Using SetConnMaxLifetime

I proposed and implemented DB.SetConnMaxLifetime(). This is a better method than SetMaxIdleConns() for reducing the number of idle connections, but that’s not all it can do.

As introduced in "Configuring sql.DB for Better Performance," there is a chance that MySQL’s wait_timeout setting will automatically close idle connections in the server. It will also close TCP connections that have not been used by the OS or router for extended periods of time. In either case, we first find out that the TCP has been closed when trying to receive the response after go-sql-driver/mysql sends a query. It may take several dozen seconds before the disconnection is detected. As such, it’s impossible to know if the sent query has been carried out. This renders a safe retry impossible.

In order to avoid this danger, we should close connections that haven’t been used for extended periods of time instead of reusing them. We should simply create new connections. SetConnMaxLifetime() is an API that sets the lifetime of a connection. If you set it to 10 seconds, you won’t be able to reuse connections that haven’t been used for 10 seconds.

By setting the lifetime of connections, you can mop up all sorts of other problems.

  • It makes it easier to increase and decrease servers when the database server is load balanced.

  • It makes failover easier for database servers.

  • When changing settings for MySQL online, you can root out and stop connections operating under old settings.

After weighing how much adding a separate API that limits a connection’s idle time would affect performance in a real environment against the difficulty of implementing it in sql.DB, I didn’t add one.

Recommended sql.DB Settings

  • Definitely set SetMaxOpenConns(). You need this in order to stop opening new connections and sending queries when the load is high and server response slows. If possible, it’s good to do a load test and set the minimum number of connections to ensure maximum throughput, but even if you can’t do that, you should decide on a reasonably appropriate number based on max_connection and the number of cores.

  • Configure SetMaxIdleConns() to be equal to or higher than SetMaxOpenConns(). Let SetConnMaxLifetime handle closing idle connections.

  • Set SetConnMaxLifetime() to be the maximum number of connections x 1 second. In most environments, a load of one connection per second won’t be a problem. When you want to set it for longer than an hour, discuss that with an infrastructure/network engineer.


Greetings fellow Python enthusiasts, @methane here. compact dict was merged into Python 3.6 back in September (right before it went into beta). As a result of this fortunate turn of events, I received a recommendation and became a Python Core Developer in October of last year.


Please allow me to clarify. It’s not like I’m a full-time committer for Python employed by KLab. However, thanks to KLab’s flex-time system, I’ve been able to spend the majority of my work hours contributing to OSS and pouring over code. Especially in the last 3 months, I’ve been spending a lot of time working with Python specifically, so it’s almost like I’m a full-time Python coder.


I don’t really get very many opportunities to share this part of my engineering efforts here in Japan, so after being absolutely blown away by what Money Forward’s Urabe-san wrote about in his recent article on ruby-core, I decided to buckle down and write about some of the stuff that’s been going on in the world of Python recently.

Python 3.6 Released

Python 3.6 was released on December 23. There were so many important improvements included in this update that there’s no way I could cover them all in one blog. However, the easiest one to explain (even for people who don’t normally work with Python) is probably f-string.


f-string is a feature that enables you to write methods in the middle of strings. Many LL have this feature. However, when used too much, it makes it more difficult to maintain the code you’re writing. I’d like to suggest the following–just as you would write f"{foo.name} = {foo.value}" instead of "{foo.name} = {foo.value}".format(foo=foo), simply use it to replace .format(name=name)  and off you go.


Personally, I mostly contributed to speeding up compact dict and asyncio. This is a little off-point, but I thought I’d share this little tidbit: 2 days after Python 3.6 was released, a new hash that’s almost identical to compact dict was implemented in Ruby 2.4.0. What a coincidence.

UTF-8 Support Improved for C locale (Mainly)

In Linux, Python looks at the locale for determining the encoding for Terminal, standard input-output, and file paths.


However, C locale in POSIX (also known as POSIX locale) uses ASCII as a general rule. When non-ASCII characters are used, it generates a Unicode Encode Error. As for standard input-output, you can control it with the environment variable PYTHONIOENCODING. However, it can’t be used to set the encodings for command line functions or file paths.


C locale is the default locale, so when you don’t define it in crontab, it gets used automatically at somewhat inopportune times–like when the shh sends a LANG=ja_JP.UTF-8 to a server connection that doesn’t have a  ja_JP.UTF-8 locale.  Many engineers choose to use C locale on purpose when they want to avoid translated error messages (it’s hard to write reports in English), or when they don’t want the behavior of commands to change. When creating small Linux environments with containers or other special insertions, apart from C, no other locales even exist in order to conserve space.


Thanks to the above reasons and more, sometimes Unicode Encode Errors are generated when using Python in C locale. This especially creates problems for people who aren’t actually Python engineers themselves, but are in a situation in which they are using Python-created tools.


Python has a broad base of users. This means that the way these engineers use locales also varies widely. To be brutally honest, there just aren’t too many users out there using C locale who really want to use ASCII. That’s why it’s been proposed that the new default setting for C locale should be changed to UTF-8.


(Because this is still in the proposal phase, you can’t use it even if you check out Python’s in-development branch.)

PEP 540: Add a New UTF-8 Mode

We’ve proposed adding a UTF-8 mode to Python. In UTF-8 mode, it ignores the encoding designated by the locale, and the file paths and standard input-output are all set to UTF-8.


This mode comes in three basic settings: disabled, enabled, and strict. The difference between enabled and strict is as follows. You should use enabled when you plan on using surrogate escape to handle transparent byte strings that aren’t encoded in UTF-8. With strict, byte strings that aren’t encoded in UTF-8 throw errors. With C locale, the mode is set to enabled by default. This makes it possible to use file paths and read and write standard input-output not encoded in UTF-8. These days you might not ever really use this, but this is extremely useful when mounting external file systems (just like back in the day).


For locales other than C, the encoding set by locale uses strict. This means that any data besides UTF-8 generates an error. This is useful when you want to eliminate everything but UTF-8. For example, let’s say you want to catch and eliminate any data that may become corrupted and turn into an illegible string of weird characters before it becomes a problem.


This mode can be controlled by environment variables such as PYTHONUTF8 and options like  -X utf-8. For engineers who want to ignore the locale altogether, you can write .bashrc to export PYTHONUTF8=1. You could also write PYTHONUTF8=1 to /etc/environment if you prefer.

PEP 538: Coercing the Legacy C locale to C.UTF-8

This PEP (Python Enhancement Proposal) proposes changing the locale to C.UTF-8 (if the system supports it.) when C locale is the one designating the environment variables. By doing so, not only will it affect Python itself, but we can also expect it to make other libraries operate in UTF-8 instead of ASCII or latin1.


For example, this applies to readline when being used by Python’s REPL. We’ve received reports that readline is able to handle UTF-8 conveniently and easily without changing any settings on REPL when being used by Python on an Android device.

Expanding New Calling Conventions (METH_FASTCALL)

Fundamentally speaking, when looking at the way Python calls functions from the perspective of the world of C, ordered parameters are passed as doubles and keyword variables are passed via dict.


With the advent of Python 3.6, a new calling convention emerged that took the head pointer array and the number of ordered variables and passed them on, instead of using doubles. This meant that the variables stored in the stack on the calling side didn’t have to keep them as doubles–they were free to pass them on just as they are.


Currently, efforts to make use of a new calling convention for “functions that can be called by Python” created in the C language are already underway. In fact, the most important parts have already been finished. Additionally, this is still kind of hush-hush, but apart from functions and methods created by PyMethodDef (name, flag showing which calling convention to use, structure made of function pointers), there is a form that makes entire objects callable. (For engineers familiar with Python, you might want to think of it as a  operator.itemgetter() that returns objects.) The function pointer tp_call is included in the structure made up of this metadata.

Because PyMethodDef traditionally uses flags to offer support for multiple calling conventions, we had to use a special API when making calls from the C language. Until now, tp_call only offered support for standard calling conventions that merely received doubles and dict. Now the external library has a chance of calling the function pointer directly without going through the API.


At the same time, in order to maintain compatibility, the function pointer tp_fastcall has been added. For designs that have made room for tp_fastcall already, there is a patch currently under review that embeds a function that automatically converts tp_fastcall to tp_call.


At the end of the day, while it may look like we are only calling one variable from the world of Python, this variable is often passed between multiple functions in the world of C. As a result, when both the new and traditional methods are used inside the same code, it creates a risk in which an inordinate number of conversions may take place internally when they don’t need to be. With the advent of Python 3.7, the passing of variables internally has been unified to only use the new method. I believe the new calling conventions are able to flex their true muscle and show of their true power in this environment.


There was an article about ordered dictionaries posted on the PyPy developers blog back in January 2015:

Faster, more memory efficient and more ordered dictionaries on PyPy

The new version of PyPy (2.5.1) released afterwards continued offering insertion order support for dict, while also reducing the amount of memory used.

In other news, someone proposed an innovative idea for CPython in PEP 468: We can preserve the order of arguments by receiving keyword arguments in the **kwargs syntax as placeholder arguments.

For example, inside the SQLAlchemy query, if we wrote .filter_by(name="methane", age=32), there was no way to be sure if it would create a query that read WHERE name = "methane" AND age = 32 or WHERE age = 32 AND name="methane". This new change brings the added benefit of being able to preserve the order of arguments.

(filter_by and other workarounds are special shortcut functions. You can also preserve the order of arguments if you use the filter method, which doesn’t use keyword arguments.)

The person who proposed this idea re-implemented the OrderedDict class that was part of pure Python in C. Since Python 3.5, OrderedDict has become significantly faster and is much more memory-efficient. (The reason he avoided revising dict was that it has already been optimized in such a multifaceted and complicated way for the sake of the Python interpreter.)

However, while it has been re-implemented in C, OrderedDict, which manages the order with bi-directional links, still had some pretty costly overhead of its own. It was using almost twice the amount of memory.

Python 3.5.1 (default, Dec  7 2015, 17:23:22)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> d = {i:i for i in range(100)}
>>> from collections import OrderedDict
>>> od = OrderedDict((i,i) for i in range(100))
>>> sys.getsizeof(d), sys.getsizeof(od)
(6240, 11816)

Because of that and other factors, PEP 468 was put on the shelf.

All that being said, I too was resistant to the idea of changing the specs if there was a chance that it would reduce the performance of all keyword arguments, even if it was useful in some use cases. I had also taken an interest in PyPy’s dict, so I thought I would take a stab at it with just enough time to leave room for the chance for it to be in time for Python 3.6.

(The beta version is planned for release in the first half of September, after which no one will be allowed to add any more features. I would need to implement the changes, run an evaluation verification test, and discuss it on the mailing list before this deadline in order for it to be merged in the the new version of Python.)

Data Structures

The only data structure I changed was PyDictKeysObject. However, the memory layout has become much more dynamic than before.

struct _dictkeysobject {
   Py_ssize_t dk_refcnt;
   Py_ssize_t dk_size;
   dict_lookup_func dk_lookup;
   Py_ssize_t dk_usable;
   Py_ssize_t dk_nentries;  /* How many entries are used. */
   char dk_indices[8];    /* Dynamically sized. 8 is the minimum. */

#define DK_SIZE(dk) ((dk)->dk_size)
#define DK_IXSIZE(dk) (DK_SIZE(dk) <= 0xff ? 1 : DK_SIZE(dk) <= 0xffff ? 2 : \
                      DK_SIZE(dk) <= 0xffffffff ? 4 : sizeof(Py_ssize_t))
#define DK_ENTRIES(dk) ((PyDictKeyEntry*)(&(dk)->dk_indices[DK_SIZE(dk) * \

dk_get_index(PyDictKeysObject *keys, Py_ssize_t i)
   Py_ssize_t s = DK_SIZE(keys);
   if (s <= 0xff) {
       return ((char*) &keys->dk_indices[0])[i];
   else if (s <= 0xffff) {
       return ((PY_INT16_T*)&keys->dk_indices[0])[i];
   else if (s <= 0xffffffff) {
       return ((PY_INT32_T*)&keys->dk_indices[0])[i];
   else {
       return ((Py_ssize_t*)&keys->dk_indices[0])[i];

dk_set_index(PyDictKeysObject *keys, Py_ssize_t i)

The previous hash tables were arrays of three-word (hash, key, value) PyDictKeyEntry data structures. However, in the new method I was trying, I changed these hash table elements to integer values. char dk_index[8] is used when declaring the data structure, but this is only to set the minimum dk_size to 8 at this time. A larger size will be acquired when it’s allocated. This integer type actually allows for a dk_size of up to 128 is a char. However, starting at 256 it becomes an int16_t. This is how I was able to cut down the size of the hash table as much as possible.

Furthermore, we are unable to declare the data structure directly because the size of dk_indices is dynamic. However, I carefully placed the PyDictKeyEntry data structure array after this data structure. The size of this array is not dk_size, but rather it is 2/3 of that. (I introduced this idea in a previous article. This is the maximum number of elements that can be inserted into this hash table.) When inserting new elements, we simply write them into this array, and save the index in dk_indices. “dk_nentries” refers to the number of elements found inside the array.

The following pseudo-code, which assumes that the same key doesn’t already exist, shows what the code looks like when inserting new elements.

// Search for the place of insertion inside dk_indices
pos = lookup(keys, key, hash);

// Write the new entry into the entries array

DK_ENTRIES(mp)[keys->dk_nentries].me_hash = hash;
DK_ENTRIES(mp)[keys->dk_nentries].me_key = key;
DK_ENTRIES(mp)[keys->dk_nentries].me_value = value;

// Save the index of that entry into dk_indices
dk_set_index(keys, pos, keys->dk_nentries);

// Lastly, we increment the entry number we just finished using

Deleting Items

In order to delete items from this version of dict, we have to insert a placeholder “dummy element” in the same position inside dk_indices. (The reason each index is dealt with using a 1-byte entry number that goes up to 128 and not 256 is because we need to account for negative values that will be inserted as dummy values and represented as empty spaces.)

There are two ways to delete items from our entries.

When the “compact dict” idea was first posted to the Python developers mailing list, it was suggested that we would be able to closely preserve the entries array by moving the last element to the location where the deleted element had previously been. Using this method, the last element moves forward. This means that when we delete an element, we lose the attribute of “being able to preserve the order in which elements were inserted.”

On the other hand, what the PyPy community and I decided to go with at this time is simply to insert a handy NULL in the newly-vacated position.

// Search for the position of the index of the element that was deleted inside dk_indices
pos = lookup(keys, key, hash);

// Get the position of the element to be deleted inside the entry array
index = dk_get_index(keys, pos);

// Delete the element
DK_ENTRIES(mp)[index].me_key = NULL;
DK_ENTRIES(mp)[index].me_value = NULL;

// Add a “dummy” inside dk_indices
dk_set_index(keys, pos, DUMMY);

Handling things this way, the entry array becomes filled with dummy entries when a lot insertions and deletions are performed. This gives rise to a slightly unpleasant situation that calls for a compaction to be run in order to take care of this. However, in the method I first proposed, the hash table became filled with dummy entries when insertions and deletions are performed repeatedly, giving rise to the possibility that we will no longer be able to perform searches. Either way, we are going to have to run a compaction. At the end of the day, I believe it’s better to be able to preserve the insertion order.

By the way, deleting the last element in the entry array and decrementing dk_nentries, .popitem() preserves the average calculation amount in 0(1). In this case as well, we do not increment the “remaining number of insertable elements” known as dk_usable. This means that when we repeatedly delete and insert items, we will need to run a compaction and restructure the hash table.

Shared-Key Dict

Now we can get on the to the problem known as shared key dict.

This is what I thought when I first started working on the project: If we do NOT insert dummy elements into the hash table and if the entry array side is set to NULL, we can tell this is just another dummy element, just like before when we implemented compact dict.

Boy, did I have a lot to learn. This way, we won’t be able to preserve the order of insertion for dict the first time a new element is added to shared key.

>>> class A:
...     pass
>>> a = A()
>>> b = A()
>>> a.a = 1
>>> a.b = 2
>>> b.b = 1
>>> b.a = 2
>>> a.__dict__.items()
dict_items([('a', 1), ('b', 2)])
>>> b.__dict__.items()  # Even though the actual insertion order is b, a…
dict_items([('a', 2), ('b', 1)])

In order to address this problem, we need to consider the following three methods. After discussing this on the mailing list, we have no way of knowing which way the language will proceed until a final decision is made by either Guido or a core developer appointed by Guido himself.

  1. Simply Accept Things the Way They Are

The way the language specs for Python work right now, the order of dict is indefinite. As a result, even though currently the behavior of the language preserves insertion order (EXCEPT for dict, which manages the characteristics of instances), there is no problem from the point of view of the actual specs of the language.

When using compact dict, the size of shared key dict and the ma_values array becomes 2/3 the size of dk_keys, making it much more compact. As a result, we should simply accept this as a blessing from above.

On the other hand, as a downside of all this, we have to work extra hard to preserve the insertion order in almost all cases. This means that programmers who don’t take the time to check on the language specs might mistake this as part of the specs. Blaming this problem on the person for mistaking the specs of the language is hardly fair. For example, in order to avoid this downside in Go, the order when map is iterated is left indefinite on purpose (by using pseudo-random numbers generated at high speeds).

2. Stop Using Shared Key If the Insertion Order Is Wrong

If the new element attempts to insert itself in an order that differs from the order held by shared key, you could always just stop using shared key immediately.

While this looks like the safest route at first glance, it’s difficult to tell just how long you can maintain shared key, as it becomes difficult to predict the amount of resources that will be consumed. Also, when a rarely used path differs from the inserted order, shared key will be disengaged. Even though we’re continuing to use the same-sized dict for around the same amount, the amount of memory used will continue to increase slowly but steadily. These are just some of the problems that may arise from adopting this method.

For programs such as web applications that run for extended periods of time, no one is happy about the fact that the amount of memory used becomes difficult to predict and, on top of that, gradually increases. Choosy programmers should choose to stay away from this method.

3. Stop Using Shared Key Dict

shared key dict is an interesting beast. When it fits just right, it is extremely effective. However, compact ordered dict is more stable and more effective overall. On top of that, in order to support shared key dict, the implementation of dict becomes much more complicated. I tried seeing how many implementations of shared key dict I could remove. Out of a total of 4100 lines, I was able to remove around 500. I simply just deleted them, so if you did a little refactoring, you could probably get rid of even more.

While this is effective, when I built a Python document using Sphinx and measured maxrss using /usr/bine/time, I got the following results:

  • shared: 176312k

  • compact + shared: 158104k

  • compact only: 166888k

As you can see, even if you quit using shared key, the amount of memory usage reduced via the effect on compact dict is larger.

(Of course, this is just the results of a test run on one application. If anyone knows of another practical application that’s good at measuring other indicators, and has stable running times and memory usage amounts when using classes and instances a reasonable amount, please drop me a line. I’d love to chat.)

Also, when deleting shared key and running the remaining portion of the program, you can aim for an even higher level of efficiency than compact + shared by implementing a specialized dict that is separate but even more efficient. The idea i have right now is currently being implemented in POC. If it gets adopted, I’ll share it with you all later.


I’ll close out this article by adding a little more detail as to how to speed up compact dict by using OrderedDict.

Python 3 includes the move_to_end(key, last=True) method that was not available in Python 2.7.  These keyword arguments are rather tricky to use, but by using move_to_end(key, last=False), you can move elements to the front of the line. (The actual functionality aside, I think this method is extremely poorly named. It should be called move_to_front(key).)

I have an idea. In order to implement this feature, you would need to handle the dk_entries array as a fixed-capacity deque, and not as a fixed-capacity dynamic array. In other words, right now we’re using everything from dk_entries[0] to dk_entries[dk_nentries-1]. However, in addition to this, when adding elements to the front of the line, we’ll add them behind dk_entries and move towards the front of the line when inserting them.

In order to make this idea a reality, we’ll have to create a bizarre version of dk_nentries, modifying it to be able to use the scanning and resizing functions of hash table. That should do the trick just nicely. By adding 1 word (8 bytes) for per OrderedDict, we should be able to cut the amount of memory consumed in half.

All that being said, we still have our hands full with the shared key problem. On top of that, once we are able to preserve the insertion order for dict, the number of chances to use OrderedDict are reduced as a result, leaving us with little motivation to implement the idea. At the very least, I don’t think this will be ready in time for Python 3.6 (unless someone does the heavy lifting for me.)


Back to Top