The other day, I was able to mark a major milestone off of my list of things to do when a patch I made for Go was officially adopted into the programming language. I am now a contributor to Go.

This patch solved a pesky problem that required scaling up to 16+ cores in order to preserve performance quality when using database/sql.Stmt heavily. How did I go about investigating this problem? I’ll answer that question as I go over the steps involved in integrating my patch into Go.

Background

I help run maintenance for a TechEmpower project called Framework Benchmarks. I’m mainly in charge of Python-related tests. The project involves creating multiple versions of the same app using different programming languages and frameworks, and then noting the associated benchmarks. I’m fascinated by Go, so I play around with it by adding the Gin framework and participating in code reviews when I get the chance.

For Round 9, the most recent round of the Benchmark project held on May 1, 2014, Peak Hosting was added to the live testing environment. This super high-spec environment features duel Xeon Et-2660 v2 machines and a 10 Gigabit Ethernet connection.

This environment produced some unusual results. In other environments, there were a large number of Go frameworks that placed in the top, especially in Java and nginx-based environments. In this particular environment, however, these former success stories perform at an almost mediocre level, losing to lightweight frameworks such as Scala, Node.js, and PHP.

I received a pull request asking me to fix this problem on November 2, 2014. The request called for a way to fix the lock contention triggered by creating multiple instances of *sql.DB and using them in a round-robin fashion. I was initially opposed to the idea. The purpose of this project was not merely to aim for a high score. My goal was to provide practical code samples for each framework to as many people as possible, as well as important KPIs measuring the performance of each framework as vital reference material. Not only was creating multiple instances of *sql.DB not part of my original design, I didn’t even think the idea was practical.

About the Architecture for database/sql

I’m going to take a short detour and talk about the basic architecture for database/sql. Feel free to skip this section if you’ve ever written a program in Go that accesses a database.

database/sql is a lot like PDO for PHP. It goes on top of the drivers used to connect to each database, and helps give rise to a uniform interface. If you’re talking about MySQL for instance, go-sql-drivers/mysql would be the most well-known driver. Users generally don’t use these drivers directly.

sql.DB is the heart of database/sql. It serves as a connection pool that can be used in the following ways.

  1. Send queries directly with DB.Exec() or DB.Query().

  2. Create an Stmt object used to show prepared statements with DB.Prepare(), then use Stmt.Exec() or Stmt.Query().

  3. Create a Tx object used to represent transactions with DB.Begin(), then use Tx.Exec() or Tx.Query().

When using transactions, the Tx object hangs on to each connection it checks out from the database. In any other case, the query is made using connections pooled by the database. However, this does not allow the user to choose a preferred connection. (An API that lets you check out connections without transactions is planned to be added to Go 1.5 in order to cover additional use cases not accounted for here.)

With the first use case, if you don’t have DB.Query() acting as a substitute place holder, DB.Prepare(), Stmt.Exec(), and Stmt.Close() will run internally. However, this means the code triggers three round trips per query, which takes a toll on overall performance. go-sql-drivers/mysql isn’t designed to handle placeholder substitution either. (I drummed up a pull request that takes care of this, but it's still under review.) This means the best option performance-wise is use case number two.

When using the second use case, Stmt will take care of dealing with the results of Prepare() for each driver. Thanks to this sophisticated design, the user doesn’t have to worry about which connection they used Prepare() with. This is the real deciding factor in the problem here.

All things considered, it’s pretty odd to have multiple databases with this type of architecture. The ideal architecture scales one database per DB required as necessary.

Thorough Testing

First, I had to create an environment that was close to the one in which the problem occurred on Peak Hosting. I combined two c3.8xlarge machines running Amazon Web Services (AWS), enabled enhanced networking, and put them both in the same placement group. This gave rise to an environment that could connect a 32-core machine with a stable 10 Gigabit network connection. Amazon Linux AMI has enhanced networking enabled by default, which makes everything a lot easier. If you’re using the AWS Spot Instance, you can probably borrow two c3.8xlarge machines at roughly $0.70/hour.

Next, I turned on Benchmark, keeping a close eye on my CPU performance via top command while trying to recreate the problem. When using a mutual exclusion (mutex) lock, it’s pretty low-cost if you can trigger the lock right away. If the contention turns into a real competition, you’ll use up a bit of your CPU cycle. If the problematic mutex is locked for a long period of time, the number of runnable threads decreases, lowering the CPU usage. If you keep getting multiple locks in a short period of time, you will generate lock contentions more frequently, heating up the competition. Even if your performance doesn’t scale along with the number of cores, then your CPU usage rate will tend to increase. In this particular test, I was able to confirm the latter set of conditions.

Finally, I used net/http/pprof to determine the root of the problem. net/http/pprof not only includes a CPU profiler, it also comes with multiple diagnostic features you can use to take a more informed look at your system. The feature that helps the most with lock contention is the full stack dump. Here’s the actual stack dump.

Let’s start by locating the Mutex.Lock() with a lot of stopped goroutine()’s. In this case, you’ll find a lot of stack traces like the following. You can check it out in the source code here.

goroutine 81 [semacquire]:
sync.(*Mutex).Lock(0xc2080c42d0)
/home/ec2-user/local/go/src/sync/mutex.go:66 +0xd3
database/sql.(*Stmt).connStmt(0xc2080c4280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/ec2-user/local/go/src/database/sql/sql.go:1357 +0xa6
database/sql.(*Stmt).Query(0xc2080c4280, 0xc2089a1be8, 0x1, 0x1, 0x0, 0x0, 0x0)
/home/ec2-user/local/go/src/database/sql/sql.go:1438 +0x120

We’ve gotten pretty close to the root of the problem here, but we can’t be absolutely certain about the culprit just yet. Lock contention is associated with a problem that can be defined by multiplying the frequency of locks gained by the amount of locks held within a certain time period. In the stack dump, on the other hand, the most frequent occurrences show up the most often. By excluding all identical stack dumps from the list, we’ll be able to look for the section containing Stmt.mu locks a bit more easily. That’s when we come across the following stack dump.

goroutine 158 [semacquire]:
sync.(*Mutex).Lock(0xc208096de0)
/home/ec2-user/local/go/src/sync/mutex.go:66 +0xd3
database/sql.(*DB).connIfFree(0xc208096dc0, 0xc2083080c0, 0x0, 0x0, 0x0)
/home/ec2-user/local/go/src/database/sql/sql.go:695 +0x67
database/sql.(*Stmt).connStmt(0xc2080c4280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/ec2-user/local/go/src/database/sql/sql.go:1378 +0x316

The same location described above is causing Stmt.mu() to lock. However, from this same section you can see that we’re now waiting for a DB.mu() lock. When you examine the code for Stmt.connStmt() and DB.connIfFree(), connStmt() calls connIfFree() while looping in response to Stmt.css (the list that manages the prepared connection and the stmt pair). Because connIfFree() locks DB.mu every time, we know that DB.mu is being locked at frequent intervals. However, this specific location only comes up once at the same time as one goroutine does when Stmt.mu locks. It appears that it’s coinciding with a DB.mu lock occurring at another location. Next we need to go back to the stack dump and look for a section containing a DB.mu lock. Finally, we’ve found a lock contention for DB.addDep().

# DB.addDep containing DB.mu
goroutine 18 [runnable]:
database/sql.(*DB).addDep(0xc208096dc0, 0x7f0932f0e4d8, 0xc2080c4280, 0x73d340, 0xc2082ec540)
/home/ec2-user/local/go/src/database/sql/sql.go:362
database/sql.(*Stmt).Query(0xc2080c4280, 0xc2087cbbe8, 0x1, 0x1, 0x0, 0x0, 0x0)
/home/ec2-user/local/go/src/database/sql/sql.go:1455 +0x49a

# DB.addDep containing DB.mu. There are multiple stack dumps exactly like this.
goroutine 64 [semacquire]:
sync.(*Mutex).Lock(0xc208096de0)
/home/ec2-user/local/go/src/sync/mutex.go:66 +0xd3
database/sql.(*DB).addDep(0xc208096dc0, 0x7f0932f0e4d8, 0xc2080c4280, 0x73d340, 0xc2085f0540)
/home/ec2-user/local/go/src/database/sql/sql.go:364 +0x38
database/sql.(*Stmt).Query(0xc2080c4280, 0xc20899fbe8, 0x1, 0x1, 0x0, 0x0, 0x0)
/home/ec2-user/local/go/src/database/sql/sql.go:1455 +0x49a

We finally have a complete grasp of the problem at hand.

  • Stmt.connStmt() is locking Stmt.mu for a (relatively) long period of time, preventing Stmt from moving in parallel.

  • Stmt.connStmt() continuously calls DB.connIfFree() multiple times, causing DB.mu locks to slow down due to lock contention.

  • DB.addDep() is competing with DB.mu for lock contention.

Solution #1: Isolate DB.mu

When you take a closer look at DB.addDep(), it appears to be managing resources using reference counting, then releasing the resources in the last step. In order to preserve the map used for managing reference counting and release processing, the program is locking and excluding DB.mu. However, you can isolate this into an independent mutex.

We can speed up DB.connIfFree() provided that DB.mu doesn’t cause any lock contention. This solves the problem without changing the overall structure of the code.

However, once we actually attempt to isolate the lock, we find that performance only increases by 20%. This outcome is disappointing compared to the solution offered by having multiple instances of DB and calling them in a round-robin fashion. Even if we solve the lock contention problem for DB.mu, Stmt.connStmt() is still locking Stmt.mu from the inside, slowing down the overall process. The lock contention problem for Stmt.mu still has not been solved.

We’ll throw in a patch to isolate the mutex above, then we’ll work on speeding up Stmt.connStmt() next.

Solution #2: Accelerate Stmt.connStmt() and DB.connIfFree()

Stmt.connStmt() calls DB.connIfFree() on a loop for Stmt.css. The purpose of this loop is to use prepared connections that are currently available. This loop is also used to remove prepared connections from Stmt.css once they are closed. However, if DB.mu gets locked, we can quickly determine whether or not a connection is closed or being used, all without referring to any DB members.

Looking at the inside of Stmt.connStmt(), you can lock DB.mu directly from the outside of the Stmt.css loop. At the same time, you can run the connection check right inside the loop. By reducing the number of times DB.connIfFree() is called, we are able to accelerate the process to the same speed as when using multiple databases in a round-robin fashion.

This is the solution that looked the most promising, so I submitted a patch based on it. You can check it out here.

Steps for Adding Patches to Go

Let’s start by reviewing the Contribution Guidelines for Go. For those of you short on time, here’s a quick rundown of the most pertinent points.

First, you must obtain a consensus on your general plan for the revision on Github Issues.

Before you begin your revision, be sure to sign Google’s CLA. (Details here)

Although the repository for Go recently transition from Mercurial to Git, this doesn't mean that the repository moved to Github. Github is being used as an Issue Tracker and a clone of the repository. Since the original repository is located at https://go.googlesource.com/go, if you’re cloning it from Github, you’ll have to make a few minor changes, such as editing the origin marker in git upstream set-url.

Code reviews for Go take place on Gerrit. Head over to the site and log in. Remember to get yourself a password that will be required when you upload patches from the command line tool.

Next, we’ll install the git-codereview tool.

go get -u golang.org/x/review/git-codereview


This tool manages your patches with one branch per code committed, and sends the patch to Gerrit. If you’re already used to Git, the following explanation should sound familiar.

  • git codereview change fix-abc is equivalent to git checkout -b fix-abc.

  • If it’s your first time committing the code, git codreview change is equivalent to git commit. After that, it’s closer to git commit --amend. (You can use -a just like with commit.)

  • git codereview sync is equivalent to git fetch origin && git rebase origin/master.

  • git codereview mail -diff is equivalent to git diff HEAD..master.

  • git codereview mail is used to send your patch to Gerrit.

Your patch will be reviewed once it is uploaded. (Go is constantly under development. I was totally blown away when I received a reply right away when I uploaded a patch on New Year's Day. Do these guys ever take a break?)

Here’s a list of criteria that will be reviewed:

  1. Overall architecture

  2. Code and commit log review

  3. Test structure

  4. Comment review

The points above paint a general picture of things to watch out for as your patch progresses through the different stages of the review process. Keep in mind that your patch will be reviewed by multiple Googlers several times.

Closing Thoughts

At the beginning of the review, we were able to greatly simplify the architecture, as we eliminated the process of using prioritized prepared connections. This idea came up when the issue was being discussed on Github Issue. If I had only taken the time to read the comments carefully, I would’ve been able to save everyone a lot of trouble.

As for the test itself, I was editing the preexisting benchmarks quite easily. For benchmarks measuring parallelism, however, someone pointed out that I should use RunParallel. That reviewer also posted a comment that included a sample of the code that could be used as a draft. If the beginning and end of the original file already include legacy code, then there’s no way they will let you get away with including legacy code in your edits.

Towards the end of the review, I received a lot of advice on my phrasing and the general composition of my English sentences. Some suggested that it was okay to omit the subject of the sentence when writing on the line to the right of the comment. For longer texts that get moved above the original comment though, I was advised to begin each sentence with the appropriate subject. These points weren’t related to my code really, but they were still part of the process. I ended up submitting my patch a total of 13 times before it was finally accepted.

To be completely honest, it would’ve been a lot easier to just report the problem in the Issues section of my test results, and leave the actual revisions to the Googlers. This would’ve made life simpler for both me and the reviewers. However, having multiple Googlers review my code with so much care was an extremely valuable experience. I’m glad they didn’t just say, “We’ll take care of it,” but instead stuck with me till the end of the review process. Thanks Googlers.



@methane