Performance of Go's regexp package vs. PCRE

Recently I’ve written a tiny Go program to calculate statistics from nginx logfiles. The source of logstats is available on Github as usually.

After I was more or less done, I started experimenting with it on one of my projects which up until this point was using a PHP script to perform the same logic. To my great surprise, the logstats was slower than the PHP script. What gives?

Profiling

Before we start changing code, let’s measure what is actually eating up the CPU cycles. With a tiny benchmark program, we can easily generate a CPU profile and run that through go tool pprof and see which functions are taking up most of the runtime:

logstats> go tool pprof cpu.out

Main binary filename not available.
Type: cpu
Time: Apr 21, 2018 at 5:21pm (CEST)
Duration: 2.71s, Total samples = 2.67s (98.46%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20 --cum
Showing nodes accounting for 2.23s, 83.52% of 2.67s total
Dropped 43 nodes (cum <= 0.01s)
Showing top 50 nodes out of 83
      flat  flat%   sum%        cum   cum%
         0     0%     0%      2.46s 92.13%  github.com/xrstf/logstats/cmd/logstats.BenchmarkSample
         0     0%     0%      2.46s 92.13%  github.com/xrstf/logstats/cmd/logstats.benchmarkRun
     0.02s  0.75%  0.75%      2.46s 92.13%  github.com/xrstf/logstats/cmd/logstats.processFile
         0     0%  0.75%      2.46s 92.13%  testing.(*B).runN
         0     0%  0.75%      2.16s 80.90%  github.com/xrstf/logstats/parser.(*nginxParser).ParseLine
         0     0%  0.75%      2.14s 80.15%  testing.(*B).launch
     0.02s  0.75%  1.50%      1.73s 64.79%  regexp.(*Regexp).FindStringSubmatch
     0.01s  0.37%  1.87%      1.66s 62.17%  regexp.(*Regexp).doExecute
     0.01s  0.37%  2.25%      1.64s 61.42%  regexp.(*machine).backtrack
     0.54s 20.22% 22.47%      1.59s 59.55%  regexp.(*machine).tryBacktrack
     0.01s  0.37% 22.85%      0.37s 13.86%  time.Parse
     0.14s  5.24% 28.09%      0.36s 13.48%  time.parse
         0     0% 28.09%      0.32s 11.99%  testing.(*B).run1.func1
     0.31s 11.61% 39.70%      0.31s 11.61%  regexp.(*bitState).shouldVisit (inline)
     0.10s  3.75% 43.45%      0.27s 10.11%  regexp/syntax.(*Inst).MatchRune
     0.01s  0.37% 43.82%      0.24s  8.99%  bufio.(*Scanner).Scan
     0.19s  7.12% 50.94%      0.19s  7.12%  regexp.(*inputString).step
     0.18s  6.74% 57.68%      0.18s  6.74%  runtime.cgocall
         0     0% 57.68%      0.18s  6.74%  runtime.mstart
         0     0% 57.68%      0.18s  6.74%  runtime.systemstack

Well good thing we measured. I was suspecting my evaluation logic to be the culprit. Another spot I suspected was I/O issues, so I was thinking about decoupling file reading from parsing. As it turns out, most of the time was actually spent running the regular expression that parses the logfile lines (visible at the regexp.(*Regexp).FindStringSubmatch call)!

The regular expression is pretty straightforward and should not cause such a bad performance (bad compared to PHP, which uses the same regular expression, but PCRE):

func makeRegex() string {
    ipRegex := `[a-f0-9.:]+`
    userRegex := `[^ ]+`
    timeRegex := `.+?`
    methodRegex := `[^ ]+`
    uriRegex := `[^ ]+`
    protocolRegex := `HTTP/[0-9.]+`
    statusRegex := `[0-9]+`
    sizeRegex := `[0-9]+`

    return fmt.Sprintf(`^(%s) - %s \[(%s)\] "(%s) (%s) (%s)" (%s) (%s)`, ipRegex, userRegex, timeRegex, methodRegex, uriRegex, protocolRegex, statusRegex, sizeRegex)
}

Go’s regexp package

I’ve always known that Go does not use PCRE, but rather RE2, an implementation of Thompson NFA. See the Go documentation for more information.

One of the supposed advantages of RE2 over PCRE are better, and much more importantly, linear performance. But given that the PHP script runs faster (while doing more work, because it also gathers some database metrics, CPU load etc.), it’s obvious that RE2 is not the best choice here.

Switching to PCRE

Thankfully, it’s relatively easy to link against libpcre and use the mighty Perl Compatible Regular Expressions. Surprisingly this even works without many issues on Windows.

On Linux, just install the development package for libpcre via the package manager of your choice, e.g. sudo apt-get install libpcre++-dev on Debian/Ubuntu.

On Windows, you can just download the sources from SourceForge (I’m sorry), extract them somewhere and set

set PKG_CONFIG_PATH=D:\opt\libpcre\pcre-7.0

Now you can pull the Go wrapper and let Go take care of building it:

go get github.com/gijsbers/go-pcre

Rewriting The Code

The documentation is rather minimal, but it’s straightforward to change the code from regexp to PCRE:

diff --git a/parser/nginx.go b/parser/nginx.go
index 6b161c8..b6172f3 100644
--- a/parser/nginx.go
+++ b/parser/nginx.go
@@ -3,17 +3,17 @@ package parser
 import (
 	"errors"
 	"fmt"
-	"regexp"
 	"strconv"
 	"time"

+	pcre "github.com/gijsbers/go-pcre"
 	"github.com/xrstf/logstats"
 )

 var nginxInvalidLine = errors.New("Could not parse log line")

 type nginxParser struct {
-	lineRegex *regexp.Regexp
+	lineRegex pcre.Regexp
 }

 func NewNginxParser() *nginxParser {
@@ -29,37 +29,37 @@ func NewNginxParser() *nginxParser {
 	regex := fmt.Sprintf(`^(%s) - %s \[(%s)\] "(%s) (%s) (%s)" (%s) (%s)`, ipRegex, userRegex, timeRegex, methodRegex, uriRegex, protocolRegex, statusRegex, sizeRegex)

 	return &nginxParser{
-		lineRegex: regexp.MustCompile(regex),
+		lineRegex: pcre.MustCompile(regex, 0),
 	}
 }

 func (s *nginxParser) ParseLine(line string) (*logstats.LogLine, error) {
-	match := s.lineRegex.FindStringSubmatch(line)
-	if match == nil {
+	matcher := s.lineRegex.MatcherString(line, 0)
+	if !matcher.Matches() {
 		return nil, nginxInvalidLine
 	}

-	parsed, err := time.Parse("02/Jan/2006:15:04:05 -0700", match[2])
+	parsed, err := time.Parse("02/Jan/2006:15:04:05 -0700", matcher.GroupString(2))
 	if err != nil {
 		return nil, nginxInvalidLine
 	}

-	statusCode, err := strconv.ParseInt(match[6], 10, 16)
+	statusCode, err := strconv.ParseInt(matcher.GroupString(6), 10, 16)
 	if err != nil {
 		return nil, nginxInvalidLine
 	}

-	size, err := strconv.ParseInt(match[7], 10, 64)
+	size, err := strconv.ParseInt(matcher.GroupString(7), 10, 64)
 	if err != nil {
 		return nil, nginxInvalidLine
 	}

 	return &logstats.LogLine{
-		IP:       match[1],
+		IP:       matcher.GroupString(1),
 		Date:     parsed,
-		Method:   match[3],
-		Uri:      match[4],
-		Protocol: match[5],
+		Method:   matcher.GroupString(3),
+		Uri:      matcher.GroupString(4),
+		Protocol: matcher.GroupString(5),
 		Status:   int(statusCode),
 		Size:     size,
 	}, nil

Results

My testcase involves parsing 100MB worth of logfiles: The original code was taking 3.1 seconds for the job. The PCRE variant only using 2.5 seconds. A noticeable improvement.

But we forgot one thing: The evaluation logic is still using regexp. We could maybe also change that code to use PCRE!

Fast forward 15 minutes and we have also updated the evaluation logic and measure the execution time: 3.6 seconds. Suddenly our program got even slower.

Lessons Learned

What do we learn from this? Switching to PCRE is not a general solution and we should not blindly go with “Muh PCRE is faster than regexp!”. Measure, then measure again and optimize where it makes sense.