503 Service Unavailable

2009-09-09

GNU grep is slow on UTF-8

Filed under: Software — rg3 @ 22:23

Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete.

Thanks to someone on the ##slackware FreeNode IRC channel that mentioned the problem some weeks ago, I discovered that GNU grep is very slow when working on UTF-8 files, and possibly other Unicode encodings. This, apparently, is a long-standing bug that hasn’t been officially fixed yet. The problem manifests itself when you run grep using locale settings that involve using UTF-8. Let’s see the following example:

$ echo $LANG
en_US.UTF-8
$ time grep '^....' /usr/share/dict/words >/dev/null 

real    2m16.795s
user    2m10.536s
sys     0m0.087s
$ export LANG=C
$ time grep '^....' /usr/share/dict/words >/dev/null 

real    0m0.031s
user    0m0.028s
sys     0m0.003s

In the previous text, /usr/share/dict/words is a file part of the bsd-games package in my Slackware system. It contains a list of English words and it’s not too long. It has below 40000 lines, each line having a word, and weights about 345 KB. Still, as you can see in the previous example, it takes more than 2 minutes in my computer to search for words having at least 4 characters. When I change my locale settings to “C” (ASCII), it only takes 31 milliseconds. The difference is amazing. Does grep behave differently in both cases? The answer is yes.

When grep runs in UTF-8 mode, the dot character, for example, represents any multi-byte character, while in ASCII mode the dot represents a single byte. See for example the following, using an accented Spanish character to form a 5-letter word.

$ echo ámbar | LANG=C grep '^.....$'
$ echo ámbar | LANG=en_US.UTF-8 grep '^.....$'
ámbar

The á character is represented using two bytes in UTF-8. Using the UTF-8 locale, grep correctly identifies it as a single character. Hence, my search for a 5-character word inside the file correctly returns 1 result. With LANG=C, no results are found. This feature is not, however, worth making grep so slow.

If you try to reproduce the problem above, probably you will not succeed, at least in your Linux system. This is because most Linux distributions are well aware of the problem and ship a patched GNU grep, and have been doing so for years. Debian does it (and with it, Ubuntu), Archlinux does it, Fedora does it, etc. Other distributions like Slackware traditionally ship software as vanilla as possible, and the problem shows, as seen above. Slackware’s GNU grep is completely vanilla. Most distributions use slightly different versions of the same patch, which replaces the MBS (Multi-Byte Sequence) treatment almost completely.

In my most recent scripts, I avoid GNU grep altogether, and use the fantastic and very efficient PCRE library (Perl Compatible Regular Expressions), used by many open source software projects (e.g. the Apache web server). The pcre package is present in most Linux distributions and BSD ports systems. It will probably ship the pcregrep tool inside. This is an alternative grep which features compatibility option-wise with the most common POSIX and GNU options, like -n, -l, -r, -w, etc. It expects, however, a Perl regular expression. They are, in the most common cases, like every other regular expression syntax out there, but closer to egrep than grep. By default, pcregrep behaves like grep with the LANG=C locale, even if your locale specifies that you are using UTF-8. It’s this fast:

$ time pcregrep '^....' /usr/share/dict/words >/dev/null 

real    0m0.061s
user    0m0.042s
sys     0m0.003s

A bit slower than grep with C locale, yes, but not a problem. In addition, you can activate UTF-8 mode to enable compatibility with multi-byte characters by using the -u option, explicitly. In this mode, pcregrep is not much slower:

$ time pcregrep -u '^....' /usr/share/dict/words >/dev/null

real    0m0.068s
user    0m0.049s
sys     0m0.002s

Of course, it’s able to behave correctly in the previous UTF-8 test with the -u flag:

$ echo ámbar | pcregrep -u '^.....$'
ámbar

Moving away from GNU grep to pcregrep is not a bad option. You get consistently fast behavior, regular expression syntax compatible with Perl, and get to choose if you want UTF-8 compatibility or not by providing an explicit option. So long, GNU grep! Welcome, pcregrep!

Final note: GNU awk suffers from this problem too, but its behavior with a UTF-8 locale is more or less equivalent to a patched grep. Still a bit slow, though.

$ time awk '/^..../' /usr/share/dict/words >/dev/null

real    0m0.373s
user    0m0.342s
sys     0m0.003s
$ export LANG=C
$ time awk '/^..../' /usr/share/dict/words >/dev/null

real    0m0.075s
user    0m0.055s
sys     0m0.002s
About these ads

2 Comments »

  1. Thanks so much for this post, I must have upgraded cygwin and standard grep queries were taking forever.
    I don’t work with unicode much yet, so I just ‘export LANG=C’ for now, but did notice that pcregrep is installed. If it acts up again, I’ll switch.

    Comment by Juraj — 2010-08-13 @ 05:01 | Reply

  2. Another option is ‘grin’, another ack-like solution (Python based) and in some (possibly subjective) testing, was about 70% faster than ack. http://pypi.python.org/pypi/grin/

    Comment by David — 2011-12-15 @ 06:13 | Reply


RSS feed for comments on this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: