First visit? Hello! :) Check out the whole photo gallery, or just the good ones.

Never use regular expressions to enforce type

Regular expressions are dangerous beasts, and Unicode can be a foggy minefield. Fun ensues when you attempt to ride the former through the latter. Let's take Django's URL dispatching as an example:

(r'^(\d\d\d\d)/$', 'jl6.core.views.showyear'),

That's a typical line in a Django URLconf. The first parameter is a regular expression to match against the requested URL. If it matches, one group is captured and passed to the second parameter - a Django view. In this case, the intention is to capture exactly four digits, representing a year.

So why on earth does this URL ("/٢٠٠٨/") map to a year view?!

And what about a URL like this ("/𝟮𝟬𝟬𝟴/")? If using Firefox, you'll see this in the status bar:

And this in the address bar:

You might think huh? Did I just somehow embed a boldface control code in plain text? Well, given the title of this post, you can probably guess that the answer is more to do with exotic Unicode characters than control codes.

When Django resolves the URLs above, it feeds them through Python's regular expression engine. Crucially, Django has chosen to enable Unicode mode for all regular-expression-based URL matching! Here's the relevant line from the Django 1.2.3 source. The re.UNICODE flag changes the meaning of regular expression character classes such as \d. Without it, \d means [0-9]. With it, \d means [0-9], plus 410 other characters from around the world which are categorized as "Number, Decimal Digit".

Our first strange URL, /٢٠٠٨/, uses Arabic-Indic digits:

  • ٢ U+0662 ARABIC-INDIC DIGIT TWO
  • ٠ U+0660 ARABIC-INDIC DIGIT ZERO
  • ٠ U+0660 ARABIC-INDIC DIGIT ZERO
  • ٨ U+0668 ARABIC-INDIC DIGIT EIGHT

The second strange URL, /𝟮𝟬𝟬𝟴/, does not really involve control characters:

  • 𝟐 U+1D7D0 MATHEMATICAL BOLD DIGIT TWO
  • 𝟎 U+1D7CE MATHEMATICAL BOLD DIGIT ZERO
  • 𝟎 U+1D7CE MATHEMATICAL BOLD DIGIT ZERO
  • 𝟖 U+1D7D6 MATHEMATICAL BOLD DIGIT EIGHT

(I should mention at this point that you might be seeing a pile of square boxes instead of actual characters, depending on the sophistication of your browser and OS.)

What ties it all together is Python's ability to convert any of these Unicode digits to an integer:

Python 2.6.5 (r265:79063, Jul  5 2010, 11:46:13) 
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> int(u'໕')
5
>>> 

N.B.: the u'' syntax is required in Python 2.x to create a Unicode string; all strings are Unicode in Python 3.x, so you could write int('໕') and get the same effect.

Now, I can't decide if this is a really awesome example of end-to-end internationalization, or a nasty class of security vulnerabilities waiting to happen. If you are expecting a \d to always match an ASCII digit, you could end up with corruption or malformation if writing the result to a protocol that demands 7-bit US-ASCII, such as HTTP headers.

The key message should be: never use regular expressions for type enforcement. Or, be careful with your flags if you do — and always consider the result of a regular expression match to be of type string, with untrusted content.

I have not mentioned the other regular expression character classes: \w and \s. Clearly Unicode vastly expands the alphabet of matching characters for these tokens too. I'm guessing most people already expect all sorts of Unicode bumf to match these classes, but it was a surprise to me that not even digits were safe.

Further reading:

Comments

  1. Hinrik Örn Sigurðsson, on Saturday 13th November 2010, said:

    No. I'd say the key message here is: know your regular expressions.

    r'^([0-9]{4})/$' is very clear would work just fine. Just because you used the wrong regular expression doesn't mean that regular expressions are unfit for this task.

  2. Frozenball, on Saturday 13th November 2010, said:

    How exactly is regular expressions at fault here?

  3. Adam, on Saturday 13th November 2010, said:

    This is good to know for sure. However, like Hinriki mentioned [0-9] will save you from this type of problem.

  4. James, on Saturday 13th November 2010, said:

    Hinrik, I actually don't object to the behaviour of \d here, and I haven't "fixed" the regular expression I'm using. I like the fact that these exotic digits still count as digits. Think of it as case insensitivity for numbers.

    The point of the post is that you can't necessarily trust \d to match a simple ASCII integer, unless you take care with the flags (not possible in Django, as the re.UNICODE flag is hard-coded), or perform additional checks such as running it through int(), or, like you say, the set [0-9].

  5. Stefan Rusek, on Sunday 14th November 2010, said:

    James, I think your point should be to encourage people to stop limiting themselves to ASCII. We no longer live in an ASCII world, but people keep being surprised that this is so. Every regular expression implementation that I know of when passed a Unicode string (several don't support anything but Unicode strings), matches all characters that the Unicode spec defines as a number when \d is specified. So if you want to match only digits between 0 and 9, there is only one correct way to match it, [0-9].

    Python: http://docs.python.org/library/re.html

    .NET http://msdn.microsoft.com/en-us/library/20bw873z(v=VS.71).aspx

Add a comment






Formatting help
You typeYou see
*italics*italics
**bold**bold
[link text](http://www.example.com) link text
* item 1
* item 2
* item 3
  • item 1
  • item 2
  • item 3
> quoted text
quoted text

← Chinese Remainder Theorem | Home | Bioshock →