otsukare Thoughts after a day of work

Parsing UA string is not a regex job

We all do it. It is quick and it answers most of our needs. But if we really want to dig into the details parsing a UA string with regex is a sure way to fail. DailyMotion Team with gave me a few files with a list of user agent strings for a few of their domains. The biggest one representing 7 days of traffic contains 386 844 unique strings. We are not talking about access here. It's about the variability of user agent strings a server receives. It's properly insane. Digging into the data is a trove of information, broken delights, privacy insanity, etc. I will write more about these later on. Today we are talking about parsing one part of the UA string and a specific part of it.

Shape Of A User Agent String

Let's take the Firefox OS user agent string.

Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0

Quite simple. So far so good. We can identify by eye patterns. We could even break down things more or less easily. Let's say, I'm interested only by the things which are inside the first paranthesis couple. A very simple regex could be:

\((.*)\)

The matching group will then be

Mobile; rv:18.0

Yeah! Love is everything. We can go lunch. (I did go to lunch 🍕, but then I had to face the reality). Let's try another string.

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36

The predictable result is for a greedy pattern.

Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko

Oops. Yeah I can fix that. It's easy, I will not get scared by such a simple thing. Let's exclude anything which looks like the start of a paranthesis.

\(([^(]*)\)

Now we get

Windows NT 5.1

Pfew. We are done this time. Let's check

(Linux; U; Android 4.3; ko-kr; Nexus 7 Build/JSS15Q) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30 NAVER(inapp; search; 230; 4.7.7)

(Passing comment, we start noticing strings becoming insanely long). At least our pattern is working.

Linux; U; Android 4.3; ko-kr; Nexus 7 Build/JSS15Q

Another try with this user agent string

Mozilla/5.0 (Linux; Android 4.0.4; Micromax P500(Funbook) Build/IMM76D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36

The result is

(Funbook) Build/IMM76D)

Nested parenthesis, aka nested patterns, are not an easy task in regex, except if you rely on additional knowledge about subpatterns. But… remember we are dealing with user agent strings and the list of patterns are a known unknown. For those who want more details, check the Automata Theory (I don't pretend to understand 10% of what is there). If you want the answer for dummies like me:

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators available, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.

After regex, there is python

Steve Frécinaux (@sfreci) helped on IRC with one way of doing it in python.

import re
def find_parens(s):
    s = s[s.index('(')+1:]
    level = 1
    for match in re.finditer('[()]', s):
        level = level + (1 if match.group(0) == '(' else -1)
        if level == 0: return s[:match.start()]

Let's check

>>> find_parens("Mozilla/5.0 (Linux; Android 4.0.4; Micromax P500(Funbook) Build/IMM76D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36")
'Linux; Android 4.0.4; Micromax P500(Funbook) Build/IMM76D'

This is working!

>>> find_parens("Opera/9.80 (Windows NT 6.1; WOW64; MRA 6.1 (build 6578)) Presto/2.12.388 Version/12.15")
'Windows NT 6.1; WOW64; MRA 6.1 (build 6578)'

With this we can start studying one part of the user agent string and make some stats. Notice that at first I was interested by the devices. You will notice that there is in fact no logical patterns for devices in user agent strings. We have in this article 2 patterns with regards to the build version: Build/IMM76D and (build 6578).

Maybe what it would take instead of whinning on how bad the UA strings are irregular is to define a specification to write them. Though I'm uncertain on how effective it would be given the history of it.

Otsukare!