Tokyo PC Users Group
	  Home Page
Home
Members Only
Newsletter
Newsgroups
Become a Member
Meeting Info & Map
Officers
Members
Corporate Members
Photos
Workshops & Training
Other Clubs
Job Hunting?
Constitution

 

Searching, matching and replacing successfully - Part 2

by Kurt Keller

Have you had a chance to use regular expressions since you read the first part of this article? Well, what we covered last time only allows you quite limited usage of regular expressions. It was a starting point. In order to use regex efficiently, there are a few more things which absolutely need to be covered. This second part uses examples which might look a bit complicated, but you should be able to apply the concepts shown for various purposes.

Checking logfiles

One area where I often use regular expressions, is getting info from logfiles, be it for troubleshooting or for getting statistics data. Let's assume we're running a DNS server with query logging turned on. Every request made to the DNS server is being written to the logfile in the following format:


   29-Jun-2002 17:07:00.634 XX+/127.0.0.1/mouse.pinboard.com/ANY/IN
   29-Jun-2002 17:07:01.255 XX+/127.0.0.1/194.87.164.210.in-addr.arpa/PTR/IN
   29-Jun-2002 17:07:01.256 XX+/127.0.0.1/7.195.209.194.in-addr.arpa/PTR/IN
   29-Jun-2002 17:07:04.158 XX /210.164.87.195/sysstats.com/A/IN
   29-Jun-2002 17:07:04.475 XX /210.164.87.195/sysstats.com/MX/IN
   29-Jun-2002 17:07:04.488 XX /210.164.87.195/sysstats.com/MX/IN
   29-Jun-2002 17:07:05.876 XX+/127.0.0.1/pinboard.com/ANY/IN
   29-Jun-2002 17:07:06.161 XX+/127.0.0.1/sysstats.com/ANY/IN

As you can see, there is always one line per query and it consists of seven fields. I'm not going to explain in detail what these fields mean. In our examples we're only concerned about the fourth field, which is the IP address of the requestor.

Who's bombarding us?

Sometimes I see on our internal DNS server that some client goes haywire and sends queries every 20 or so milliseconds. Yes, 50 queries a second! While that's peanuts for our DNS server to keep up with, the client's CPU usage goes up right to 100 %, it generates lots of unnecessary network traffic and fills my logdisks. Even though such a client sends so many queries, at over 5000 clients it can be hard to spot it in a live logfile. Moreover, any sysadmin likes to be pro-actively informed when something is out of the common, rather than find out the hard way when things stop working.

The idea is to hourly have a script go over the daily cycled live logfile, count how many queries each client issues and send an alert to the sysadmin if any client is over a certain threshold. I'm not going to present the whole script here. We're only concentrating on the part which uses regular expressions to extract the client IP addresses from the logfile.

Regular expression work is, in fact, pattern matching work; the better you can describe the pattern you want to match, the more successful you're going to be. Last time we knew the exact string we wanted to match, but this time we don't know what string we want to match. Obviously we should extract the fourth field of each line in the logfile, the IP address of the requestor, and count how often each one appears. What we know about this field is, that it always is an IP address, the fourth field in the logfile and right between the first and the second forward slash. We'll be using sed, the Stream EDitor, to do this extraction work.

Let's see how we can translate "show me everything between the first and the second forward slash" into a regular expression. I suggest to separate each line into three parts: the part before the fourth field including the first forward slash, the requestor IP address which we're after and the rest which is following.

We don't know what's in the first part, but we know that the last character of it is a forward slash and surely no character before that is a forward slash. So the first part consists of two subparts: anything which is not a forward slash, as often as possible and exactly one forward slash. Remember that the dot stands for any character at all? But this does include the forward slash. You might be inclined to simply append a forward slash after the dot (./), but it will not work either, because this would mean any single character followed by a forward slash. As we have multiple characters before the first forward slash, we need some kind of quantifier to go with this. There are a number of quantifiers available:

* zero or more of the preceding
? zero or one of the preceding
+ one or more of the preceding
{n,m} at least n and at most m of the preceding
{n} exactly n of the preceding

Not all the tools support all the quantifiers and some, including sed, require writing the {n,m} construct as \{n,m\}. So let's try to add an asterisk (sed does not support the + quantifier) after the dot and finally get .*/ for the pattern, which we're going to try now. We're trying to replace the first part of our three part line with nothing, which is equivalent to deleting it. For this the sed command s (substitute) is being used which has the following syntax: s,pattern_to_match,replacement_string,


	%> sed "s,.*/,," bindqueries.log
	IN
	IN
	IN
	IN
	IN
	IN
	IN
	IN

Oops, not quite what we expected. Everything up to and including the last forward slash has been deleted instead of only up to and including the first one. This is because regular expressions have a resemblance to humans: they are greedy. Any meta character will match as much as possible at first and only backtrack if the next meta character can not match any more what it should.

Last time we also spoke about character classes, for example [Tt] matches both the upper and lower case letter t. Character classes can also be negated by using a caret at the very first position. Thus [^Tt] actually means everything except for upper or lower case t, and [^/] means everything except for a forward slash. So let's try this:


	%>  sed "s,[^/]*/,," bindqueries.log
	127.0.0.1/mouse.pinboard.com/ANY/IN
	127.0.0.1/194.87.164.210.in-addr.arpa/PTR/IN
	127.0.0.1/7.195.209.194.in-addr.arpa/PTR/IN
	210.164.87.195/sysstats.com/A/IN
	210.164.87.195/sysstats.com/MX/IN
	210.164.87.195/sysstats.com/MX/IN
	127.0.0.1/pinboard.com/ANY/IN
	127.0.0.1/sysstats.com/ANY/IN

Looks much better already. What's left to do is to delete the third part. We're simply going to pipe the output of the first sed command into another sed command. Should be pretty straightforward now to create a regular expression for matching everything from and including the first forward slash and translate that to a sed substitute command. The two commands piped together are:


	$> sed "s,[^/]*/,," bindqueries.log | sed "s,/.*,,"
	127.0.0.1
	127.0.0.1
	127.0.0.1
	210.164.87.195
	210.164.87.195
	210.164.87.195
	127.0.0.1
	127.0.0.1

Great, exactly what we wanted. Now running this through sort and uniq, we get an ordered list, showing for each IP address how many queries it sent:


	$> sed "s,[^/]*/,," bindqueries.log | sed "s,/.*,," | sort | uniq -c | sort -nr
	   5 127.0.0.1
	   3 210.164.87.195

The rest of our proactive monitoring script would now only have to check whether there are any clients with more than, let's say 20'000 requests, and if so send an alert to the sysadmin.

Backreferences

We used two sed commands for our DNS logfile extraction example to match and delete the parts we don't want. The same result could also be achieved with a single command matching what we actually want to keep and replacing the whole line with that. By enclosing parts of the match pattern in round brackets, these can be referenced in the substitution part with \1 for the first bracket, \2 for the second one and so on.

As mentioned earlier, each line in the logfile should be divided into three parts. The matching expressions for the first and the third part we already saw, but the second part we omitted up to now. What would it look like? As usual, there is more than one way to do it. You could say that the second part can only consist of digits and dots and I could say that it is just everything up to, but not including, the next forward slash. As I'm the one writing the article, we're trying my solution. Let's first analyze the command before trying it out:


	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
sed's substitute command and the characters to separate command, matching pattern, substitution pattern and end of command. Usually a forward slash is being used as the dividing character, but because we use that in the pattern itself, I use commas instead, sparing me the trouble of escaping all the slashes in the pattern.

	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
Match anything except for a forward slash, zero or more times.

	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
Match one forward slash.

	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
Copy everything between \( and \) into a temporary buffer which can be referenced later by \1.

	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
Match anything except for a forward slash, zero or more times.

	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
Match anything zero or more times. We don't need to explicitly address the forward slash after the matched IP address, as we're not really interested in it and we exclude it in the pattern we in fact are interested in. Using .* instead of /.* yields better performance.

	sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
Replace everything with the contents of the first buffer.

So now let's check out whether the command gives the right output:


	%> sed "s,[^/]*/\([^/]*\).*,\1," bindqueries.log
	127.0.0.1
	127.0.0.1
	127.0.0.1
	210.164.87.195
	210.164.87.195
	210.164.87.195
	127.0.0.1
	127.0.0.1

Perfect!

A word about performance

Depending on the regex engine your tools are built upon, there can be a heavy performance penalty using one method or another. Yes, there are different regex engines and they can differ drastically. I'm not going into details here on performance tuning regular expressions, but just to give you an idea about the scope of difference here are a few numbers. On one platform, using a sample file with somewhat less than 200'000 log entries, I timed the mentioned single sed command, the double sed command and an awk (Aho, Weinberger, Kernighan - names of program authors) command doing the same thing. The awk command finished in less than 3 seconds, the double sed command in 34 seconds and the single sed command ran for 146 seconds. Making a very small change in the single sed command I brought time down to 133 seconds. On another platform, using a sample file with close to one million entries, the single sed command cost 8 seconds, the double sed command 9 seconds and the awk version required 86 seconds. So if you develop scripts relying heavily on regular expressions, it pays off to know which regex engine your tools are using and how they work internally. It may also be well worth spending some time testing different approaches.

Further reading

We still only have scratched the surface of regular expressions, but the scratch is a bit deeper than last time. Hopefully it was possible to show you that this character salad can be extremely handy and a big time saver and to whet your appetite for regular expressions a little bit. If you have O'Reilly's book Unix in a Nutshell (http://www.oreilly.com/catalog/unixnut3/) you'll find a chapter about pattern matching which can get you a little bit further. If you want to know the nitty-gritty details about regular expressions, I can recommend Jeffrey Friedl's book Mastering Regular Expressions (http://www.oreilly.com/catalog/regex/), but you might want to wait a month or two for the second edition to be published (http://www.oreilly.com/catalog/regex2/).

If you found these two articles about regular expressions useful, and would like to see more of it, let the editor know, preferably with some examples you would like to have solved.

References

Mastering Regular Expressions http://www.oreilly.com/catalog/regex/
PINBOARD http://www.pinboard.com/
HighTechSamurai http://kurt.www.pinboard.com/


© Algorithmica Japonica Copyright Notice: Copyright of material rests with the individual author. Articles may be reprinted by other user groups if the author and original publication are credited. Any other reproduction or use of material herein is prohibited without prior written permission from TPC. The mention of names of products without indication of Trademark or Registered Trademark status in no way implies that these products are not so protected by law.

Algorithmica Japonica

January , 2003

The Newsletter of the Tokyo PC Users Group

Submissions : Editor


Tokyo PC Users Group, Post Office Box 103, Shibuya-Ku, Tokyo 150-8691, JAPAN