Searching, matching & replacing successfully - Part I
by Kurt Keller
Could be useful for other stuff, couldn't it?
AJ editor <editors@TPC.org> John Doe <firstname.lastname@example.org> Alice Springs <email@example.com> TPC President <firstname.lastname@example.org> TPC program director <email@example.com> Bob Hacker <firstname.lastname@example.org> Dave Cybercop <email@example.com>Now we're informed that the Tokyo PC Club changes its domain name from tpc. org to tokyopc.org, because the Transaction Processing Performance Council so badly wants to get the domain tpc.org and has offered the Club a lot of money for it. We now could simply open the aliases.txt file with a text editor and change every occurrence of tpc.org into tokyopc.org. If our aliases.txt file only consists of the seven entries above, this is no problem. But the whole file does have about 700 addresses and unfortunately I don't have a secretary to whom I could assign the task to adapt my aliases file. Here a few tools which understand regular expressions can help us.
Regular expressions are always about matching some text. The better we can describe what we want to match, the more successful we're going to be. Regex is always line oriented, that is the whole match must occur on a single line to be recognized (exceptions are possible with certain tools but very rare). As there is exactly one email address per line in our aliases.txt file, we use the command grep (Global Regular Expression Print) to show us all the tpc. org addresses in the file. So first we must know what text we want to match. Should be tpc.org, right? So let's try:
%> grep tpc.org aliases.txt TPC President <firstname.lastname@example.org> TPC program director <email@example.com> Bob Hacker <firstname.lastname@example.org>Well, not quite what we expected. First of all the AJ editor's address is missing. This is because regular expressions usually are case sensitive. The command line switch -i for grep makes regex support case insensitive and will solve that problem. But why do we have Bob Hacker in this list? After all tpcsorg is not the same as tpc.org, right? Well, the dot is one of the special characters and means any character at all. So a single dot matches anything, but not nothing. If we want to match a real dot, we need to take the special meaning away from the dot, we need to escape it. This is being done by preceding it with a backslash. To prevent interpretation of the backslash by the Unix shell, we'd better put the whole regular expression between single quotes (I spare you the detailed explanation of this at the moment, just believe me for now). So with this added knowledge lets try again:
%> grep -i 'tpc\.org' aliases.txt AJ editor <editors@TPC.org> TPC President <email@example.com> TPC program director <firstname.lastname@example.org>Cool, that's the list of all the tpc.org addresses we need to adapt. But I'm not yet satisfied with the regular expression we used. What if I had the email address <email@example.com> in my aliases.txt file? It would be matched as well because it contains the string tpc.org. We do know much more about the string we actually want to match than what we put into our regex. It is an email address and the domain is tpc.org. There will always be an at mark right in front of it. By including the at mark in our regex, we can further safeguard what we'll be matching. Add the following line to your aliases.txt file to see whether I'm having you on or telling you the truth:
Mike Oldfield <firstname.lastname@example.org>
%> grep -i 'tpc\.org' aliases.txt AJ editor <editors@TPC.org> TPC President <email@example.com> TPC program director <firstname.lastname@example.org> Mike Oldfield <email@example.com>
%> grep -i '@tpc\.org' aliases.txt AJ editor <editors@TPC.org> TPC President <firstname.lastname@example.org> TPC program director <email@example.com>Now we have safeguarded one side of the string we want to match. And now you expect me to also take precautions at the end of the string, right? Hey, you're getting the hang of it, good. Add another line to your aliases.txt file:
Donna Summer <firstname.lastname@example.org>When checking the syntax on our aliases. txt file, we see that all the email addresses are enclosed in angle brackets. If this is really so, it will help us to form a more reliably matching regular expression. I don't want to strain my eyes visually checking my own 700 lines aliases.txt file to see whether really ALL the email addresses end with an angle bracket. Instead I use another regular expression to confirm it. First we need to know how many lines there are in our aliases.txt. We count them by piping the whole file into the wc (Word Count) utility. The -l command line switch is used to only count lines:
%> cat aliases.txt | wc -l 9And now we count how many lines end with closing angle brackets:
%> grep '>$' aliases.txt | wc -l 9With the help of grep we filter all the lines which have a closing angle bracket (>) just before the end of the line ($) and pipe this output into wc which then counts the lines for us. And good luck, the number is the same as with the last command, which means that all the lines end with closing angle brackets. So our final grep command looks like this:
%> grep -i '@tpc\.org>' aliases.txt AJ editor <editors@TPC.org> TPC President <email@example.com> TPC program director <firstname.lastname@example.org>As you can see, the output is correct even though we now have the additional two 'troublemaker addresses' in our file, so our matching expression should be fine and reliable. With this output you can open your aliases.txt file and look for these entries to adapt them. No need to check each line in the file separately and possibly miss one.
Lazy people need to know a little bit more
I'm going to do something else, though. I'm using regular expressions to actually do the whole work for me. On one hand I'm lazy and on the other hand, even though I'm a fast typist, my typing is not so reliable, I make too many mistakes. So I'm going to use another of those handy little Unix utilities: sed, the Stream EDitor. As most other tools and editors that support regular expressions, it does have a substitute function.
What you match by a regular expression (or parts of it, if you want to) can be substituted with something else. sed does, however, not have an option for case insensitivity. So in order to match both upper and lower case I need to use character classes. When you want to match any one of a bunch of possible characters or signs, you can put the list of possibilities within square brackets. For example [abc] will match any one of the letters a, b or c. So if I want to match either an upper or lower case t I can use [Tt], for either case of the letter p it would be [Pp] and so on. The same can be used with grep:
%> grep '@[Tt][Pp][Cc]\.[Oo][Rr][Gg]>' aliases.txt AJ editor <editors@TPC.org> TPC President <email@example.com> TPC program director <firstname.lastname@example.org>I'm not going to explain the whole sed command, but if you have been following along, it should not be too difficult to at least guess what is going on:
%> sed 's/@[Tt][Pp][Cc]\.[Oo][Rr][Gg]>/@tokyopc.org>/' aliases.txt >aliases.new %> cat aliases.new >aliases.txt %> rm aliases.newEven though I'm using regular expressions often, I wouldn't dare doing this without using grep this way first to control what is being matched.
The whole thing could also be done in a couple of seconds reliably with regular expressions in my favourite text editor, vim. For those of you knowing vi or vim:
:% s/@tpc\.org>/@tokyopc.org>/igcMore to follow
As the sed and vim examples show, knowing how to use regular expressions can simplify your life and job a lot in many situations. Even adapting my 700 line aliases. txt file wouldn't take more than a couple of seconds with regular expressions and vim. Without regular expressions it would be a tedious, error prone and time consuming task. We have only slightly scratched the surface of what regular expressions can do.
Some very important concepts, such as quantifiers, or backreferences in substitution have not even been mentioned yet. Watch out for part 2 of this introduction to regular expressions.
© Algorithmica Japonica Copyright Notice: Copyright of material rests with the individual author. Articles may be reprinted by other user groups if the author and original publication are credited. Any other reproduction or use of material herein is prohibited without prior written permission from TPC. The mention of names of products without indication of Trademark or Registered Trademark status in no way implies that these products are not so protected by law.
January , 2003
The Newsletter of the Tokyo PC Users Group
Submissions : Editor
Tokyo PC Users Group, Post Office Box 103, Shibuya-Ku, Tokyo 150-8691, JAPAN