Login  Register 
February 05, 2012
:: Resources» Fingerfuel Blogs» Justin Fisher's Blog   Search
Search_Blog
 Print   
Justin's Blog
Apr 12

Written by: Justin Fisher
4/12/2008 5:18 PM

Searching for a phrase using Regex

 

If you aren’t familiar with regular expressions, you should be!  There are lots of websites ready to tell you more than you are likely to care to know about regular expressions.   http://www.regular-expressions.info/ has tutorials, and the Wikipedia entry is worth reading http://en.wikipedia.org/wiki/Regular_expression.

 

The Problem

 

So here is the problem to be solved: we want an easy way to use a regex to search for a phrase in a text.  If we just match on the characters in the phrase, the pattern will include space characters between the words, and if the text has multiple spaces, tabs, or line breaks between words the pattern won’t match.  The space between words doesn’t need to match exactly, so long as the words themselves match:

Pattern:   lazy dog

Text:      The quick brown fox jumps over the lazy   dog.

The pattern doesn’t match because there are three spaces between “lazy” and “dog” in the text, but only one space in the pattern. 

 

Wildcards to the rescue! 

 

In addition to being able to search for literal characters (like our pattern “lazy dog”), regex allow a rich assortment of metacharacters, characters which tell the regex how to operate.  For example, a period (“.”) matches any single character.   If you want to match just a period, you can “escape” the period by preceeding it with a “\”; the “\” tells the regex to ignore the meaning of the next character and treat it as a literal character.  A metacharacter we’ll need to solve our problem is the “match 0 or more times” character (“*”).

So let’s adjust our pattern to allow for an unknown number of blanks between words in our phrase.  Now we have:

Pattern:   lazy *dog

The “*” means to match 0-or-more spaces (the character immediately before the “*”), so now our pattern matches the “lazy   dog” in the text.  Cool!  We’re done – or are we?

 

Complications

 

What if our text had a tab character between “lazy” and “dog”?  Tab doesn’t match space, but both are considered whitespace to human readers; after all, we can’t see the tab or space characters!  The regex metacharacters provide a way to match a tab character: “\t”.  There is also a way to describe alternatives, so we could match a sequence of either spaces or tabs (or both, of course!).  It’s all getting a bit complicated!

Fortunately, the regex in .Net provides us with a metacharacter for just this situation: “\s”.  It matches any whitespace character, including tab, space, and newline.  If we add the metacharacter for “match 1 or more of the preceeding character”, which is “+”, we can get the desired effect by making the pattern like this:

Pattern:   lazy\s+dog

This version will match our text even if there a newline in between “lazy” and “dog”.  Note that we used the “+” because we didn’t want to treat “lazydog” as a match; if we wanted that to match, we’d use the pattern “lazy\s*dog” to do the job.

 

Back to the original problem

 

So we know how to make a pattern that will match a phrase – we just separate the words with “\s+”.  But we are going to have users type in the phrases, and we don’t want them to have to remember to put “\s+” between their words instead of using spaces!  We can cope with this by using a regex method to replace sequences of whitespace between words with the “\s+” notation.  For example,

Input:     quick  brown     fox

Pattern:   quick\s+brown\s+fox

Here is a code snippet to do this:

    private String MakePattern(String argPattern)

    {

      String strMatchWhitespace = "\\s+";

     

      // Convert internal whitespace to whitespace spanning expressions

      return Regex.Replace(argPattern,strMatchWhitespace,strMatchWhitespace);

    }

Note something rather neat here: we’re using the strMatchWhitespace pattern to match sequences of whitespace to be replaced by the strMatchWhitespace pattern!

 

Finding the phrase

 

How do we use the MakePattern method to find our phrase?  Let's suppose that we aren't interested in where the phrase occurs, or whether it occurs several times, but just whether or not it appears at all.   So our approach will be to take the original phrase, turn it into a pattern, match the pattern, and return true if the pattern has been matched:

 

 

    public Boolean PhraseFound(String argPhrase, String argText)

    {

      String strPattern = MakePattern(argPhrase);

      Match match = Regex.Match(argText, strPattern);

      return match.Success;

    }

 

    public Boolean PhraseNotFound(String argPhrase, String argText)

    {

      return !PhraseFound(argPhrase,argText);

    }

 

Tags:

1 comments so far...

Re: Just going through a phrase

Hmmm... That's an interesting idea. Using a regex to replace what it matches and thereby produce a regex. There's something beautiful and brilliant about that; something that could be applied in more than this case...

By Will on   4/17/2008 2:07 PM

Your name:
Title:
Comment:
Add Comment    Cancel  
  
Blog_Archive
Archive
<February 2012>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
26272829123
45678910
Monthly
April, 2008
 Print   
:: Resources» Fingerfuel Blogs» Justin Fisher's Blog
Copyright © 2004-2007 by Fingerfuel.com.