Searching for a phrase using Regex
If you aren’t familiar with regular expressions, you should be! There are lots of websites ready to tell you more than you are likely to care to know about regular expressions. http://www.regular-expressions.info/ has tutorials, and the Wikipedia entry is worth reading http://en.wikipedia.org/wiki/Regular_expression.
The Problem
So here is the problem to be solved: we want an easy way to use a regex to search for a phrase in a text. If we just match on the characters in the phrase, the pattern will include space characters between the words, and if the text has multiple spaces, tabs, or line breaks between words the pattern won’t match. The space between words doesn’t need to match exactly, so long as the words themselves match:
Pattern: lazy dog
Text: The quick brown fox jumps over the lazy dog.
The pattern doesn’t match because there are three spaces between “lazy” and “dog” in the text, but only one space in the pattern.
Wildcards to the rescue!
In addition to being able to search for literal characters (like our pattern “lazy dog”), regex allow a rich assortment of metacharacters, characters which tell the regex how to operate. For example, a period (“.”) matches any single character. If you want to match just a period, you can “escape” the period by preceeding it with a “\”; the “\” tells the regex to ignore the meaning of the next character and treat it as a literal character. A metacharacter we’ll need to solve our problem is the “match 0 or more times” character (“*”).
So let’s adjust our pattern to allow for an unknown number of blanks between words in our phrase. Now we have:
Pattern: lazy *dog
The “*” means to match 0-or-more spaces (the character immediately before the “*”), so now our pattern matches the “lazy dog” in the text. Cool! We’re done – or are we?
Complications
What if our text had a tab character between “lazy” and “dog”? Tab doesn’t match space, but both are considered whitespace to human readers; after all, we can’t see the tab or space characters! The regex metacharacters provide a way to match a tab character: “\t”. There is also a way to describe alternatives, so we could match a sequence of either spaces or tabs (or both, of course!). It’s all getting a bit complicated!
Fortunately, the regex in .Net provides us with a metacharacter for just this situation: “\s”. It matches any whitespace character, including tab, space, and newline. If we add the metacharacter for “match 1 or more of the preceeding character”, which is “+”, we can get the desired effect by making the pattern like this:
Pattern: lazy\s+dog
This version will match our text even if there a newline in between “lazy” and “dog”. Note that we used the “+” because we didn’t want to treat “lazydog” as a match; if we wanted that to match, we’d use the pattern “lazy\s*dog” to do the job.
Back to the original problem
So we know how to make a pattern that will match a phrase – we just separate the words with “\s+”. But we are going to have users type in the phrases, and we don’t want them to have to remember to put “\s+” between their words instead of using spaces! We can cope with this by using a regex method to replace sequences of whitespace between words with the “\s+” notation. For example,
Input: quick brown fox
Pattern: quick\s+brown\s+fox
Here is a code snippet to do this:
private String MakePattern(String argPattern)
{
String strMatchWhitespace = "\\s+";
// Convert internal whitespace to whitespace spanning expressions
return Regex.Replace(argPattern,strMatchWhitespace,strMatchWhitespace);
}
Note something rather neat here: we’re using the strMatchWhitespace pattern to match sequences of whitespace to be replaced by the strMatchWhitespace pattern!
Finding the phrase
How do we use the MakePattern method to find our phrase? Let's suppose that we aren't interested in where the phrase occurs, or whether it occurs several times, but just whether or not it appears at all. So our approach will be to take the original phrase, turn it into a pattern, match the pattern, and return true if the pattern has been matched:
public Boolean PhraseFound(String argPhrase, String argText)
{
String strPattern = MakePattern(argPhrase);
Match match = Regex.Match(argText, strPattern);
return match.Success;
}
public Boolean PhraseNotFound(String argPhrase, String argText)
{
return !PhraseFound(argPhrase,argText);
}