As it happens, my friend was not satisfied with the solution to the phrase-searching problem I posted previously. The proposed approach was fine for the positive case (match if the phrase is found), but wasn’t useable in the negative case (match if the phrase isn’t found – I had provided “not match if the phrase isn’t found” ). So back to the drawing board to whip up a simple negation of the original solution regex.
Turns out, negation in regex isn’t so simple after all.
The Obvious Approach
In many languages the “!” character is used to represent negation: for example, “!=” for ”not equals”. The obvious thing to try, then, is to put “!” in front of the regex we want negate. Going back to our example, we have:
Pattern: !(cat)
Text: The quick brown dog jumps over the lazy fox.
What’s this? We get a “false” from attempting to match the pattern. What happened? The “!” character turns out to be treated as a literal, and not as a negation. I warned you it wasn’t going to be that easy!
The Hard Way
OK, the easy way didn’t work; what about a harder way? Let’s give it a whirl. First, we need to remember that the positive match didn’t care where in the target string the match occurred. That was easy to code for – just allow an arbitrary sequence of characters before the pattern we were looking for, by putting the wildcard in front like so:
Pattern: .*dog
Text: The quick brown dog jumps over the lazy fox.
Match: The quick brown dog jumps over the lazy fox.
The wildcard matches the highlighted characters, then the literal “dog” matches, and we have success.
When we need to determine that the regex isn’t matched anywhere in the target string, it gets a lot more complicated:
Alternative 1: ^[^c]*$
This part matches target strings that don’t have any “c” characters at all, and so can’t have a “cat” in them. Now we need to allow for strings that have a “c” but not followed by “a”:
Alternative 2: ^([^c]*c[^a])*$
But what if there is a “c” as the last character? We need to allow an optional sequence at the end:
Alternative 2: ^([^c]*c[^a])*([^c]*c)?$
What if the last 2 characters are “cc”? That’s OK, because the first part will match (we have a “c” followed by something other than “a”). But we still have to deal with strings that have “ca” but not “cat” and the complexity is getting out of hand.
The Better Way
Perl introduced a very useful feature to regex: lookahead. For our purposes, it gets even better – there is syntax for a negative lookahead, which matches only if the lookahead fails to match! Using this syntax, we have:
Pattern: ^(?!(?s:.*)cat)
The “^” matches the start of the line, the “(?!” lookahead pattern matches only if the rest of the pattern in parentheses fails to match. The “(?s:.*)” says to let newline match “.” then says match any number of characters, which allows wrapping around from line to line. So the lookahead will fail if there are any number characters followed by “cat”, just the situation we were trying to detect.