PHP regexp replace word(s) in html string if not inside tags

The problem, was to find and replace text inside HTML (without breaking the HTML), take for example this example string:

<img title=”My image” alt=”My image” src=”/gfx/this is my image.gif”><p>This is my string</p>

and you want to replace the string “my” to another string or to enclose it inside another tag (let’s assume <strong></strong>), but only the “my” outside the html tags. So after the transformation it would look like:

<img title=”My image” alt=”My image” src=”/gfx/this is my image.gif”><p>This is <strong>my</strong> string</p>

With PHP Regular Expression functions, the typical solution find and replace with word boundary fails here.

preg_replace('/\b(my)\b/i',
             '<strong>$1</strong>',
             $html_string);

you will end up with messed up html

<img title=”<strong>My</strong> image” alt=”<strong>My</strong> image” src=”/gfx/this is <strong>my</strong> image.gif”><p>This is <strong>my</strong> string</p>

now think the wonderful mess that would be if you are replacing the words like “form” or “alt” that can be a text node, a html tag or attribute….

So how to fix this? I figured that the only common thing to all tags is the open and close character, the < and >, from here you simply search the word you want to replace and the next close tag char (the > sign), and within the matched result, you try to find a open tag char, if you don’t find an open tag you are within a tag, so you abort the replace. Here it is the code:

function checkOpenTag($matches) {
    if (strpos($matches[0], '<') === false) {
        return $matches[0];
    } else {
        return '<strong>'.$matches[1].'</strong>'.$matches[2];
    }
}

preg_replace_callback('/(\bmy\b)(.*?>)/i',
                      'checkOpenTag',
                      $html_string);

If you are going to use this kind of code to implement several words search in a HTML text (ex: a glossary implementation) test for performance and do think about a caching system.

That’s it, remember as this solution worked fine for me, it also can work terribly bad for you so proceed at your own risk (aka liability disclaimer).

UPDATE 19-04-14
There was a comment about this post that warms about only the first occurrence being replaced in an HTML segment. So, there is an updated version of the PHP example with this issue corrected:

<?

class replaceIfNotInsideTags {

  private function checkOpenTag($matches) {
    if (strpos($matches[0], '<') === false) {
      return $matches[0];
    } else {
      return '<strong>'.$matches[1].'</strong>'.$this->doReplace($matches[2]);
    }
  }

  private function doReplace($html) {
    return preg_replace_callback('/(\b'.$this->word.'\b)(.*?>)/i',
                                 array(&$this, 'checkOpenTag'),
                                 $html);
  }

  public function replace($html, $word) {
    $this->word = $word;

    return $this->doReplace($html);
  }
}

$html = '<p>my bird is my life is my dream</p>';

$obj = new replaceIfNotInsideTags();
echo $obj->replace($html, 'my');

?>

10 thoughts on “PHP regexp replace word(s) in html string if not inside tags”

  1. WOW! It is exactly what i was looking for . how can i make it match also the text “sh.pk” with the text “shpk” and make it strong ? thanks in advice

  2. This is a simple approach that will work in some circumstances, but it will fall short in the following two ways otherwise.

    (1) If there is more than one occurrence of the target word before the next html tag, only the first occurrence of the word will be replaced. The other occurrences will not be changed.

    For example, given

    The brown dog jumped over the yellow dog.

    where “dog” is the target to be replaced with “rat”, the result will be

    The brown rat jumped over the yellow dog.

    (2) If there is an occurrence of the target word that is not followed by any html tag, the match will fail.

    For example, given

    The brown dog jumped overThe yellow dog

    the first occurrence of dog will match but the second will not.

    1. (1) that is a good point, and i will update shortly the PHP code to fix this
      (2) really didn’t catch your point here

      1. HI Marco,
        I’m trying to use the code on a health and fitness website: http://sophisticatedbooty.com to highlight a search word when you do a search. When I search “sprouts” without the quotes for example, the first sentence of the first article that comes up right below the title of the article, the word doesn’t get highlighted, but the other occurrences lower down do. (The way I have it implemented the occurrence in the title won’t be highlighted, that’s not the issue.)
        Thanks for getting back to me!

Leave a Reply to de luon manh khoeCancel reply