Oct 10

PHP: Parse HTML returning links

Tag: PHPGrant Perry @ 12:49 am

My goal was more complex than what’s described here in, but I wanted to share a simple function for returning the links in some HTML (now that I know what I’m doing)… Hopefully someone finds this useful, it was a common question in forums I noticed.

Regular expressions are a power tool for working with strings. PHP provides support for a couple of different types but I’m using preg (aka the Perl compatible one).

The regular expression I put together for this was:

/<a\s[^>]*href=”(?P<href>[^"]*)”\s[^>]*>(?P<name>.*)<\/a>/si

What this means is:

  • / - perl regular expression patterns are enclosed in forward slashes (this is the opening one)
  • <a - is satisfied literally (the open of the html a tag)
  • \s - is a single whitespace character (includes line breaks etc)
  • [^>]* - satisfied by any characters except >, this can be satisfied zero - many times (allows for anything else inside the html a tag)
    • [ ] - a charter class
    • ^ - except the following
    • > - is satisfied literally
    • * - the charter class can occur zero of many times
  • href=” - is satisfied literally
  • (?P<href>[^"]*) - match and return as ‘href’ - any characters except “, this can be satisfied zero - many times (gets everything inside the href attribute)
    • ( ) - match and return
    • ?P<href> - nominate the name we’ll return it as ‘href’ could be anything you like!
    • [^"]* - satisfied by any characters except “, this can be satisfied zero - many times
  • > - is satisfied literally (the close of the html a tag)
  • (?P<name>.*) - match and return as name - any character, this can be satisfied zero - many times (gets everything inside the a tag)
    • ( ) - match and return
    • ?P<name> - nominate the name we’ll return it as ‘name’.
    • .* - satisfied by any character, this can be satisfied zero - many times
  • <\/a> - is satisfied literally (but we’re escaping the forward slash we don’t want to end up pattern here)
  • / - now we want to end our pattern!
  • si - the trailing s and i are modifiers to change the way the expression is interpreted
    • s - means the . we’ve used can also represent line breaks (normally it doesn’t)
    • i - means the entire thing is case insensitive!

A PHP function using this might look like so:

private function getLinks($responseBody){       
    $_regexp = '/<a\s[^>]*href="(?P<href>[^"]*)"\s[^>]*>(?P<name>.*)<\/a>/si';
    preg_match_all($_regexp, $responseBody, $matches);
 
    $i = 0;
    foreach($matches['name'] as $name) {
        $links[$i]['name'] = trim($name);
        $i++;
    }
 
    $i = 0;
    foreach($matches['href'] as $href) {
        $links[$i]['href'] = $href;
        $i++;
    }
 
    return $links;   
}

Issues with this regular expression I know I haven’t address are:

  • You’re link may not be text, it could be an image or anything!
  • Not everyone using double quotes for their attributes.
  • Browsers support sloppy HTML this experession doesn’t! E.g. <a href = /link/>

Any corrections or feedback would be pleased to hear from you!

Share bookmark

3 Responses to “PHP: Parse HTML returning links”

  1. lenin says:

    Hi, what is $responseBody
    please i need some example
    thnx

  2. Grant Perry says:

    $responseBody is a string containing the HTML you wanted to be parsed. Maybe it wasn’t named the best in hindsight - the reason for this name was I was obtaining my HTML using a XHR… hence we are parsing the response body as appose to response header.. But you could use with a file loaded from the filesystem as well..

  3. Tim Porter says:

    Thanks!

Leave a Reply



Close
E-mail It