Oct 10 2007
PHP: Parse HTML returning links
My goal was more complex than what’s described here in, but I wanted to share a simple function for returning the links in some HTML (now that I know what I’m doing)… Hopefully someone finds this useful, it was a common question in forums I noticed.
Regular expressions are a power tool for working with strings. PHP provides support for a couple of different types but I’m using preg (aka the Perl compatible one).
The regular expression I put together for this was:
/<a\s[^>]*href=”(?P<href>[^"]*)”\s[^>]*>(?P<name>.*)<\/a>/si
What this means is:
- / - perl regular expression patterns are enclosed in forward slashes (this is the opening one)
- <a - is satisfied literally (the open of the html a tag)
- \s - is a single whitespace character (includes line breaks etc)
- [^>]* - satisfied by any characters except >, this can be satisfied zero - many times (allows for anything else inside the html a tag)
- [ ] - a charter class
- ^ - except the following
- > - is satisfied literally
- * - the charter class can occur zero of many times
- href=” - is satisfied literally
- (?P<href>[^"]*) - match and return as ‘href’ - any characters except “, this can be satisfied zero - many times (gets everything inside the href attribute)
- ( ) - match and return
- ?P<href> - nominate the name we’ll return it as ‘href’ could be anything you like!
- [^"]* - satisfied by any characters except “, this can be satisfied zero - many times
- > - is satisfied literally (the close of the html a tag)
- (?P<name>.*) - match and return as name - any character, this can be satisfied zero - many times (gets everything inside the a tag)
- ( ) - match and return
- ?P<name> - nominate the name we’ll return it as ‘name’.
- .* - satisfied by any character, this can be satisfied zero - many times
- <\/a> - is satisfied literally (but we’re escaping the forward slash we don’t want to end up pattern here)
- / - now we want to end our pattern!
- si - the trailing s and i are modifiers to change the way the expression is interpreted
- s - means the . we’ve used can also represent line breaks (normally it doesn’t)
- i - means the entire thing is case insensitive!
A PHP function using this might look like so:
private function getLinks($responseBody){ $_regexp = '/<a\s[^>]*href="(?P<href>[^"]*)"\s[^>]*>(?P<name>.*)<\/a>/si'; preg_match_all($_regexp, $responseBody, $matches); $i = 0; foreach($matches['name'] as $name) { $links[$i]['name'] = trim($name); $i++; } $i = 0; foreach($matches['href'] as $href) { $links[$i]['href'] = $href; $i++; } return $links; }
Issues with this regular expression I know I haven’t address are:
- You’re link may not be text, it could be an image or anything!
- Not everyone using double quotes for their attributes.
- Browsers support sloppy HTML this experession doesn’t! E.g. <a href = /link/>
Any corrections or feedback would be pleased to hear from you!
