richardoneill.com.au » Articles » Regular Expressions in Ruby and PHP

Regular Expressions in Ruby and PHP

14 June 2007 PHP, Ruby, Programming

Ruby

Here is a short comparison between Ruby and PHP. The task was to print the location of all HTML links in a webpage using regular expressions.

If anyone knows of a cleaner or more efficient way to do this in Ruby or PHP, please post it in the comments!

Ruby

require 'net/http'

#connect and get the webpage
host = Net::HTTP.new('www.site.com.au', 80)
body = host.get('/index.php', nil ).body

puts "Links found..."

#find link URIs
links = body.scan(/<a(.*?)href="(.*?)"(.*?)>(.*?)</a>/)

#print all link URIs
links.each {|id,uri| puts uri}

PHP

<?php

$page = file_get_contents('http://www.site.com.au/index.php');

// find links

preg_match_all('/<a(.*?)href="(.*?)"(.*?)>(.*?)</a>/', $page, $links);


// links found

foreach($links[2] as $link)
{
   print "$linkn";
}

?>

Gavin

Very cool article Rich! I'm really liking these posts about comparing ruby/php - great help for learning ruby... ;)

Aaron Saray

I had to write a page scraping application at work to use to preview development CMS pages (long story) - so we had to be very extensive with our regular expressions. I noticed that you might get some false positives and miss some other links with your current regular expression. Since you've demonstrated some great refactoring in your past articles, let me do so with this comment.

Start:
<a(.*?)href="(.*?)"(.*?)>(.*?)</a>

Problem: user makes up their own tag such as <amanda href...
'manda ' will match unfortunately. We know that normally there are 1 to many spaces between the a and the href... but there could also be line breaks...
Additionally, with the above match, <ahref will also get matched.
So lets use the whitespace and the 1 or more operator...

Next:
<as+href="(.*?)"(.*?)>(.*?)</a>

ok - so here's your challenge ;) What if the user uses a single quote - or doesn't use any quotes (this will still work - its not valid HTML but it still is a valid link). Here's a hint - you're going to need to grouping in regular expressions.

-aaron

Comment on this article
Name
Website
Canberra Web Design