Data mining and content aggregation with PHP
Tag it:
Delicious
blogmarks
Stumble
Furl it!
Digg
YahooMyWeb
Technorati
?

This is an interesting article I came across. It is mirrored and the link is provided at the end of the page.

The functions described in this article will allow you to parse unstructured data from HTML pages using PHP and perl regular expressions,

The unstructured data can then be stored in a structured fashion in a database of your choosing.

Data Mining Basics Tutorial:

Luckily, there are alternatives. Using PHP and MySQL you can effectively accomplish the same content aggregation tasks with little cost, but you have to learn the basics first. Most data mining projects follow these basic steps.

Basic Data Mining Steps
  1. Fetch the HMTL page(s) of Interest using the Snoopy PHP Class
  2. Split the page HTML into a more manageable portion
  3. Remove un-wanted HTML tag attributes
  4. Reformat HTML, adjust spacing and remove entities
  5. Match content with regular expressions
  6. Store content into a MySQL database for future use

Step 1. - Fetching the Data / HTML Content

Developers have done this work already. The Snoopy PHP class, which can be downloaded at http://sourceforge.net/projects/snoopy/, has all the necessary tools to download an HTML page from the internet. It's advisable to be considerate when you are retrieving content from any site. Contact the Web site admin before fetching with any automated scripting, don't bombard the Web site with thousands of HTTP requests a second, always take notice of any copyrights and setup a contract for content sharing if needed. Leeching Web content from a Web site without permission can lead to serious legal issues.

Steps 2, 3 & 4 - Parsing the Data

Once you have the HTML page(s), you will need to parse the data into a friendlier format. This is done so that pattern matching will be more reliable and consistent in the future. The functions shown below will greatly help in this matter.

PHP Data Parsing Functions:

 

  • First, use the splitPageHTML() function to split the page. This can be done by examining the HTML code for comment tags, or just split out the <body> section of the HTML.

    Next, use the removeTags() function to remove most of the unwanted HTML tags. You still want to keep the table tags and other structural tags, as they will usually help you in identifying the content.

    Finally, use the reformatHTML() function to reformat the HTML and adjust spacing. This will greatly clean up the remaining data and prepare it for pattern matching.

    Step 5 - Matching Content with PERL Regular Expressions

    This is where it gets more complicated. If you are not familiar with regular expressions, please take a moment to visit one of the following sites.

    PHP Builder Tutorial on Regular Expressions
    Google Search for Regular Expressions
    Or purchase the very informative O'Reilly book Mastering Regular Expressions

    Now that you've mastered regular expressions, it's time to look at your content up to this step. Use the PHP vardump() function to output your resulting content to the screen. Take notice of structure HTML tags (table, tr, td, div, p). Also, note any class attributes. Class attributes will greatly help you to target certain patterns. It's beyond the scope of this article to discuss all possibilities, but we can go over a simple example pattern.

    Shown below is example PHP code of how to match a phone number from a sniblet of HTML ($document). It's probably one of the simplest regular expression matchings. I'll go over it piece by piece. Note that I use $document as the HTML document text. $phone_pattern is used as the pattern in the PERL regular expression match, PHP preg_match() function.

    // HTML sniblet from the var_dump($document) call
    // <td class="details"> Phone: (222) 222-2222 </td>

    $phone_pattern = '/[\s\n\r\t]*Phone:[\s\n\r\t]*([ \(\)0-9\-]+)[\s\n\r\t]*/i';
    if (preg_match($phone_pattern, $document, $matches)) $info['phone'] = $matches[1];

    Note that I've shown the class attribute, class="details", as it can sometimes help in the pattern matching scheme. Table structures and other HTML tags may also help. But, in this simple case we are trying to match a phone number. Below are the pieces to this regular expression.

    [\s\n\r\t]* - white space, soft return, hard return, tab; 0 or more occurrences
    [ \(\)0-9\-]+ - spaces, parenthesis, numbers 0 thru 9, dash; 1 or more occurrences
    ( ) - matching portions set to $matches[1], $matches[2] and so on
    This was just a simple example, but it shows a very important aspect of regular expressions. By using regular expressions you can continue to match your data in the future even if the page changes slightly. In combination with the three functions mentioned above; splitPageHTML(), removeTags() and reformatHTML(), your scripts will continue to match data properly well into the life of the script.
  • Comments
    Search
    Only registered users can write comments!
    Last Updated ( Friday, 18 July 2008 )