|
This is an interesting article I came across. It is mirrored and the link is provided at the end of the page. The functions described in this article will allow you to parse unstructured data from HTML pages using PHP and perl regular expressions, The unstructured data can then be stored in a structured fashion in a database of your choosing.
Data Mining Basics Tutorial:
Luckily, there are alternatives. Using PHP and MySQL you can
effectively accomplish the same content aggregation tasks with little
cost, but you have to learn the basics first. Most data mining projects
follow these basic steps.
Basic Data Mining Steps
- Fetch the HMTL page(s) of Interest using the Snoopy PHP Class
- Split the page HTML into a more manageable portion
- Remove un-wanted HTML tag attributes
- Reformat HTML, adjust spacing and remove entities
- Match content with regular expressions
- Store content into a MySQL database for future use
Step 1. - Fetching the Data / HTML Content
Developers have done this work already. The Snoopy PHP class, which can be downloaded at http://sourceforge.net/projects/snoopy/,
has all the necessary tools to download an HTML page from the internet.
It's advisable to be considerate when you are retrieving content from
any site. Contact the Web site admin before fetching with any automated
scripting, don't bombard the Web site with thousands of HTTP requests a
second, always take notice of any copyrights and setup a contract for
content sharing if needed. Leeching Web content from a Web site without
permission can lead to serious legal issues.
Steps 2, 3 & 4 - Parsing the Data
Once you have the HTML page(s), you will need to parse the data
into a friendlier format. This is done so that pattern matching will be
more reliable and consistent in the future. The functions shown below
will greatly help in this matter.
PHP Data Parsing Functions:
|