FRIHOST FORUMS SEARCH FAQ TOS BLOGS COMPETITIONS
You are invited to Log in or Register a free Frihost Account!


Data Mining From tables and then insert into MySQL





flatliner
Hello there I need help

There is a website with about 8 different colunms across and over 58 rows and will have new information in them every week. I need to extract the data from all these colunms every week and insert it into an identical (with the same colunm headings) MySQL table. I have looked around at data mining programs but I just cant get it to work. Any1 recommand one? or infrom me of a way of doing it.

Kind Regards
Ciaran Mc cann
MrBlueSky
If you give the url of the page, I can probably write a PHP or Perl function which extracts the information from the table. You will have to write the additional code to put the information in MySQL yourself, but that shouldn't be too difficult.
flatliner
oh jas thanks very much thats very kind of you.

The the url is http://webiplan.kildarecoco.ie/publiciplan/Email/week.asp?SD=9/5/07&ED=16/5/07&AC=K

PHP solutins would be best thanks.

It has about 4 tables in it but I only want the secound on.

The table with the heading

Quote:
PLANNING APPLICATIONS GRANTED UNDER SECTION 34 OF THE ACT SUBJECT TO CONDITIONS FROM 9/5/07 TO 16/5/07
In deciding a planning application the Planning Authority has had regard to submissions or observations received in accordance with the Planning and Development Regulations 2006.


and it has 12 colunms.

Thanks very much again.

Kind Regards
Ciaran Mc Cann
MrBlueSky
Oke, have a look at this. The script mines all the data from the second table. You can see it in action here:

http://www.mrbluesky.frih.net/test/scrape.php (look at the source of the page to see how the data is stored)

Take a look at the comments in the source code below for more information. Basically it's just one function, getTable($url), which returns a data-structure which contains all the data from the table.

You might want to do two things:
1) Check how the data from the table is stored, and see if it needs changing depending on the format in which you want to store the data to MySQL.
2) Check if the script also works with pages (urls) that contain older versions of the table (if available) to do some extra testing

Let me know about any bugs, or changes and fune-tuning you want.

Note:
* The page containing the tables is big, so I have set the execution time to 120 seconds instead of the standard 30, or else it won't work.
* I have tried to take care that trivial changes in the layout of the page won't break the script. If large changes to layout are made the script propably need to be changed.

The script:

Code:

<?php

   error_reporting(E_ALL);
   //
   // BIG FILE, we need some extra time to load it!
   //
   ini_set("max_execution_time", 120);

   function getTable($url) {
   
     //
     // 1. Fetch content of page
     //
     $ch = curl_init($url);
     curl_setopt($ch, CURLOPT_HEADER, 0);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
     
     $content = curl_exec($ch);
     
     curl_close($ch);
     
     //
     // 2. Process content of page
     //
     
     // Remove everything before the table
     $temp = preg_split("/section.34.*?<table[^>]*>/si", $content);
     $content = $temp[1];
     
     // Remove everything after table
     $content = preg_replace("/<\/table>.*$/si", "", $content);
     
     // Spit rows
     $rows = preg_split("/<\s*tr[^>]*>/i", $content);
       
     // Remove the first element, which doesn't contain a row
     // and the second element which contains the header
     array_shift($rows);
     array_shift($rows);
     
     //
     // Process each row
     //
     
     $data = array();
     $rowcount = 0;
     
     foreach ($rows as $row) {
        
        $data[$rowcount] = array();
        
        // Remove HTML comments
        $row = preg_replace("/<!--.*?-->/s", "", $row);
        
        // Convert &nbsp; to spaces
        $row = preg_replace("/&nbsp;/i", " ", $row);
        
        // Split row to cells
        $cells = preg_split("/<\s*td[^>]*>/i", $row);
        
        // Remove first element, which doesn't contain a cell
        array_shift($cells);
        
        // Process each cell
        $cellcount = 0;
        foreach ($cells as $cell) {
           // Convert <BR>'s to spaces
           $cell = preg_replace("/<br>/i", " ", $cell);
           // Remove HTML-tags
           $cell = preg_replace("/<.*?>/s", "", $cell);
           // Remove starting and trailing whitespace
           $cell = ltrim(rtrim($cell));
           // Remove multiple spaces
           $cell = preg_replace('/\s\s+/', ' ', $cell);
           // Store cell contents in $data[row][column]
           $data[$rowcount][$cellcount] = $cell;
           $cellcount++;
        }
        $rowcount++;
     }
        
     return $data;   
        
   }
   
   // The URL of the page which contains the table to get the data from
   $url = 'http://webiplan.kildarecoco.ie/publiciplan/Email/week.asp?SD=9/5/07&ED=16/5/07&AC=K';
   
   // getTable($url) returns an array containing all data from the table:
   //   $data[row][column] (both start at zero)
   //   
   // for example:
   //  $data[0][0] the data from the first cell from the first row
   //  $data[2][4] the fifth cell from the third row
   //
   // to get the last cell from the last row:
   //  $rows = count($data);
   //  $cols = count($data[0]);
   //  $lastcell = $data[$rows-1][$cols-1];
   //
   $data = getTable($url);
   
   // Lets dump all the contents of $data to see what's in it
   // Look at the source of the page!!!
   var_dump($data);
   
?>
flatliner
Thanks very much that was very kind of you to do that for me.
flatliner
Hello Mrbluesky, thanks very much for writing the script for me. But I am in need of a mod. About 3 or 4 more mods. In the next few days. If you could please. There are a few different verisons of the tables, old versions and new ones. Here is the url to the older version of the table I need the script modified 2 thanks.

URL removed by user

Again its the secound table I wish to mine the data from.

Also I had the code modifted by some1 else and I did some myself also, as I am sure ya will notice. I will donate some $FIRH to you for this, thanks.

Code:

<?php
   
   $myurl = $_POST[url];
   
   error_reporting(E_ALL);
   //
   // BIG FILE, we need some extra time to load it!
   //
   ini_set("max_execution_time", 120);
   function getTable($url) {
   
     //
     // 1. Fetch content of page
     //
     $ch = curl_init($url);
     curl_setopt($ch, CURLOPT_HEADER, 0);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
     
     $content = curl_exec($ch);
     
     curl_close($ch);
     
     //
     // 2. Process content of page
     //
     
     // Remove everything before the table
     $temp = preg_split("/section.34.*?<table[^>]*>/si", $content);
     $content = $temp[1];
     
     // Remove everything after table
     $content = preg_replace("/<\/table>.*$/si", "", $content);
     
     // Spit rows
     $rows = preg_split("/<\s*tr[^>]*>/i", $content);
       
     // Remove the first element, which doesn't contain a row
     // and the second element which contains the header
     array_shift($rows);
     array_shift($rows);
     
     //
     // Process each row
     //
     
     $data = array();
     $rowcount = 0;
     
     foreach ($rows as $row) {
       
        $data[$rowcount] = array();
       
        // Remove HTML comments
        $row = preg_replace("/<!--.*?-->/s", "", $row);
       
        // Convert &nbsp; to spaces
        $row = preg_replace("/&nbsp;/i", " ", $row);
       
        // Split row to cells
        $cells = preg_split("/<\s*td[^>]*>/i", $row);
       
        // Remove first element, which doesn't contain a cell
        array_shift($cells);
       
        // Process each cell
        $cellcount = 0;
  $cell_value = "";
  $field_count = "";
        foreach ($cells as $cell) {
           // Convert <BR>'s to spaces
           $cell = preg_replace("/<br>/i", " ", $cell);
           // Remove HTML-tags
           $cell = preg_replace("/<.*?>/s", "", $cell);
           // Remove starting and trailing whitespace
           $cell = ltrim(rtrim($cell));
           // Remove multiple spaces
           $cell = preg_replace('/\s\s+/', ' ', $cell);
           // Store cell contents in $data[row][column]
            $cell = preg_replace('/\'/', ' ', $cell);
     
     $data[$rowcount][$cellcount] = $cell;
       
     $cell_value .= !empty($cell) ? "'".$cell."'~" : "NULL~";//get the cell value
     if($cellcount == 0){$field_name = "FILE_NUMBER_$cellcount";}//and so on 11 times
     if($cellcount == 1){$field_name = "APPLICANT_NAME_$cellcount";}
     if($cellcount == 2){$field_name = "APPLICATION_TYPE_$cellcount";}
     if($cellcount == 3){$field_name = "APPLICATION_RECEIVED_$cellcount";}
     if($cellcount == 4){$field_name = "DEVELOPMENT_DESCRIPTION_$cellcount";}
     if($cellcount == 5){$field_name = "DEVELOPMENT_LOCATION_$cellcount";}
     if($cellcount == 6){$field_name = "DECISION_DATE_$cellcount";}
     if($cellcount == 7){$field_name = "DEVELOPMENT_TYPE_$cellcount";}
     if($cellcount == 8){$field_name = "PROTECTED_STRUCTURE_$cellcount";}
     if($cellcount == 9){$field_name = "WASTE_LICENCE_$cellcount";}
     if($cellcount == 10){$field_name = "RECEIVE_MARKETING_MATERIAL_$cellcount";}
     $field_count .= $field_name."~";//and so on 11 times
     $field_name = ""; //reset value
           $cellcount++;   
        }

     $field_count = rtrim($field_count,"~");//trim last ~
     $cell_value = rtrim($cell_value,"~");//trim last ~
     
     
     
     //here prepare the fields for insert and run mysql_query
     echo "insert into(".str_replace("~",",",$field_count).") values(".str_replace("~",",",$cell_value).")replace<br>";
     $cell_value = "";//clear values after insert
   $field_count = "";//clear values after insert
        $rowcount++;
     }
       
     return $data;   
       
   }
   
   // The URL of the page which contains the table to get the data from
   $url = "$myurl";
   
   // getTable($url) returns an array containing all data from the table:
   //   $data[row][column] (both start at zero)
   //   
   // for example:
   //  $data[0][0] the data from the first cell from the first row
   //  $data[2][4] the fifth cell from the third row
   //
   // to get the last cell from the last row:
   //  $rows = count($data);
   //  $cols = count($data[0]);
   //  $lastcell = $data[$rows-1][$cols-1];
   //
   $data = getTable($url);
   
   // Lets dump all the contents of $data to see what's in it
   // Look at the source of the page!!!
   
   
?>



flatliner
actually no sorry i worked it out. thanks
Related topics
script backup database
Build an online dictionary by PHP/MySQL
How To : Improve Your PHP Programming
[PhP] News Posting Tutorial (code, actually ^^')
MySQL statement error (got me going)
MySQL DB Backup script - minor problem
Access to MySQL
text formatting inside of textarea adds white space
insert data to mySQL database problem
how to insert data into mysql base from a web page
mysql insert into multiple tables at once?
cannot add data to mysql from PHP form
data mining
PHP/MySQL Dropdown value, and insert into database
Reply to topic    Frihost Forum Index -> Scripting -> Php and MySQL

FRIHOST HOME | FAQ | TOS | ABOUT US | CONTACT US | SITE MAP
© 2005-2011 Frihost, forums powered by phpBB.