FRIHOST FORUMS SEARCH FAQ TOS BLOGS COMPETITIONS
You are invited to Log in or Register a free Frihost Account!


counting the unique word frequencies





bukaida
I have a table articles which has the following structure
Code:

CREATE TABLE `articles` (
 `body` text,
 `title` varchar(250) default NULL,
 `id` int(11) NOT NULL auto_increment,
 PRIMARY KEY  (`id`),
 FULLTEXT KEY `body` (`body`,`title`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8


Now my first question is --

Can I store a UTF textfile of say 2MB under 'body' column ? OR any other better alternative to do the same without storing the file into MYSQL? I have to perform a FULL TEXT SEARCH.

I want to count each of the unique word's frequency and print them in a html tabular format. I can now count the total number of words by a PHP function
Code:

function adv_count_words($str)
          {     $words = 0; 
             $str = eregi_replace(" +", " ", $str);
            $array = explode(" ", $str);
             for($i=0;$i < count($array);$i++)
                 {
                    if (eregi("[0-9A-Za-z---]", $array[$i]))
                    $words++;
                } 
                return $words;
                }

But it returns the duplicate words also. At the same time it doesnot take into account, the individual words and their frequencies. I have mysql 5 with PHP 4.4.Please help.
SonLight
What you need is to keep a list of each word you have seen, with the word itself and the number of times you have seen it in each entry. For the sake of efficiency, a hash table is probably the best method. However, the most important thing is getting the logic right so that each time you find a word, you call a function to look for it in the list. If it is there, just increment the count. If it is not there, call another function to add it and set the count to 1.

I would recommend getting it working with simple arrays first, unless you are comfortable with using hash tables. That way you can demonstrate the correctness of the code first.
bukaida
@SonLight
I am not very proficient in PHP. So it will be very helpful if you can provide a sample code (with the array implementation, as the hash table is not my cup of tea at this level).Thank you in advance.
Fire Boar
First things first, that PHP code is pretty messy, I'd strongly advise cleaning up the indentation. Now, here's some sample code.

Code:
function find_words($str) {
  $str = preg_replace('/[^0-9A-Za-z---]/', ' ', $str);
  $str = preg_replace('/\s+/', ' ', $str);
  $words = explode(' ', $str);

  $found = array();
  foreach ($words as $word) {
    if (!empty($word)) {
      $found[strtolower($word)]++;
    }
  }
  return $found;
}


You can then use array_keys on the return value to find the words, with each key's value being the number of words. Here's an example of how it works:

Code:
$a = find_words('She sells sea shells on the sea shore; the shells she sells are sea shore shells.');
/*
$a == array(
  'she'    => 2,
  'sells'  => 2,
  'sea'    => 3,
  'shells' => 3,
  'on'     => 1,
  'the'    => 2,
  'shore'  => 2,
  'are'    => 1
)
*/


And here's an example of how you could tabulate a string:

Code:
$words = find_words($somestring);

echo '<table><thead><tr><th scope="col">Word</th><th scope="col">Count</th></tr></thead><tbody>';
foreach ($words as $word => $count) {
  echo "<tr><td>$word</td><td>$count</td></tr>";
}
echo '</tbody></table>';
bukaida
Thanx a lot Fire Boar. Giving the code a try. Will inform you about the result soon.

PS . Trying to learn PHP. So following your coding standard ( Although it will be always difficult to follow an expert, but thats how people learn to do new things Very Happy )

THANK YOU VERY MUCH AGAIN.
Fire Boar
As far as coding standards go, I tend to follow the Drupal coding standard, though sometimes I like to put { on its own line indented the same amount as the corresponding }. It doesn't really matter what standard you follow as long as it's consistent throughout any given project.
bukaida
Thanx Man. This is a nice tutorial. Lots of new things to learn.
bukaida
The word frequency counter is working with english characters only but not working with UTF-8.
Code:

function find_words($str) {
  $str = preg_replace('/[^0-9A-Za-z---]/', ' ', $str);
  $str = preg_replace('/\s+/', ' ', $str);
  $words = explode(' ', $str);

  $found = array();
  foreach ($words as $word) {
    if (!empty($word)) {
      $found[strtolower($word)]++;
    }
  }
  return $found;
}




Since strtolower() doesnot work with UTF-8 characters so I have omitted it.
Code:

function find_words($str) {
  $str = preg_replace('/[^0-9A-Za-z---]/', ' ', $str);
  $str = preg_replace('/\s+/', ' ', $str);
  $words = explode(' ', $str);

  $found = array();
  foreach ($words as $word) {
    if (!empty($word)) {
      $found[$word]++;
    }
  }
  return $found;
}


But still it is not working. Please Help.

Total word count from my old function is working.So the total count is ok for UTF-8 Characters.
Fire Boar
You probably shouldn't use UTF-8 characters in the key of a PHP array. Try this instead.

Code:
function find_words_mb($str) {
  $str   = preg_replace('/[^0-9A-Za-z---]/', ' ', $str);
  $str   = preg_replace('/\s+/', ' ', $str);
  $str   = strtr($str, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz');
  $words = explode(' ', $str);

  $found_words = array();
  $found       = array();
  foreach ($words as $word) {
    if (!empty($word)) {
      if (($key = array_search($word, $found_words)) !== false) {
        $found[$key]['count']++;
      }
      else {
        $found[]       = array('word' => $word, 'count' => 1);
        $found_words[] = $word;
      }
    }
  }
  return $found;
}


To "to lower" anything not in the roman alphabet, just add an upper/lower pair to strtr. Here is the example output for this new function.

Code:
$a = find_words_mb('She sells sea shells on the sea shore; the shells she sells are sea shore shells.');
/*
$a == array(
  0 => array('word' => 'she', 'count' => 2),
  1 => array('word' => 'sells', 'count' => 2),
  2 => array('word' => 'sea', 'count' => 3),
  3 => array('word' => 'shells', 'count' => 3),
  4 => array('word' => 'on', 'count' => 1),
  5 => array('word' => 'the', 'count' => 2),
  6 => array('word' => 'shore', 'count' => 2),
  7 => array('word' => 'are', 'count' => 1)
)
*/


And to get data out...

Code:
$words = find_words_mb($somestring);

echo '<table><thead><tr><th scope="col">Word</th><th scope="col">Count</th></tr></thead><tbody>';
foreach ($words as $word) {
  echo "<tr><td>$word[word]</td><td>$word[count]</td></tr>";
}
echo '</tbody></table>';
bukaida
This modification is somehow working for me--
Code:

 function find_words($str) {
 
  $str = eregi_replace(" +", " ", $str);
 
$newstr = preg_replace("/[|]/","",$str); // For a single value |
$newstr = preg_replace("/[,]/","",$newstr); // For a single value ,
$newstr = preg_replace("/[:]/","",$newstr); // For a single value :
$newstr = preg_replace("/[.]/","",$newstr); // For a single value .
$newstr = preg_replace("/[[]/","",$newstr); // For a single value [
$newstr = preg_replace("/[]]/","",$newstr); // For a single value ]
$newstr = preg_replace("/[\n\r]/","",$newstr); // For a newline
$newstr = preg_replace("/[\"]/","",$newstr); // For a "

echo('<h3 align="center"><u>The Text after trimming</u></h3><font color="green">');
echo $newstr;
echo('</font>');
$mywords = explode(' ', $newstr);

  $found = array();
  foreach ($mywords as $myword) {
     
    if (!empty($myword)) {
       
      $found[$myword]++;
   
  }
  }
 
  return $found;
}

I know the preg_replace() can accept array , so all the replacements can be made in a single statement. But except that , any other improvement can be done to the code?
Also if a paganitation can be done, it will be very helpful as all the records are coming in one page now.
Fire Boar
Code:
function trim_chars($str) {
  $str = preg_replace('/[\|,:\.\[\]\"]/', ' ',$str);
  return preg_replace('/\s+/', ' ', $str);
}

function find_words($str) {
  $words = explode(' ', $str);

  $found_words = array();
  $found       = array();
  foreach ($words as $word) {
    if (!empty($word)) {
      if (($key = array_search($word, $found_words)) !== false) {
        $found[$key]['count']++;
      }
      else {
        $found[]       = array('word' => $word, 'count' => 1);
        $found_words[] = $word;
      }
    }
  }

  return $found;
}

$trimmed = trim_chars($mystring);

echo '<h3 style="text-align: center; text-decoration: underline;">The Text after trimming</h3><p style="color: green;">';
echo $trimmed;
echo '</p>';

$mywords = find_words($trimmed);


Try that instead. Some general markup things:
- In HTML, include all "how it looks" code as a CSS declaration: either using the style attribute (as used above) or the class attribute. The font tag should never be used, and the text should usually appear in a <p> tag.
- Don't use the ereg functions: they are deprecated.
- Avoid putting anything unrelated to the function in it. You're using echo statements in a function that is supposed to take a string as an input and return an array as output. This is generally not such a great idea. In the example above, I split the trim operation out into another function for this.
- You're still using UTF-8 characters as array keys. This isn't a great idea.
- Make use of preg_replace's character classes. Don't split them into multiple expressions, that's very inefficient. Also you don't need to worry about \r and \n when using the /\s+/ pattern match.
- Be more consistent in indentation. Make sure the start of every line between any pair of { and } is the same number of spaces in.
- Try to avoid having trailing spaces at the end of a line, especially blank lines.
- Try to avoid using tabs: some editors display them as 8 spaces, some as 4, some as 2, some are configurable. Spaces are more consistent.
bukaida
Thanx man, a very nice piece of cleanup. Any help in pagination ?
Fire Boar
This doesn't really translate too well to pagination, but I suppose you could use array_slice on the result of find_words.
bukaida
preg_replace() is not working good with UTF-8. ergi_replace() is working fine (specialy for blank space and newlines).
example--

$str = eregi_replace(" +", " ", $str); is working but

$str = preg_replace('/[^0-9A-Za-z---]/', ' ', $str);
$str = preg_replace('/\s+/', ' ', $str);

is producing garbled output.

However preg_replace() is working fine with other english character replacement.
Fire Boar
That's a bug with Windows' way of handling UTF-8. It works fine on anything apart from Windows. Try this instead:

Code:
function trim_chars($str) {
  $str = preg_replace('/[\|,:\.\[\]\"]/', ' ',$str);
  return preg_replace('/[ \t\r\n]+/', ' ', $str);
}
bukaida
There is a major problem with this approach with my current project. The full text search function of mysql is fine when the database is small. But the server is collapsing for my current data (50 MB of text data). So I am trying to go for an indexed based search by integrating Sphinx with my code. The basic queries worked fine but for having the same level of result as I was getting earlier ( showed in screenshot) requires editing the sphinxapi.php file which is quite tough for a beginner like me. I am currently using the supplied test.php file with minor modifications--

Code:


require ( "connect.php" );
require ( "sphinxapi.php" );
$key=$_POST['q'];
echo('Search key is:');
echo $key;
echo('<br><br>');
$args = array();
foreach ( $_SERVER["argv"] as $arg )
   $args[] = $arg;

$cl = new SphinxClient ();

$q = $key;
$sql = "SELECT * from articles";
$mode = SPH_MATCH_ALL;  //Changing this to boolean didnot help
$host = "localhost";
$port = 9312;
$index = "*";
$groupby = "";
$groupsort = "@group desc";
$filter = "group_id";
$filtervals = array();
$distinct = "";
$sortby = "";
$limit = 20;
$ranker = SPH_RANK_PROXIMITY_BM25;
$select = "";

|
|
|
|


////////////
// do query
////////////

$cl->SetServer ( $host, $port );
$cl->SetConnectTimeout ( 1 );
$cl->SetArrayResult ( true );
$cl->SetWeights ( array ( 100, 1 ) );
$cl->SetMatchMode ( $mode );
if ( count($filtervals) )   $cl->SetFilter ( $filter, $filtervals );
if ( $groupby )            $cl->SetGroupBy ( $groupby, SPH_GROUPBY_ATTR, $groupsort );
if ( $sortby )            $cl->SetSortMode ( SPH_SORT_EXTENDED, $sortby );
if ( $sortexpr )         $cl->SetSortMode ( SPH_SORT_EXPR, $sortexpr );
if ( $distinct )         $cl->SetGroupDistinct ( $distinct );
if ( $select )            $cl->SetSelect ( $select );
if ( $limit )            $cl->SetLimits ( 0, $limit, ( $limit>1000 ) ? $limit : 1000 );
$cl->SetRankingMode ( $ranker );
$res = $cl->Query ( $q, $index );

////////////////
// print me out
////////////////

if ( $res===false )
{
   print "Query failed: " . $cl->GetLastError() . ".\n";

} else
{
   if ( $cl->GetLastWarning() )
      print "WARNING: " . $cl->GetLastWarning() . "\n\n";

   print "Query '$q' retrieved $res[total] of $res[total_found] matches in $res[time] sec.\n";
   echo('<br><br>');
   print "Query stats:\n";
   if ( is_array($res["words"]) )
      foreach ( $res["words"] as $word => $info )
         print "    '$word' found $info[hits] times in $info[docs] documents\n";
   print "\n";

   if ( is_array($res["matches"]) )
   {
      $n = 1;
      echo('<br><br><h2>');
      print "Matches:\n";
      echo('</h2><br>');
      foreach ( $res["matches"] as $docinfo )
      {
         echo('<br>');
         print "$n. doc_id=$docinfo[id]";
         
         //Query addition
            $query="SELECT title
             FROM articles
             WHERE id='$docinfo[id]'";
             $result=mysql_query($query)or die(mysql_error());
               if($result){
              while($row=mysql_fetch_array($result)){
              extract($row);
                                            }
                      }
     echo('&nbsp;&nbsp; Title=');
    echo($title);
            
            //Query end
            
            
         foreach ( $res["attrs"] as $attrname => $attrtype )
         {
            $value = $docinfo["attrs"][$attrname];
            if ( $attrtype & SPH_ATTR_MULTI )
            {
               $value = "(" . join ( ",", $value ) .")";
            } else
            {
               if ( $attrtype==SPH_ATTR_TIMESTAMP )
                  $value = date ( "Y-m-d H:i:s", $value );
            }
            print ", $attrname=$value";
            
         }
         print "\n";
         $n++;
      }
   }
}


How can I search something with a wild character(e.g. hu* will bring all the words starting with hu). Previously boolean mode search was doing the trick. But now even setting the mode in the code is not helping.
jmraker
In mysql a "text" field type can store 64k of data, if you want to store more the "mediumtext" field type can store about 1.6meg and the "longtext" field type can store about 4.2gig of data
bukaida
Storing the data is not an issue, searching in resonable time is the thing that is required.
Nemesis234
what exactly do you want? do you want a unique word counter or do you want a mysql bulk text search engine?
Related topics
The future of Wi-Fi
Windows Tips&tricks!
Bumper Stickers
Another (almost) worst Word problem. Ever ;)
convert ur PDF files to word documents
Problem with opening a word doc from site hosted in frihost
Making Your Website Look Professional
Bluetooth
Grab your very unique Gangsa Name!
Most the return card that praises
Carl's Start-to-Finish Professional Website Tutorial
Mod Projects lots of info
Define Religion?
What is the usefulness of religion?
Reply to topic    Frihost Forum Index -> Scripting -> Php and MySQL

FRIHOST HOME | FAQ | TOS | ABOUT US | CONTACT US | SITE MAP
© 2005-2011 Frihost, forums powered by phpBB.