Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Monday, January 18, 2016

Finding keywords in texts

Finding keywords in texts


I have an array with incidents that has happened, that are written in free text and therefore aren't following a pattern except for some keywords, eg. "robbery", "murderer", "housebreaking", "car accident" etc. Those keywords can be anywhere in the text, and I want to find those keywords and add those to categories, eg. "Robberies".

In the end, when I have checked all the incidents I want to have a list of categories like this:

Robberies: 14  Murder attempts: 2  Car accidents: 5  ...  

The array elements can look like this:

incidents[0] = "There was a robbery on Amest Ave last night...";  incidents[1] = "There has been a report of a murder attempt...";  incidents[2] = "Last night there was a housebreaking in...";  ...  

I guess the best here is to use regular expressions to find the keywords in the texts, but I really suck at regexp and therefore need some help here.

The regular expressions is not correct below, but I guess this structure would work? Is there a better way of doing this to avoid DRY?

var trafficAccidents = 0,      robberies = 0,      ...    function FindIncident(incident) {      if (incident.match(/car accident/g)) {          trafficAccidents += 1;      }      else if (incident.match(/robbery/g)) {          robberies += 1;      }      ...  }  

Thanks a lot in advance!

Answer by Bruno for Finding keywords in texts


Use an object to store your data.

events = [      { exp : /\brobbery|robberies\b/i,       //       \b                      word boundary      //         robbery               singular      //                |              or      //                 robberies     plural      //                          \b   word boundary      //                            /i case insensitive        name : "robbery",        count: 0      },      // other objects here  ]    var i = events.length;      while( i-- ) {        var j = incidents.length;       while( j-- ) {            // only checks a particular event exists in incident rather than no. of occurrences          if( events[i].exp.test( incidents[j] ) {               events[i].count++;          }      }  }  

Answer by rbtLong for Finding keywords in texts


Actually, I would kind of disagree with you here . . . I think string functions like indexOf will work perfectly fine.

I would use JavaScript's indexOf method which takes 2 inputs:

string.indexOf(value,startPos);

So one thing you can do is define a simple temporary variable as your cursor as such . . .

function FindIncident(phrase, word) {      var cursor = 0;      var wordCount = 0;      while(phrase.indexOf(word,cursor) > -1){          cursor = incident.indexOf(word,cursor);          ++wordCount;              }      return wordCount;  }  

I have not tested the code but hopefully you get the idea . . .

Be particularly careful of the starting position if you do use it.

Answer by Ed Johnson for Finding keywords in texts


RegEx makes my head hurt too. ;) If you're looking for exact matches and aren't worried about typos and misspellings, I'd search the incident strings for substrings containing the keywords you're looking for.

incident = incident.toLowerCase();  if incident.search("car accident") > 0 {      trafficAccidents += 1;  }  else if incident.search("robbery") > 0 {      robberies += 1;  }  ...  

Answer by Dancrumb for Finding keywords in texts


The following code shows an approach you can take. You can test it here

var INCIDENT_MATCHES = {    trafficAccidents: /(traffic|car) accident(?:s){0,1}/ig,    robberies: /robbery|robberies/ig,    murder: /murder(?:s){0,1}/ig  };    function FindIncidents(incidentReports) {    var incidentCounts = {};    var incidentTypes = Object.keys(INCIDENT_MATCHES);    incidentReports.forEach(function(incident) {      incidentTypes.forEach(function(type) {        if(typeof incidentCounts[type] === 'undefined') {          incidentCounts[type] = 0;        }        var matchFound = incident.match(INCIDENT_MATCHES[type]);        if(matchFound){            incidentCounts[type] += matchFound.length;        };      });    });      return incidentCounts;  }  

Regular expressions make sense, since you'll have a number of strings that meet your 'match' criteria, even if you only consider the differences in plural and singular forms of 'robbery'. You also want to ensure that your matching is case-insensitive.

You need to use the 'global' modifier on your regexes so that you match strings like "Murder, Murder, murder" and increment your count by 3 instead of just 1.

This allows you to keep the relationship between your match criteria and incident counters together. It also avoids the need for global counters (granted INCIDENT_MATCHES is a global variable here, but you can readily put that elsewhere and take it out of the global scope.

Answer by FrankieTheKneeMan for Finding keywords in texts


Use an array of objects to store all the many different categories you're searching for, complete with an appropiate regular expression and a count member, and you can write the whole thing in four lines.

var categories = [      {          regexp: /\brobbery\b/i          , display: "Robberies"          , count: 0      }      , {          regexp: /\bcar accidents?\b/i          , display: "Car Accidents"          , count: 0      }      , {          regexp: /\bmurder\b/i          , display: "Murders"          , count: 0      }  ];    var incidents = [       "There was a robbery on Amest Ave last night..."      , "There has been a report of an murder attempt..."      , "Last night there was a housebreaking in..."  ];    for(var x = 0; x

Now, no matter what you need, you can simply edit one section of code, and it will propagate through your code.

This code has the potential to categorize each incident in multiple categories. To prevent that, just add a 'break' statement to the if block.

Answer by elclanrs for Finding keywords in texts


You could do something like this which will grab all words found on each item in the array and it will return an object with the count:

var words = ['robbery', 'murderer', 'housebreaking', 'car accident'];    function getAllIncidents( incidents ) {    var re = new RegExp('('+ words.join('|') +')', 'i')      , result = {};    incidents.forEach(function( txt ) {      var match = ( re.exec( txt ) || [,0] )[1];      match && (result[ match ] = ++result[ match ] || 1);    });    return result;  }    console.log( getAllIncidents( incidents ) );  //^= { housebreaking: 1, car accident: 2, robbery: 1, murderer: 2 }  

This is more a a quick prototype but it could be improved with plurals and multiple keywords.

Demo: http://jsbin.com/idesoc/1/edit

Answer by caiosm1005 for Finding keywords in texts


Yes, that's one way to do it, although matching plain-words with regex is a bit of overkill ? in which case, you should be using indexOf as rbtLong suggested.

You can further sophisticate it by:

  • appending the i flag (match lowercase and uppercase characters).
  • adding possible word variations to your expression. robbery could be translated into robber(y|ies), thus matching both singular and plural variations of the word. car accident could be (car|truck|vehicle|traffic) accident.

Word boundaries \b

Don't use this. It'll require having non-alphanumeric characters surrounding your matching word and will prevent matching typos. You should make your queries as abrangent as possible.


if (incident.match(/(car|truck|vehicle|traffic) accident/i)) {      trafficAccidents += 1;  }  else if (incident.match(/robber(y|ies)/i)) {      robberies += 1;  }  

Notice how I discarded the g flag; it stands for "global match" and makes the parser continue searching the string after the first match. This seems unnecessary as just one confirmed occurrence is enough for your needs.

This website offers an excellent introduction to regular expressions

http://www.regular-expressions.info/tutorial.html


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.