Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Monday, May 30, 2016

How to parse product titles (unstructured) into structured data?

How to parse product titles (unstructured) into structured data?


I am looking to parse unstructured product titles like ?Canon D1000 4MP Camera 2X Zoom LCD? into structured data like {brand: canon, model number: d1000, lens: 4MP zoom: 2X, display type: LCD}.

So far I have:

  1. Removed stopwords and cleaned up (remove characters like - ; : /)
  2. Tokenizing long strings into words.

Any techniques/library/methods/algorithms would be much appreciated!

EDIT: There is no heuristic for the product titles. A seller can input anything as a title. For eg: 'Canon D1000' can just be the title. Also, this exercise is not only for camera datasets, the title can be of any product.

Answer by dragonxlwang for How to parse product titles (unstructured) into structured data?


If you are only getting titles (like amazon products), then you can view this as a sentence and considering sequential labeling.

Depending on whether the attributes are given or unknown ( Attributes are like brand, model etc.), there are several issues here:

1: If this is what given then the problem is "easy" and you can use any "sequential labeling" methods to work out. Methods include CRF (conditional random fields) and Markov Models (HMM, MEMM, etc)

2: If not, then you need to extract (attribute, value) pairs the same way as parsing (dependency parsing, full parsing). But I am wondering if this is feasible since there is really little knowledge about the attributes beforehand. Another possibility is that given lots of external information (either the reviews and the product descriptions), you possibly can figure out those attributes and then extract the pairs from the titles. Ex. you find lots of correlation of "brand" and "canon" in reviews, then spotting the word "canon" from title with camera somewhere as well, you know this is a value for "brand".

Answer by Mike Lischke for How to parse product titles (unstructured) into structured data?


You might have more success with a neural net to parse such free text, but you will fail with just plain text parsing, because many of the words need a context you don't have.

However, depending on the level of precision you want to achieve you can come up with a partial solution (which then requires human post-treatment). Or force at least a minimum structure on the input (like product names always must follow a certain pattern). This way you have a much better start since you can better identify the product which should give you enough context information to understand the remaining input.

There's definitely no 100% solution possible (not even with a neural net), I guess.

Answer by George-Bogdan Ivanov for How to parse product titles (unstructured) into structured data?


I agree there is no 100% success method. A possible approach would be to train a custom NER(Named Entity Recognition) with some manually annotated data. The labels would be: BRAND/MODEL/TYPE. Also a common way to filter model names/brands is to use a dictionary. Brands/models usually are non-dictionary words.

Answer by Jirka-x1 for How to parse product titles (unstructured) into structured data?


Since you have a lot of training data (I assume you have a lot of pairs title + structured json specification), I would try to train a Named Entity Recognizer.

For example, you can train the Stanford NER. See this FAQ entry explaining how to do it. Obviously, you will have to fiddle with the parameters as product titles are not exactly sentences.

You will need to prepare the training data but that should not be that hard. You need two columns, word and answer and you can add the the tag column (but I am not sure what the accuracy of standard POS taggerwould be as it is rather non-typical text). I would simply extract the value of the answer column from the associated json specification, there will be some ambiguity, but I think it will be rare enough so you can ignore it.

Answer by Alex Nevidomsky for How to parse product titles (unstructured) into structured data?


Having developed a commercial analyzer of this kind, I can tell you that there is no easy solution for this problem. But there are multiple shortcuts, especially if your domain is limited to cameras/electronics.

Firstly, you should look at more sites. Many have product brand annotated in the page (proper html annotations, bold font, all caps in the beginning of the name). Some sites have entire pages with brand selectors for search purposes. This way you can create a pretty good starter dictionary of brand names. Same with product line names and even with models. Alphanumeric models can be extracted in bulk by regular expressions and filtered pretty quickly.

There are plenty of other tricks, but I'll try to be brief. Just a piece of advice here: there is always a trade-off between manual work and algorithms. Always keep in mind that both approaches can be mixed and both have return-on-invested-time curves, which people tend to forget. If your goal is not to create an automatic algorithm to extract product brands and models, this problem should have limited time budget in your plan. You can realistically create a dictionary of 1000 brands in a day, and for decent performance on known data source of electronic goods (we are not talking Amazon here or are we?) a dictionary of 4000 brands may be all you need for your work. So do the math before you invest weeks into the latest neural network named entity recognizer.


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.