Page 1 of 1

Cluster words with SOM

PostPosted: Fri Jun 17, 2011 2:02 am
by ftatarli
Hello ,

I'm looking for a way of cluster words. I tried the follow but the result isn't so good. I image that the problem is the way of represent the words for the kohonen network.

For example there are 10 words. So i convert this words using the ASCII representation. The problem is that the words have differente length so I count the length of the biggest word and then complete with 0 the other words. But it seems to be influenced the SOM and the cluster isn't so good.

What better approach can I use for it?

Sorry for my poor english.

Thanks,

Filipe

Re: Cluster words with SOM

PostPosted: Fri Jun 17, 2011 9:36 am
by andrew.kirillov
Hello,

ftatarli wrote:But it seems to be influenced the SOM and the cluster isn't so good.

Yes, I can imagine. If you have maximum characters set to 10, for example, and you provide words like "bad" and "man" to the SOM, than these 2 words will have 80% of identical characters from the stand point of the network - character "a" and 7 spaces.

I am not sure what are you trying solve with SOM. But maybe you better try some other approaches ... Start from Levenshtein distance - which is a metric for measuring the amount of difference between two words. Then you may research semantics, for example.

Re: Cluster words with SOM

PostPosted: Fri Jun 17, 2011 1:01 pm
by ftatarli
Hello,

andrew.kirillov wrote:I am not sure what are you trying solve with SOM.


I have +/- 40k different words and I'm looking for a way to cluster this words to reducing the dimensionality of this bag of words.

I believe that semantic probably will be the best result but the difficult is high to, so i intend make by steps to learn different forms to solve my problem.

I will research about Levenshtein distance, thanks for the help.

But i insist if anyone knows a way of "normalize" the length of the words to use SOM.

so thanks for the help i will try the levenshtein approach and post the result.

Filipe