AForge.NET

  :: AForge.NET Framework :: Articles :: Forums ::

Cluster words with SOM

The forum is to discuss topics from different artificial intelligence areas, like neural networks, genetic algorithms, machine learning, etc.

Cluster words with SOM

Postby ftatarli » Fri Jun 17, 2011 2:02 am

Hello ,

I'm looking for a way of cluster words. I tried the follow but the result isn't so good. I image that the problem is the way of represent the words for the kohonen network.

For example there are 10 words. So i convert this words using the ASCII representation. The problem is that the words have differente length so I count the length of the biggest word and then complete with 0 the other words. But it seems to be influenced the SOM and the cluster isn't so good.

What better approach can I use for it?

Sorry for my poor english.

Thanks,

Filipe
ftatarli
 
Posts: 2
Joined: Fri Jun 17, 2011 1:42 am

Re: Cluster words with SOM

Postby andrew.kirillov » Fri Jun 17, 2011 9:36 am

Hello,

ftatarli wrote:But it seems to be influenced the SOM and the cluster isn't so good.

Yes, I can imagine. If you have maximum characters set to 10, for example, and you provide words like "bad" and "man" to the SOM, than these 2 words will have 80% of identical characters from the stand point of the network - character "a" and 7 spaces.

I am not sure what are you trying solve with SOM. But maybe you better try some other approaches ... Start from Levenshtein distance - which is a metric for measuring the amount of difference between two words. Then you may research semantics, for example.
With best regards,
Andrew


Interested in supporting AForge.NET Framework?
User avatar
andrew.kirillov
Site Admin, AForge.NET Developer
 
Posts: 3451
Joined: Fri Jan 23, 2009 9:12 am
Location: UK

Re: Cluster words with SOM

Postby ftatarli » Fri Jun 17, 2011 1:01 pm

Hello,

andrew.kirillov wrote:I am not sure what are you trying solve with SOM.


I have +/- 40k different words and I'm looking for a way to cluster this words to reducing the dimensionality of this bag of words.

I believe that semantic probably will be the best result but the difficult is high to, so i intend make by steps to learn different forms to solve my problem.

I will research about Levenshtein distance, thanks for the help.

But i insist if anyone knows a way of "normalize" the length of the words to use SOM.

so thanks for the help i will try the levenshtein approach and post the result.

Filipe
ftatarli
 
Posts: 2
Joined: Fri Jun 17, 2011 1:42 am




Return to Artificial Intelligence