I have an improvement to the Tokenizer

Topics: Developer Forum, Project Management Forum
Mar 16, 2012 at 8:42 PM

I realize this project is not currently active but I found and use it.

I noticed CleverHuman tried to revive it but doesn't seem to have made much progress.  I tried using the IKVM java tool (very nice) but it didn't address my issue. (The same tokenizing issue appears in the latest OpenNLP code.

So, I have two things to offer.  I have updated the code to .Net 4 using Stylecop rules.

I have added a function to the span tokenizer that given an input array of chars it will separate them out of the input stream and create spans for them. This makes sure that the tokenizer handles grouping characters (parenthesis, brackets and curly braces correctly and makes them their own tokens.

Is anyone interested in such functionality I will update the MaximumEntropyTokenizer with the addition.

Please let me know

Jul 30, 2012 at 3:47 PM

I know this reply is coming a little late, but I would most defiantly be interested in any updates that bring this bad boy up to snuff.