I realize this project is not currently active but I found and use it.
I noticed CleverHuman tried to revive it but doesn't seem to have made much progress. I tried using the IKVM java tool (very nice) but it didn't address my issue. (The same tokenizing issue appears in the latest OpenNLP code.
So, I have two things to offer. I have updated the code to .Net 4 using Stylecop rules.
I have added a function to the span tokenizer that given an input array of chars it will separate them out of the input stream and create spans for them. This makes sure that the tokenizer handles grouping characters (parenthesis, brackets and
curly braces correctly and makes them their own tokens.
Is anyone interested in such functionality I will update the MaximumEntropyTokenizer with the addition.
Please let me know