how can create nbin format from data

Topics: Developer Forum, Project Management Forum
Feb 3, 2011 at 9:37 PM

Hi all,

I found that SharpNLP is very useful and it is help me so much. I'm working on statistical machine based on maximum entropy model. When i use the tokeknization as example, i call the method EnglishMaximumEntropyTokenizer tokenizer = new EnglishMaximumEntropyTokenizer(mModelPath + "EnglishTok.nbin");.

The file "EnglishTok.nbin" contain a training data.

My question is, what's the content of the file ? and how can i create a new file ? -using sharpEntropy ? how ?

Thank you for reading.

Feb 3, 2011 at 10:18 PM

The *.nbin files are trained models for the various tasks that SharpNLP can perform (such as tokenizing, or detecting parts of speech.) The files are in a custom format created by the BinaryGisModelWriter class (located in \SharpNLP\SharpEntropy\IO.)

You can read the content of this file using (you guessed it!) the BinaryGisModelReader.

As for updating these files and training new data, I was about to say you can just grab the newer files from the OpenNLP project (the models are here), but sadly the two file formats are not compatible. Here is Richard's summary comment from the BinaryGisModelWriter class:

	/// A writer for GIS models that saves models in a binary format. This format is not the one
	/// used by the <see cref="SharpEntropy.IO.JavaBinaryGisModelWriter">java version of MaxEnt</see>.
	/// It has two main differences, designed for performance when loading the data
	/// from file: first, it uses big endian data values, which is native for C#, and secondly it
	/// encodes the outcome patterns and values in a more efficient manner.

There is a JavaBinaryGisModelReader class in  \SharpNLP\SharpEntropy\IO, so the new files from the OpenNLP project referenced above MAY work. But I am not sure. If the OpenNLP's file format has not changed, then I imagine it will work.

As for modifying a GIS model, or training your own, the OpenNLP documentation is probably the best place to start: http://maxent.sourceforge.net/howto.html

I think the biggest issue here will be getting good training data. Though it may suffice just to use the new OpenNLP models.

Feb 4, 2011 at 8:51 AM

Thank you Patrick for the answer, it's very helpful. 

I read the tutorial in code project and i created a new project in visual studio to understand how it's work. So i found that the training data contain string and numbers (calculated from Maximum Entropy formula).

If we take the example of umbrella, the data file (plain text) contains:

Warm Dry No_Umbrella
Cold Dry No_Umbrella
Cold Rainy Umbrella
Cold Dry Umbrella
Warm Dry No_Umbrella

the code to create a model is:

System.IO.StreamReader trainingStreamReader = new System.IO.StreamReader(trainingDataFile);
SharpEntropy.ITrainingEventReader eventReader = new SharpEntropy.BasicEventReader(new SharpEntropy.PlainTextByLineDataReader(trainingStreamReader));
SharpEntropy.GisTrainer trainer = new SharpEntropy.GisTrainer();
trainer.TrainModel(eventReader);
mModel = new SharpEntropy.GisModel(trainer);

My question, is it the same for machine translation model file ? and the source file is (as example) :

My_name_is_Jack    Mon_nom_est_Jack
My                 mon
My ma
name nom
is est
Kate Kate
... 

Where the first sentence is in English and the second (outcome) is the translation in French.

If it's, Can i say that the best outcome (through GetBestOutcome method) for "My name is Kate" is "Mon nom est Kate" ?

 

Thank you.

 

 

Feb 4, 2011 at 3:03 PM

I'm not sure what a model for statistical machine translation would look like, I've not tried to do that.

It may be worth asking the question on the OpenNLP users list (http://incubator.apache.org/opennlp/mail-lists.html)

Feb 4, 2011 at 5:39 PM

Thank you for the answer Patrick. I will try it.

I will send my question to openNLP too.