ralph lauren italia two words
What is the best way to classify following words in pos tagging
I am doing pos tagging.Given the following tokens in the training set, is it better to consider each token as word1/postag and word2/postag or consider them as one word that is word1/word2/postag?
Examples: (The postag is not required to be included)
Those tokens above are from the training set.For example ralph lauren italia shop words like might seem weird, but when i googled for”", the first page had more”"Then”Interstate” “Johnson”As 2 separate words.On the other hand, words like”"Have”Polo” “Ralph”As 2 separate words more often than they are together as one.I am trying to build a language model and you are right that my language model is bias to the training set i have.What i would want to know is, with such ambiguous word1/word2 appearing in my training set.To be continued
Oct 14 ’12 at 15:52
The examples don’t seem to fall into one category with respect to the use of the slash is a phrase acronym, 1/2 is a number, indicates something in between the ralph lauren italia two words, etc.
I feel there is no treatment of the component words that would work for all the cases in question, and therefore the better option is to handle them as unique words.At decoding stage, when the tagger will probably be presented with more previously unseen examples of such words, the decision can often be made based on the context, rather than the word itself.