# Step 2: Build the dictionary and replace rare words with UNK token.vocabulary_size = 50000def build_dataset(words, n_words):"""Process raw inputs into a dataset."""count = [['UNK', -1]]count.extend(collections.Counter(words).most_common(n_words - 1))dictionary = dict()for word, _ in count:dictionary[word] = len(dictionary)data = list()unk_count = 0for word in words:if word in dictionary:index = dictionary[word]else:index = 0 # dictionary['UNK']unk_count += 1data.append(index)count[0][1] = unk_countreversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))return data, count, dictionary, reversed_dictionarydata, count, dictionary, reverse_dictionary = build_dataset(vocabulary,vocabulary_size)

I am learning the elementary example of Vector Representation of Words using Tensorflow.

This Step 2 is titled as "Build the dictionary and replace rare words with UNK token", however, there's no prior defining process of what "UNK" refers to.

To specify the question:

0) What does UNK generally refer to in NLP?

1) What does count = [['UNK', -1]] mean? I know the bracket [] refer to list in python, however, why do we collocating it with -1?

1

Best Answer


As it is already mentioned in the comments, in tokenizing and NLP when you see the UNK token, it is probably to indicate unknown word.

for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:

'my house is big' -> ['my', 'UNK', 'is', 'big']

PS: that count = [['UNK', -1]] is for initionalizing the count, and it will be like [['word', number_of_occurences]] as Ivan Aksamentov has already said.