The word_tokenize function in nltk takes, as far as I understand, a string represented sentence and returns a list of all its words:

>>> from nltk import word_tokenize, wordpunct_tokenize>>> s = ("Good muffins cost $3.88\nin New York. Please buy me\n"... "two of them.\n\nThanks.")>>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.','Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

However, in my program it's important to keep the spaces for further computation, therefore I rather want word_tokenize to return it like this:

['Good', ' ', 'muffins', ' ', 'cost', ' ', '$', '3.88', ' ', 'in', ' ', 'New', ' ', 'York.', ' ', 'Please', ' ', 'buy', ' ', 'me', ' ', 'two', ' ', 'of', ' ', 'them', '.', 'Thanks', '.' ]

How can I change/replace/tweak word_tokenize to accomplish this?

1

Best Answer


You can break this task in two steps -

Step 1: Take the string and break in on the basis of spaces

Step 2: Tokenize each word (as split by space in step 1) using word_tokenize

>>> s = "Good muffins cost $3.88\nin New York. Please buy me\n">>> ll = [[word_tokenize(w), ' '] for w in s.split()]>>> list(itertools.chain(*list(itertools.chain(*ll))))['Good', ' ', 'muffins', ' ', 'cost', ' ', '$', '3.88', ' ', 'in', ' ', 'New', ' ', 'York', '.', ' ', 'Please', ' ', 'buy', ' ', 'me', ' ']