I want to clean a list of words, which is over 1 million long.
The base data was ratings given in form of sentences. I broke this down into words, with subfield(), but i get words with "commas" or "question marks" or other signs. As separator i used ' ' (empty). I need just the words, because i need to count the frequency.
is considered as 3 different words, but it is only one. How can i eliminate all these signs around the words?