Which my script is based on, I'm looking at how i can modify the script to stop collecting things such as date timestamp, stop words etc into the WordTuple. Also using Tri-gram idea to be able to identify recurring terms so as to be able to use this data for topic modeling later.
Due to the sensitivity of my data I'm unable to share it here.