Computer Science Department
TxtRP
Jacob Dickens
TxtRP (Pronounced Texter-P) is a predictive text generator built from a machine learning algorithm using the Word2Vec framework to finish sentences and generate new strings. It is an interactive system that uses both public-domain training data and user data to predict new phrases. The tool brainstorms for the user, lessens the mental strain with writing literary prose, and helps with worldbuilding. TxtRP is a useful tool for any writer as its simple GUI has a plentiful amount of information to get started.
This project was a wonderful adventure into two of my hobbies: programming and story crafting. The research into creating TxtRP was a journey to say the least. In scouring the many Python Library APIs, I focused on ease for the user and so, the program is tuned to work with simple plaintext which lets the user create their own models to train.
The program consists of several different scripts that work in tandem to deliver a finished result to the user. TxtRP reads text from plaintext files and tries to understand the text through rigorous calculations. It won’t be able to recognize some special characters or whitespace, it just focuses on the main body of text, or the corpus. From the corpus, words are tokenized, which means they are separated by word and given a part of speech tag. These tokens are then vectorized or given a specific vector value that in practice, pins words closer to each other if they frequently appear in the text together. It is then given to the Word2Vec framework to learn on. The framework is a shallow neural network which has a large setup overhead. After setup and a bit of training, the machine then outputs one word at a time and can be chained to create full sentences, all dependent on the user input.