Analytics Super Specialization Project


Detecting the author of speech using NLP (Natural Language Processing)



In politics, “One news can change a vote, one vote can change the result and one result can impact a nation’s future to any extreme”. In the era of media driven politics the fake stories can adversely change the opinion of people. During super specialization, our team has been assigned to build a model to eliminate one such problem. We had the sample speeches of Mr. Obama and Mr. Trump. And our duty is to identify the actual author between them. We started with generating a word cloud for the authors to observe a virtual pattern between them.


We went for 3 phases of model building (Naive Bayes – LSTM – LSTM Dropout). During our pre-processing we included stop words as they had a huge impact on author’s detection and also performed lower case conversion, punctuation removal, tokenization, tagging and lemmatization. In Naive Bayes, we called BernoulliNB() but the model had very low recall for Mr. Trump as there was a clear misbalance in the samples. To overcome this we moved into LSTM model and performed several hyperparameter tuning. From there we moved for LSTM Dropout to avoid the overfit.



And Finally, We created a model to detect the author of speech with an F-Score of 0.97 and 96.72% accuracy. We also tested the goodness of model with a real time document and our model successfully passed the test. The key takeaway for me was the effectiveness of hyperparameter tuning on goodness of model and time management on completing a task. So, “No more fake stories during our model’s watch”.


Piravin Kanth Raman