Imagine if you can predict the movements of stocks by analyzing people’s sentiments. That is precisely what we did in this project. So, we had two type of datasets for this project, one is a excel file which has the 7 numeric factors used to predict the alpha signals and other one is a JSON file which contains tweets and sentiment scores of each tweet.
The JSON file is a huge one as it almost contained more than a million tweets in it and processing it was really challenging. For the first few days the team concentrated working on the numeric data to classify alpha signals. Our team worked on different methods like Logistic Regression. Decision Tree, Random Forest, KNN and Naïve Bayes. At the end we choose Random Forest model. It was over fitting at first but after performing a Grid Search Cross Validation we got 0.62 f1 score, which by the way was our evaluation metric.
The next step was processing the tweets and believe me my friend, it really was not a cake walk. The team was new to work with JSON data format and as already mentioned the data was huge. The data per-processing step consisted of removing URL, Hashtags, Mentions, Re-tweets, Numbers and Emojis from the tweet. Tokenization was done and a Stacked LSTM model with a dropout of 0.4 was used to predict the sentiments.
The predicted sentiments were merged, based on ticker and date, with the numeric factors as a new factor. Again, a random forest model was built to predict the alpha signals.
Personally, this project has taught me how to handle large amount of data. This project also broke the stereotypes I has like Naïve Bayes is the only model which will work for text and LSTM will be used only for time series. If I want to give a synonym for our project it would be ‘challenging’ but the learning we had form this project is really high.
PGDM Analytics Student at IFIM Business School