Machine Learning: Staying on Topic

Matt Schwartz
June 30, 2020

I started SocialSentiment.io with a somewhat simplistic machine learning algorithm. I defined a recurrent neural network to perform sentiment analysis of short texts from social media. Its purpose, and therefore its training set, is focused on the topic of stocks, companies, and their products. It returned only one floating point value representing a single prediction for each string. I quickly fell into a common ML natural language processing trap: new text which is off-topic returns unpredictable and unhelpful results. Worse yet, the results don't directly indicate the text it analyzed is off-topic.

Examples of the Problem

Search a social network such as Twitter for references to the Intel Corporation. Many posts, probably most, refer to the company as "Intel" and not "Intel Corporation" or "Intel Corp". Therefore you're going to search for the word "Intel".

Here are some posts that recently came back which are on-topic:

    What does Apple dumping Intel mean for Mac users?

    So turns out, I have been running all the games from the Intel gpu instead of the nvidia...

    Amazon buys self-driving car company run by former Intel Oregon exec

Along with them comes posts that aren't related to the company or its stock at all:

    It is common for different intel agencies to attach different degrees of confidence based on the manner on underlying intel...

    Goddammit man what action are YOU gonna take? You’re the chairman of the intel committee!!!

Another example is Google / Alphabet. Youtube is a company owned by Alphabet, so they are included in our social media searches. Search social media for Youtube and the most popular posts are about music and music videos on the site.

    [Song name] Officially Sets YouTube Record For Most Views In First 24 Hours

    [Band name] Smashes YouTube Record As [song name] Soars Past 100 Million Views

While these are referring to Youtube, they aren't on-topic for the kinds of posts we're interested in analyzing.

Since our NN is trained on posts involving general business and stock opinions, plus specific industry sentiment like computing, it naturally returns widely varying results for these off-topic texts. These posts aren't useful to our analysis at all, so how do we ignore them?

Garbage In, Garbage Out

An old coworker of mine used to respond to bug reports in his software with "Garbage in, garbage out!"

Ideally we would filter these posts out before processing them by our RNN. So we started with this approach by adding a negative filter to social media searches. Ignore "house intel" and "senate intel", for example. This of course helped.

But there are more difficult filters. "Intel community", for example, may refer to the company or the government. "Intel chairman" might be the board chairman or a member of the US Congress. We don't want to ignore these posts and lose valuable information.

Multi-Label Classification & Off-Topic Training

We added another approach to solve this problem. We changed our ML algorithm to perform multi-label text classification. Instead of a binary label classification, returning a floating point number between 0 and 1, we redesign and retrained it to label things as positive, negative, neutral, and off-topic.

Our original binary classification took the typical approach of its last dense layer having a unit size of 1 with a sigmoid activation to bound the result between 0 and 1. The redesigned model with multi-label classification ended with a dense layer the size of the number of labels. By keeping the sigmoid activation we now get a prediction of each individual label.

If the prediction of every label is low for a text, or actually below some threshold we choose to rely on, then we know the model is not well trained for this particular text. We can choose to ignore it or hang onto it later for better training.

We can also proactively train it on my off-topic texts which it otherwise classified. If the prediction for the off-topic label is high we must have previously trained it on something similar.

Conclusion

Since switching to this multi-label text classification model for our machine learning algorithm we have much more accurate results. We still catch and predict the sentiment of too many off-topic posts. With more training and fine tuning it'll improve over time.