Sentiment Analysis Accuracy Explained by a Data Scientist: Part Two

In this two-part series, we interviewed NetBase QuidTM Data Scientist, Michael Dukes, to help us break down precisely what sentiment analysis is, how it works, and the technological processes that differentiate “accurate” from “okay” analyses.

The Importance of Continuous Consumer & Market Intelligence

In the first piece, we laid the foundation of what sentiment analysis is and why accuracy is a differentiator amongst the tools available today. And in this second part of the series, we’ll elaborate further, showing what deep learning tools offer brands – and the variety of back-end capabilities to watch for, as they make a huge difference when it comes to deriving actionable, accurate insight.

As Michael suggested previously, deep learning is not a panacea for sentiment analysis – far from it. In fact, there are a number of problems with the applicability of such systems to real world data.

04-08-deep-learning-conversation-shows-excitement-and-challenges

Q: What are the problems with deep learning?

I’ll detail two of the main problems with it – inexplicability and oversimplified foundational assumptions

Inexplicability

The results from deep learning systems – even for English – are often ‘inexplicable’ in the sense that we as humans cannot figure out why the system behaves the way it does, nor how to fix an observed problem.

For example, Distilbert, the deep learning system I showed you last time, tells me that the sentence “I want a full length mirror” is 98.2% negative. I cannot even begin to imagine why we get that result. There doesn’t seem to be anything negative in this sentence.

distilbert-1

Modify the sentence to “I want a full length mirror from Ikea” and we get a score of 99.6% negative

distilbert 1.2

Why? Who knows? And if we were running an analysis about the topic of Ikea furniture, what on earth would we be able to do to generate some sort of sensible result for cases like this?

Here is a related set of examples which exemplify how minimal changes in word content affect the sentiment scores that are provided by Distilbert:

Example Distilbert Sentiment Score
Just back from a successful trip to Ikea!  64.7% negative            
Just back from a successful trip to Ikea. 98.3% negative            
Successful trip to Ikea!                   100% positive             
Was it a successful trip to Ikea?          99.2% negative            

To break down the errors in the analyses:

  • The presence of the innocuous expression “just back from” seems to cause two sentences expressing clear positive sentiment to be treated as negatives, and the lack of an exclamation mark is enough to increase the negativity by 30%.
  • A question which appears to be entirely neutral is scored as the most negative of all the examples.
  • And only the simplest example is analyzed correctly.

In the NetBase QuidTM system, we derive positive sentiment from the first three examples and neutral sentiment in the case of the question. And this seems to align with what a human would judge. It is difficult to understand how the Distilbert system produces the observed results and even harder to understand how they would be corrected in a production system.

Next, we’ll explore the danger of oversimplified assumptions.

Oversimplified assumptions about the nature of ‘sentiment’ lead to incorrect analyses

Historically, sentiment analysis tools have relied on some fairly shaky assumptions – and frankly, many still do. Even the idea, for example, that a document can in totality be summed up as “positive”, “negative” or “neutral” is quite suspect. One of the advantages of the NetBase QuidTM sentiment system is our focus on linguistically detailed “aspect-based” sentiment.

Many of our competitors, as well as open-source sentiment systems, are unable to provide this level of detail because they are focused on labelling entire sentences or documents.

In the first part of this series, we shared the following fictitious steakhouse tweet example, along with a detailed breakdown of how it should be analyzed, versus how it actually is analyzed by some sentiment tools:

We had a meal at Steakhouse X yesterday. The fries were delicious but the steak was awful.

As noted there, Distilbert strangely analyzed the tweet as 97.2% negative. And for Distilbert, the wording of the text seemed to have much more importance than the meaning, which is the exact opposite of how a sentiment analysis tool should work. Distilbert doesn’t really know much about meaning.

Oversimplified assumptions about how to classify data lead to incorrect analyses

These oversimplifications of linguistic complexity lead to problems in many ways. Here’s a related case where sentence or post-level machine learning systems are also going to get you into trouble:

The Rams beat the Bengals in Superbowl LVI.

This is a simple statement of fact, which conveyed negative news for Bengals fans and positive news for Rams fans in 2022. But according to Distilbert this sentence is 99.9% positive.

distilbert-2

How can this possibly be correct? Is there some universally shared assumption amongst football fans that the Bengals getting beaten is always a good thing? Or is it the case that we all believe that the Rams should always be the winners? Neither of these justifications is true of course. In fact, these biases simply arise from the way that training data is often collated for machine learning systems.

Human annotators, typically put under time pressure to pick one of three overly general sentiment labels (positive, negative or perhaps neutral), have no opportunity to think through the possible perspectives of a sentence and simply pick the one which comes to mind first. Typically, they’ll pick the perspective of the subject of the sentence (in this case “the Rams”).

And these human biases are reinforced over many thousands of cases in the training data until the system itself is “fixed” to that biased perspective.

In a void however, this sentence obviously defies any positive or negative summarization. Any sentiment this sentence might convey is entirely dependent upon observer perspective. And a huge amount of content on social media is the same way. This creates a serious problem for machine learning systems of any type.

The NetBase QuidTM application, in contrast, extracts both positive attributes (for the Rams) and negative attributes (for the Bengals) from this sentence. Our linguistic analysis recognizes that such verbs as ‘beat’, ‘defeat’, ‘overcome’, ‘dump’, etc., entail semantic roles in which one participant typically benefits and the other suffers. The choice of primary search term – “Bengals”, “Rams” or perhaps just “Superbowl” – determines which sentiment you see in the results. This is a nuanced understanding of sentiment that is key to extracting accurate results.

And this brings us to our next question – around struggles that all sentiment analysis tools typically face.

Q: Where else do sentiment analysis tools typically struggle – and why?

Language is complicated, slippery and hard to pin down. Some languages lack verb agreement or ‘drop’ pronouns freely (for example, Japanese and Indonesian). This makes it harder to track which entity a sentiment is about.

Similarly, the problem of dealing with anaphora in longer pieces of text is always an ongoing issue. In some languages it is also quite difficult to distinguish statements from questions in written form. And then social media conventions can also make it difficult to track what sentiment is about or even if it is positive or negative (e.g. with strings of hashtags).

We’ve developed techniques for handling some of these cases but there are still limits to what can be achieved given the complexity of the data we’re dealing with. It is also often difficult to get the correct interpretation from complex syntactic structure involving conjunction or relative clauses. Similarly, many words are ambiguous depending on their context, so disambiguation poses a challenge.

And finally, a major problem with handling data from the internet is that the text we receive may be malformed in various ways: data may be unexpectedly truncated or alternatively it may be missing line breaks. Sentence boundaries are often missing, and internet users aren’t too interested in proper spelling or formatting!

But the takeaway here is that we’re always working on accommodating these challenges – and our aggressive platform update schedule speaks to that.

Q: How often is NetBase QuidTM’s platform updated – and based on what? New technology, user needs, market shifts – a combination of that and more?

In short, all of the above. We don’t expect our customers to have to retrain our system, so if errors are noticed by our users we try to handle them as quickly as possible. Typically, we update the NLP system at least once or twice per quarter. Sometimes this involves deploying new products that have been requested by product management and other times we’re deploying bug fixes or upgrading particular analyses – for example, sentiment or named entity recognition.

We’re always trying to add new features that help our customers and we’re constantly making improvements to the platform to increase accuracy and recall. It’s really something you need to experience to appreciate, and we’re always available for a demo geared toward your company’s specific needs. Reach out for your demo today!

The Importance of Continuous Consumer & Market Intelligence

Premier social media analytics platform

Expand your social platform with LexisNexis news media

Power of social analytics for your entire team

Media analytics and market intelligence platform

Enrich your media analytics with social data

Media coverage for historical & real-time monitoring

Actionable data to drive your business decisions

AI, Image Analytics, Reporting Tools & more

Out-of-the-box integration with other data sources