Brands across the globe want to know about their audiences – which is why the NetBase platform understands 42 languages, including slang, sarcasm, and emojis. We regularly analyze data for audiences that speak each of these languages – but we hadn’t attempted to analyze a bilingual audience until recently. It proved to be an interesting challenge.

Our marketing department had been following the election and wanted to discover the hot topics being discussed by Spanish speakers in the U.S. who had also been talking about the presidential election. Typically, this analysis involves a comparison against another audience with similar characteristics, including the language they speak. We have experience building comparison audiences that are monolingual, but Spanish Twitter users in the U.S. often also tweet in English. This makes building a audience which can function as a direct comparison more complicated.

And comparing audiences needs to be an “apples to apples” scenario to mean anything.

Analyzing monolingual audiences

To explain what I mean, let’s look at how we create and compare audiences generally. For example, we might see what’s especially interesting to Fiat car enthusiasts in as compared to the average Twitter user. In this case, the audience of people who talk about Fiats is the target audience, and a general audience of Twitter users functions as the comparison.

We know the Fiat talkers will be mentioning Fiats more than the general population. That’s okay, because we want their content to be different – that’s how we find interesting things about what they’re saying.

What we want to be the same is the language they use. Why? Suppose our Fiat talkers are speaking mainly in Italian, but our comparison audience speaks mainly English. In that case, there won’t be many interesting insights.

The reason is that when we run a comparison for insights, it doesn’t equate two words that mean the same thing in different languages. For example, when someone in the Fiat topic uses the word “vacanza,” and someone in the comparison topic uses the word “vacation,” the system sees these as two different words and may conclude, “The Fiat talkers say ‘vacanza’ more than they do in the general population.” This wouldn’t tell us if the Fiat talkers actually discuss their vacations more frequently.

To avoid this problem, we make sure that both audiences are speaking the same language. Our practice is to make sure that each audience member has at least 90% of their tweets in the language of interest.

This is a common use case for us, so we have it down. But in the case of the Spanish speakers talking about the U.S. election, the audience tends to be bilingual, and we’re interested in both the Spanish and English conversation. Defining what it means to be bilingual is a new use case, so we had to take a new approach.

Tuning the parameters for distribution of language

What we realized we needed to do was define what it means to be bilingual. For the target audience of Spanish-speakers referencing the election in Spanish, we experimented and found if an author has one original tweet in Spanish, it’s an indication that they’re competent in Spanish and use it to communicate.

The next step was to define the comparison audience with respect to the target audience in terms of their language usage. It’s not as simple as requiring that the comparison audience has one original Spanish tweet. This is because the comparison audience is not anchored on a topic the way the target audience is, and this ends up resulting in a different distribution of languages used.

In order to align the comparison audience, we modeled it based on three features: English tweet frequency, Spanish tweet frequency, and the frequency of tweets in all other languages. For each feature, we calculated a range of one standard deviation from the target audience mean, and only selected authors in the comparison audience whose language usage fell within this window.

This method ensures parity between the two audiences in terms of their language usage.

Rising to the bilingual challenge

This is the “apples to apples” concept I mentioned earlier. Without creating an even playing field, you only pick up “these people tweet more in English” and “these people tweet more in Spanish” – which isn’t the info you’re looking for.

The breakthrough we saw, in approaching this marketing challenge, was in being able to create an audience whose language choice on social media mimicked the language being used in the election topic – though their content was more general.

This is something we couldn’t have done if we hadn’t defined the languages of interest for the topic we were analyzing. For the election topic it was definitely Spanish and English – but we had to specify both and “all other languages” to get the results we were looking for. This is something to keep in mind if you run your own analysis.

For all the trial and error, the data that comes through is much cleaner in the end – without the noise and random tweets that turn out to be irrelevant. It’s much better when you can be confident as you tell decision makers you’ve uncovered a meaningful connection in the way people are talking about a topic on social in a cultural context.

So now you know how to examine bilingual audiences – and you even have insights into what sets Latin American consumers apart from Sean’s recent post.

If you want to know more, you can try these techniques yourself – or reach out and we’ll walk you through it.

Image from Diaper

Premier social media analytics platform

Expand your social platform with LexisNexis news media

Power of social analytics for your entire team

Media analytics and market intelligence platform

Enrich your media analytics with social data

Media coverage for historical & real-time monitoring

Data streams & custom KPIs for advanced data science

AI, Image Analytics, Reporting Tools & more

Out-of-the-box integration with other data sources