Eurovision Song Contest: A Data Science point of view. Part 1

This article is part of an ongoing series about Eurovision Song Contest.

First of all, let me introduce you one of my passions in life: the Eurovision Song Contest (ESC). For all of you who don’t know it, ESC is a contest organized yearly by the European Broadcast Union (EBU). Their mision was to unite european countries after WWII.

Eurovision Song Contest logo. Source — EBU.

The contest key points are:

The event happens annualy since 1956. Except 2020 due to Covid-19.
Every participant broadcaster should indicate an artist or band to represent the country every year.
The voting system is divided in Jury Votes and, more recently, the Televoting: Jury is a group of experts that usualy belong to the music industry. Televoting is the vote given by the public using SMS, phone or eurovision app.
A country (jury or public) cannot vote himself.
Every country votes in 10 of the participants. The top choice receives 12 points and the last one receive 1 vote.
Only the top 3 is annnouced on the finals.
For more information, please check eurovision.tv

Despite of its main effort to unite Europe throughout the music, many situation have drawn a very delicate politicial, social and economic panorama. Follow a list of reasons to study the relationship of european countries in this music festival.

Political issues:

Germany was reunited.
Yuguslavia and USSR splitted in many countries.
There were many conflicts all over Europe (North Cyprus, Yugoslavia, Crimea, Kosovo, Nagorno-Karabakh, etc).
Greece, Spain and Portugal were affected by the adoption of Euro and fell into a economic crisis.
Immigrations have changed. Ex. Polish people living in Germany can vote on Poland as they were germans, this is known as diaspora voting.
Israel has strong ties with many european countries.
Many arab countries are also part of EBU but had choosen not to take part due to Israel’s presence.
Jamala’s 2017 winning song was seen as a protest over Crimea.
Russia still hold strong influence over the former USSR countries.
Turkey/Azerbaijan had and have strong and blood conflicts with Armenia (Armenian Genocide).

Cultural issues:

Many clusters emerge naturally due to language family: the romance languages in Iberic peninsula. The slavic languages in the balcan + russia. The germanic languages in the north.
Religions: Agnostics, muslins, cristhians catholic, cristhians orthodox, etc.
LGBT support: Turkey said they stopped taking part on eurovision due to Conchita victory on 2014 edition. Something similar happened in Hungary.
Some countries always give their 12 points to their best friends (i.e. Greece ← → Cyprus).

So, with all of this information in mind… How can Data Science help on understanding the ESC and what kind of questions can we answer? Being a eurofan, that’s how ESC followers are called, these questions quickly flood my mind, some of them are:

Is there any bias on voting?
What are the most influencial country in the competition ?
Are the countries organized in communites ?
Are the countries organized by cultural affinity, like languages and so on ?
Is the geographical distance important?

I will try to answer all of those questions and more throughout this series of articles. But first things first, let’s start doing Data Science.

Being a eurofan, that’s how ESC followers are called, and a nerd about Eurovision related topics, this part was easier than it may seem at the beginning. I already had a bunch of excel files with the scoreboards from ESC results year after year, I have only had to collect them together in a single file. I have uploaded this dataset to Kaggle so that everyone can access to it (feel free to upvote it 😉 )

The data is structured as follows:
· Year: The edition of the contest.
· Type of show: There are semifinals and finals.
· Type of vote: Jury or televote.
· Origin of votes: Country who votes.
· Destination of votes: Country who receives.
· Ammount of votes: Number of points given.
· Duplicates check: Tag to check that a country cannot vote to themselfs.

In the end we have around 50K rows of data.

First of all we create a copy of our dataset. This step is important so that you don’t lose your original dataset in case you have to step back. We will select only where points are given to simplify our task and remove empty edges.

df2 = df.copy().query('points > 0')

Next, duplicates are removed from the dataset, remember that check column we introduced in our dataset?.

df2['duplicate'] = df2['duplicate'].apply(lambda x: True if x == 'x' or x==True else False)df2 = df2.query('duplicate == False').drop(columns=['duplicate'])

Some countries have been named differently throughout history of ESC, so we will standardize them.

def applyRename(x):renamings ={'North Macedonia':'Macedonia','F.Y.R. Macedonia':'Macedonia','The Netherands': 'Netherlands','The Netherlands':'Netherlands','Bosnia & Herzegovina':'Bosnia',}return renamings[x] if x in renamings else xdf2['countryfrom'] = df2['countryfrom'].apply(applyRename)df2['countryto']   = df2['countryto'].apply(applyRename)

Also, we have to take into account that Yugoslavia was taking part in the contest before splitting into several countries. We will attribute to each country the votes that Yugoslavia got as a whole before splitting. Split between Serbia and Montenegro will also be taken into account.

division = {'Yugoslavia':['Macedonia','Serbia','Montenegro','Slovenia','Bosnia','Croatia'],'Serbia & Montenegro':['Serbia','Montenegro'],}df2['countryfrom'] = df2['countryfrom'].apply(lambda x:division[x] if x in division else x)df2['countryto']   = df2['countryto'].apply(lambda x:division[x] if x in division else x)df2 = df2.explode('countryfrom').explode('countryto')

We will also remove from the dataset the countries that no longer are taking part. The limit has been set in 5 editions to now, so all the countries that didn’t take part in the last 5 years will be removed.

toKeep = df2.groupby('countryfrom').apply(lambda x:pd.Series({'years':x['year'].nunique(),'last_participation':df2['year'].max() - x['year'].max(),})).query(f'years >= {minYears} and last_participation <= {last_participation}').reset_index()['countryfrom'];display(HTML("<p>ignored countries: %s</p>" %', '.join(df2[df2['countryfrom'].isin(toKeep)==False]['countryfrom'].unique())))df2 = df2[df2['countryfrom'].isin(toKeep)]
df2 = df2[df2['countryto'].isin(toKeep)]

Our last cleaning step is to only take into account the points given during the final when a country has qualified to the final. Since 2004, due to the growing number of participants, there are semifinals in the show.

df2['finalcode']=df2.final.map({'f':1,'sf':2,'sf1':2,'sf2':2})temp1 = df2.groupby(['countryto','year']).agg({'finalcode':'min'})df2 = pd.merge(df2,temp1, on=['countryto','year','finalcode'], how='inner')assert len(df2.groupby(['countryfrom','countryto','year']).agg({'final':'nunique'}).query('final >1')) == 0df2.drop(columns=['finalcode','edition'], inplace=True)

With this the dataset is ready for next steps. In part two we will do Exploratory Data Analysis (EDA) and try to get some insights and some first answers to the questions raised about Eurovision Song Contest.

Footer