Exploratory Data Analysis:English Premier League

Suhaib Ali Kamal
Nerd For Tech
Published in
6 min readMay 15, 2021

--

An in-depth analysis into the history of the English Premium league plus some cool functions

The EPL is one of the most popular and comptitive football leagues in the world with millions of fans watching from all around the world. In this article, we will go through some exploratory data analysis of the EPL data. We will also go through some cool functions to derive table standing and relegations escapes.

The dataset that we have in hand can be downloaded from the following link.https://www.kaggle.com/saife245/english-premier-league

For this analysis we are goind to download the dataset for 18 seasons starting from 2000/01 to 2017/18. The dataset is in different excel files(18 files) so we are going to need a for loop to download the list of files

cwd = os.path.abspath('') 
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.csv'):
df = df.append(pd.read_csv(file), ignore_index=True)

Using the above code, we were able to download 18 different files and were also able to merge into one complete file.After conducting some data preprocessing dropping missing values we had a set of 6460 rows with the following columns.

Div = League Division
Date = Match Date (dd/mm/yy)
HomeTeam = Home Team
AwayTeam = Away Team
FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals
FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)
HTHG = Half Time Home Team Goals
HTAG = Half Time Away Team Goals
HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)

Match Statistics (where available)
Attendance = Crowd Attendance
Referee = Match Referee
HS = Home Team Shots
AS = Away Team Shots
HST = Home Team Shots on Target
AST = Away Team Shots on Target
HHW = Home Team Hit Woodwork
AHW = Away Team Hit Woodwork
HC = Home Team Corners
AC = Away Team Corners
HF = Home Team Fouls Committed
AF = Away Team Fouls Committed
HO = Home Team Offsides
AO = Away Team Offsides
HY = Home Team Yellow Cards
AY = Away Team Yellow Cards
HR = Home Team Red Cards
AR = Away Team Red Cards

After reducing the dataset to these columns, I felt there were some columns that needed to be added. To begin, the FTR result column only stated whether the result was in favour of the home team or the away team .To change this I added two more columns:(Winning Team & Losing team). The code for the data transformation is shown below

conditions=[df['FTR']=='H',df['FTR']=='A',df['FTR']=='D']
choices=[df.AwayTeam,df.HomeTeam,'Match Drawn']
choices2=[df.HomeTeam,df.AwayTeam,'Match Drawn']
df['Losing Team']=np.select(conditions,choices,default=np.nan)
df['Winning Team']=np.select(conditions,choices2,default=np.nan)

Furthermore, the dataset also does not have the season. To add season we could sort the data in an ascending format use a for loop to add season. The EPL typically has 380 games so the season will have have to change after 380 rows of data.

Now that we have gone through some data transformations let us try to get some basic data analysis and answer some question.

What is the average number of goals scored trend for the last 18 years?

To answer this question we can use the groupby function in pandas to plot a lineplot. The code is show below.

df.groupby('Season')['Total Goals'].mean().plot(kind='line')
Avg goals/match scored by season

The above line chart shows that the average goals scored per match has been around 2.5 to 2.8 with 2012 to 2015 seasons being the most prolific in terms of goals scored.

Q2) Compare the teams home season records in the last 18 years

For this we will add another column that will calculate points per match for hme and away teams. We will use the same format as we did previously with the winning team column.

conditions=[df['FTR']=='H',df['FTR']=='A',df['FTR']=='D']
choices=[3,0,1]
choices2=[0,3,1]
df['HomeTeamPoints']=np.select(conditions,choices,default=np.nan)
df['AwayTeamPoints']=np.select(conditions,choices2,default=np.nan)

After the transformation we can add a groupby column and plot a bar chart to see how many points on average do teams earn on their home turfs.

df.groupby("HomeTeam")['HomeTeamPoints'].mean().sort_values(ascending=False).head(5).plot(kind='barh')
Avg points/home game

As can be seen from the above table, the top4 of the 2000’s have the best home record.

Using this dataset we will try to draw a season table for any year. We will later develop this into a user function where the user will be able to input the season and get the season table.

def table():
x=input("Please enter the season table that you want?")
df_season=df[df['Season']==x]
group1=df_season.groupby('HomeTeam')['HomeTeamPoints'].sum()
group2=df_season.groupby("AwayTeam")['AwayTeamPoints'].sum()
final=group1+group2
final=final.sort_values(ascending=False)
return final

The above function will be able to derive a season table for any year that we want. This table can then be used for further analysis. For example using the same code we can derive the average minimum points required to escape relegation.

sum1=0
for i in season:
df_season=df[df['Season']==i]
group1=df_season.groupby('HomeTeam')['HomeTeamPoints'].sum()
group2=df_season.groupby("AwayTeam")['AwayTeamPoints'].sum()
final=group1+group2
final=final.sort_values(ascending=False)
sum1=sum1+final.iloc[16]
print("Average points required to escape relegation is ", round(sum1/18,0)

The above code’s result is 38 which implies that an average points is enough to escape relegation and stay in the EPL.Similiarly we can use it to see the minimum points required to get a Top4 place.

sum1=0
for i in season:
df_season=df[df['Season']==i]
group1=df_season.groupby('HomeTeam')['HomeTeamPoints'].sum()
group2=df_season.groupby("AwayTeam")['AwayTeamPoints'].sum()
final=group1+group2
final=final.sort_values(ascending=False)
sum1=sum1+final.iloc[3]
print("Average points required to achieve Top4 place is ", round(sum1/18,0)

The result for the above code is 70 implying that an average of 70 points is enough to secure a Champions League place. We can now movefurther and assess how important the referees are in determining match results.

For this we are going to conduct a hypothesis test to see whether there is any significant difference between the referees attitudes to home teams and awat teams. For this analysis, we will add two more columns( Home Team booking points)&(Away Team Booking Points). The code will assign 15 booking points to every yellow card and 25 booking points to every red card.

df['HBP']=df['HY']*15+df['HR']*25
df['ABP']=df['AY']*15+df['AR']*25

Next, we are going to conduct a t-test to determine whether there is a significant difference between the referees attitudes to home teams and away teams.

stat, p = ttest_ind(df.HBP, df.ABP)
print('Statistics=%.3f, p=%.3f' % (stat, p))

The p-value is around 0 indicating that there is a significant difference between the referees attitudes to home teams and away teams.

One of the things that EPL is renowned for is close matches and exciting turnarounds. We can analyse this data to see how easily is it to change half time results for losing teams both home and away.

sns.heatmap(pd.crosstab(df.FTR,df.HTR,normalize='columns'),cmap='viridis',annot=True)

For example according to the above heatmap, in only 5.1% of the matches in the EPL have the away team been able to overturn a half time losing result.However, if the away team is winning at half time they have a 69% chance to end up being the winner. Moreover if the home team is winning at half time 81% times the home team ends up being the winner.

Lastly, we are going to analyse head to head results for some major teams.I have created a function where the user can enter teams according to his own will and the code will prepare a table listing out the head-to head record.

def win_loss():
team1=input("Please enter your first team")
team2=input("Please enter your second team")
new_df=df[(df['HomeTeam']==team1) | (df['HomeTeam']==team2)]
new_df=new_df[(new_df['AwayTeam']==team1) | (new_df['AwayTeam']==team2)]
return new_df['Winning Team'].value_counts()

For example using this code we can see the head to head record between Arsenal and Chelsea.

Chelsea        14
Match Drawn 13
Arsenal 9

I hope you liked it. To connect please visit my LinkedIn profile through the following link.

https://www.linkedin.com/in/suhaib-kamal/

--

--