-
Notifications
You must be signed in to change notification settings - Fork 0
/
exploratory_data_analysis_2.Rmd
287 lines (248 loc) · 8.22 KB
/
exploratory_data_analysis_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
title: "Exploring World Cup Statistics Geographically By Country"
output:
html_document:
toc: TRUE
toc_float: TRUE
---
<style type="text/css">
h1.title {
text-align: center;
}
</style>
```{r setup, include=FALSE}
# Load packages
library(sf)
library(tidyverse)
library(tmap)
library(tmaptools)
library(viridis)
library(plotly)
# Set ggplot theme
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
fig.width = 6,
fig.asp = .6,
out.width = "90%"
)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
# Set tmap mode to interactive
tmap_mode("view")
```
```{r import df, include=FALSE, warning=FALSE, message=FALSE}
# rename
data <- read.csv("./data/12_4_dataset.csv")
write.csv(data, "./data/worldcup_final.csv")
# Read in csv datafile
wc_df <- read_csv("./data/worldcup_final.csv") %>%
janitor::clean_names() %>%
mutate(
gd = gsub("\\+","",gd),
gd = gsub("\\−","-",gd),
part = as.numeric(part),
pld = as.numeric(pld),
w = as.numeric(w),
d = as.numeric(d),
l = as.numeric(l),
gf = as.numeric(gf),
ga = as.numeric(ga),
gd = as.numeric(gd),
pts = as.numeric(pts),
rank = as.numeric(rank),
goals = as.numeric(goals),
land_area = as.numeric(land_area_km))
```
```{r import shapefile, include=FALSE, warning=FALSE, message=FALSE}
# Read in shapefile and convert values to numeric
wc_countries <- st_read("data/geofiles/worldcup_countries.shp") %>%
janitor::clean_names() %>%
mutate(
gd = gsub("\\+","",gd),
gd = gsub("\\−","-",gd),
part = as.numeric(part),
pld = as.numeric(pld),
w = as.numeric(w),
d = as.numeric(d),
l = as.numeric(l),
gf = as.numeric(gf),
ga = as.numeric(ga),
gd = as.numeric(gd),
pts = as.numeric(pts),
rank = as.numeric(rank),
goals = as.numeric(goals),
land_area = as.numeric(land_area))
```
<br>
<br>
## Data by Country
<br>
### Number of Participations in the World Cup by Country
```{r echo=FALSE, warning=FALSE, message=FALSE}
tm_shape(wc_countries) +
tm_polygons(
col = "part",
style = "cont",
palette = "-viridis",
id = "country",
alpha = .7,
border.col = "white",
lwd = .5,
breaks = seq(0, 22, by = 2),
title = "Participations (part)")
```
Brazil is an international powerhouse when it comes to soccer. Brazil is the only country to make an appearance at every World Cup in history from 1930-2022. Brazil's national team has appeared in all 22 tournaments to date, with Germany having participated in 20, Italy and Argentina in 18 and Mexico in 17.
<br>
### FIFA Ranking
FIFA rankings are a unique way to measure how national teams compare to each other and can prove important when it comes to tournaments like a World Cup. National teams are ranked by FIFA based on their game results with the most successful teams being ranked highest (top team ranked as #1).
```{r echo=FALSE, warning=FALSE, message=FALSE}
tm_shape(wc_countries) +
tm_polygons(
col = "rank",
style = "cont",
palette = "viridis",
id = "country",
alpha = .7,
border.col = "white",
lwd = .5,
breaks = seq(0, 170, by = 20),
title = "FIFA Rank (rank)")
```
<br>
```{r echo=FALSE, warning=FALSE, message=FALSE}
wc_df %>%
mutate(
country = fct_reorder(country, -rank),
text_label = str_c("Country: ", country, "\nConfederation: ", confederation, "\nFIFA Rank: ", rank)) %>%
plot_ly(
x = ~country,
y = ~rank,
color = ~confederation,
text = ~text_label,
alpha = 0.7,
legendgroup = ~confederation,
colors = viridis_pal(option = "D")(3)) %>%
layout(
title = "FIFA Rank By Country",
xaxis = list(title = "Country"),
yaxis = list(title = "FIFA Rank"))
```
<br>
```{r echo=FALSE, warning=FALSE, message=FALSE}
wc_df %>%
group_by(confederation) %>%
ggplot(aes(x = confederation, y = rank, fill = confederation)) + geom_boxplot(alpha = .7) +
labs(
title = "FIFA Rank for Each FIFA Confederation",
x = "Confederation",
y = "FIFA Rank") +
scale_fill_viridis_d(name = "Confederation") +
theme(legend.position = "none")
```
<br>
As of the 2022 FIFA world cup, Brazil is the highest ranked national team, followed by Belgium, Argentina, France, and England. <br>
The highest ranked national teams appear to be spatially clustered in the CONMEBOL (South America) and UEFA (Europe) Confederations.
<br>
### Goals Scored by Top Player
The plot below shows the number of goals scored by top record goal scorer in each country, including both currently active and inactive players.
```{r echo=FALSE, warning=FALSE, message=FALSE}
tm_shape(wc_countries) +
tm_polygons(
col = "goals",
style = "cont",
palette = "viridis",
id = "country",
alpha = .7,
border.col = "white",
lwd = .5,
breaks = seq(0, 120, by = 20),
title = "Top Player Goals Scored (goals)")
```
<br>
```{r echo=FALSE, warning=FALSE, message=FALSE}
wc_df %>%
mutate(
country = fct_reorder(country, goals),
text_label = str_c("Country: ", country, "\nGoals Scored: ", goals, "\nPlayer: ", player, "\nConfederation: ", confederation)) %>%
plot_ly(
x = ~country,
y = ~goals,
color = ~confederation,
text = ~text_label,
alpha = 0.7,
legendgroup = ~confederation,
colors = viridis_pal(option = "D")(3)) %>%
layout(
title = "Number Goals Scored By Top Player In Each Country",
xaxis = list(title = "Country"),
yaxis = list(title = "Goals"))
```
<br>
```{r echo=FALSE, warning=FALSE, message=FALSE}
wc_df %>%
group_by(confederation) %>%
ggplot(aes(x = confederation, y = goals, fill = confederation)) + geom_boxplot(alpha = .7) +
labs(
title = "Distribution of Top Goals Scored for Each FIFA Confederation",
x = "Confederation",
y = "Goals") +
scale_fill_viridis_d(name = "Confederation") +
theme(legend.position = "none")
```
<br>
Interestingly, even though Brazil ranks the highest for most of the predictor variables in the dataset, Brazil does not hold one of the top slots for the most goals scored by top record goal scorer. Portugal, Iran, and Argentina hold the top three positions and represent three different Confederations. The highest ranked national teams for record goal scorers appear to be spatially clustered in the UEFA (Europe), AFC (Asia and Australia), and CONMEBOL (South America) Confederations.
<br>
### Goal Difference
The plot below shows the total number of goals scored by country minus the total goals scored against country throughout all World Cup tournaments.
```{r echo=FALSE, warning=FALSE, message=FALSE}
tm_shape(wc_countries) +
tm_polygons(
col = "gd",
palette = "viridis",
style = "cont",
id = "country",
alpha = .7,
border.col = "white",
lwd = .5,
breaks = seq(-40, 140, by = 10),
title = "Goal Difference (gd)")
```
<br>
```{r echo=FALSE, warning=FALSE, message=FALSE}
wc_df %>%
mutate(
country = fct_reorder(country, gd),
text_label = str_c("Country: ", country, "\nGoal Difference: ", gd, "\nConfederation: ", confederation)) %>%
plot_ly(
x = ~country,
y = ~gd,
color = ~confederation,
text = ~text_label,
alpha = 0.7,
legendgroup = ~confederation,
colors = viridis_pal(option = "D")(3)) %>%
layout(
title = "Goal Difference In Each Country",
xaxis = list(title = "Country"),
yaxis = list(title = "Goal Difference"))
```
<br>
```{r echo=FALSE, warning=FALSE, message=FALSE}
wc_df %>%
group_by(confederation) %>%
ggplot(aes(x = confederation, y = gd, fill = confederation)) + geom_boxplot(alpha = .7) +
labs(
title = "Distribution of Goal Difference for Each FIFA Confederation",
x = "Confederation",
y = "Goal Difference") +
scale_fill_viridis_d(name = "Confederation") +
theme(legend.position = "none")
```
<br>
Brazil has the highest positive goal difference, followed by Germany, Italy, France, and Argentina. Mexico, South Korea, and Bulgaria have the most negative goal differences. The national teams with the highest positive goal difference appear to be spatially clustered in the UEFA (Europe) and CONMEBOL (South America) Confederations.