R Programming and Fight Metrics

Implementing R

With the basics covered in the previous section it was a matter of using the skills on a different dataset. I found one containing fight metrics for all UFC (Ultimate Fighting Championship) events for UFC1 in 1993 up to UFC Fight Night 83 in 2016 available at:

https://www.reddit.com/r/datasets/comments/47a7wh/ufc_fights_and_fighter_data/

The data contained in the dataset was held in the following vectors

Variable Name Description
fid Fighter ID
name Fighter’s name
nick Nickname
birth_date Fighter’s date of birth
height Fighter’s height
weight Fighter’s weight
association Training camp
class Weight Class
locality City/County/State
Country Country of origin

Using this dataset the following graph was created giving a visual result of the number of fighters that have fought with the UFC by weight class. What we can see is that the most active weight classes were Welterweight, Middleweight and Lightweight. The least active weight classes were Atomweight, Super Heavyweight and Strawweight.

weightclass count

The other dataset used contained the following

Variable Name Description
pageurl event url
eid Event Id
mid Match ID
event_name Name of Event
event_org Organinisation name
event_date Event date
event_place Location of event
f1pageurl Fighter 1 url
f2pageurl Fighter 2 url
f1name Fighter 1 Name
f2name Fighter 2 Name
f1result Fighter 1 result
f2result Fighter 2 result
f1id Fighter 1 Id
f2id Fighter 2 Id
method Method of victory
method_d Method Description
round Round finished
time Time of fighter finished
ref Referee of fight

Using this data the below graph was created showing the number of time the fight ended in each of the different possible outcomes.

results

What we can see from this is that the majority of fights ended by going to judges’ decision followed by half as many fights ending by TKO (technical knockout) closely followed by submission victory and then under a third of the number of decisions is KO (knock out) finishes.

The following graphs show the percentage of fights finished by decision, KO and Submission from 1993 to 2016 (Decisions were not an outcome until later UFC events)

decision

This shows that the number of fights going to decision has increased since the beginning of the events to almost 50% of fights since 2010 on wards

Ko

This graph shows the number of fights ending by KO has always been low fewer than 25% since the beginning

subs

This graph shows that the number of submission in the beginning were about 75% this was because of fighters such as Royce Gracie who was a Brazilian jujitsu fighter who would fight other fighters who were looking to stand up and throw punches and kicks and he would win by taking the fight to the ground and neutralizing the other fighter. After the success at the beginning as more and more fighters learned submissions and submission defences the number of wins by submissions reduced below 25%

With these two datasets loaded it was then a matter of joining them together by fighterid and being able to look and see the amount of victories each fighter had, calculating the percentage of that and outputting the results.

name class fights wins win_perc_adj
 1: Jon Jones  Light Heavyweight 16 15 0.8142700
 2: Georges St. Pierre Welterweight 21 19 0.8116540
 3: Conor McGregor Featherweight 7 7 0.7636766
 4: Yoel Romero Middleweight 7 7 0.7636766
 5: Tony Ferguson Lightweight 11 10 0.7605096
 6: Anderson Silva Middleweight 19 16 0.7571829
 7: Don Frye Heavyweight 10 9 0.7457933
 8: Chris Weidman Middleweight 10 9 0.7457933
 9: Khabib Nurmagomedov Lightweight 6 6 0.7444223
10: Royce Gracie Middleweight 13 11 0.7334771

This gave the best overall fighters win percentage throughout the weight classes but if we wanted a breakdown for each then the result would be:

Class Name Fights Wins Win_perc_adj
1 Light Heavyweight Jon Jones 16 15 0.81427
2 Light Heavyweight Daniel Cormier 7 6 0.68834
3 Light Heavyweight Rashad Evans 19 14 0.67805
4 Welterweight Georges St. Pierre 21 19 0.811654
5 Welterweight Stephen Thompson 8 7 0.710175
6 Welterweight Warlley Alves 4 4 0.694669
7 Featherweight Conor McGregor 7 7 0.763677
8 Featherweight Jose Aldo 8 7 0.710175
9 Featherweight Max Holloway 14 11 0.697299
10 Middleweight Yoel Romero 7 7 0.763677
11 Middleweight Anderson Silva 19 16 0.757183
12 Middleweight Chris Weidman 10 9 0.745793
13 Lightweight Tony Ferguson 11 10 0.76051
14 Lightweight Khabib Nurmagomedov 6 6 0.744422
15 Lightweight Donald Cerrone 20 16 0.728364
16 Heavyweight Don Frye 10 9 0.745793
17 Heavyweight Cain Velasquez 13 11 0.733477
18 Heavyweight Junior dos Santos 14 11 0.697299
19 Flyweight Joseph Benavidez 13 11 0.733477
20 Flyweight Demetrious Johnson 13 11 0.733477
21 Flyweight Henry Cejudo 4 4 0.694669
22 Strawweight Joanna Jedrzejczyk 5 5 0.721752
23 Strawweight Tecia Torres 3 3 0.661745
24 Strawweight Valerie Letourneau 4 3 0.597335
25 Bantamweight Raphael Assuncao 8 7 0.710175
26 Bantamweight Dominick Cruz 4 4 0.694669
27 Bantamweight Aljamain Sterling 4 4 0.694669
28 Super Heavyweight Jon Hess 1 1 0.56874
29 Super Heavyweight Andre Roberts 3 2 0.553915
30 Super Heavyweight Scott Ferrozzo 5 3 0.544351
31 Atomweight Michelle Waterson 1 1 0.56874

Further development of these data sets could allow both statistical and predictive information such as:

  • Statically
    • Number of events each year,
    • Average fight times
  • Prediction
    • Likely results of fights taking into consideration fighters win loss records
    • How a fighter might win be it decision, ko or submission
    • Success rate of submission moves in ending fights
    • Spotting if there is any coalition between fighters and judges

***Interesting Fact***

Since the mid-1960s firefighter have added a wetting agent to make water wetter resulting in less friction in the hose cause the water to pass through it quicker

Introduction to R and Should Leo have survived the Titantic

When dealing with large amounts of data there are a number languages that can be used to sort, analysis and graph that information. One of the most popular is R which is a statistical programming tool used throughout the world. It can be downloaded from http://www.r-project.org/

To begin using this language it can be useful to trying beginner courses that are available throughout the internet. One of the most popular sites is Code school which has a useful course called ‘Try R code school’, a step by step guide for learning the basics of R covering the different areas of R such as

  1. Using R
  2. Vectors
  3. Matrices
  4. summary Statistics
  5. Factors
  6. Data Frames
  7. Real-World Data

When you complete each section you receive a badge totalling 8 badges with the 8 badge counting for having completed the course

completed codeschool

Having the basics of using R with data I wanted to look at more uses of graphs in R. This was done using the ggplot library package and using Cookbook for R website http://www.cookbook-r.com/Graphs/

Using a sample data set of a restaurant containing the time and total bill of different transactions i.e.

#> time total_bill
#> 1 Lunch 14.89
#> 2 Dinner 17.23

Using this data, different versions of bar charts were created displaying the totals using ggplot tool set.

For a basic bar graph showing time broken down into Lunch and Dinner on the X axis and the total amount on the Y axis the code used was:

ggplot(data=dat, aes(x=time, y=total_bill)) +
geom_bar(stat=”identity”)

This resulted in the bellow graph

Rplot01

To add a bit more style to the chart by mapping the time of day to different fill colours adding ‘fill=time’ to the code results in the Lunch being filled with a pink colour and the dinner filled with a blue colour

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(stat=”identity”)

The graph produced makes the data stand out that much better and more interesting

Rplot02

A black outline can be around each of the Time groups by adding ‘colour=”black”‘ to the code making the edges more defined in the graph such as code example below:

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(colour=”black”, stat=”identity”)

Rplot03

And finally as the there are only two values and they are labelled already on the X axis, the use of a legend on the side of the chart is unnecessary. To remove the legend from the chart is easily done by using ‘guides(fill=FALSE)’ meaning that the legend will not be displayed to the right of the chart.

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(colour=”black”, stat=”identity”) +
guides(fill=FALSE)

Rplot04


And now onto the Leo question

Having used these skills on example datasets the idea was to next try to implement them on a larger scale with other data. While looking for information I came across a data set on Kaggle.com (https://www.kaggle.com/c/titanic) containing information on Titanic Passengers such as age, sex, departure location, ticket price and if they survived or not. We could in theory use this information to determine if given certain information would it be possible to say if a person would survive a trip on the Titanic or not using machine learning to predict the outcome. Referencing an article by Megan Risdal the results and steps were as follows.

The datasets used were the train.csv and test.csv downloadable from the Get the Data section on Kaggle.

The datasets were made up of the following variables

>

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

Using this information it was possible to create graphs illustrating the different collations between the different passengers and whether they survived or not. Such as family size, gender, age

Rplot

The first two graphs shown above illustrate the survival of passengers who traveled alone or in a group of more than one. What we can see from the information is that people who were traveling alone were more than likely to perish as were people who traveled in a family of more than 4 members but if you were in a family group of 2,3 or 4 you had a better chance of survival.

The next graph broke down the age and survival rate for both male and female passengers. What can be taken from this is that being a child under 18 years of age gave you a better chance of survival but was not a guarantee as even if you survived the sinking the cold and harsh weather conditions would be difficult for the passengers. Also the data shows you had a much better chance of surviving if you were female regardless of age than the male counter-part. And finally if you were an 81 year old male you are assured to survive which is good to know.

Further analysis of the data could provide a graph to illustrate the importance of each of the supplied fields to the chance of surviving the Titanic sinking.

So in conclusion to answer the original question of should Leo have survived. A quick trip to the Wikipedia page of the 1997 Film https://en.wikipedia.org/wiki/Titanic_(1997_film) we can glean that Leonardo DiCaprio’s character of Jack Dawson was a 20 year old male(obviously), who was traveling 3rd class and departed alone. Taking this information and using the model of information that we already have and filling in the relevant information it is likely that Leo’s character would have perished aboard the sinking ship regardless of being the star of the film and dying for the plot and sad ending…… even though he could have fit on the wreckage as shown below.

***Bonus random Titanic Fact***

There were 7,000 heads of lettuce aboard the Titanic.

Fusion Tables

Fusion Table of Irish Population

Above is a heat map made from a fusion table of the Irish population by each county and how each county is made up of male and female population according to the the 2011 Irish census obtained by the Central Statistics Office displayed using google maps and its geo-sync so give a visualization of the data.

What we can observe from the data is that the larger population centers are based around the coastal areas and that the largest populations centered in counties with major ports/airports, which would traditionally coincide with the Irish cities (Dublin, Galway and Cork) allowing for trade and commerce throughout the ages. The development and availability of public transport, the construction and upgrading of roads, technological development has had bleed off effects from these counties can be seen in the surrounding areas increasing the population.

This information can be used for the planning of roads and transport routes between larger population centers that might encourage growth in counties with lower figures. It can also be used for looking at house and rental prices in relation to population

Creating the Fusion Table

The first step in creating the above heat map was to install Google Fusion Tables on Chrome and import the files that were to be merged into Google Fusion Tables with a geographical map.

To begin data of the population of Ireland extracted from the 2011 census downloaded from the Central Statistics Office Website http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/

The next table uploaded contained the geometry information of Ireland and the name of the counties. This in information was taken from Irish independent Website

http://www.independent.ie/editorial/test/map_lead.kml

The information had to be then cleaned taking out errors correcting the locations so that it would match the statistics table. The main issues to be corrected were the county names and also the break down of some counties into north and south areas which had to be combined into the one.

The two files were then merged together creating a new document named ‘Irish Population by County’. When the table was merged, the map was then styled and edited.

To create a random distribution of counties based on population density, on the map view going to the configure map menu and clicking on the change feature styles button. Clicking on the buckets tap option and selecting the fill color tab and selecting 8 buckets. For this map, each bucket was allocated a different colour for each of the population ranges. The colours and ranges assigned were as follows: Grey for counties with a population between 31,798 and 75,000, light purple for counties with a population between 75,000 and 100,000, light blue for counties with a population between 100,000 and 125,000, dark blue for counties with a population between 125,000 and 150,000, green for counties with a population between 150,000 and 200,000, yellow for counties with a population between 200,000 and 400,000, orange for counties with a population between 400,000 and 1,000,000, and finally red for counties with a population between 1,000,000 – 1,274,000. This distribution was chosen to clearly highlight the highly populated counties against the lowly populated counties using contracting colours to highlight the different ranges.

Population Gender Breakdown

Below are charts showing the Population of the whole of Ireland through 1911 – 2011

This graph shows the sum of the population through the ages and then with it broken down by male and female.

This graph shows the male and female population in the Republic of Ireland from 1911 – 2011. This graph shows that the was a large drop in the population until the 1961 at which point the population grew until 1986 where it stayed steady for female population and a drop for males until 1991 where we see an increase with female population increasing more than male.

This graph shows the break down of the Northern Ireland’s gender break down from 1911 – 2011. This shows that there has always been an larger amount of females than males but that the birthrates have increased year on year since 1936 and this gender balance has continued. Possible reasons of this could be for geographical, environmental or economic. With the size of Northern Ireland being so small the opportunities available to its population in relation to employment and livelihood historically younger men would travel to areas where employment was more likely, be that in the England, Scotland, Wales, Republic of Ireland or further afield. With this to consider it would be understandable that the ratio of male to females would be different.

Using this information it was a matter of looking at why the population was so low in the 1960s and why at this point of time the occurrence that would allow for such an increase in population to occur. Possible reasons can be considered as the world in general was going through a population increase after World War 2, but where in the rest of the world and europe this had been happening for a number of years the increase in Ireland took a number of years later to take effect. This coupled with the  change of long period of economic isolation and attempt at self reliance ended and a more open market was introduced. This resulted in better employment opportunities causing a reduction in the amount of people emigrating to other countries with younger people staying in the country and having a family at home which was also helped with the expansion of the welfare system.