Introduction to R and Should Leo have survived the Titantic

When dealing with large amounts of data there are a number languages that can be used to sort, analysis and graph that information. One of the most popular is R which is a statistical programming tool used throughout the world. It can be downloaded from http://www.r-project.org/

To begin using this language it can be useful to trying beginner courses that are available throughout the internet. One of the most popular sites is Code school which has a useful course called ‘Try R code school’, a step by step guide for learning the basics of R covering the different areas of R such as

  1. Using R
  2. Vectors
  3. Matrices
  4. summary Statistics
  5. Factors
  6. Data Frames
  7. Real-World Data

When you complete each section you receive a badge totalling 8 badges with the 8 badge counting for having completed the course

completed codeschool

Having the basics of using R with data I wanted to look at more uses of graphs in R. This was done using the ggplot library package and using Cookbook for R website http://www.cookbook-r.com/Graphs/

Using a sample data set of a restaurant containing the time and total bill of different transactions i.e.

#> time total_bill
#> 1 Lunch 14.89
#> 2 Dinner 17.23

Using this data, different versions of bar charts were created displaying the totals using ggplot tool set.

For a basic bar graph showing time broken down into Lunch and Dinner on the X axis and the total amount on the Y axis the code used was:

ggplot(data=dat, aes(x=time, y=total_bill)) +
geom_bar(stat=”identity”)

This resulted in the bellow graph

Rplot01

To add a bit more style to the chart by mapping the time of day to different fill colours adding ‘fill=time’ to the code results in the Lunch being filled with a pink colour and the dinner filled with a blue colour

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(stat=”identity”)

The graph produced makes the data stand out that much better and more interesting

Rplot02

A black outline can be around each of the Time groups by adding ‘colour=”black”‘ to the code making the edges more defined in the graph such as code example below:

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(colour=”black”, stat=”identity”)

Rplot03

And finally as the there are only two values and they are labelled already on the X axis, the use of a legend on the side of the chart is unnecessary. To remove the legend from the chart is easily done by using ‘guides(fill=FALSE)’ meaning that the legend will not be displayed to the right of the chart.

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(colour=”black”, stat=”identity”) +
guides(fill=FALSE)

Rplot04


And now onto the Leo question

Having used these skills on example datasets the idea was to next try to implement them on a larger scale with other data. While looking for information I came across a data set on Kaggle.com (https://www.kaggle.com/c/titanic) containing information on Titanic Passengers such as age, sex, departure location, ticket price and if they survived or not. We could in theory use this information to determine if given certain information would it be possible to say if a person would survive a trip on the Titanic or not using machine learning to predict the outcome. Referencing an article by Megan Risdal the results and steps were as follows.

The datasets used were the train.csv and test.csv downloadable from the Get the Data section on Kaggle.

The datasets were made up of the following variables

>

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

Using this information it was possible to create graphs illustrating the different collations between the different passengers and whether they survived or not. Such as family size, gender, age

Rplot

The first two graphs shown above illustrate the survival of passengers who traveled alone or in a group of more than one. What we can see from the information is that people who were traveling alone were more than likely to perish as were people who traveled in a family of more than 4 members but if you were in a family group of 2,3 or 4 you had a better chance of survival.

The next graph broke down the age and survival rate for both male and female passengers. What can be taken from this is that being a child under 18 years of age gave you a better chance of survival but was not a guarantee as even if you survived the sinking the cold and harsh weather conditions would be difficult for the passengers. Also the data shows you had a much better chance of surviving if you were female regardless of age than the male counter-part. And finally if you were an 81 year old male you are assured to survive which is good to know.

Further analysis of the data could provide a graph to illustrate the importance of each of the supplied fields to the chance of surviving the Titanic sinking.

So in conclusion to answer the original question of should Leo have survived. A quick trip to the Wikipedia page of the 1997 Film https://en.wikipedia.org/wiki/Titanic_(1997_film) we can glean that Leonardo DiCaprio’s character of Jack Dawson was a 20 year old male(obviously), who was traveling 3rd class and departed alone. Taking this information and using the model of information that we already have and filling in the relevant information it is likely that Leo’s character would have perished aboard the sinking ship regardless of being the star of the film and dying for the plot and sad ending…… even though he could have fit on the wreckage as shown below.

***Bonus random Titanic Fact***

There were 7,000 heads of lettuce aboard the Titanic.

Leave a Reply

Your email address will not be published. Required fields are marked *