Data Management & Data Governance

data_management_banner_1

Data management is the term used for referring to how companies and organisations around the world handle data that is being recorded on a daily basis with ever more data being generated each year than the previous. The importance of having this information at hand is proving more and more important so trying to maintaining uniformity across all machines in different countries, time zones and organisations can be a difficult. This is why more and more companies are looking to Master Data Management in order to gain the maximum benefits from the available data and to use it in the most effective way possible.

According to a Gartner article from 2013

“Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.  

Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.”

What this implies is that for a company to really use MDM that insuring the right manor and methods are used when dealing with the data and data entry. The correct handling of this MDM is importance for companies as:

  • Master data is used to make decisions on all company levels.
  • Business processes throughout the entire company rely on master data.
  • Higher quality master data helps to improve the operational efficiency of a company.
  • With high-quality master data, costs can be reduced.

The goal of an MDM initiative is to provide processes for collecting, aggregating, matching, consolidating, assuring quality and distributing critical data throughout an organisation to ensure consistency and control in the ongoing maintenance and application use of this information

In order to utilise Master Data Management there are different processes that can be followed, they are:

  • ETL: Extract, Transform, Load. A process in database responsible for data extraction extracting data from different sources and compiling it into a consolidated location
  • EAI: Enterprise application integration is the term for the plans, methods and tools used to modernizing, consolidating, and coordinating data for use in companies systems
  • EII: Enterprise information integration, is the ability to support a unified view of data and information for an entire organisation

Data Governance

blog-banner-data-governance


Data Governance is used by organisations to share common goals of company/corporate polices for data definition, enforcement and for communicating ideas and principals.

It is thought that as most companies data is held in databases and on computers that the responsibility for the data should fall within the IT department but this is not always the case  and some people don’t see the need to look after the data governance in a company on a continuing basis.

Data governance initiatives can improve data quality by have an assigned teams responsible for data’s accuracy, accessibility, consistency, and completeness, among other metrics. The team commonly consisting of project management, business managers, and data stewards.

These would be the people that would drive strategy and vision for that data, what data is stored, assign and manage the data stewards and set “Best Pratices” for the company to follow

Some of the most popular tools for Data Governance are

  • Onesoft Connect
  • A.K.A
  • Collibra
  • Acaveo
  • BigData
  • Fusion Platform

References


Gartner (2013) http://www.gartner.com/it-glossary/master-data-management-mdm [Accessed 16th September 2016]

Baum, D., (n.d) ‘Masters of the Data’ Oracle. [Online]. Available from: http://www.oracle.com/us/c-central/cio-solutions/information-matters/importance-of-data/index.html [Accessed 17th September 2016]

Couture, N., (n.d.) ‘Implementing an enterprise Data Quality Strategy’. Business Intelligence Journal, vol. 18, no. 4.

David L., (n.d) ‘Data Governance for Master Data Management and Beyond’ SAS The power to know. [Online]. Available from: http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/data-governance-for-MDM-and-beyond-105979.pdf [Accessed 15th September 2016]

Big Data and the 3ish V’s

bigdata

Data can be made up of facts or figures bits of information, but not in itself information. When data are processed, interpreted, organized, structured or presented to make them meaningful or useful, this is called information and information provides context for data.

Now companies around the world are realising that having access to large amounts of information or raw data is a valuable commodity and how that data is handled is just as important as raw data itself isn’t valuable but once that data is data is processed and analysed it can be turned into information that can be used to guide companies through specific market targeting, product development, organisational changes etc.

This data can come from different sources but is classified as big data when the information that is being gathered is collected in a high volume, at great velocity/speed and a variety of information. It is often quoted as:

“’Big Data’ is high -volume, -velocity and –variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

– Gartner (Gartner, Inc. (NYSE: IT) is the world’s leading information technology research and advisory company)

What this breaks down to is the introduction to what many people refer to as the 3 V’s of Big Data

  • Volume – The amount of data that is available in the world goes through huge leaps every year with the common theory being that “the amount of the data in the world doubles every two years” this can be seen with the evolution of different web services such as Facebook, Netflix, Amazon and Google offering new service, adopting existing aspects and implementing new areas for people to upload and share their data with the world (and also with the hosting site).
  • Velocity – The speed in which new data is created, stored and analysed refers can greatly affect the data management principles as having the most current and relevant information from the data can have drastic effects to how a company operates, how quick they can adapt to changing trends or adapt to shifts in the market.
  • Variety – This refers to the wide array of data that is being collected from different sources such as social websites such as Facebook & Twitter, smart phones supplying SMS, GPS locations, photos, audio, video through multiple smart phone apps which are used daily by large amounts of the worlds’ population. There is also the wide use of office documents such as word documents, spreadsheets, relational and unstructured databases all of which used in business daily. With all this information being created using all these different mediums it can easily be said that the majority of the information being created daily can be called unstructured and the quest is how to use this varied data to be of use to organisations and how they relate to each other is something continuing to be explored and utilised

These 3 V’s are used to help identify key dimensions of Big Data but they are certainly not the only V’s that have been associated with big data with some sources adding 2-3 items onto the list with others adding another 11 V’s to the list. Each of them adapting and pulling from the three core elements.

Some of the most popular additions are:

  • Value – How valuable is the information that is being gathered or does the data being collected have a value. This can be considered by looking at how although you might have access to data that it does not mean that the data has any value
  • Viability – In an article in Wired Neil Biehn stated that “we want to carefully select attributes and factors that are most likely to predict outcomes that matter mist to businesses” This can be shortened to the fact that with the information gathered how viable is it and can it be assessed to check the viability of that data as with so many varieties of data and variables to take into consideration when creating and building an effective predictive model
  • Veracity – How confident can you be in the data is it reliable and accurate
  • Variability -How likely is the data going to change in the case of sales is the information you are receiving seasonal trends or could the changes be part of temporary flux in trends (such as Ugg boots popularity* see references below)
  • Visualization – How can the data be viewed in a manner that is easy to understand and comprehend.
  • Virality – How easily is data used by others and at what rate is done.
  • Volatility – How stable is the data, does the information have a deadline or a best before date before the information become useless or irrelevant.

Looking at these and taking into my own opinions what we can take from the above it can be said that along with the 3 core V’s some of the additional V’s do have value and should be considered as useful in breaking down the value of Big Data but some could be considered as unnecessary and used simply to break down other factors down further.


References

Biehn, N. and PROS (2013) A Gorgeous—and Unsettling—Video of evolution in action. Available at: https://www.wired.com/insights/2013/05/the-missing-vs-in-big-data-viability-and-value/ (Accessed: 08 September 2016).

(No Date) Available at: http://blueshiftideas.com/reports/051504DownTrendinDeckersUGGBootsWillOnlyWorsen.pdf (Accessed: 12 September 2016).

here (2014) Top 10 big data challenges – A serious look at 10 big data V’s. Available at: https://www.mapr.com/blog/top-10-big-data-challenges-%E2%80%93-serious-look-10-big-data-v%E2%80%99s (Accessed: 10 September 2016).

reserved, A. rights (no date) Data vs information – difference and comparison. Available at: http://www.diffen.com/difference/Data_vs_Information (Accessed: 11 September 2016).


Interesting Fact
During World War II, the crew of the British submarine HMS Trident kept a fully grown reindeer called Pollyanna aboard their vessel for six weeks (it was a gift from the Russians).

R Programming and Fight Metrics

Implementing R

With the basics covered in the previous section it was a matter of using the skills on a different dataset. I found one containing fight metrics for all UFC (Ultimate Fighting Championship) events for UFC1 in 1993 up to UFC Fight Night 83 in 2016 available at:

https://www.reddit.com/r/datasets/comments/47a7wh/ufc_fights_and_fighter_data/

The data contained in the dataset was held in the following vectors

Variable Name Description
fid Fighter ID
name Fighter’s name
nick Nickname
birth_date Fighter’s date of birth
height Fighter’s height
weight Fighter’s weight
association Training camp
class Weight Class
locality City/County/State
Country Country of origin

Using this dataset the following graph was created giving a visual result of the number of fighters that have fought with the UFC by weight class. What we can see is that the most active weight classes were Welterweight, Middleweight and Lightweight. The least active weight classes were Atomweight, Super Heavyweight and Strawweight.

weightclass count

The other dataset used contained the following

Variable Name Description
pageurl event url
eid Event Id
mid Match ID
event_name Name of Event
event_org Organinisation name
event_date Event date
event_place Location of event
f1pageurl Fighter 1 url
f2pageurl Fighter 2 url
f1name Fighter 1 Name
f2name Fighter 2 Name
f1result Fighter 1 result
f2result Fighter 2 result
f1id Fighter 1 Id
f2id Fighter 2 Id
method Method of victory
method_d Method Description
round Round finished
time Time of fighter finished
ref Referee of fight

Using this data the below graph was created showing the number of time the fight ended in each of the different possible outcomes.

results

What we can see from this is that the majority of fights ended by going to judges’ decision followed by half as many fights ending by TKO (technical knockout) closely followed by submission victory and then under a third of the number of decisions is KO (knock out) finishes.

The following graphs show the percentage of fights finished by decision, KO and Submission from 1993 to 2016 (Decisions were not an outcome until later UFC events)

decision

This shows that the number of fights going to decision has increased since the beginning of the events to almost 50% of fights since 2010 on wards

Ko

This graph shows the number of fights ending by KO has always been low fewer than 25% since the beginning

subs

This graph shows that the number of submission in the beginning were about 75% this was because of fighters such as Royce Gracie who was a Brazilian jujitsu fighter who would fight other fighters who were looking to stand up and throw punches and kicks and he would win by taking the fight to the ground and neutralizing the other fighter. After the success at the beginning as more and more fighters learned submissions and submission defences the number of wins by submissions reduced below 25%

With these two datasets loaded it was then a matter of joining them together by fighterid and being able to look and see the amount of victories each fighter had, calculating the percentage of that and outputting the results.

name class fights wins win_perc_adj
 1: Jon Jones  Light Heavyweight 16 15 0.8142700
 2: Georges St. Pierre Welterweight 21 19 0.8116540
 3: Conor McGregor Featherweight 7 7 0.7636766
 4: Yoel Romero Middleweight 7 7 0.7636766
 5: Tony Ferguson Lightweight 11 10 0.7605096
 6: Anderson Silva Middleweight 19 16 0.7571829
 7: Don Frye Heavyweight 10 9 0.7457933
 8: Chris Weidman Middleweight 10 9 0.7457933
 9: Khabib Nurmagomedov Lightweight 6 6 0.7444223
10: Royce Gracie Middleweight 13 11 0.7334771

This gave the best overall fighters win percentage throughout the weight classes but if we wanted a breakdown for each then the result would be:

Class Name Fights Wins Win_perc_adj
1 Light Heavyweight Jon Jones 16 15 0.81427
2 Light Heavyweight Daniel Cormier 7 6 0.68834
3 Light Heavyweight Rashad Evans 19 14 0.67805
4 Welterweight Georges St. Pierre 21 19 0.811654
5 Welterweight Stephen Thompson 8 7 0.710175
6 Welterweight Warlley Alves 4 4 0.694669
7 Featherweight Conor McGregor 7 7 0.763677
8 Featherweight Jose Aldo 8 7 0.710175
9 Featherweight Max Holloway 14 11 0.697299
10 Middleweight Yoel Romero 7 7 0.763677
11 Middleweight Anderson Silva 19 16 0.757183
12 Middleweight Chris Weidman 10 9 0.745793
13 Lightweight Tony Ferguson 11 10 0.76051
14 Lightweight Khabib Nurmagomedov 6 6 0.744422
15 Lightweight Donald Cerrone 20 16 0.728364
16 Heavyweight Don Frye 10 9 0.745793
17 Heavyweight Cain Velasquez 13 11 0.733477
18 Heavyweight Junior dos Santos 14 11 0.697299
19 Flyweight Joseph Benavidez 13 11 0.733477
20 Flyweight Demetrious Johnson 13 11 0.733477
21 Flyweight Henry Cejudo 4 4 0.694669
22 Strawweight Joanna Jedrzejczyk 5 5 0.721752
23 Strawweight Tecia Torres 3 3 0.661745
24 Strawweight Valerie Letourneau 4 3 0.597335
25 Bantamweight Raphael Assuncao 8 7 0.710175
26 Bantamweight Dominick Cruz 4 4 0.694669
27 Bantamweight Aljamain Sterling 4 4 0.694669
28 Super Heavyweight Jon Hess 1 1 0.56874
29 Super Heavyweight Andre Roberts 3 2 0.553915
30 Super Heavyweight Scott Ferrozzo 5 3 0.544351
31 Atomweight Michelle Waterson 1 1 0.56874

Further development of these data sets could allow both statistical and predictive information such as:

  • Statically
    • Number of events each year,
    • Average fight times
  • Prediction
    • Likely results of fights taking into consideration fighters win loss records
    • How a fighter might win be it decision, ko or submission
    • Success rate of submission moves in ending fights
    • Spotting if there is any coalition between fighters and judges

***Interesting Fact***

Since the mid-1960s firefighter have added a wetting agent to make water wetter resulting in less friction in the hose cause the water to pass through it quicker

Introduction to R and Should Leo have survived the Titantic

When dealing with large amounts of data there are a number languages that can be used to sort, analysis and graph that information. One of the most popular is R which is a statistical programming tool used throughout the world. It can be downloaded from http://www.r-project.org/

To begin using this language it can be useful to trying beginner courses that are available throughout the internet. One of the most popular sites is Code school which has a useful course called ‘Try R code school’, a step by step guide for learning the basics of R covering the different areas of R such as

  1. Using R
  2. Vectors
  3. Matrices
  4. summary Statistics
  5. Factors
  6. Data Frames
  7. Real-World Data

When you complete each section you receive a badge totalling 8 badges with the 8 badge counting for having completed the course

completed codeschool

Having the basics of using R with data I wanted to look at more uses of graphs in R. This was done using the ggplot library package and using Cookbook for R website http://www.cookbook-r.com/Graphs/

Using a sample data set of a restaurant containing the time and total bill of different transactions i.e.

#> time total_bill
#> 1 Lunch 14.89
#> 2 Dinner 17.23

Using this data, different versions of bar charts were created displaying the totals using ggplot tool set.

For a basic bar graph showing time broken down into Lunch and Dinner on the X axis and the total amount on the Y axis the code used was:

ggplot(data=dat, aes(x=time, y=total_bill)) +
geom_bar(stat=”identity”)

This resulted in the bellow graph

Rplot01

To add a bit more style to the chart by mapping the time of day to different fill colours adding ‘fill=time’ to the code results in the Lunch being filled with a pink colour and the dinner filled with a blue colour

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(stat=”identity”)

The graph produced makes the data stand out that much better and more interesting

Rplot02

A black outline can be around each of the Time groups by adding ‘colour=”black”‘ to the code making the edges more defined in the graph such as code example below:

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(colour=”black”, stat=”identity”)

Rplot03

And finally as the there are only two values and they are labelled already on the X axis, the use of a legend on the side of the chart is unnecessary. To remove the legend from the chart is easily done by using ‘guides(fill=FALSE)’ meaning that the legend will not be displayed to the right of the chart.

ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
geom_bar(colour=”black”, stat=”identity”) +
guides(fill=FALSE)

Rplot04


And now onto the Leo question

Having used these skills on example datasets the idea was to next try to implement them on a larger scale with other data. While looking for information I came across a data set on Kaggle.com (https://www.kaggle.com/c/titanic) containing information on Titanic Passengers such as age, sex, departure location, ticket price and if they survived or not. We could in theory use this information to determine if given certain information would it be possible to say if a person would survive a trip on the Titanic or not using machine learning to predict the outcome. Referencing an article by Megan Risdal the results and steps were as follows.

The datasets used were the train.csv and test.csv downloadable from the Get the Data section on Kaggle.

The datasets were made up of the following variables

>

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

Using this information it was possible to create graphs illustrating the different collations between the different passengers and whether they survived or not. Such as family size, gender, age

Rplot

The first two graphs shown above illustrate the survival of passengers who traveled alone or in a group of more than one. What we can see from the information is that people who were traveling alone were more than likely to perish as were people who traveled in a family of more than 4 members but if you were in a family group of 2,3 or 4 you had a better chance of survival.

The next graph broke down the age and survival rate for both male and female passengers. What can be taken from this is that being a child under 18 years of age gave you a better chance of survival but was not a guarantee as even if you survived the sinking the cold and harsh weather conditions would be difficult for the passengers. Also the data shows you had a much better chance of surviving if you were female regardless of age than the male counter-part. And finally if you were an 81 year old male you are assured to survive which is good to know.

Further analysis of the data could provide a graph to illustrate the importance of each of the supplied fields to the chance of surviving the Titanic sinking.

So in conclusion to answer the original question of should Leo have survived. A quick trip to the Wikipedia page of the 1997 Film https://en.wikipedia.org/wiki/Titanic_(1997_film) we can glean that Leonardo DiCaprio’s character of Jack Dawson was a 20 year old male(obviously), who was traveling 3rd class and departed alone. Taking this information and using the model of information that we already have and filling in the relevant information it is likely that Leo’s character would have perished aboard the sinking ship regardless of being the star of the film and dying for the plot and sad ending…… even though he could have fit on the wreckage as shown below.

***Bonus random Titanic Fact***

There were 7,000 heads of lettuce aboard the Titanic.

Fusion Tables

Fusion Table of Irish Population

Above is a heat map made from a fusion table of the Irish population by each county and how each county is made up of male and female population according to the the 2011 Irish census obtained by the Central Statistics Office displayed using google maps and its geo-sync so give a visualization of the data.

What we can observe from the data is that the larger population centers are based around the coastal areas and that the largest populations centered in counties with major ports/airports, which would traditionally coincide with the Irish cities (Dublin, Galway and Cork) allowing for trade and commerce throughout the ages. The development and availability of public transport, the construction and upgrading of roads, technological development has had bleed off effects from these counties can be seen in the surrounding areas increasing the population.

This information can be used for the planning of roads and transport routes between larger population centers that might encourage growth in counties with lower figures. It can also be used for looking at house and rental prices in relation to population

Creating the Fusion Table

The first step in creating the above heat map was to install Google Fusion Tables on Chrome and import the files that were to be merged into Google Fusion Tables with a geographical map.

To begin data of the population of Ireland extracted from the 2011 census downloaded from the Central Statistics Office Website http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/

The next table uploaded contained the geometry information of Ireland and the name of the counties. This in information was taken from Irish independent Website

http://www.independent.ie/editorial/test/map_lead.kml

The information had to be then cleaned taking out errors correcting the locations so that it would match the statistics table. The main issues to be corrected were the county names and also the break down of some counties into north and south areas which had to be combined into the one.

The two files were then merged together creating a new document named ‘Irish Population by County’. When the table was merged, the map was then styled and edited.

To create a random distribution of counties based on population density, on the map view going to the configure map menu and clicking on the change feature styles button. Clicking on the buckets tap option and selecting the fill color tab and selecting 8 buckets. For this map, each bucket was allocated a different colour for each of the population ranges. The colours and ranges assigned were as follows: Grey for counties with a population between 31,798 and 75,000, light purple for counties with a population between 75,000 and 100,000, light blue for counties with a population between 100,000 and 125,000, dark blue for counties with a population between 125,000 and 150,000, green for counties with a population between 150,000 and 200,000, yellow for counties with a population between 200,000 and 400,000, orange for counties with a population between 400,000 and 1,000,000, and finally red for counties with a population between 1,000,000 – 1,274,000. This distribution was chosen to clearly highlight the highly populated counties against the lowly populated counties using contracting colours to highlight the different ranges.

Population Gender Breakdown

Below are charts showing the Population of the whole of Ireland through 1911 – 2011

This graph shows the sum of the population through the ages and then with it broken down by male and female.

This graph shows the male and female population in the Republic of Ireland from 1911 – 2011. This graph shows that the was a large drop in the population until the 1961 at which point the population grew until 1986 where it stayed steady for female population and a drop for males until 1991 where we see an increase with female population increasing more than male.

This graph shows the break down of the Northern Ireland’s gender break down from 1911 – 2011. This shows that there has always been an larger amount of females than males but that the birthrates have increased year on year since 1936 and this gender balance has continued. Possible reasons of this could be for geographical, environmental or economic. With the size of Northern Ireland being so small the opportunities available to its population in relation to employment and livelihood historically younger men would travel to areas where employment was more likely, be that in the England, Scotland, Wales, Republic of Ireland or further afield. With this to consider it would be understandable that the ratio of male to females would be different.

Using this information it was a matter of looking at why the population was so low in the 1960s and why at this point of time the occurrence that would allow for such an increase in population to occur. Possible reasons can be considered as the world in general was going through a population increase after World War 2, but where in the rest of the world and europe this had been happening for a number of years the increase in Ireland took a number of years later to take effect. This coupled with the  change of long period of economic isolation and attempt at self reliance ended and a more open market was introduced. This resulted in better employment opportunities causing a reduction in the amount of people emigrating to other countries with younger people staying in the country and having a family at home which was also helped with the expansion of the welfare system.