2016年7月17日星期日

Kobe Bryant Shot Selection: Data Exploration & Visulization

The dataset, Kobe Bryant Shot Selection is from Kaggle website. You can find more information and details about this dataset here: https://www.kaggle.com/c/kobe-bryant-shot-selection/data

I was inspired by Alexandru Papiu's report: https://www.kaggle.com/apapiu/kobe-bryant-shot-selection/exploring-kobe-s-shots/notebook He gave very detailed description of his thoughts and many fantastic visualization using R programming. What I am thinking is, can I use tableau to do the same thing? Then, let's do it.

1. Loading in the data:

The data I downloaded directly from the Kaggle website is data.csv, a kind of text file. So, when connecting the data, choose Text File.
You can keep this connection live or extract. Here, I won't change the original data, so I will choose Extract. 
The data are shown here:

2. Shot_made_flag

According to the dataset and Alexandru Papiu's report, if shot_made_flag = 1, it means Kobe made this shot, 0 means missed, observations with null are what we should predict. 
Therefore, we should move the shot_made_flag from Measures to Dimensions for future use. 
Using tables in tableau:
In this dataset, Kobe made 11,465 and missed 14,232. 5,000 nulls are what we will predict. 

3. Plot

In Alexandru Papiu's report, he created two plot functions:
* pplot() is a function to plot to see accuracy by feature. (bar chart, x=feature, show the counts)
* courtplot() is a function to plot to see position by feature (point graph, x=lon,y=lat)
And we can do similar things in Tableau
1). The locations for the various shot_types.
Move loc_x and loc_y to Dimension. Scatter Plot, row = loc_x, column = loc_y, use filter shot_made_flage = 1 or 0, use combined_shot_type for color.

2) As we can see from this scatter plot, there are many red points (Jump Shot). Then, highlight other shot type.

3) Shot from up close
Move shot_distance to Dimension. Two filters: shot_made_flag = 0 or 1, shot_distance < 5. Color: 0-made-green, 1-missed-orange.
4) Shot Made vs. Shot Missed

Distance really matters when shot.
5) Histogram of Shot & Accuracy by x_bins
Create a bin using suggest bin size.


Column is x_bins, row is count of Number of Records, and edit table calculation. Filter is shot_made flag = 0 and 1, also the color.

6) Histogram & Accuracy by action_type

Find action_type that count < 20

Then, create a set that excludes these action types. Column is count of Number of Records, and edit table calculation. Row is set of action type > 20.
Kobe likes Jump Shot. However, he was good at Dunk Shot.
7) Shot zone area
Column: loc_x. Row: loc_y. Scatter Plot. Shot_Zone_Area as color.
8) Shot Zone Basic
Column: loc_x. Row: loc_y. Scatter Plot. Shot_Zone_Basic as color.
9) Shot Zone Range
Column: loc_x. Row: loc_y. Scatter Plot. Shot_Zone_Range as color.
10) Histogram & Accuracy by minutes_remaining

Move minutes_remaining from Measure to Dimension. Column: Minutes remaining. Row: Count of Number of records with calculation (Percent of total, table down)
11) Histogram & Accuracy by Period

Move period from Measure to Dimension. Column: period. Row: Count of Number of records with calculation (Percent of total, table down).
The team, which Kobe was in, was pretty powerful, they needn't to get to the 7th periods. And, the accuracy of Kobe shot is stable enough. 
12) Histogram & Accuracy by Seconds Remaining
Move seconds_remaining from Measure to Dimension. Column: seconds_remaining. Row: Count of Number of records with calculation (Percent of total, table down).
From this histogram, we can see that Kobe really tried himself best at the end of time. But the accuracy is only around 30%, the lowest.
13) Histogram & Accuracy by Season

Kobe shot most in 2005~2006, and his accuracy of shot is highest in 2007~2008.
14) Histogram & Accuracy by shot distance
Everyone likes shot near the basket. When far away, he did really a good job from around 30 to 45. 
15) Histogram & Accuracy by Combined shot type

4. Tableau Dashboards

I decide to make dashboards to clarify "Where?" "When?"
https://public.tableau.com/views/KobeBryantShots/KobeBryantShots?:embed=y&:display_count=yes&:showTabs=y




2016年6月22日星期三

World University Rankings Dataset Exploration and Visualization


I downloaded this dataset from Kaggle. There are 6 csv files: timesData.csv, school_and_country_table.csv, cwurData.csv, educational_attainment_supplementary_data.csv, education_expenditure_supplementary_data.csv, shanghaiData.csv. What I will focus on is the timesData.
For more information about the dataset, please check: https://www.kaggle.com/mylesoneill/world-university-rankings
I will use R to do some exploration and use Tableau to make some visualizations.

1. Load the data

timesData <- read.csv("timesData.csv", header = TRUE)

2. View the structure of data


> dim(timesData)
[1] 2603   14
> timesData[1:3,]
  world_rank                       university_name                  country teaching international research citations income total_score num_students
1          1                    Harvard University United States of America     99.7          72.4     98.7      98.8   34.5        96.1       20,152
2          2    California Institute of Technology United States of America     97.7          54.6     98.0      99.9   83.7        96.0        2,243
3          3 Massachusetts Institute of Technology United States of America     97.8          82.3     91.4      99.9   87.5        95.6       11,074
  student_staff_ratio international_students female_male_ratio year
1                 8.9                    25%                   2011
2                 6.9                    27%           33 : 67 2011
3                 9.0                    33%           37 : 63 2011

In timesData.csv, there are 2,603 observations and 14 variables:
world _rank: world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250)
university_name:name of university
country:country of each university.
teaching:university score for teaching (the learning environment).
international:university score international outlook (staff, students, research).
research:university score for research (volume, income and reputation).
citations:university score for citations (research influence).
income:university score for industry income (knowledge transfer).
total_score:total score for university, used to determine rank.
num_students:number of students at the university.
student_staff ratio:Number of students divided by number of staff.
international_students:Percentage of students who are international.
female_male_ratio:Female student to Male student ratio.
year: year of the ranking (2011 to 2016 included).

3. Ask Questions and Visuliaze

Now, it's time to think about what I want to get from this data.

Q1: Where are the first 100 rank university come from?
For this question, we can use a map to show.
Use country as a geo-information to generate a map.
Use world_rank as a filter to extract the university in the first 100.
Use the number of records as color.
Use year as a filter.

From this map, we can see that most top 100 universities come from America.

Q2: The total score determines the rank, but how it comes? What score matters most? Teaching? International? Research? Citation? Or Income?
For this question, we need to do regression in R first.

> dataset <- cbind.data.frame(timesData$teaching, timesData$international, timesData$research,
+              timesData$citations, timesData$income, timesData$total_score)
> colnames(dataset) <- c("teaching", "international", "research", "citation", "income", "total")
> summary(dataset)
    teaching    international     research        citation          income         total     
 Min.   : 9.9   20.7   :  10   Min.   : 2.90   Min.   :  1.20   -      : 218   -      :1402  
 1st Qu.:24.7   29.6   :  10   1st Qu.:19.60   1st Qu.: 45.50   100.0  :  68   49.0   :  13  
 Median :33.9   -      :   9   Median :30.50   Median : 62.50   28.0   :  26   51.1   :  12  
 Mean   :37.8   34.3   :   9   Mean   :35.91   Mean   : 60.92   31.1   :  20   46.6   :  11  
 3rd Qu.:46.4   46.8   :   9   3rd Qu.:47.25   3rd Qu.: 79.05   28.8   :  19   46.9   :  10  
 Max.   :99.7   48.4   :   9   Max.   :99.40   Max.   :100.00   28.5   :  18   50.1   :  10  
                (Other):2547                                    (Other):2234   (Other):1145
As we can see from the summary, there are 9 empty value in international score, 218 empty value in income score and 1402 empty value in total score. It is not meaningful to give a value for those empty. The easiest way is to delete all the records with those value.
> library(sqldf)
> dataset <- sqldf("SELECT * 
+                   FROM dataset
+                   WHERE total != '-'
+                   AND international != '-'
+                   AND income != '-'")
> dataset$total <- as.numeric(as.character(dataset$total))
> dataset$income <- as.numeric(as.character(dataset$income))
> dataset$international <- as.numeric(as.character(dataset$international))
> summary(dataset)
    teaching     international      research        citation          income           total      
 Min.   :15.90   Min.   : 14.8   Min.   :13.10   Min.   :  8.60   Min.   : 24.20   Min.   :41.40  
 1st Qu.:37.70   1st Qu.: 45.5   1st Qu.:36.90   1st Qu.: 66.50   1st Qu.: 36.40   1st Qu.:50.20  
 Median :46.45   Median : 61.6   Median :48.10   Median : 78.60   Median : 45.80   Median :55.80  
 Mean   :50.04   Mean   : 61.1   Mean   :51.97   Mean   : 76.81   Mean   : 54.24   Mean   :59.63  
 3rd Qu.:59.55   3rd Qu.: 79.1   3rd Qu.:64.70   3rd Qu.: 89.10   3rd Qu.: 68.12   3rd Qu.:65.90  
 Max.   :99.70   Max.   :100.0   Max.   :99.40   Max.   :100.00   Max.   :100.00   Max.   :96.10  
> dim(dataset)
[1] 1056    6
What I will use for regression analysis is the dataset with 1056 rows and 6 columns.
The most most basic regression is linear regression. We can get some information from scatterplot:
library(ggplot2)
ggplot(dataset, aes(x=teaching, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=international, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=research, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=citation, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=income, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)





From the scatterplots, we can see only teaching and research score looks like linear regression with the total score. 
Standard Linear Regression: 
> lm_total = lm(total ~ ., data = dataset)
> summary(lm_total)
 
Call:
lm(formula = total ~ ., data = dataset)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-1.12476 -0.12664 -0.02884  0.05652  1.78528 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1176804  0.0539644   2.181   0.0294 *  
teaching      0.2997580  0.0011151 268.813   <2e-16 ***
international 0.0703900  0.0003961 177.728   <2e-16 ***
research      0.3009812  0.0009668 311.322   <2e-16 ***
citation      0.3023552  0.0005430 556.857   <2e-16 ***
income        0.0248092  0.0003820  64.954   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2619 on 1050 degrees of freedom
Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
F-statistic: 4.972e+05 on 5 and 1050 DF,  p-value: < 2.2e-16
 
> rmse = sqrt(mean(residuals(lm_total)^2))
> rmse
[1] 0.2611183
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+   train1 = c(1:n)
+   train = train1[train1 != k]
+   m1 = lm(total ~ ., data = dataset[train, ])
+   pred = predict(m1, newdata = dataset[-train, ])
+   obs = dataset$total[-train]
+   error[k] = obs - pred
+ }
> lm1_me = mean(error)
> lm1_rmse = sqrt(mean(error^2))
> lm1_me
[1] 2.806287e-05
> lm1_rmse
[1] 0.2629693
From the coefficients table, citation, research and teaching scores are much more important than other two.  And the RMSE for this model is 0.2629693
Non Linear Regression Model:

> poly1 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+   train1 = c(1:n)
+   train = train1[train1 != k]
+   m3 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset[train,])
+   pred = predict(m3, newdat = dataset[-train, ])
+   obs = dataset$total[-train]
+   error[k] = obs - pred
+ }
> poly1_me = mean(error)
> poly1_rmse = sqrt(mean(error^2))
> poly1_rmse
[1] 0.2590252
With poly degree 2 on citation, international and income hit a better RMSE = 0.2590252. Continue trying:

> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset) 
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+   train1 = c(1:n)
+   train = train1[train1 != k]
+   m4 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset[train,])
+   pred = predict(m4, newdat = dataset[-train, ])
+   obs = dataset$total[-train]
+   error[k] = obs - pred
+ }
> poly2_me = mean(error)
> poly2_rmse = sqrt(mean(error^2))
> poly2_rmse
[1] 0.2544657
With poly degree 3 on citation, international and income hit a better RMSE = 0.2544657. Stop trying, because it may comes overfitting somewhere.
Then, try to visualize in Tableau:

Q3: How to visualize the number of students and the ratio of student:staff?


The darker the color is, the more international students there are. The bigger the square is, the higher the ratio of student to staff is.

Here is the dashboard:



Also , you can check this link for a better interactive experience:
https://public.tableau.com/views/UniversitiesWorldRank/1?:embed=y&:display_count=yes&:showTabs=y

If you have any good idea, please comment below.