2016年6月22日星期三

World University Rankings Dataset Exploration and Visualization


I downloaded this dataset from Kaggle. There are 6 csv files: timesData.csv, school_and_country_table.csv, cwurData.csv, educational_attainment_supplementary_data.csv, education_expenditure_supplementary_data.csv, shanghaiData.csv. What I will focus on is the timesData.
For more information about the dataset, please check: https://www.kaggle.com/mylesoneill/world-university-rankings
I will use R to do some exploration and use Tableau to make some visualizations.

1. Load the data

timesData <- read.csv("timesData.csv", header = TRUE)

2. View the structure of data


> dim(timesData)
[1] 2603   14
> timesData[1:3,]
  world_rank                       university_name                  country teaching international research citations income total_score num_students
1          1                    Harvard University United States of America     99.7          72.4     98.7      98.8   34.5        96.1       20,152
2          2    California Institute of Technology United States of America     97.7          54.6     98.0      99.9   83.7        96.0        2,243
3          3 Massachusetts Institute of Technology United States of America     97.8          82.3     91.4      99.9   87.5        95.6       11,074
  student_staff_ratio international_students female_male_ratio year
1                 8.9                    25%                   2011
2                 6.9                    27%           33 : 67 2011
3                 9.0                    33%           37 : 63 2011

In timesData.csv, there are 2,603 observations and 14 variables:
world _rank: world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250)
university_name:name of university
country:country of each university.
teaching:university score for teaching (the learning environment).
international:university score international outlook (staff, students, research).
research:university score for research (volume, income and reputation).
citations:university score for citations (research influence).
income:university score for industry income (knowledge transfer).
total_score:total score for university, used to determine rank.
num_students:number of students at the university.
student_staff ratio:Number of students divided by number of staff.
international_students:Percentage of students who are international.
female_male_ratio:Female student to Male student ratio.
year: year of the ranking (2011 to 2016 included).

3. Ask Questions and Visuliaze

Now, it's time to think about what I want to get from this data.

Q1: Where are the first 100 rank university come from?
For this question, we can use a map to show.
Use country as a geo-information to generate a map.
Use world_rank as a filter to extract the university in the first 100.
Use the number of records as color.
Use year as a filter.

From this map, we can see that most top 100 universities come from America.

Q2: The total score determines the rank, but how it comes? What score matters most? Teaching? International? Research? Citation? Or Income?
For this question, we need to do regression in R first.

> dataset <- cbind.data.frame(timesData$teaching, timesData$international, timesData$research,
+              timesData$citations, timesData$income, timesData$total_score)
> colnames(dataset) <- c("teaching", "international", "research", "citation", "income", "total")
> summary(dataset)
    teaching    international     research        citation          income         total     
 Min.   : 9.9   20.7   :  10   Min.   : 2.90   Min.   :  1.20   -      : 218   -      :1402  
 1st Qu.:24.7   29.6   :  10   1st Qu.:19.60   1st Qu.: 45.50   100.0  :  68   49.0   :  13  
 Median :33.9   -      :   9   Median :30.50   Median : 62.50   28.0   :  26   51.1   :  12  
 Mean   :37.8   34.3   :   9   Mean   :35.91   Mean   : 60.92   31.1   :  20   46.6   :  11  
 3rd Qu.:46.4   46.8   :   9   3rd Qu.:47.25   3rd Qu.: 79.05   28.8   :  19   46.9   :  10  
 Max.   :99.7   48.4   :   9   Max.   :99.40   Max.   :100.00   28.5   :  18   50.1   :  10  
                (Other):2547                                    (Other):2234   (Other):1145
As we can see from the summary, there are 9 empty value in international score, 218 empty value in income score and 1402 empty value in total score. It is not meaningful to give a value for those empty. The easiest way is to delete all the records with those value.
> library(sqldf)
> dataset <- sqldf("SELECT * 
+                   FROM dataset
+                   WHERE total != '-'
+                   AND international != '-'
+                   AND income != '-'")
> dataset$total <- as.numeric(as.character(dataset$total))
> dataset$income <- as.numeric(as.character(dataset$income))
> dataset$international <- as.numeric(as.character(dataset$international))
> summary(dataset)
    teaching     international      research        citation          income           total      
 Min.   :15.90   Min.   : 14.8   Min.   :13.10   Min.   :  8.60   Min.   : 24.20   Min.   :41.40  
 1st Qu.:37.70   1st Qu.: 45.5   1st Qu.:36.90   1st Qu.: 66.50   1st Qu.: 36.40   1st Qu.:50.20  
 Median :46.45   Median : 61.6   Median :48.10   Median : 78.60   Median : 45.80   Median :55.80  
 Mean   :50.04   Mean   : 61.1   Mean   :51.97   Mean   : 76.81   Mean   : 54.24   Mean   :59.63  
 3rd Qu.:59.55   3rd Qu.: 79.1   3rd Qu.:64.70   3rd Qu.: 89.10   3rd Qu.: 68.12   3rd Qu.:65.90  
 Max.   :99.70   Max.   :100.0   Max.   :99.40   Max.   :100.00   Max.   :100.00   Max.   :96.10  
> dim(dataset)
[1] 1056    6
What I will use for regression analysis is the dataset with 1056 rows and 6 columns.
The most most basic regression is linear regression. We can get some information from scatterplot:
library(ggplot2)
ggplot(dataset, aes(x=teaching, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=international, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=research, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=citation, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=income, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)





From the scatterplots, we can see only teaching and research score looks like linear regression with the total score. 
Standard Linear Regression: 
> lm_total = lm(total ~ ., data = dataset)
> summary(lm_total)
 
Call:
lm(formula = total ~ ., data = dataset)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-1.12476 -0.12664 -0.02884  0.05652  1.78528 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1176804  0.0539644   2.181   0.0294 *  
teaching      0.2997580  0.0011151 268.813   <2e-16 ***
international 0.0703900  0.0003961 177.728   <2e-16 ***
research      0.3009812  0.0009668 311.322   <2e-16 ***
citation      0.3023552  0.0005430 556.857   <2e-16 ***
income        0.0248092  0.0003820  64.954   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2619 on 1050 degrees of freedom
Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
F-statistic: 4.972e+05 on 5 and 1050 DF,  p-value: < 2.2e-16
 
> rmse = sqrt(mean(residuals(lm_total)^2))
> rmse
[1] 0.2611183
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+   train1 = c(1:n)
+   train = train1[train1 != k]
+   m1 = lm(total ~ ., data = dataset[train, ])
+   pred = predict(m1, newdata = dataset[-train, ])
+   obs = dataset$total[-train]
+   error[k] = obs - pred
+ }
> lm1_me = mean(error)
> lm1_rmse = sqrt(mean(error^2))
> lm1_me
[1] 2.806287e-05
> lm1_rmse
[1] 0.2629693
From the coefficients table, citation, research and teaching scores are much more important than other two.  And the RMSE for this model is 0.2629693
Non Linear Regression Model:

> poly1 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+   train1 = c(1:n)
+   train = train1[train1 != k]
+   m3 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset[train,])
+   pred = predict(m3, newdat = dataset[-train, ])
+   obs = dataset$total[-train]
+   error[k] = obs - pred
+ }
> poly1_me = mean(error)
> poly1_rmse = sqrt(mean(error^2))
> poly1_rmse
[1] 0.2590252
With poly degree 2 on citation, international and income hit a better RMSE = 0.2590252. Continue trying:

> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset) 
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+   train1 = c(1:n)
+   train = train1[train1 != k]
+   m4 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset[train,])
+   pred = predict(m4, newdat = dataset[-train, ])
+   obs = dataset$total[-train]
+   error[k] = obs - pred
+ }
> poly2_me = mean(error)
> poly2_rmse = sqrt(mean(error^2))
> poly2_rmse
[1] 0.2544657
With poly degree 3 on citation, international and income hit a better RMSE = 0.2544657. Stop trying, because it may comes overfitting somewhere.
Then, try to visualize in Tableau:

Q3: How to visualize the number of students and the ratio of student:staff?


The darker the color is, the more international students there are. The bigger the square is, the higher the ratio of student to staff is.

Here is the dashboard:



Also , you can check this link for a better interactive experience:
https://public.tableau.com/views/UniversitiesWorldRank/1?:embed=y&:display_count=yes&:showTabs=y

If you have any good idea, please comment below.