I downloaded this dataset from
Kaggle. There are 6 csv files: timesData.csv, school_and_country_table.csv, cwurData.csv, educational_attainment_supplementary_data.csv, education_expenditure_supplementary_data.csv, shanghaiData.csv. What I will focus on is the timesData.
For more information about the dataset, please check:
https://www.kaggle.com/mylesoneill/world-university-rankings
I will use R to do some exploration and use Tableau to make some visualizations.
1. Load the data
timesData <- read.csv("timesData.csv", header = TRUE)
2. View the structure of data
> dim(timesData)
[1] 2603 14
> timesData[1:3,]
world_rank university_name country teaching international research citations income total_score num_students
1 1 Harvard University United States of America 99.7 72.4 98.7 98.8 34.5 96.1 20,152
2 2 California Institute of Technology United States of America 97.7 54.6 98.0 99.9 83.7 96.0 2,243
3 3 Massachusetts Institute of Technology United States of America 97.8 82.3 91.4 99.9 87.5 95.6 11,074
student_staff_ratio international_students female_male_ratio year
1 8.9 25% 2011
2 6.9 27% 33 : 67 2011
3 9.0 33% 37 : 63 2011
In timesData.csv, there are 2,603 observations and 14 variables:
world _rank: world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250)
university_name:name of university
country:country of each university.
teaching:university score for teaching (the learning environment).
international:university score international outlook (staff, students, research).
research:university score for research (volume, income and reputation).
citations:university score for citations (research influence).
income:university score for industry income (knowledge transfer).
total_score:total score for university, used to determine rank.
num_students:number of students at the university.
student_staff ratio:Number of students divided by number of staff.
international_students:Percentage of students who are international.
female_male_ratio:Female student to Male student ratio.
year: year of the ranking (2011 to 2016 included).
3. Ask Questions and Visuliaze
Now, it's time to think about what I want to get from this data.
Q1: Where are the first 100 rank university come from?
For this question, we can use a map to show.
Use country as a geo-information to generate a map.
Use world_rank as a filter to extract the university in the first 100.
Use the number of records as color.
Use year as a filter.
From this map, we can see that most top 100 universities come from America.
Q2: The total score determines the rank, but how it comes? What score matters most? Teaching? International? Research? Citation? Or Income?
For this question, we need to do regression in R first.
> dataset <- cbind.data.frame(timesData$teaching, timesData$international, timesData$research,
+ timesData$citations, timesData$income, timesData$total_score)
> colnames(dataset) <- c("teaching", "international", "research", "citation", "income", "total")
> summary(dataset)
teaching international research citation income total
Min. : 9.9 20.7 : 10 Min. : 2.90 Min. : 1.20 - : 218 - :1402
1st Qu.:24.7 29.6 : 10 1st Qu.:19.60 1st Qu.: 45.50 100.0 : 68 49.0 : 13
Median :33.9 - : 9 Median :30.50 Median : 62.50 28.0 : 26 51.1 : 12
Mean :37.8 34.3 : 9 Mean :35.91 Mean : 60.92 31.1 : 20 46.6 : 11
3rd Qu.:46.4 46.8 : 9 3rd Qu.:47.25 3rd Qu.: 79.05 28.8 : 19 46.9 : 10
Max. :99.7 48.4 : 9 Max. :99.40 Max. :100.00 28.5 : 18 50.1 : 10
(Other):2547 (Other):2234 (Other):1145
As we can see from the summary, there are 9 empty value in international score, 218 empty value in income score and 1402 empty value in total score. It is not meaningful to give a value for those empty. The easiest way is to delete all the records with those value.
> library(sqldf)
> dataset <- sqldf("SELECT *
+ FROM dataset
+ WHERE total != '-'
+ AND international != '-'
+ AND income != '-'")
> dataset$total <- as.numeric(as.character(dataset$total))
> dataset$income <- as.numeric(as.character(dataset$income))
> dataset$international <- as.numeric(as.character(dataset$international))
> summary(dataset)
teaching international research citation income total
Min. :15.90 Min. : 14.8 Min. :13.10 Min. : 8.60 Min. : 24.20 Min. :41.40
1st Qu.:37.70 1st Qu.: 45.5 1st Qu.:36.90 1st Qu.: 66.50 1st Qu.: 36.40 1st Qu.:50.20
Median :46.45 Median : 61.6 Median :48.10 Median : 78.60 Median : 45.80 Median :55.80
Mean :50.04 Mean : 61.1 Mean :51.97 Mean : 76.81 Mean : 54.24 Mean :59.63
3rd Qu.:59.55 3rd Qu.: 79.1 3rd Qu.:64.70 3rd Qu.: 89.10 3rd Qu.: 68.12 3rd Qu.:65.90
Max. :99.70 Max. :100.0 Max. :99.40 Max. :100.00 Max. :100.00 Max. :96.10
> dim(dataset)
[1] 1056 6
What I will use for regression analysis is the dataset with 1056 rows and 6 columns.
The most most basic regression is linear regression. We can get some information from scatterplot:
library(ggplot2)
ggplot(dataset, aes(x=teaching, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=international, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=research, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=citation, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
ggplot(dataset, aes(x=income, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
From the scatterplots, we can see only teaching and research score looks like linear regression with the total score.
Standard Linear Regression:
> lm_total = lm(total ~ ., data = dataset)
> summary(lm_total)
Call:
lm(formula = total ~ ., data = dataset)
Residuals:
Min 1Q Median 3Q Max
-1.12476 -0.12664 -0.02884 0.05652 1.78528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1176804 0.0539644 2.181 0.0294 *
teaching 0.2997580 0.0011151 268.813 <2e-16 ***
international 0.0703900 0.0003961 177.728 <2e-16 ***
research 0.3009812 0.0009668 311.322 <2e-16 ***
citation 0.3023552 0.0005430 556.857 <2e-16 ***
income 0.0248092 0.0003820 64.954 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2619 on 1050 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
F-statistic: 4.972e+05 on 5 and 1050 DF, p-value: < 2.2e-16
> rmse = sqrt(mean(residuals(lm_total)^2))
> rmse
[1] 0.2611183
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+ train1 = c(1:n)
+ train = train1[train1 != k]
+ m1 = lm(total ~ ., data = dataset[train, ])
+ pred = predict(m1, newdata = dataset[-train, ])
+ obs = dataset$total[-train]
+ error[k] = obs - pred
+ }
> lm1_me = mean(error)
> lm1_rmse = sqrt(mean(error^2))
> lm1_me
[1] 2.806287e-05
> lm1_rmse
[1] 0.2629693
From the coefficients table, citation, research and teaching scores are much more important than other two. And the RMSE for this model is 0.2629693
Non Linear Regression Model:
> poly1 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+ train1 = c(1:n)
+ train = train1[train1 != k]
+ m3 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset[train,])
+ pred = predict(m3, newdat = dataset[-train, ])
+ obs = dataset$total[-train]
+ error[k] = obs - pred
+ }
> poly1_me = mean(error)
> poly1_rmse = sqrt(mean(error^2))
> poly1_rmse
[1] 0.2590252
With poly degree 2 on citation, international and income hit a better RMSE = 0.2590252. Continue trying:
> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset)
> n = length(dataset$total)
> error = dim(n)
> for (k in 1:n) {
+ train1 = c(1:n)
+ train = train1[train1 != k]
+ m4 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset[train,])
+ pred = predict(m4, newdat = dataset[-train, ])
+ obs = dataset$total[-train]
+ error[k] = obs - pred
+ }
> poly2_me = mean(error)
> poly2_rmse = sqrt(mean(error^2))
> poly2_rmse
[1] 0.2544657
With poly degree 3 on citation, international and income hit a better RMSE = 0.2544657. Stop trying, because it may comes overfitting somewhere.
Then, try to visualize in Tableau:
Q3: How to visualize the number of students and the ratio of student:staff?
The darker the color is, the more international students there are. The bigger the square is, the higher the ratio of student to staff is.
Here is the dashboard:
Also , you can check this link for a better interactive experience:
https://public.tableau.com/views/UniversitiesWorldRank/1?:embed=y&:display_count=yes&:showTabs=y
If you have any good idea, please comment below.