I downloaded this dataset from Kaggle. There are 6 csv files: timesData.csv, school_and_country_table.csv, cwurData.csv, educational_attainment_supplementary_data.csv, education_expenditure_supplementary_data.csv, shanghaiData.csv. What I will focus on is the timesData.
For more information about the dataset, please check: https://www.kaggle.com/mylesoneill/world-university-rankings
I will use R to do some exploration and use Tableau to make some visualizations.
1. Load the data
timesData <- read.csv("timesData.csv", header = TRUE)
2. View the structure of data
> dim(timesData) [1] 2603 14 > timesData[1:3,] world_rank university_name country teaching international research citations income total_score num_students 1 1 Harvard University United States of America 99.7 72.4 98.7 98.8 34.5 96.1 20,152 2 2 California Institute of Technology United States of America 97.7 54.6 98.0 99.9 83.7 96.0 2,243 3 3 Massachusetts Institute of Technology United States of America 97.8 82.3 91.4 99.9 87.5 95.6 11,074 student_staff_ratio international_students female_male_ratio year 1 8.9 25% 2011 2 6.9 27% 33 : 67 2011 3 9.0 33% 37 : 63 2011
In timesData.csv, there are 2,603 observations and 14 variables:
world _rank: world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250)
university_name:name of university
country:country of each university.
teaching:university score for teaching (the learning environment).
international:university score international outlook (staff, students, research).
research:university score for research (volume, income and reputation).
citations:university score for citations (research influence).
income:university score for industry income (knowledge transfer).
total_score:total score for university, used to determine rank.
num_students:number of students at the university.
student_staff ratio:Number of students divided by number of staff.
international_students:Percentage of students who are international.
female_male_ratio:Female student to Male student ratio.
year: year of the ranking (2011 to 2016 included).
3. Ask Questions and Visuliaze
Now, it's time to think about what I want to get from this data.Q1: Where are the first 100 rank university come from?
For this question, we can use a map to show.
Use country as a geo-information to generate a map.
Use world_rank as a filter to extract the university in the first 100.
Use the number of records as color.
Use year as a filter.
From this map, we can see that most top 100 universities come from America.
Q2: The total score determines the rank, but how it comes? What score matters most? Teaching? International? Research? Citation? Or Income?
For this question, we need to do regression in R first.
> dataset <- cbind.data.frame(timesData$teaching, timesData$international, timesData$research, + timesData$citations, timesData$income, timesData$total_score) > colnames(dataset) <- c("teaching", "international", "research", "citation", "income", "total") > summary(dataset) teaching international research citation income total Min. : 9.9 20.7 : 10 Min. : 2.90 Min. : 1.20 - : 218 - :1402 1st Qu.:24.7 29.6 : 10 1st Qu.:19.60 1st Qu.: 45.50 100.0 : 68 49.0 : 13 Median :33.9 - : 9 Median :30.50 Median : 62.50 28.0 : 26 51.1 : 12 Mean :37.8 34.3 : 9 Mean :35.91 Mean : 60.92 31.1 : 20 46.6 : 11 3rd Qu.:46.4 46.8 : 9 3rd Qu.:47.25 3rd Qu.: 79.05 28.8 : 19 46.9 : 10 Max. :99.7 48.4 : 9 Max. :99.40 Max. :100.00 28.5 : 18 50.1 : 10 (Other):2547 (Other):2234 (Other):1145
> library(sqldf) > dataset <- sqldf("SELECT * + FROM dataset + WHERE total != '-' + AND international != '-' + AND income != '-'") > dataset$total <- as.numeric(as.character(dataset$total)) > dataset$income <- as.numeric(as.character(dataset$income)) > dataset$international <- as.numeric(as.character(dataset$international)) > summary(dataset) teaching international research citation income total Min. :15.90 Min. : 14.8 Min. :13.10 Min. : 8.60 Min. : 24.20 Min. :41.40 1st Qu.:37.70 1st Qu.: 45.5 1st Qu.:36.90 1st Qu.: 66.50 1st Qu.: 36.40 1st Qu.:50.20 Median :46.45 Median : 61.6 Median :48.10 Median : 78.60 Median : 45.80 Median :55.80 Mean :50.04 Mean : 61.1 Mean :51.97 Mean : 76.81 Mean : 54.24 Mean :59.63 3rd Qu.:59.55 3rd Qu.: 79.1 3rd Qu.:64.70 3rd Qu.: 89.10 3rd Qu.: 68.12 3rd Qu.:65.90 Max. :99.70 Max. :100.0 Max. :99.40 Max. :100.00 Max. :100.00 Max. :96.10 > dim(dataset) [1] 1056 6
The most most basic regression is linear regression. We can get some information from scatterplot:
library(ggplot2) ggplot(dataset, aes(x=teaching, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE) ggplot(dataset, aes(x=international, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE) ggplot(dataset, aes(x=research, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE) ggplot(dataset, aes(x=citation, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE) ggplot(dataset, aes(x=income, y=total)) + geom_point(shape=1) +geom_smooth(method=lm, se = FALSE)
From the scatterplots, we can see only teaching and research score looks like linear regression with the total score.
Standard Linear Regression:
> lm_total = lm(total ~ ., data = dataset) > summary(lm_total) Call: lm(formula = total ~ ., data = dataset) Residuals: Min 1Q Median 3Q Max -1.12476 -0.12664 -0.02884 0.05652 1.78528 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1176804 0.0539644 2.181 0.0294 * teaching 0.2997580 0.0011151 268.813 <2e-16 *** international 0.0703900 0.0003961 177.728 <2e-16 *** research 0.3009812 0.0009668 311.322 <2e-16 *** citation 0.3023552 0.0005430 556.857 <2e-16 *** income 0.0248092 0.0003820 64.954 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2619 on 1050 degrees of freedom Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996 F-statistic: 4.972e+05 on 5 and 1050 DF, p-value: < 2.2e-16 > rmse = sqrt(mean(residuals(lm_total)^2)) > rmse [1] 0.2611183 > n = length(dataset$total) > error = dim(n) > for (k in 1:n) { + train1 = c(1:n) + train = train1[train1 != k] + m1 = lm(total ~ ., data = dataset[train, ]) + pred = predict(m1, newdata = dataset[-train, ]) + obs = dataset$total[-train] + error[k] = obs - pred + } > lm1_me = mean(error) > lm1_rmse = sqrt(mean(error^2)) > lm1_me [1] 2.806287e-05 > lm1_rmse [1] 0.2629693
Non Linear Regression Model:
> poly1 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset) > n = length(dataset$total) > error = dim(n) > for (k in 1:n) { + train1 = c(1:n) + train = train1[train1 != k] + m3 = lm(total ~ teaching + research + poly(citation, degree = 2) + poly(international, degree = 2) + poly(income, degree = 2) , data = dataset[train,]) + pred = predict(m3, newdat = dataset[-train, ]) + obs = dataset$total[-train] + error[k] = obs - pred + } > poly1_me = mean(error) > poly1_rmse = sqrt(mean(error^2)) > poly1_rmse [1] 0.2590252
> poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset) > n = length(dataset$total) > error = dim(n) > poly2 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset) > n = length(dataset$total) > error = dim(n) > for (k in 1:n) { + train1 = c(1:n) + train = train1[train1 != k] + m4 = lm(total ~ teaching + research + poly(citation, degree = 3) + poly(international, degree = 3) + poly(income, degree = 3) , data = dataset[train,]) + pred = predict(m4, newdat = dataset[-train, ]) + obs = dataset$total[-train] + error[k] = obs - pred + } > poly2_me = mean(error) > poly2_rmse = sqrt(mean(error^2)) > poly2_rmse [1] 0.2544657
Then, try to visualize in Tableau:
Q3: How to visualize the number of students and the ratio of student:staff?
The darker the color is, the more international students there are. The bigger the square is, the higher the ratio of student to staff is.
Also , you can check this link for a better interactive experience:
https://public.tableau.com/views/UniversitiesWorldRank/1?:embed=y&:display_count=yes&:showTabs=y
If you have any good idea, please comment below.








没有评论:
发表评论