Team members responsible for this notebook:
Yijia Mao: make ggplots for age groups and race groups, write explanations, and compare California employment rates with average employment rates.
Minghong Zheng: make employment rate difference maps, write explanations.
Yuhan Wang: make plots for general, run t-tests, run regressions and make predicts, write explanations.
Yiwen Wei: proofread codes.
%load_ext rmagic
from IPython.core.display import Image
%%bash
cd ../data1/cleaned
ls
%%R
print(getwd())
setwd('../visualizations')
Part A¶
We are first working on the 2013 data by state.
read csv into R.
%%R
gen2013=read.csv('../data1/cleaned/gen2013.csv',header=T)
print(head(gen2013))
We need to compare the general employment rate between men and women. So we first generated 2 subsets with group name "Men" and "Women" only. Then we made a scatter plot to show the employment rates in different states.
%%R -w 800
gen2013m=subset(gen2013, group=="Men")
gen2013w=subset(gen2013, group=="Women")
jpeg("Employment Rate of Men and Women by States in 2013.jpeg", width=800)
print(head(gen2013m$state))
plot(gen2013m$state, gen2013m$emp_rate, col='blue', las=2, ylim=c(min(gen2013m$emp_rate,gen2013w$emp_rate), max(gen2013m$emp_rate,gen2013w$emp_rate)),
xlab='State Name', ylab='Employment Rate (percentage)', col.lab='blue')
points(gen2013w$state, gen2013w$emp_rate, col='red')
points(gen2013m$state, gen2013m$emp_rate, col='blue')
legend('topleft', c('Employment Rate-Men','Employment Rate-Women'), col=c('blue','red'),pch=1)
title('Employment Rate of Men and Women by States in 2013')
dev.off()
Image("Employment Rate of Men and Women by States in 2013.jpeg")
We also would like to know which states have the highest employment rates, and which states have the lowest employment rates.
%%R
print(list(gen2013m[which.max(gen2013m$emp_rate),],
gen2013m[which.min(gen2013m$emp_rate),],
gen2013w[which.max(gen2013w$emp_rate),],
gen2013w[which.min(gen2013w$emp_rate),]))
And the employment rate difference between men and women.
we first made a new dataframe gen2013mw, which combined Men and Women employment rates for the same state in the same row. Two new columns were generated: emp_rate_diff, and unemp_rate_diff.
%%R
names(gen2013m)[5:6]=paste(names(gen2013m)[5:6],'_m',sep='')
names(gen2013w)[5:6]=paste(names(gen2013w)[5:6],'_w',sep='')
print(head(gen2013m))
%%R
gen2013mw=merge(gen2013m,gen2013w, by='state')
gen2013mw['emp_diff']=gen2013mw['emp_rate_m']-gen2013mw['emp_rate_w']
gen2013mw['unemp_diff']=gen2013mw['unemp_rate_m']-gen2013mw['unemp_rate_w']
gen2013mw=subset(gen2013mw, select=c(state,emp_rate_m,unemp_rate_m,emp_rate_w,unemp_rate_w,emp_diff, unemp_diff))
print(head(gen2013mw))
Make a scatter plot to see the employment difference. We can clearly see that for all states, the difference is positive, which suggests that men are employed more than women. To see this for sure, we will run t-tests to test whether the difference is statistically significant.
Similarly, we made a plot too show the unemployment rate difference. In most states, the unemployment of men is a few percentage higher then women.
%%R
jpeg("Employment Rate Difference in 2013.jpeg", width=800)
plot(gen2013mw$state, gen2013mw$emp_diff, col='blue', las=2, ylim=c(min(gen2013mw$emp_diff), max(gen2013mw$emp_diff)), xlab='State Name', ylab='Employment Rate Difference (Percentage)', col.lab='blue')
title('Employment Rate Difference in 2013')
dev.off()
Image("Employment Rate Difference in 2013.jpeg")
%%R
jpeg("Employment Rate Difference in 2013.jpeg", width=800)
plot(gen2013mw$state, gen2013mw$unemp_diff, col='blue', las=2, ylim=c(min(gen2013mw$unemp_diff), max(gen2013mw$unemp_diff)), xlab='State Name', ylab='Unmployment Rate Difference (Percentage)', col.lab='blue')
title('Unmployment Rate Difference in 2013')
dev.off()
Image("Employment Rate Difference in 2013.jpeg")
see which state has the highest and lowest employment difference. Do the same for unemployment rate.
%%R
print(list(gen2013mw[which.max(gen2013mw$emp_diff),],
gen2013mw[which.min(gen2013mw$emp_diff),],
gen2013mw[which.max(gen2013mw$unemp_diff),],
gen2013mw[which.min(gen2013mw$unemp_diff),]))
run a one-sample t-test to see whether we can reject the null hypothesis of "Men and women have the same employment rate", by comparing the employment rate difference with 0.
do the same for unemployment difference.
%%R
print(t.test(gen2013mw$emp_diff, mu=0))
print(t.test(gen2013mw$unemp_diff, mu=0))
from what we get from the t-tests, we can reject the null hypotheses, and say that men has much higher employment rate, while lower unemployment rate, than women.
Make maps to see employment rate difference between men and women in states in 2013
- Use packages "map" and "ggplot2" to make maps
%%R
install.packages("map")
install.packages("ggplot2")
%%R
require(ggplot2)
require(maps)
- Make a dataframe about an ordered list of longitude and latitude points that outlines each US state
%%R
us_state_map=map_data('state')
print(head(us_state_map))
- Name diff_rate as a dataframe with states and difference between employment rate of men and women in states in 2013
%%R
diff_rate=gen2013mw[c("emp_diff")]
B=tolower(unname(unlist(gen2013mw["state"])))#lowercase states
diff_rate["region"]=B#use name "region" instead of "state"
print(head(diff_rate))
- Merge our dataframe of diff_rate into the map data, and sort it agian
%%R
map_data1=merge(us_state_map,diff_rate,by="region",all=T)
map_data1=map_data1[order(map_data1$order), ]
print(head(map_data1))
- Generate a similar map for employment rate difference between men and women in 2013 as we have done above
%%R
setwd('../visualizations')
emp_difference=unname(unlist(map_data1["emp_diff"]))# covert the class of dataframe into class of numeric
print(qplot(long,lat , data=map_data1, geom="polygon", group=group, fill=emp_difference)
+ labs(x="", y="")+theme_bw()+ggtitle("Employment Rate Difference between Men and Women in 2013")
+theme(legend.position="bottom", legend.direction="horizontal")+scale_fill_gradient2("fill"))
- Save it as image "emp_diff_wm.jpeg" into visulizations directory
%%R
ggsave("emp_diff_wm.jpeg")
- Putting it all togther and make a function to generate this kind of map
%%R
mapusa=function(dataset_m,dataset_w,title_for_empdiff){
K=dataset_m["emp_rate"]-dataset_w["emp_rate"]
names(K)="emp_diff"
B=tolower(unname(unlist(dataset_m["state"])))#lowercase states
K["region"]=B
require(ggplot2)
require(maps)
us_state_map=map_data('state')
map_data=merge(us_state_map,K,by="region",all=T)
map_data=map_data[order(map_data$order), ]
emp_difference=unname(unlist(map_data["emp_diff"]))
setwd('../visualizations')
print(qplot(long,lat , data=map_data, geom="polygon", group=group, fill=emp_difference)
+ labs(x="", y="")+theme_bw()+ggtitle(title_for_empdiff)
+theme(legend.position="bottom", legend.direction="horizontal")+scale_fill_gradient2("fill"))
ggsave(paste(title_for_empdiff,".jpeg",sep=""))
}
Now we analyze the gender discrimination of all the states based on various age groups.¶
%%R
age2013=read.csv('../data1/cleaned/age2013.csv',header=T)
i=sapply(age2013, is.factor)
print(i)
age2013[i]=lapply(age2013[i],as.character)
age2013=age2013[complete.cases(age2013),]
A1=subset(age2013,grp_code==26|grp_code==33)
A2=subset(age2013,grp_code==27|grp_code==34)
A3=subset(age2013,grp_code==28|grp_code==35)
A4=subset(age2013,grp_code==29|grp_code==36)
A5=subset(age2013,grp_code==30|grp_code==37)
A6=subset(age2013,grp_code==31|grp_code==38)
A7=subset(age2013,grp_code==32|grp_code==39)
A=list(A1,A2,A3,A4,A5,A6,A7)
print(A1[1,"grp_code"])
print(head(A1))
clean=function(x,diff){ #make a function called"clean" to clean non-matching data
AK=NULL
a=1
for (i in (1:((nrow(x)-1))))
{if (abs(x[i+1,"grp_code"]-x[i,"grp_code"])!=diff)
{AK[a]=i
a=a+1}}
if (is.null(AK)) {return(x=x)}
else{return(x=x[-AK,])}}
for( i in (1: length(A)))
{A[[i]]=clean(A[[i]],7)} # clean non- matching data
%%R -h 1000 -w 1000
bigA=do.call(rbind, A)
bigA$Age=gsub(".*, ", "", bigA$group)
bigA$Gender=gsub(", .*", "", bigA$group)
p <- ggplot(bigA, aes(x = factor(Age), y = emp_rate, color = Gender)) + geom_boxplot() + geom_jitter() + ggtitle("Employment Rate by Age Groups") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Age Groups") + ylab("Employment Rate")
print(p)
ggsave("Employment Rate by Age Groups.jpeg")
%%R -h 1000 -w 1000
p2 <- ggplot(bigA, aes(x = factor(Age), y = unemp_rate, color = Gender)) + geom_boxplot() + geom_jitter() + ggtitle("Unemployment Rate by Age Groups") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Age Groups") + ylab("Unemployment Rate")
print(p2)
ggsave("Unemployment Rate by Age Groups.jpeg")
Use the function"mapusa" to make a map for employment rate difference of different age groups according to gender¶
Here is an example for making a map of empolyment rate difference from 16 to 19 years according to gender¶
%%R
Age2013m=subset(A[[1]],grp_code==A[[1]][1,"grp_code"])# the dataset of employment and unemployment rate among men from 16 to 19 years
Age2013w=subset(A[[1]],grp_code==A[[1]][2,"grp_code"])# the dataset of employment and unemployment rate among women from 16 to 19 years
mapusa(Age2013m,Age2013w,"Employment Rate Difference from 16 to 19 years")
- Put the codes above together and make a function called "make_age_group" to make maps for different age groups
%%R
make_age_group=function(i){
Agegroup=c("16 to 19 years","20 to 24 years","25 to 34 years","35 to 44 years","45 to 54 years","55 to 64 years","65 years and over")
Age2013m=subset(A[[i]],grp_code==A[[i]][1,"grp_code"])
Age2013w=subset(A[[i]],grp_code==A[[i]][2,"grp_code"])
mapusa(Age2013m,Age2013w,paste("Employment Rate Difference from ",Agegroup[i],sep=""))
}
- Make a map for empolyment rate difference from 20 to 24 years according to gender
%%R
make_age_group(2)
- Make a map for empolyment rate difference from 25 to 34 years according to gender
%%R
make_age_group(3)
- Make a map for empolyment rate difference from 35 to 44 years according to gender
%%R
make_age_group(4)
- Make a map for empolyment rate difference from 45 to 54 years according to gender
%%R
make_age_group(5)
- Make a map for empolyment rate difference from 55 to 64 years according to gender
%%R
make_age_group(6)
- Make a map for empolyment rate difference from 65 years and over according to gender
%%R
make_age_group(7)
%%R
race2013=read.csv('../data1/cleaned/race2013.csv',header=T)
i=sapply(race2013, is.factor)
race2013[i]=lapply(race2013[i],as.character)
race2013=race2013[complete.cases(race2013),]
R1=subset(race2013,grp_code==5|grp_code==6)
R2=subset(race2013,grp_code==8|grp_code==9)
R3=subset(race2013,grp_code==14|grp_code==15)
R=list(R1,R2,R3)
for( i in (1: length(R)))
{R[[i]]=clean(R[[i]],1)} # use the function"clean" to clean non- matching data
print(R[[2]])
%%R -h 1000 -w 1000
bigR=do.call(rbind, R)
bigR$Gender=gsub(".*, ", "", bigR$group)
bigR$Race=gsub(", .*", "", bigR$group)
print(head(bigR))
q <- ggplot(bigR, aes(x = factor(Race), y = emp_rate, color = Gender)) + geom_boxplot() + geom_jitter() + ggtitle("Employment Rate by Race Groups") + xlab("Race Groups") + ylab("Employment Rate")
print(q)
ggsave("Employment Rate by Race Groups.jpeg")
%%R -h 1000 -w 1000
q2 <- ggplot(bigR, aes(x = factor(Race), y = unemp_rate, color = Gender)) + geom_boxplot() + geom_jitter() + ggtitle("Unemployment Rate by Race Groups") + xlab("Race Groups") + ylab("Unemployment Rate")
print(q2)
ggsave("Unemployment Rate by Race Groups.jpeg")
want to compare California Employment rate with the rates in other states & predict its trend according to the data in the past 10 years
We take the average of all males and females in all states and compare them with the average of males and females in California:
%%R -h 1000 -w 1000
print(head(gen2013))
gen2013m=subset(gen2013, group=="Men")
gen2013w=subset(gen2013, group=="Women")
MenAvg=mean(gen2013m[ ,"emp_rate"], na.rm = TRUE)
WomenAvg=mean(gen2013w[ ,"emp_rate"], na.rm=TRUE)
print(MenAvg)
print(gen2013$emp_rate)
gen2013m$difference=gen2013m$emp_rate-MenAvg
gen2013w$differencen=gen2013w$emp_rate-WomenAvg
print(head(gen2013m))
jpeg("Male: Difference between California Employment Rate and Average Employment Rate.jpeg",width=800)
plot(gen2013m$state,gen2013m$difference, las=2, xlab="state", main="Male: Difference between California Employment Rate and Average Employment Rate")
dev.off()
Image("Male: Difference between California Employment Rate and Average Employment Rate.jpeg")
%%R
jpeg("Female: Difference between California Employment Rate and Average Employment Rate.jpeg",width=800)
plot(gen2013w$state,gen2013w$difference, las=2, xlab="state", main="Female: Difference between California Employment Rate and Average Employment Rate")
dev.off()
Image("Female: Difference between California Employment Rate and Average Employment Rate.jpeg")
%%R
gencal=read.csv('../data1/cleaned/gencal.csv',header=T)
print(head(gencal))
Use the function"mapusa" to make maps for employment rate difference in different race groups according to gender¶
- Create a similar function called "make_race_group" as the function "make_age_group" to generate maps for different race groups
%%R
make_race_group=function(i){
Racegroup=c("Black or African American","Hispanic or Latino ethnicity","White")
Age2013m=subset(R[[i]],grp_code==R[[i]][1,"grp_code"])
Age2013w=subset(R[[i]],grp_code==R[[i]][2,"grp_code"])
mapusa(Age2013m,Age2013w,paste("Employment Rate Difference for ",Racegroup[i],sep=""))
}
%%R
A=list()
- Make a map for empolyment rate difference for Black or African American according to gender
%%R
make_race_group(1)
- Make a map for empolyment rate difference for Hispanic or Latino ethnicity according to gender
%%R
make_race_group(2)
- Make a map for empolyment rate difference for White according to gender
%%R
make_race_group(3)
See the change trend, and make a prediction
Use 2013 California data, a linear regression is done using emp_diff (the employment rate difference between men and women) as dependent variable, and year as independent variable. And do the same for unemp_diff.
Using the regression result. We were able to predict the employment rate difference and unemployment difference in 2014.
First, calculate the employment rate difference, and clean the dataframe.
%%R
gencalm=subset(gencal, group=='Men')
gencalw=subset(gencal, group=="Women")
gencalmw=merge(gencalm,gencalw, by='year')
gencalmw['emp_diff']=gencalmw['emp_rate.x']-gencalmw['emp_rate.y']
gencalmw['unemp_diff']=gencalmw['unemp_rate.x']-gencalmw['unemp_rate.y']
gencalmw=subset(gencalmw, select=c('year','emp_diff', 'unemp_diff'))
print(gencalmw)
Run a linear regression to see the intercept and the coefficient.
%%R
year=gencalmw$year
diff=gencalmw$emp_diff
undiff=gencalmw$unemp_diff
model=lm(diff~year) #model of employment
modelun=lm(undiff~year) #model of unemployment
print(model)
print(modelun)
Predict the employment rate difference in 2014.
Combine the data together with data in the past years, and make a plot with regression line.
Do the same for unemployment rate.
%%R
print(getwd())
setwd('../visualizations')
%%R
emp2014=data.frame(year=2014)
pred=predict(model, newdata=emp2014)
predun=predict(modelun, newdata=emp2014)
print(c(pred,predun))
data2014=data.frame(2014, pred, predun)
names(data2014)[1:3]=c('year','emp_diff', 'unemp_diff')
gencal14=rbind(gencalmw,data2014)
jpeg("prediction emp.jpeg")
plot(gencal14$year,gencal14$emp_diff, xlab="year", ylab="Employment Difference")
abline(model)
title(main="Employment Rate Linear Regression in California")
dev.off()
jpeg("prediction unemp.jpeg")
plot(gencal14$year,gencal14$unemp_diff, xlab="year",
ylab="Unemployment Difference")
abline(modelun)
title(main="Unemployment Rate Linear Regression in California")
dev.off()
The regression results show that if our linear model is accurate, then the employment rate difference in 2014 is likely to be 11.85, and the unemployment rate difference in 2014 is likely to be 0.98.
Image('prediction emp.jpeg')
Image('prediction unemp.jpeg')