Part 2: Data Cleaning
Team members responsible for this notebook:
Yuhan Wang: clean data and save them, make dataframes for each group, make boxplots to see the data summary, make trend graphs for California, write explanation.
Minghong Zheng, Yiwen Wei, Yijia Mao: proofread codes.
%load_ext rmagic
Import "Image" to show the graphs. Set working directory to ../visualizations to save our graphs¶
from IPython.core.display import Image
%%R
print(getwd())
setwd('../visualizations')
See where our raw data files are.
%%bash
cd ../data1/raw
ls
Data cleaning Part A, using data in 2013 only, for all states.
Read the csv file that contains our raw data.
Find the variables with mode of "factor", and change them to "character".
Delete unnecessary variables, observations with NA values.
Check the data frame we get.
%%R
A13=read.csv('../data1/raw/2013_to_be_cleaned.csv',header=T)
i=sapply(A13, is.factor)
A13[i]=lapply(A13[i],as.character)
A13=subset(A13, select=c(grp_code, state, group, emp_rate, unemp_rate))
A13=A13[complete.cases(A13),]
print(head(A13))
Check the mode of our variables.
%%R
type=data.frame(lapply(A13,class))
print(type)
Since we have 3 different questions to answer: general gender discrimination, gender discrimination in different age groups, gender discrimination in different races, 3 dataframes will be created, 1 for each question.
%%R
A13gen=subset(A13, grp_code==2 | grp_code==3) #general
A13age=subset(A13, grp_code>=26 & grp_code<=39) #age
A13race=subset(A13,grp_code==5|grp_code==6|grp_code==8|grp_code==9
|grp_code==11|grp_code==12|grp_code==14|grp_code==15) #race
print(head(A13race))
print(class(A13age))
write cleaned data into csv files, and save them in the proper directory.
%%R
write.csv(A13gen, '../data1/cleaned/gen2013.csv')
write.csv(A13age, '../data1/cleaned/age2013.csv')
write.csv(A13race, '../data1/cleaned/race2013.csv')
Data Visualization for 2013
Now we show the summary of relevant variables. And each make a boxplot to visualize.
general employment rate by gender
general unemployment rate by gender
employment rate in different age groups by gender
unemployment rate in different age groups by gender
employment rate in race groups by gender
unemployment rate in race groups by gender
%%R
install.packages('psych')
library(psych)
Summary of data by group. We can see the number of observations, mean, standard deviation, median, min, max, etc.
%%R
print(describe.by(A13gen$emp_rate, A13gen$group))
Make a boxplot to visualize the data distribution, and save it.
%%R
jpeg('summary general.jpeg', width=600, height=600)
par(mfrow=c(2,1))
boxplot(A13gen$emp_rate~A13gen$group, ylab="employment rate",
main="Boxplot-general employment rate")
boxplot(A13gen$unemp_rate~A13gen$group, ylab="unemployment rate",
main="Boxplot-general unemployment rate")
dev.off()
Image('summary general.jpeg')
As we can see from the boxplot, the employment rates of Men is in general higher than the employment rates of Women.
%%R
print(describe.by(A13age$emp_rate, A13age$group))
%%R
jpeg('summary by age.jpeg', width=600, height=1000)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))
boxplot(A13age$emp_rate~A13age$group, las=2, ylab="employment rate",
main="Boxplot-employment rate by age group")
boxplot(A13age$unemp_rate~A13age$group, las=2, ylab="unemployment rate",
main="Boxplot-unemployment rate by age group")
dev.off()
Image('summary by age.jpeg')
As shown in the boxplot above, the employment rate increases as age increases until the 35 to 44 years age group where it reaches the highest value, and then it decreases as age gets larger.
%%R
print(describe.by(A13race$emp_rate, A13race$group))
%%R
jpeg('summary by race.jpeg', width=600, height=1100)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))
boxplot(A13race$emp_rate~A13race$group, las=2, ylab="employment rate",
main="Boxplot-employment rate by race group")
boxplot(A13race$unemp_rate~A13race$group, las=2, ylab="unemployment rate",
main="Boxplot-unemployment rate by race group")
dev.off()
Image('summary by race.jpeg')
Data Cleaning Part B, using data in 2004-2013, focusing on California to see the trend.
Read csv file into R, and combine them together. Add a new variable "year" to distinguish data for each year.
%%R
A=read.csv('../data1/raw/2004_to_be_cleaned.csv',header=T)
A['year']=2004
for (i in 2005:2013){
B=read.csv(paste('../data1/raw/',i,'_to_be_cleaned.csv',sep=''), header=T)
B['year']=i
A=rbind(A,B)}
i=sapply(A, is.factor)
A[i]=lapply(A[i],as.character)
A['unemp_rate']=lapply(A['unemp_rate'],as.numeric)
print(head(A))
%%R
type=data.frame(lapply(A,class))
print(type)
Keep only the variables we need. we will keep year, grp_code, group,emp_rate and unemp_rate. keep only observations with state name "California".
%%R
A=subset(A, select=c(year, grp_code, state, group, emp_rate, unemp_rate))
A=A[A$state=='California',]
print(tail(A))
Delete any individual with N.A values.
%%R
A=A[complete.cases(A),]
Since we have 3 different questions to answer: general gender discrimination, gender discrimination in different age groups, gender discrimination in different races, 3 dataframe will be created, one for each question.
%%R
Agen=subset(A, grp_code==2 | grp_code==3) #general
Aage=subset(A, grp_code>=26 & grp_code<=39) #age
Arace=subset(A,grp_code==5|grp_code==6|grp_code==8|grp_code==9
|grp_code==11|grp_code==12|grp_code==14|grp_code==15) #race
print(head(Agen))
print(class(Aage))
%%R
write.csv(Agen, '../data1/cleaned/gencal.csv')
write.csv(Aage, '../data1/cleaned/agecal.csv')
write.csv(Arace, '../data1/cleaned/racecal.csv')
Make boxplots to visualize the employment rate and unemployment rate in CA in the past 10 years.
%%R
jpeg('summary CA general.jpeg', width=600, height=600)
par(mfrow=c(2,1))
boxplot(Agen$emp_rate~Agen$group, ylab="employment rate",
main="general employment rate in CA, 2004-2013")
boxplot(Agen$unemp_rate~Agen$group, ylab="unemployment rate",
main="general unemployment rate in CA, 2004-2013")
dev.off()
Image('summary CA general.jpeg')
%%R
jpeg('summary CA age.jpeg', width=600, height=1000)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))
boxplot(Aage$emp_rate~Aage$group, las=2, ylab="employment rate",
main="employment rate by age group in CA, 2004-2013")
boxplot(Aage$unemp_rate~Aage$group, las=2, ylab="unemployment rate",
main="unemployment rate by age group in CA, 2004-2013")
dev.off()
Image('summary CA age.jpeg')
%%R
jpeg('summary CA by race.jpeg', width=600, height=1100)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))
boxplot(Arace$emp_rate~Arace$group, las=2, ylab="employment rate",
main="employment rate by race in CA, 2004-2013")
boxplot(Arace$unemp_rate~Arace$group, las=2, ylab="unemployment rate",
main="unemployment rate by race in CA, 2004-2013")
dev.off()
Image('summary CA by race.jpeg')
Employment rate change trend in the past 10 years in California
We make a graph to see the employment rate change in the past 10 years (from 2004 to 2013) in California. This will also allow us to visualize the employment rate difference in men and women.
%%R
gencal=read.csv('../data1/cleaned/gencal.csv',header=T)
print(head(gencal))
Create subsets for men and women. Make plot with lines connecting the dots.
%%R
gencalm=subset(gencal, group=='Men')
gencalw=subset(gencal, group=='Women')
print(head(gencalm))
jpeg('general trend.jpeg', width=600, height=400)
plot(gencalm$year,gencalm$emp_rate,
ylim=c(min(gencalw$emp_rate, gencalm$emp_rate),
max(gencalw$emp_rate, gencalm$emp_rate)),
xlab="year", ylab="Employment Rate")
lines(gencalm$year,gencalm$emp_rate)
points(gencalw$year, gencalw$emp_rate, col="red")
lines(gencalw$year,gencalw$emp_rate, col="red")
legend('topright', c("Employment rate - Men","Employment rate - Women"),
col=c("black","red"), pch=1)
title(main="Employment Rate Change in the last 10 years in California")
dev.off()
The graph below is made to show the employment rate changes from 2004 to 2013 in California. The black line is for Men, while the red line is for Women.
Image('general trend.jpeg')
Trend by age group in the last 10 years
Create a list called listage: "agecal26",...,"agecal32".
Assign a subset of dataframe "agecal" to each one in the list with matching group code.
Make a plot, and add points and lines to the plot. Now each age group has a different color. (see legend). Solid lines are for men, while dotted lines are for women.
%%R
agecal=read.csv('../data1/cleaned/agecal.csv',header=T)
print(head(agecal))
%%R
jpeg('trend by age group.jpeg', width=600, height=600)
listage=list(paste('agecal',26:39, sep=""))
for (i in 1:14){
listage[[i]]=subset(agecal, grp_code==as.character(i+25))}
par(mai=c(0.82,0.82,0.82,1.22),xpd=T)
plot(emp_rate ~ year, data=agecal, type="n", xlab="year",
ylab="employment rate")
title(main="Employment rate in CA by age group in the last 10 years")
colors=rep(c('red','yellow','orange','pink','green','blue','purple'),2)
linestyle=c(rep(1,7),rep(2,7))
for (i in 1:14){
points(emp_rate ~ year, data=listage[[i]],
col=as.character(colors[i]))
lines(emp_rate ~ year, data=listage[[i]],
col=as.character(colors[i]), lty=linestyle[i])
}
legend("topright", inset=c(-0.2,0),c("16-19","20-24",'25-34','35-44','45-54','55-64','65+'),
col=c('red','yellow','orange','pink','green','blue','purple'), pch=1)
dev.off()
Image('trend by age group.jpeg')
interesting findings from the graph above
In age group "16-19 years", women has higher employment rate than men. This is the only age group that women get employed more. This is possible because more guys than girls are going to high schools or colleges at that age instead of working.
Also, in 2008, most of the groups' employment rates are decreasing due to the financial crisis. However, the age group of "65 years old and above" has an increasing trend.
Trend graph by race group
%%R
racecal=read.csv('../data1/cleaned/racecal.csv',header=T)
print(head(racecal))
Create a list called listrace: "racecal1",...,"racecal8".
Assign a subset of dataframe "racecal" to each one in the list with matching group code: 5 for White Men, 6 for White Women, 8 for Black Men, 9 for Black Women, 11 for Asian Men, 12 for Asian Women, 14 for Latino Men, and 15 for Latino Women.
Add points and lines to the plot. So now each race group has a different color. (see legend). Solid lines represent men, while dotted lines represent women.
%%R
jpeg('trend by race group.jpeg', width=600, height=600)
listrace=list(paste('racecal',1:8, sep=""))
for (i in 1:2){
listrace[[i]]=subset(racecal, grp_code==as.character(i+4))}
for (i in 3:4){
listrace[[i]]=subset(racecal, grp_code==as.character(i+5))}
for (i in 5:6){
listrace[[i]]=subset(racecal, grp_code==as.character(i+6))}
for (i in 7:8){
listrace[[i]]=subset(racecal, grp_code==as.character(i+7))}
par(mai=c(0.82,0.82,0.82,1.22),xpd=T)
plot(emp_rate ~ year, data=racecal, type="n", xlab="year",
ylab="employment rate")
title(main="Employment rate in CA by race group in the last 10 years")
colors=c(rep('red',2),rep('yellow',2),rep('green',2), rep('blue',2))
linestyle=rep(c(1,2),4)
for (i in 1:8){
points(emp_rate ~ year, data=listrace[[i]],
col=as.character(colors[i]))
lines(emp_rate ~ year, data=listrace[[i]],
col=as.character(colors[i]), lty=linestyle[i])
}
legend("topright", inset=c(-0.2,0),c('white','black','asian','latino'),
col=c('red','yellow','green','blue'), pch=1)
dev.off()
Image('trend by race group.jpeg')
*** Some Interesting Findings from the Graph Above
In all four race groups, men have higher employment rate than women in the same race group.
The employment rate difference is largest in Latinos, possible due to the fact that many Latino women are busy raising their children. The second largest difference is found in Whites, while the employment rate difference in Black people is the smallest.
Review the cleaned data, and Visualizations
Go to the folder where we saved cleaned data, and list the files inside.
6 cleaned data frames are generated and stored in this folder, 3 of them are for 2013 analysis, and the other 3 are for California trend analysis.
%%bash
cd ../data1/cleaned
ls
Go to the folder where we saved visualizations, and list the images inside.
6 images with names starting with "summary" are boxplots made to visualize the general dataset features, such as max, min, median, and outliers.
The other 3 images are made to see the trend of employment rate changes in the past 10 years in California. All of them are discussed briefly above, and will be explored in detail in Notebook 4.
%%bash
cd ../visualizations
ls