Date

Team members responsible for this notebook:

Yiwen Wei: write analysis results and conclusions.

Yijia Mao : save images and put them in the notebook.

Minghong Zheng: make PDF files, revise analysis results.

Yuhan Wang: write results for t-tests, make powerpoint slides, publish project website.

Project Objective

Our topic for the final project is to explore whether gender affects the employment rate and the unemployment rate in different states during the recent 10 years, and see whether there exists gener discrimination in the job market.

  1. Whether there is gender discrimination in job markets. Do employers favor male over female?
  2. Which states have the most severe discrimination and which states have the least discrimination?
  3. Whether other factors, such as age and race affect the discrimination situation?
  4. Focusing on California, how is it compared with the average?
  5. Focusing on California, what is the trend of change in the past 10 years?
In [1]:
%load_ext rmagic
In [2]:
from IPython.core.display import Image

Data Gathering

The data of State Employment and Unemployment statistics from 2004 to 2013 is downloaded from the Bureau of Labor Statistics website. Their file formats are xls and then we save them in subdirectory "raw" load data into dataFrame. Then we rename every variable: Group Code,State,Group,Population,Total number of labor,Percentage of labor, Totoal number of employment,Employment rate,Total number of unemloyment,Unemployment rate and drop the unnecessary rows. Finally, we use the script file to load other xls files into dataframe and save them as csv files.

Data Cleaning

Part A, 2013 data for all states

We first read the csv file that contains our raw data and find the variables with mode of "factor", and change them to "character". Then delete unnecessary variables, observations with NA values.

Since we have 3 different questions to answer: general gender discrimination, gender discrimination in different age groups, gender discrimination in different races, 3 dataframe will be created, one for each question.

Below is the general discription of the cleaned data:

2013 General

In [3]:
%%R
library(psych)
A13gen=read.csv('../data1/cleaned/gen2013.csv',header=T)
setwd("../visualizations")
In [4]:
%%R
print(describe.by(A13gen$emp_rate, A13gen$group))
group: Men
  vars  n  mean   sd median trimmed mad min  max range skew kurtosis   se
1    1 51 65.04 4.63   64.2   64.86 4.6  55 75.5  20.5 0.32    -0.26 0.65
------------------------------------------------------------ 
group: Women
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
1    1 51 54.65 4.98   54.1   54.47 5.04 45.9 66.2  20.3 0.34    -0.82 0.7

2013 By Age Group

In [5]:
%%R
A13age=read.csv('../data1/cleaned/age2013.csv',header=T)

print(describe.by(A13age$emp_rate, A13age$group))
group: Men, 16 to 19 years
  vars  n  mean   sd median trimmed  mad  min max range skew kurtosis   se
1    1 49 28.79 8.57   26.8   28.09 7.56 11.8  52  40.2 0.78      0.2 1.22
------------------------------------------------------------ 
group: Men, 20 to 24 years
  vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 51 66.16 6.67   65.2   66.13 6.08 46.7 79.9  33.2 -0.08     0.22 0.93
------------------------------------------------------------ 
group: Men, 25 to 34 years
  vars  n  mean   sd median trimmed  mad  min max range skew kurtosis   se
1    1 51 83.08 3.83   83.1   83.01 3.26 73.4  91  17.6 0.04     -0.1 0.54
------------------------------------------------------------ 
group: Men, 35 to 44 years
  vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
1    1 51 85.7 3.37   85.6   85.83 3.26  78 91.6  13.6 -0.3    -0.39 0.47
------------------------------------------------------------ 
group: Men, 45 to 54 years
  vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 51 80.8 4.85   80.5   80.87 6.23 70.6 89.9  19.3 -0.08    -0.91 0.68
------------------------------------------------------------ 
group: Men, 55 to 64 years
  vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 51 66.77 6.37   67.1   66.87 5.93 51.8 81.3  29.5 -0.12    -0.54 0.89
------------------------------------------------------------ 
group: Men, 65 years and over
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 50 23.08 3.96  22.25   22.93 4.52 14.6 33.1  18.5 0.34     -0.1 0.56
------------------------------------------------------------ 
group: Women, 16 to 19 years
  vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 46 31.1 9.18   27.9   30.69 9.12 16.7 48.7    32 0.45    -1.06 1.35
------------------------------------------------------------ 
group: Women, 20 to 24 years
  vars  n  mean  sd median trimmed  mad min  max range skew kurtosis   se
1    1 51 62.46 7.2   61.7   62.21 7.26  47 76.8  29.8 0.23    -0.81 1.01
------------------------------------------------------------ 
group: Women, 25 to 34 years
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 69.43 5.49   68.8   69.33 5.19 54.4 81.7  27.3 0.06    -0.01 0.77
------------------------------------------------------------ 
group: Women, 35 to 44 years
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
1    1 51 71.74 5.02   70.7   71.57 5.04 62.6 81.4  18.8 0.26    -0.96 0.7
------------------------------------------------------------ 
group: Women, 45 to 54 years
  vars  n mean  sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 71.5 5.3   70.8   71.28 5.04 60.5 83.5    23  0.3    -0.65 0.74
------------------------------------------------------------ 
group: Women, 55 to 64 years
  vars  n  mean  sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 57.91 7.1     58   57.81 7.71 43.3 73.1  29.8 0.07    -0.77 0.99
------------------------------------------------------------ 
group: Women, 65 years and over
  vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 50 14.9 2.89   14.4   14.63 3.04 10.2 21.6  11.4 0.64    -0.39 0.41

2013 By Race Group

In [6]:
%%R
A13race=read.csv('../data1/cleaned/race2013.csv',header=T)

print(describe.by(A13race$emp_rate, A13race$group))
group: Black or African American, men
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 38 55.49 6.09   55.1   55.38 6.82 43.5 72.2  28.7 0.29    -0.19 0.99
------------------------------------------------------------ 
group: Black or African American, women
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 38 52.07 4.85   52.4   51.97 5.11 43.6 62.7  19.1 0.12    -0.87 0.79
------------------------------------------------------------ 
group: Hispanic or Latino ethnicity, men
  vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 46 71.69 7.27  72.65   71.99 8.38 55.4 82.4    27 -0.37    -1.01 1.07
------------------------------------------------------------ 
group: Hispanic or Latino ethnicity, women
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 42 53.61 6.11  53.15   52.96 5.78 45.7 72.2  26.5 0.99     0.58 0.94
------------------------------------------------------------ 
group: White, men
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 66.48 4.89   65.3   66.13 4.74 54.6 81.6    27 0.64     0.88 0.68
------------------------------------------------------------ 
group: White, women
  vars  n  mean   sd median trimmed  mad  min max range skew kurtosis   se
1    1 51 55.09 5.94   54.2   54.59 5.78 45.7  75  29.3 0.91      0.8 0.83

Part B: California

We cleaned data for California in the last 10 years.

We first read csv files from 2004 to 2013 into R, and combine them together. And then added a new variable "year" to distinguish data for each year. We kept only the variables we need: year, grp_code, group,emp_rate and unemp_rate. And kept only observations with state name "California".

Similarly, we have 3 different questions to answer: general gender discrimination, gender discrimination in different age groups, gender discrimination in different races, 3 dataframe will be created, one for each question.

Then we made boxplots to visualize the employment rate and unemployment rate in CA in the past 10 years.

Summary of California data, general

In [7]:
Image('summary CA general.jpeg')
Out[7]:

The avergae employment rate of California men is 65% and that of California women is 52%, however, the distribution of California men is more disperse. Although there is a still big difference, it's almost the same as the difference in gernal. So the discrimination situation deos not get worse. On the other hand,the distribution of the unemployment rate does not make much difference.

Summary of CA data by age group

In [8]:
Image('summary CA age.jpeg')
Out[8]:

The age group situation of Califonia is basicall the same as that of the year 2013. For both men and women, the employment rate increases as age increases until the 35 to 44 years age group where it reachs the highest value, and then it decreases as age gets larger. The unemployment is greatest in the under 19 years age group and then it decreases all the time except for that the distribution of the under 19 years age gourp.

Summary of CA data by race group

In [9]:
Image('summary CA by race.jpeg')
Out[9]:

The race group situation of Califonia is also basically the same as that of the year 2013.

Employment rate change in California

We also made graphs to show employment rate change trends in the past 10 years in California. The graphs will be expained in detail in Data Analysis Part.

General trend in the last 10 years

The graph below was made to show the employment rate changes from 2004 to 2013 in California. The black line is for Men, while the red line is for Women.

In [10]:
%%R
print(getwd())
setwd('../visualizations')
[1] "/home/oski/Team_Four.0/visualizations"

In [11]:
Image('general trend.jpeg')
Out[11]:

Trend by age group

Each age group has a different color. (see legend). Solid lines are for men, while dotted lines are for women.

In [12]:
Image('trend by age group.jpeg')
Out[12]:

Trend by race group

Each race group has a different color. (see legend). Solid lines represent men, while dotted lines represent women.

In [13]:
Image('trend by race group.jpeg')
Out[13]:

In age group "16-19 years", women has higher employment rate than men. This is the only age group that women get employed more. This is possible because more guys than girls are going to high schools or colleges at that age instead of working.

Also, in 2008, most of the groups' employment rates are decreasing due to the financial crisis. However, the age group of "65 years old and above" has an increasing trend.

In all four race groups, men have higher employment rate than women in the same race group. The employment rate difference is largest in Latinos, possibly due to the fact that many Latino women are busy raising their children. The second largest difference is found in Whites, while the employment rate difference in Black people is the smallest.

Data Analysis

Part A T-Test

In [14]:
%%R
gen2013=read.csv('../data1/cleaned/gen2013.csv',header=T)
gen2013m=subset(gen2013, group=="Men")
gen2013w=subset(gen2013, group=="Women")
names(gen2013m)[5:6]=paste(names(gen2013m)[5:6],'_m',sep='')
names(gen2013w)[5:6]=paste(names(gen2013w)[5:6],'_w',sep='')
gen2013mw=merge(gen2013m,gen2013w, by='state')
gen2013mw['emp_diff']=gen2013mw['emp_rate_m']-gen2013mw['emp_rate_w']
gen2013mw['unemp_diff']=gen2013mw['unemp_rate_m']-gen2013mw['unemp_rate_w']
gen2013mw=subset(gen2013mw, select=c(state,emp_rate_m,unemp_rate_m,emp_rate_w,unemp_rate_w,emp_diff, unemp_diff))
print(t.test(gen2013mw$emp_diff, mu=0))
print(t.test(gen2013mw$unemp_diff, mu=0))

	One Sample t-test

data:  gen2013mw$emp_diff
t = 28.4945, df = 50, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  9.65233 11.11630
sample estimates:
mean of x 
 10.38431 


	One Sample t-test

data:  gen2013mw$unemp_diff
t = 5.1914, df = 50, p-value = 3.826e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.3738681 0.8457398
sample estimates:
mean of x 
0.6098039 


The null hypothesis for this question is that "the employment rate difference between men and women is zero", which means there is no gender discriminition.

We run a one-sample t-test comparing the employment rate difference with 0. The result shows that, using 2013 employment difference data, the mean is 10.38%, which is significantly different from 0 at a 0.05 significance level. The confidence interval is 9.65% to 11.12%. And the p-value is almost zero. Hence we can safely reject the null hypothesis, and say that there exists employment rate difference between men and women, and the employment rate of men is much higher than women.

We did a similar one-sample t-test to compare the unemployment rate difference with 0. The result also indicates a significant difference. The mean value is 0.61%, with a confidence interval of 0.37% to 0.85%. The p-value is also small. We can conclude that the unemployment rate of men is also higher than women at a 0.05 significance level.

Here, we used t-tests to support our findings that employment rate and unemployment rate between men and women are different, statistically significant.

Part B Analysis on the discrimination over genders

General: (We use boxplot & ggplot to visualize)

In [15]:
%%R
print(getwd())
setwd('../visualizations')
[1] "/home/oski/Team_Four.0/visualizations"

In [20]:
Image("Employment Rate Difference in 2013.jpeg")
Out[20]:
In [21]:
Image("emp_diff_wm.jpeg")
Out[21]:

As we can see from the boxplot,there is a big difference between the employment rate of men and women. The employment rates of Men is in generally about 13% higher than the employment rates of Women.

As we can see from the map, the northeastern past of the US has the leaset severe discrimination on genders and the situation of the middle west part is the most severe. Generally, the less developing a region is, the more severe the sexual discrimination is.

Age Group:

In [22]:
Image("Employment Rate by Age Groups.jpeg")
Out[22]:
In [4]:
Image("Employment Rate Difference from 16 to 19 years.jpeg")
Out[4]:
In [5]:
Image("Employment Rate Difference from 20 to 24 years.jpeg")
Out[5]:
In [6]:
Image("Employment Rate Difference from 25 to 34 years.jpeg")
Out[6]:
In [7]:
Image("Employment Rate Difference from 35 to 44 years.jpeg")
Out[7]:
In [8]:
Image("Employment Rate Difference from 45 to 54 years.jpeg")
Out[8]:
In [27]:
Image("Employment Rate Difference from 55 to 64 years.jpeg")
Out[27]:
In [28]:
Image("Employment Rate Difference from 65 years and over.jpeg")
Out[28]:

As shown in the boxplot above, for both men and women, the employment rate increases as age increases until the 35 to 44 years age group where it reaches the highest value, and then it decreases as age gets larger. The trend is not linear, and it is more like a parabola opening down.

These maps show the difference employment rates between women and men in different age groups. The pink color means that the employment rate of women is higher than employment rates of men. The white color means that the employment rate of men and women are equal. The purple color means that the employment rate of men is higher than the employment rate of women.

As we can see from the map, more and more areas are colored purple, and the purple becomes darker as age increases until the 35 to 44 years, which supports the conclusion we get from the boxplot. Also, the general situation follows that of the sex group: the less developing a region is, the more severe the difference is. But the age 16-19 group is unique.

Race Group:

In [29]:
Image("Employment Rate by Race Groups.jpeg")
Out[29]:
In [30]:
Image("Employment Rate Difference for Black or African American.jpeg")
Out[30]:
In [32]:
Image("Employment Rate Difference for Hispanic or Latino ethnicity.jpeg")
Out[32]:
In [33]:
Image("Employment Rate Difference for White.jpeg")
Out[33]:

From the boxplot, we can see that Hispanic or Latino ethnicity men have the greatest employment rate and the second greatest group is White men. However, for women, Black or African American women have the lowest employment rate and then Hispanic or Latino women, and lastly, White women.

The maps show the employment rate difference in each state. States with larger difference are colored with darker purple, and states with smaller differences are colored lighter purple or white. Those with negative employment rate difference - women have higher employment rate than men - are colored pink. The gray areas are the states with data missing.

From the maps, we can tell that the employment rate difference is largest for Black or African American people, while the Latinos have the smallest difference. There are gender discrimination for White people, but not as large as for Black people.

Each race group has its unique situation. The employment difference for Black or African American are greater in North then in South, despite the development of the states. This is probably because of the history between the South states and the African American. At the same time, Latinos women in East America are hired more than men.

Part C using data in 2004-2013, focusing on California to see the trend

General trend in the last 10 years

In [34]:
Image('general trend.jpeg')
Out[34]:

The graph above was made to show the employment rate changes from 2004 to 2013 in California. The black line is for Men, while the red line is for Women. The general patterns of the employment rate of both men and women are basically the same. The employment rate changes relatively stable from year 2004 to 2008. Then it starts to decrease rapidly until year 2010, which is probably because of the financial crisis staring from 2008. And the employment finally starts to increase slowly since 2010. However, the employment of men is always about 15% higher than that of women.

Trend by age group

In [10]:
Image('trend by age group.jpeg')
Out[10]:

In age group "16-19 years", women has higher employment rate than men. This is the only age group that women get employed more. This is possibly because more guys than girls are going to high schools or colleges at that age instead of working.

Also, in 2008, most of the groups' employment rates are decreasing due to the financial crisis. However, the age group of "65 years old and above" had an increasing trend.

Trend by race group

In [9]:
Image('trend by race group.jpeg')
Out[9]:

In all four race groups, men have higher employment rate than women in the same race group.

The employment rate difference is largest in Latinos, possible due to the fact that many Latino women are busy raising their children. The second largest difference is found in Whites, while the employment rate difference in Black people is the smallest

Part D Linear Regression to make the prediction

In [37]:
%%R
print(getwd())
setwd('../visualizations')
[1] "/home/oski/Team_Four.0/visualizations"

Using data of employment rate difference in California in the past 10 years, we did a linear regression to see the change, and then predicted the employment rate difference in 2014.

In [38]:
Image('prediction emp.jpeg')
Out[38]:

The regression line in this graph is downward sloping, showing that the employment rate difference is getting smaller. The predicition of year 2014 suggests that the employment difference in 2014 will be about 10.85%. It is also included in the graph.

In [39]:
Image('prediction unemp.jpeg')
Out[39]:

The regression line in this graph is upward sloping, suggesting that the employment rate difference is getting larger. We are not sure whether the change is linear or not. As we can see from the graph, if the employment rate change follow a linear regression, the unemployment rate difference will be about 0.98%