Date

Part 2: Data Cleaning

Team members responsible for this notebook:

Yuhan Wang: clean data and save them, make dataframes for each group, make boxplots to see the data summary, make trend graphs for California, write explanation.

Minghong Zheng, Yiwen Wei, Yijia Mao: proofread codes.

In [1]:
%load_ext rmagic

Import "Image" to show the graphs. Set working directory to ../visualizations to save our graphs

In [2]:
from IPython.core.display import Image
In [3]:
%%R
print(getwd())
setwd('../visualizations')
[1] "/home/oski/Team_Four.0/notebooks"

See where our raw data files are.

In [4]:
%%bash
cd ../data1/raw
ls
2004_to_be_cleaned.csv
2004.xls
2004.xlsx
2005_to_be_cleaned.csv
2005.xls
2005.xlsx
2006_to_be_cleaned.csv
2006.xls
2006.xlsx
2007_to_be_cleaned.csv
2007.xls
2007.xlsx
2008_to_be_cleaned.csv
2008.xls
2008.xlsx
2009_to_be_cleaned.csv
2009.xls
2009.xlsx
2010_to_be_cleaned.csv
2010.xls
2010.xlsx
2011_to_be_cleaned.csv
2011.xls
2011.xlsx
2012_to_be_cleaned.csv
2012.xls
2012.xlsx
2013_to_be_cleaned.csv
2013.xls
2013.xlsx

Data cleaning Part A, using data in 2013 only, for all states.

Read the csv file that contains our raw data.

Find the variables with mode of "factor", and change them to "character".

Delete unnecessary variables, observations with NA values.

Check the data frame we get.

In [5]:
%%R
A13=read.csv('../data1/raw/2013_to_be_cleaned.csv',header=T)
i=sapply(A13, is.factor)
A13[i]=lapply(A13[i],as.character)
A13=subset(A13, select=c(grp_code, state, group, emp_rate, unemp_rate))
A13=A13[complete.cases(A13),]
print(head(A13))
  grp_code   state        group emp_rate unemp_rate
1        1 Alabama        Total     54.0        6.9
2        2 Alabama          Men     59.2        7.0
3        3 Alabama        Women     49.3        6.8
4        4 Alabama        White     56.8        5.3
5        5 Alabama   White, men     63.9        5.4
6        6 Alabama White, women     50.1        5.1

Check the mode of our variables.

In [6]:
%%R
type=data.frame(lapply(A13,class))
print(type)
  grp_code     state     group emp_rate unemp_rate
1  integer character character  numeric    numeric

Since we have 3 different questions to answer: general gender discrimination, gender discrimination in different age groups, gender discrimination in different races, 3 dataframes will be created, 1 for each question.

In [7]:
%%R
A13gen=subset(A13, grp_code==2 | grp_code==3) #general
A13age=subset(A13, grp_code>=26 & grp_code<=39) #age
A13race=subset(A13,grp_code==5|grp_code==6|grp_code==8|grp_code==9
               |grp_code==11|grp_code==12|grp_code==14|grp_code==15) #race
print(head(A13race))
print(class(A13age))
   grp_code   state                             group emp_rate unemp_rate
5         5 Alabama                        White, men     63.9        5.4
6         6 Alabama                      White, women     50.1        5.1
8         8 Alabama    Black or African American, men     43.5       14.0
9         9 Alabama  Black or African American, women     47.4       10.7
11       14 Alabama Hispanic or Latino ethnicity, men     79.3        2.3
37        5  Alaska                        White, men     67.9        6.5
[1] "data.frame"

write cleaned data into csv files, and save them in the proper directory.

In [8]:
%%R
write.csv(A13gen, '../data1/cleaned/gen2013.csv')
write.csv(A13age, '../data1/cleaned/age2013.csv')
write.csv(A13race, '../data1/cleaned/race2013.csv')

Data Visualization for 2013

Now we show the summary of relevant variables. And each make a boxplot to visualize.

  1. general employment rate by gender

  2. general unemployment rate by gender

  3. employment rate in different age groups by gender

  4. unemployment rate in different age groups by gender

  5. employment rate in race groups by gender

  6. unemployment rate in race groups by gender

In [11]:
%%R
install.packages('psych')
library(psych)
Installing package into ‘/home/oski/R/i686-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://cran.cnr.Berkeley.edu/src/contrib/psych_1.4.4.tar.gz'
Content type 'application/x-gzip' length 2252419 bytes (2.1 Mb)
opened URL
==================================================
downloaded 2.1 Mb


The downloaded source packages are in
	‘/tmp/RtmpCkMcXn/downloaded_packages’

Summary of data by group. We can see the number of observations, mean, standard deviation, median, min, max, etc.

In [12]:
%%R
print(describe.by(A13gen$emp_rate, A13gen$group))
group: Men
  vars  n  mean   sd median trimmed mad min  max range skew kurtosis   se
1    1 51 65.04 4.63   64.2   64.86 4.6  55 75.5  20.5 0.32    -0.26 0.65
------------------------------------------------------------ 
group: Women
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
1    1 51 54.65 4.98   54.1   54.47 5.04 45.9 66.2  20.3 0.34    -0.82 0.7

Make a boxplot to visualize the data distribution, and save it.

In [13]:
%%R
jpeg('summary general.jpeg', width=600, height=600)
par(mfrow=c(2,1))
boxplot(A13gen$emp_rate~A13gen$group, ylab="employment rate", 
        main="Boxplot-general employment rate")
boxplot(A13gen$unemp_rate~A13gen$group, ylab="unemployment rate", 
        main="Boxplot-general unemployment rate")
dev.off()
In [14]:
Image('summary general.jpeg')
Out[14]:

As we can see from the boxplot, the employment rates of Men is in general higher than the employment rates of Women.

In [15]:
%%R
print(describe.by(A13age$emp_rate, A13age$group))
group: Men, 16 to 19 years
  vars  n  mean   sd median trimmed  mad  min max range skew kurtosis   se
1    1 49 28.79 8.57   26.8   28.09 7.56 11.8  52  40.2 0.78      0.2 1.22
------------------------------------------------------------ 
group: Men, 20 to 24 years
  vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 51 66.16 6.67   65.2   66.13 6.08 46.7 79.9  33.2 -0.08     0.22 0.93
------------------------------------------------------------ 
group: Men, 25 to 34 years
  vars  n  mean   sd median trimmed  mad  min max range skew kurtosis   se
1    1 51 83.08 3.83   83.1   83.01 3.26 73.4  91  17.6 0.04     -0.1 0.54
------------------------------------------------------------ 
group: Men, 35 to 44 years
  vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
1    1 51 85.7 3.37   85.6   85.83 3.26  78 91.6  13.6 -0.3    -0.39 0.47
------------------------------------------------------------ 
group: Men, 45 to 54 years
  vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 51 80.8 4.85   80.5   80.87 6.23 70.6 89.9  19.3 -0.08    -0.91 0.68
------------------------------------------------------------ 
group: Men, 55 to 64 years
  vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 51 66.77 6.37   67.1   66.87 5.93 51.8 81.3  29.5 -0.12    -0.54 0.89
------------------------------------------------------------ 
group: Men, 65 years and over
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 50 23.08 3.96  22.25   22.93 4.52 14.6 33.1  18.5 0.34     -0.1 0.56
------------------------------------------------------------ 
group: Women, 16 to 19 years
  vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 46 31.1 9.18   27.9   30.69 9.12 16.7 48.7    32 0.45    -1.06 1.35
------------------------------------------------------------ 
group: Women, 20 to 24 years
  vars  n  mean  sd median trimmed  mad min  max range skew kurtosis   se
1    1 51 62.46 7.2   61.7   62.21 7.26  47 76.8  29.8 0.23    -0.81 1.01
------------------------------------------------------------ 
group: Women, 25 to 34 years
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 69.43 5.49   68.8   69.33 5.19 54.4 81.7  27.3 0.06    -0.01 0.77
------------------------------------------------------------ 
group: Women, 35 to 44 years
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
1    1 51 71.74 5.02   70.7   71.57 5.04 62.6 81.4  18.8 0.26    -0.96 0.7
------------------------------------------------------------ 
group: Women, 45 to 54 years
  vars  n mean  sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 71.5 5.3   70.8   71.28 5.04 60.5 83.5    23  0.3    -0.65 0.74
------------------------------------------------------------ 
group: Women, 55 to 64 years
  vars  n  mean  sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 57.91 7.1     58   57.81 7.71 43.3 73.1  29.8 0.07    -0.77 0.99
------------------------------------------------------------ 
group: Women, 65 years and over
  vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 50 14.9 2.89   14.4   14.63 3.04 10.2 21.6  11.4 0.64    -0.39 0.41

In [16]:
%%R
jpeg('summary by age.jpeg', width=600, height=1000)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))

boxplot(A13age$emp_rate~A13age$group, las=2, ylab="employment rate", 
        main="Boxplot-employment rate by age group")
boxplot(A13age$unemp_rate~A13age$group, las=2, ylab="unemployment rate", 
        main="Boxplot-unemployment rate by age group")
dev.off()
In [17]:
Image('summary by age.jpeg')
Out[17]:

As shown in the boxplot above, the employment rate increases as age increases until the 35 to 44 years age group where it reaches the highest value, and then it decreases as age gets larger.

In [18]:
%%R
print(describe.by(A13race$emp_rate, A13race$group))
group: Black or African American, men
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 38 55.49 6.09   55.1   55.38 6.82 43.5 72.2  28.7 0.29    -0.19 0.99
------------------------------------------------------------ 
group: Black or African American, women
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 38 52.07 4.85   52.4   51.97 5.11 43.6 62.7  19.1 0.12    -0.87 0.79
------------------------------------------------------------ 
group: Hispanic or Latino ethnicity, men
  vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
1    1 46 71.69 7.27  72.65   71.99 8.38 55.4 82.4    27 -0.37    -1.01 1.07
------------------------------------------------------------ 
group: Hispanic or Latino ethnicity, women
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 42 53.61 6.11  53.15   52.96 5.78 45.7 72.2  26.5 0.99     0.58 0.94
------------------------------------------------------------ 
group: White, men
  vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
1    1 51 66.48 4.89   65.3   66.13 4.74 54.6 81.6    27 0.64     0.88 0.68
------------------------------------------------------------ 
group: White, women
  vars  n  mean   sd median trimmed  mad  min max range skew kurtosis   se
1    1 51 55.09 5.94   54.2   54.59 5.78 45.7  75  29.3 0.91      0.8 0.83

In [19]:
%%R
jpeg('summary by race.jpeg', width=600, height=1100)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))
boxplot(A13race$emp_rate~A13race$group, las=2, ylab="employment rate", 
        main="Boxplot-employment rate by race group")
boxplot(A13race$unemp_rate~A13race$group, las=2, ylab="unemployment rate", 
        main="Boxplot-unemployment rate by race group")
dev.off()
In [20]:
Image('summary by race.jpeg')
Out[20]:
In [20]:

Data Cleaning Part B, using data in 2004-2013, focusing on California to see the trend.

Read csv file into R, and combine them together. Add a new variable "year" to distinguish data for each year.

In [21]:
%%R
A=read.csv('../data1/raw/2004_to_be_cleaned.csv',header=T)
A['year']=2004
for (i in 2005:2013){
B=read.csv(paste('../data1/raw/',i,'_to_be_cleaned.csv',sep=''), header=T)
B['year']=i
A=rbind(A,B)}
i=sapply(A, is.factor)
A[i]=lapply(A[i],as.character)
A['unemp_rate']=lapply(A['unemp_rate'],as.numeric)
print(head(A))
  X grp_code   state        group  pop ttl_labor per_labor ttl_emp emp_rate
1 1        1 Alabama        Total 3484      2179      62.5    2053     58.9
2 2        2 Alabama          Men 1651      1156      70.0    1096     66.3
3 3        3 Alabama        Women 1833      1023      55.8     957     52.2
4 4        4 Alabama        White 2518      1595      63.4    1535     61.0
5 5        5 Alabama   White, men 1220       883      72.4     850     69.7
6 6        6 Alabama White, women 1298       713      54.9     685     52.8
  ttl_unemp unemp_rate year
1       126        5.8 2004
2        61        5.3 2004
3        66        6.4 2004
4        61        3.8 2004
5        33        3.7 2004
6        28        3.9 2004

In [22]:
%%R
type=data.frame(lapply(A,class))
print(type)
        X grp_code     state     group     pop ttl_labor per_labor ttl_emp
1 integer  integer character character integer   integer   numeric integer
  emp_rate ttl_unemp unemp_rate    year
1  numeric character    numeric numeric

Keep only the variables we need. we will keep year, grp_code, group,emp_rate and unemp_rate. keep only observations with state name "California".

In [23]:
%%R
A=subset(A, select=c(year, grp_code, state, group, emp_rate, unemp_rate))
A=A[A$state=='California',]
print(tail(A))
      year grp_code      state                    group emp_rate unemp_rate
22797 2013       34 California    Women, 20 to 24 years     54.4       13.8
22798 2013       35 California    Women, 25 to 34 years     64.6        8.0
22799 2013       36 California    Women, 35 to 44 years     65.0        7.4
22800 2013       37 California    Women, 45 to 54 years     65.9        7.7
22801 2013       38 California    Women, 55 to 64 years     54.1        6.3
22802 2013       39 California Women, 65 years and over     14.3        5.0

Delete any individual with N.A values.

In [24]:
%%R
A=A[complete.cases(A),]

Since we have 3 different questions to answer: general gender discrimination, gender discrimination in different age groups, gender discrimination in different races, 3 dataframe will be created, one for each question.

In [25]:
%%R
Agen=subset(A, grp_code==2 | grp_code==3) #general
Aage=subset(A, grp_code>=26 & grp_code<=39) #age
Arace=subset(A,grp_code==5|grp_code==6|grp_code==8|grp_code==9
             |grp_code==11|grp_code==12|grp_code==14|grp_code==15) #race

print(head(Agen))
print(class(Aage))
     year grp_code      state group emp_rate unemp_rate
189  2004        2 California   Men     69.2        6.3
190  2004        3 California Women     54.1        6.0
2608 2005        2 California   Men     70.4        5.2
2609 2005        3 California Women     54.0        5.4
5016 2006        2 California   Men     70.4        4.7
5017 2006        3 California Women     53.9        5.0
[1] "data.frame"

In [26]:
%%R
write.csv(Agen, '../data1/cleaned/gencal.csv')
write.csv(Aage, '../data1/cleaned/agecal.csv')
write.csv(Arace, '../data1/cleaned/racecal.csv')

Make boxplots to visualize the employment rate and unemployment rate in CA in the past 10 years.

In [27]:
%%R
jpeg('summary CA general.jpeg', width=600, height=600)
par(mfrow=c(2,1))
boxplot(Agen$emp_rate~Agen$group, ylab="employment rate", 
        main="general employment rate in CA, 2004-2013")
boxplot(Agen$unemp_rate~Agen$group, ylab="unemployment rate", 
        main="general unemployment rate in CA, 2004-2013")
dev.off()
In [28]:
Image('summary CA general.jpeg')
Out[28]:
In [29]:
%%R
jpeg('summary CA age.jpeg', width=600, height=1000)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))

boxplot(Aage$emp_rate~Aage$group, las=2, ylab="employment rate", 
        main="employment rate by age group in CA, 2004-2013")
boxplot(Aage$unemp_rate~Aage$group, las=2, ylab="unemployment rate", 
        main="unemployment rate by age group in CA, 2004-2013")
dev.off()
In [30]:
Image('summary CA age.jpeg')
Out[30]:
In [31]:
%%R
jpeg('summary CA by race.jpeg', width=600, height=1100)
par(mfrow=c(2,1))
par(mar=c(14.1,4.1,4.1,2.1))

boxplot(Arace$emp_rate~Arace$group, las=2, ylab="employment rate", 
        main="employment rate by race in CA, 2004-2013")
boxplot(Arace$unemp_rate~Arace$group, las=2, ylab="unemployment rate", 
        main="unemployment rate by race in CA, 2004-2013")
dev.off()
In [32]:
Image('summary CA by race.jpeg')
Out[32]:

Employment rate change trend in the past 10 years in California

We make a graph to see the employment rate change in the past 10 years (from 2004 to 2013) in California. This will also allow us to visualize the employment rate difference in men and women.

In [33]:
%%R
gencal=read.csv('../data1/cleaned/gencal.csv',header=T)
print(head(gencal))
     X year grp_code      state group emp_rate unemp_rate
1  189 2004        2 California   Men     69.2        6.3
2  190 2004        3 California Women     54.1        6.0
3 2608 2005        2 California   Men     70.4        5.2
4 2609 2005        3 California Women     54.0        5.4
5 5016 2006        2 California   Men     70.4        4.7
6 5017 2006        3 California Women     53.9        5.0

Create subsets for men and women. Make plot with lines connecting the dots.

In [34]:
%%R
gencalm=subset(gencal, group=='Men')
gencalw=subset(gencal, group=='Women')
print(head(gencalm))

jpeg('general trend.jpeg', width=600, height=400)
plot(gencalm$year,gencalm$emp_rate, 
     ylim=c(min(gencalw$emp_rate, gencalm$emp_rate),
            max(gencalw$emp_rate, gencalm$emp_rate)), 
     xlab="year", ylab="Employment Rate")
lines(gencalm$year,gencalm$emp_rate)
points(gencalw$year, gencalw$emp_rate, col="red")
lines(gencalw$year,gencalw$emp_rate, col="red")
legend('topright', c("Employment rate - Men","Employment rate - Women"), 
       col=c("black","red"), pch=1)
title(main="Employment Rate Change in the last 10 years in California")
dev.off()
       X year grp_code      state group emp_rate unemp_rate
1    189 2004        2 California   Men     69.2        6.3
3   2608 2005        2 California   Men     70.4        5.2
5   5016 2006        2 California   Men     70.4        4.7
7   7404 2007        2 California   Men     70.2        5.5
9   9792 2008        2 California   Men     68.7        7.4
11 12196 2009        2 California   Men     63.8       12.3

The graph below is made to show the employment rate changes from 2004 to 2013 in California. The black line is for Men, while the red line is for Women.

In [35]:
Image('general trend.jpeg')
Out[35]:

Trend by age group in the last 10 years

Create a list called listage: "agecal26",...,"agecal32".

Assign a subset of dataframe "agecal" to each one in the list with matching group code.

Make a plot, and add points and lines to the plot. Now each age group has a different color. (see legend). Solid lines are for men, while dotted lines are for women.

In [36]:
%%R
agecal=read.csv('../data1/cleaned/agecal.csv',header=T)
print(head(agecal))
    X year grp_code      state               group emp_rate unemp_rate
1 213 2004       26 California Men, 16 to 19 years     30.2       22.4
2 214 2004       27 California Men, 20 to 24 years     67.7       11.3
3 215 2004       28 California Men, 25 to 34 years     85.9        5.8
4 216 2004       29 California Men, 35 to 44 years     88.2        4.5
5 217 2004       30 California Men, 45 to 54 years     83.1        4.5
6 218 2004       31 California Men, 55 to 64 years     65.9        4.5

In [37]:
%%R
jpeg('trend by age group.jpeg', width=600, height=600)

listage=list(paste('agecal',26:39, sep=""))
for (i in 1:14){
    listage[[i]]=subset(agecal, grp_code==as.character(i+25))}

par(mai=c(0.82,0.82,0.82,1.22),xpd=T)

plot(emp_rate ~ year, data=agecal, type="n", xlab="year", 
     ylab="employment rate")
title(main="Employment rate in CA by age group in the last 10 years")
colors=rep(c('red','yellow','orange','pink','green','blue','purple'),2)
linestyle=c(rep(1,7),rep(2,7))

for (i in 1:14){
    points(emp_rate ~ year, data=listage[[i]], 
           col=as.character(colors[i]))
    lines(emp_rate ~ year, data=listage[[i]], 
          col=as.character(colors[i]), lty=linestyle[i])
}

legend("topright", inset=c(-0.2,0),c("16-19","20-24",'25-34','35-44','45-54','55-64','65+'), 
       col=c('red','yellow','orange','pink','green','blue','purple'), pch=1)
dev.off()
In [38]:
Image('trend by age group.jpeg')
Out[38]:

interesting findings from the graph above

In age group "16-19 years", women has higher employment rate than men. This is the only age group that women get employed more. This is possible because more guys than girls are going to high schools or colleges at that age instead of working.

Also, in 2008, most of the groups' employment rates are decreasing due to the financial crisis. However, the age group of "65 years old and above" has an increasing trend.

Trend graph by race group

In [39]:
%%R
racecal=read.csv('../data1/cleaned/racecal.csv',header=T)
print(head(racecal))
    X year grp_code      state                            group emp_rate
1 192 2004        5 California                       White, men     70.4
2 193 2004        6 California                     White, women     54.1
3 195 2004        8 California   Black or African American, men     57.3
4 196 2004        9 California Black or African American, women     53.8
5 198 2004       11 California                       Asian, men     67.6
6 199 2004       12 California                     Asian, women     54.5
  unemp_rate
1        6.0
2        5.8
3       11.6
4        9.6
5        5.3
6        4.9

Create a list called listrace: "racecal1",...,"racecal8".

Assign a subset of dataframe "racecal" to each one in the list with matching group code: 5 for White Men, 6 for White Women, 8 for Black Men, 9 for Black Women, 11 for Asian Men, 12 for Asian Women, 14 for Latino Men, and 15 for Latino Women.

Add points and lines to the plot. So now each race group has a different color. (see legend). Solid lines represent men, while dotted lines represent women.

In [40]:
%%R
jpeg('trend by race group.jpeg', width=600, height=600)

listrace=list(paste('racecal',1:8, sep=""))

for (i in 1:2){
    listrace[[i]]=subset(racecal, grp_code==as.character(i+4))}
for (i in 3:4){
    listrace[[i]]=subset(racecal, grp_code==as.character(i+5))}
for (i in 5:6){
    listrace[[i]]=subset(racecal, grp_code==as.character(i+6))}
for (i in 7:8){
    listrace[[i]]=subset(racecal, grp_code==as.character(i+7))}

par(mai=c(0.82,0.82,0.82,1.22),xpd=T)

plot(emp_rate ~ year, data=racecal, type="n", xlab="year", 
     ylab="employment rate")
title(main="Employment rate in CA by race group in the last 10 years")
colors=c(rep('red',2),rep('yellow',2),rep('green',2), rep('blue',2))
linestyle=rep(c(1,2),4)

for (i in 1:8){
    points(emp_rate ~ year, data=listrace[[i]], 
           col=as.character(colors[i]))
    lines(emp_rate ~ year, data=listrace[[i]], 
          col=as.character(colors[i]), lty=linestyle[i])
}

legend("topright", inset=c(-0.2,0),c('white','black','asian','latino'), 
       col=c('red','yellow','green','blue'), pch=1)
dev.off()
In [41]:
Image('trend by race group.jpeg')
Out[41]:

*** Some Interesting Findings from the Graph Above

In all four race groups, men have higher employment rate than women in the same race group.

The employment rate difference is largest in Latinos, possible due to the fact that many Latino women are busy raising their children. The second largest difference is found in Whites, while the employment rate difference in Black people is the smallest.

Review the cleaned data, and Visualizations

Go to the folder where we saved cleaned data, and list the files inside.

6 cleaned data frames are generated and stored in this folder, 3 of them are for 2013 analysis, and the other 3 are for California trend analysis.

In [42]:
%%bash
cd ../data1/cleaned
ls
age2013.csv
agecal.csv
gen2013.csv
gencal.csv
race2013.csv
racecal.csv

Go to the folder where we saved visualizations, and list the images inside.

6 images with names starting with "summary" are boxplots made to visualize the general dataset features, such as max, min, median, and outliers.

The other 3 images are made to see the trend of employment rate changes in the past 10 years in California. All of them are discussed briefly above, and will be explored in detail in Notebook 4.

In [43]:
%%bash
cd ../visualizations
ls
general trend.jpeg
prediction emp.jpeg
prediction.jpeg
prediction unemp.jpeg
summary by age.jpeg
summary by race.jpeg
summary CA age.jpeg
summary CA by race.jpeg
summary CA general.jpeg
summary general.jpeg
trend by age group.jpeg
trend by race group.jpeg