Date

Team Four.0 Final Project

Part 1 Data Gathering

Team members responsible for this notebook:

Minghong Zheng: collect raw data from website, write functions and scripts to transform raw data into data frames, write explanations.

Yuhan Wang, Yijia Mao, Yiwen Wei : proofread codes.

Data Resources:

First, the data of State Employment and Unemployment statistics from 2004 to 2013 is downloaded from the Bureau of Labor Statistics website. Their file formats are xls.

Here are the URLs of the data sources:
http://www.bls.gov/lau/table14full04.xlsx
http://www.bls.gov/lau/table14full05.xlsx
http://www.bls.gov/lau/table14full06.xlsx
http://www.bls.gov/lau/table14full07.xlsx
http://www.bls.gov/lau/table14full08.xlsx
http://www.bls.gov/lau/table14full09.xlsx
http://www.bls.gov/lau/table14full10.xlsx
http://www.bls.gov/lau/table14full11.xlsx
http://www.bls.gov/lau/table14full12.xlsx
http://www.bls.gov/lau/ptable14full2013.xlsx

In [3]:
%load_ext rmagic
The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

Directories that we need to create:

  • a directory 'script' to run the process of data gathering
  • a directory 'data1' to save our data
  • a subdirectory 'raw' to store our raw data collected from the web
  • a subdirectory 'cleaned' to store our cleaned data
  • a subdirectory 'simulated' to store simulated data
  • a directory 'visulazations' to store our plots
In [4]:
%%bash
mkdir ../script ../data1 ../data1/raw ../data1/cleaned ../data1/simulated ../visualizations

Check whether these directories are created.

In [5]:
%%bash
cd ..
ls -r
website
visualizations
script
README.md
notebooks
LICENSE
data1

In [6]:
%%bash
cd ../data1
ls -r
simulated
raw
cleaned

Example:

Here's how we download the data in 2004 from the website; save them in subdirectory "raw"; load data in 2004.xls file into a data frame

  • Download the data from the website
In [7]:
%%R
url="http://www.bls.gov/lau/table14full04.xlsx"
file='../data1/raw/2004.xlsx'
download.file(url,destfile=file,method='curl')
  
  • Check whether the file is successfully downolad to our raw directory
In [8]:
%%bash
ls ../data1/raw
2004.xlsx

Load raw data into data Fram

  • Download the pacakage "gdata" to read our excel files
In [9]:
%%bash
sudo apt-get install r-cran-gplots -y
Reading package lists...
Building dependency tree...
Reading state information...
r-cran-gplots is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

In [10]:
ls -l ../data1/raw
total 224
-rw-rw-r-- 1 oski oski 228193 May  4 17:16 2004.xlsx

In [11]:
%%R
library(gdata)
location="../data1/raw/2004.xlsx"
A=read.xls(location)# translate these data into dataFrame and save it in the variable A
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: ‘gdata’

The following object is masked from ‘package:stats’:

    nobs

The following object is masked from ‘package:utils’:

    object.size


  • Number of rows, number of columns, variable names in the data frame
In [12]:
%%R
print(nrow(A))
print(ncol(A))
[1] 2430
[1] 14

  • Drop the unnecessary columns
In [13]:
%%R
A=A[ ,c(-1,-12:-14)]
print(names(A))
 [1] "X"   "X.1" "X.2" "X.3" "X.4" "X.5" "X.6" "X.7" "X.8" "X.9"

  • Rename every variable: Group Code,State,Group,Population,Total number of labor,Percentage of labor, Totoal number of employment,Employment rate,Total number of unemloyment,Unemployment rate
In [12]:
%%R
names(A)=c("grp_code","state","group","pop","ttl_labor","per_labor","ttl_emp","emp_rate","ttl_unemp","unemp_rate")
print(names(A))
 [1] "grp_code"   "state"      "group"      "pop"        "ttl_labor" 
 [6] "per_labor"  "ttl_emp"    "emp_rate"   "ttl_unemp"  "unemp_rate"

  • Drop the unnecessary rows
In [14]:
%%R
A=A[-1:-5, ]
row.names(A)=1:nrow(A) #number each row
  • Use the data frame we just created, we save the A data into a csv file into the raw data directory:
In [15]:
%%R
write.csv(A, '../data1/raw/2004_to_be_cleaned.csv')
  • Check whether this file is successfully created
In [16]:
%%bash
cd ../data1/raw
ls
2004_to_be_cleaned.csv
2004.xlsx

Package the gathering code into reusable functions

In [17]:
%%file ../script/dataframe-R
creat_dataframe=function(year){library(gdata)
a=year
if((year-2000)<13){
year=substr(as.character(year),3,4)
url=paste('http://www.bls.gov/lau/table14full',year,'.xlsx',sep='')
}
else{url="http://www.bls.gov/lau/ptable14full2013.xlsx"}

file=paste('../data1/raw/',a,'.xlsx',sep='')
download.file(url,destfile=file,method='curl')
filename=paste(a,'.xlsx',sep='')
A=read.xls(paste("../data1/raw/",filename,sep=''))
A=A[ ,c(-1,-12:-14)]
names(A)=c("grp_code","state","group","pop","ttl_labor","per_labor","ttl_emp","emp_rate","ttl_unemp","unemp_rate")
A=A[-1:-5, ] 
row.names(A)=1:nrow(A)
return(A)}
Writing ../script/dataframe-R

  • Now, our function dowloading and converting a raw plant XLS file into a R data frame can be used in other notebooks, using the source command:
In [18]:
%%R
source('../script/dataframe-R')# open the function script
A=creat_dataframe(2005)#create a dataframe for year 2005
print(head(A))
write.csv(A, '../data1/raw/2005_to_be_cleaned.csv')
  grp_code   state        group  pop ttl_labor          per_labor ttl_emp
1       01 Alabama        Total 3521      2179 61.899999999999999    2086
2       02 Alabama          Men 1672      1172 70.099999999999994    1127
3       03 Alabama        Women 1850      1007 54.500000000000000     960
4       04 Alabama        White 2554      1612 63.100000000000001    1563
5       05 Alabama   White, men 1243       902 72.599999999999994     878
6       06 Alabama White, women 1311       710 54.200000000000003     684
            emp_rate ttl_unemp        unemp_rate
1 59.200000000000003        92 4.200000000000000
2 67.400000000000006        45 3.800000000000000
3 51.899999999999999        48 4.700000000000000
4 61.200000000000003        50 3.100000000000000
5 70.700000000000003        24 2.600000000000000
6 52.200000000000003        26 3.700000000000000

Use the script file to load other xls files into dataframe and save them as csv files

In [19]:
%%R
source('../script/dataframe-R')
for(i in 2006:2013){
A=creat_dataframe(i)
write.csv(A, paste('../data1/raw/',i,'_to_be_cleaned.csv',sep=""))}
    

Check whether there cvs files are successfully created

In [20]:
%%bash
cd ../data1/raw
ls
2004_to_be_cleaned.csv
2004.xlsx
2005_to_be_cleaned.csv
2005.xlsx
2006_to_be_cleaned.csv
2006.xlsx
2007_to_be_cleaned.csv
2007.xlsx
2008_to_be_cleaned.csv
2008.xlsx
2009_to_be_cleaned.csv
2009.xlsx
2010_to_be_cleaned.csv
2010.xlsx
2011_to_be_cleaned.csv
2011.xlsx
2012_to_be_cleaned.csv
2012.xlsx
2013_to_be_cleaned.csv
2013.xlsx

In []: