Team Four.0 Final Project¶
Part 1 Data Gathering¶
Team members responsible for this notebook:
Minghong Zheng: collect raw data from website, write functions and scripts to transform raw data into data frames, write explanations.
Yuhan Wang, Yijia Mao, Yiwen Wei : proofread codes.
Data Resources:¶
First, the data of State Employment and Unemployment statistics from 2004 to 2013 is downloaded from the Bureau of Labor Statistics website. Their file formats are xls.
Here are the URLs of the data sources:
http://www.bls.gov/lau/table14full04.xlsx
http://www.bls.gov/lau/table14full05.xlsx
http://www.bls.gov/lau/table14full06.xlsx
http://www.bls.gov/lau/table14full07.xlsx
http://www.bls.gov/lau/table14full08.xlsx
http://www.bls.gov/lau/table14full09.xlsx
http://www.bls.gov/lau/table14full10.xlsx
http://www.bls.gov/lau/table14full11.xlsx
http://www.bls.gov/lau/table14full12.xlsx
http://www.bls.gov/lau/ptable14full2013.xlsx
%load_ext rmagic
Directories that we need to create:
- a directory
'script'
to run the process of data gathering - a directory
'data1'
to save our data - a subdirectory
'raw'
to store our raw data collected from the web - a subdirectory
'cleaned'
to store our cleaned data - a subdirectory
'simulated'
to store simulated data - a directory
'visulazations'
to store our plots
%%bash
mkdir ../script ../data1 ../data1/raw ../data1/cleaned ../data1/simulated ../visualizations
Check whether these directories are created.
%%bash
cd ..
ls -r
%%bash
cd ../data1
ls -r
Example:¶
Here's how we download the data in 2004 from the website; save them in subdirectory "raw"; load data in 2004.xls file into a data frame¶
- Download the data from the website
%%R
url="http://www.bls.gov/lau/table14full04.xlsx"
file='../data1/raw/2004.xlsx'
download.file(url,destfile=file,method='curl')
- Check whether the file is successfully downolad to our raw directory
%%bash
ls ../data1/raw
Load raw data into data Fram¶
- Download the pacakage "gdata" to read our excel files
%%bash
sudo apt-get install r-cran-gplots -y
ls -l ../data1/raw
%%R
library(gdata)
location="../data1/raw/2004.xlsx"
A=read.xls(location)# translate these data into dataFrame and save it in the variable A
- Number of rows, number of columns, variable names in the data frame
%%R
print(nrow(A))
print(ncol(A))
- Drop the unnecessary columns
%%R
A=A[ ,c(-1,-12:-14)]
print(names(A))
- Rename every variable: Group Code,State,Group,Population,Total number of labor,Percentage of labor, Totoal number of employment,Employment rate,Total number of unemloyment,Unemployment rate
%%R
names(A)=c("grp_code","state","group","pop","ttl_labor","per_labor","ttl_emp","emp_rate","ttl_unemp","unemp_rate")
print(names(A))
- Drop the unnecessary rows
%%R
A=A[-1:-5, ]
row.names(A)=1:nrow(A) #number each row
- Use the data frame we just created, we save the A data into a csv file into the raw data directory:
%%R
write.csv(A, '../data1/raw/2004_to_be_cleaned.csv')
- Check whether this file is successfully created
%%bash
cd ../data1/raw
ls
Package the gathering code into reusable functions¶
%%file ../script/dataframe-R
creat_dataframe=function(year){library(gdata)
a=year
if((year-2000)<13){
year=substr(as.character(year),3,4)
url=paste('http://www.bls.gov/lau/table14full',year,'.xlsx',sep='')
}
else{url="http://www.bls.gov/lau/ptable14full2013.xlsx"}
file=paste('../data1/raw/',a,'.xlsx',sep='')
download.file(url,destfile=file,method='curl')
filename=paste(a,'.xlsx',sep='')
A=read.xls(paste("../data1/raw/",filename,sep=''))
A=A[ ,c(-1,-12:-14)]
names(A)=c("grp_code","state","group","pop","ttl_labor","per_labor","ttl_emp","emp_rate","ttl_unemp","unemp_rate")
A=A[-1:-5, ]
row.names(A)=1:nrow(A)
return(A)}
- Now, our function dowloading and converting a raw plant XLS file into a R data frame can be used in other notebooks, using the source command:
%%R
source('../script/dataframe-R')# open the function script
A=creat_dataframe(2005)#create a dataframe for year 2005
print(head(A))
write.csv(A, '../data1/raw/2005_to_be_cleaned.csv')
Use the script file to load other xls files into dataframe and save them as csv files¶
%%R
source('../script/dataframe-R')
for(i in 2006:2013){
A=creat_dataframe(i)
write.csv(A, paste('../data1/raw/',i,'_to_be_cleaned.csv',sep=""))}
Check whether there cvs files are successfully created
%%bash
cd ../data1/raw
ls