Data Wrangling with Economics Data in R
In this post, I experiment with some data wrangling techniques. I would be using the Penn World Table version 9.1 (PWT 9.1) and the Cross National Time Series (CNTS) data set to practice these techniques.
Data Importation
pwt91 <- read_csv("pwt91.csv")
Rows: 12376 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): countrycode, country, currency_unit, i_cig, i_xm, i_xr, i_outlier,...
dbl (44): year, rgdpe, rgdpo, pop, emp, avh, hc, ccon, cda, cgdpe, cgdpo, cn...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Wrangling
We first need to “mutate” some variables. The total employment numbers do not offer as much context as the employment–population ratios; we create a new variable by dividing total employment by population. Next, we “select” the variables that are of interest to us. Suppose we wish to keep the following variables: country-code, country name, year, employment–population ratios:
pwt91 <- pwt91 %>%
mutate(emp_pop_ratio = emp / pop) %>%
select(countrycode, country, year, emp_pop_ratio)
pwt91
# A tibble: 12,376 × 4
countrycode country year emp_pop_ratio
<chr> <chr> <dbl> <dbl>
1 ABW Aruba 1950 NA
2 ABW Aruba 1951 NA
3 ABW Aruba 1952 NA
4 ABW Aruba 1953 NA
5 ABW Aruba 1954 NA
6 ABW Aruba 1955 NA
7 ABW Aruba 1956 NA
8 ABW Aruba 1957 NA
9 ABW Aruba 1958 NA
10 ABW Aruba 1959 NA
# ℹ 12,366 more rows
Let’s do some simple counting to get a good sense of our data:
# Number of countries
cat(length(unique(pwt91$country)))
182
# Number of years covered in this data set
cat(length(unique(pwt91$year)))
68
# Make sure that country and country-code are consistent
cat(length(unique(pwt91$countrycode)))
182
# First and last year
cat(min(pwt91$year))
1950
cat(max(pwt91$year))
2017
Notice that there are lots of NA’s, i.e., for some combination of country and year, there are no data. More succinctly put, the further back we go in time, the more we observe missing values for countries. Also, I expect that some countries would have more missing values than others. Before we trim the sample down, we want to examine the number missing values by year and by country to see if this is indeed a widespread trend:
pwt91 %>%
group_by(year) %>%
count(is.na(emp_pop_ratio))
# A tibble: 136 × 3
# Groups: year [68]
year `is.na(emp_pop_ratio)` n
<dbl> <lgl> <int>
1 1950 FALSE 50
2 1950 TRUE 132
3 1951 FALSE 54
4 1951 TRUE 128
5 1952 FALSE 55
6 1952 TRUE 127
7 1953 FALSE 57
8 1953 TRUE 125
9 1954 FALSE 61
10 1954 TRUE 121
# ℹ 126 more rows
After examining the table above, I find that the count confirms my thinking. True
indicates the count for missing values, and False
indicates the opposite. The first two decades over this period of 68 years (1950-2017) have many countries with missing values— more than 100 countries in our sample of 182 have missing values. As we go into the 1990’s, the number of countries with missing values drastically decreases— less than 10 countries have missing values.
pwt91 %>%
group_by(country) %>%
count(is.na(emp_pop_ratio))
# A tibble: 312 × 3
# Groups: country [182]
country `is.na(emp_pop_ratio)` n
<chr> <lgl> <int>
1 Albania FALSE 48
2 Albania TRUE 20
3 Algeria FALSE 58
4 Algeria TRUE 10
5 Angola FALSE 48
6 Angola TRUE 20
7 Anguilla FALSE 29
8 Anguilla TRUE 39
9 Antigua and Barbuda FALSE 9
10 Antigua and Barbuda TRUE 59
# ℹ 302 more rows
OECD members such as the US, UK, Australia, Canada, France, Finland, Germany, etc., have no missing values over this period. Sub-Saharan African countries, in particular, have relatively more missing values. Then, some Latin American countries have even more missing values. This is to be expected with socio-economic data; we confirm that the sampling quality introduces some region bias. Deleting missing observations can result in biased parameters and estimates and reduce the statistical power of the analyses. However, in this case, I would be using list-wise deletion, where all observations that have missing values are deleted. This means that, if I were to continue on with my analysis, the models would only be trained on data from countries that have a more complete set of data. While there may be better ways to handle biased samples, for this activity, I would simply use the year 1990 as a cut off since a reasonable number of countries would have values from 1990 to 2017:
pwt91 <- pwt91 %>%
filter(year >= 1990) %>%
na.exclude()
We count the number of rows for each country to see which countries do not have all 28 years (1990-2017) worth of data, and we drop those countries:
pwt91 <- pwt91 %>%
group_by(countrycode) %>%
filter(n() == 28)
Let’s have a sanity check to make sure we are on track:
pwt91 %>%
group_by(countrycode) %>%
count()
# A tibble: 169 × 2
# Groups: countrycode [169]
countrycode n
<chr> <int>
1 AGO 28
2 ALB 28
3 ARE 28
4 ARG 28
5 ARM 28
6 AUS 28
7 AUT 28
8 AZE 28
9 BDI 28
10 BEL 28
# ℹ 159 more rows
# Number of countries
cat(length(unique(pwt91$countrycode)))
169
As can be seen, all remaining countries have values for the 28-year period; we now have a sample of 169 countries:
pwt91
# A tibble: 4,732 × 4
# Groups: countrycode [169]
countrycode country year emp_pop_ratio
<chr> <chr> <dbl> <dbl>
1 AGO Angola 1990 0.454
2 AGO Angola 1991 0.451
3 AGO Angola 1992 0.449
4 AGO Angola 1993 0.448
5 AGO Angola 1994 0.445
6 AGO Angola 1995 0.442
7 AGO Angola 1996 0.439
8 AGO Angola 1997 0.438
9 AGO Angola 1998 0.437
10 AGO Angola 1999 0.437
# ℹ 4,722 more rows
Merging Data Sets
Suppose after we completed our analysis, we found another data set containing some other relevant variables. We would like to “join” the two data sets by a common variable. In my case, that common variable is “country-code.” We start by loading in the second data set, which I’ve converted to a .csv file for convenience:
First we need to make sure that the “key” variables share the same names across these two data sets. Then, we “select” the variables of interest to us. The Domestic8
variable is the number of anti-government demonstrations, which, according to the CNTS user manual, includes labor strikes and demonstrations.
cnts19 <- cnts19 %>%
rename(countrycode = Wbcode) %>%
select(countrycode, year, domestic8)
We would want to use left_join()
, because it preserves the original observations even when there isn’t a match. Notice that setting the argument (by = NULL) makes sure that R uses all variables that appear in both data sets for merging, this is the so-called natural join.
new_data <- pwt91 %>%
left_join(cnts19, by = NULL) %>%
na.exclude()
Joining with `by = join_by(countrycode, year)`
new_data
# A tibble: 4,455 × 5
# Groups: countrycode [162]
countrycode country year emp_pop_ratio domestic8
<chr> <chr> <dbl> <dbl> <dbl>
1 AGO Angola 1990 0.454 0
2 AGO Angola 1991 0.451 0
3 AGO Angola 1992 0.449 0
4 AGO Angola 1993 0.448 0
5 AGO Angola 1994 0.445 0
6 AGO Angola 1995 0.442 0
7 AGO Angola 1996 0.439 0
8 AGO Angola 1997 0.438 0
9 AGO Angola 1998 0.437 0
10 AGO Angola 1999 0.437 0
# ℹ 4,445 more rows
Let’s check:
new_data %>%
group_by(countrycode) %>%
count()
# A tibble: 162 × 2
# Groups: countrycode [162]
countrycode n
<chr> <int>
1 AGO 28
2 ALB 28
3 ARE 28
4 ARG 28
5 ARM 26
6 AUS 28
7 AUT 28
8 AZE 26
9 BDI 28
10 BEL 28
# ℹ 152 more rows
According to the table above, after merging, we now have some countries with missing values for the domestic8 variable. Since this exercise is about data wrangling techniques and not about data analysis, we will continue to trim the sample further for practice, excluding countries that have missing values. We need to keep in mind that, as far as analysis is concerned, too much deletion of the data would lead to biases. Just how much bias are we able to tolerate is a whole new topic of discussion. For now, we will proceed:
new_data <- new_data %>%
group_by(countrycode) %>%
filter(n() == 28)
Now, let’s check:
new_data %>%
group_by(countrycode) %>%
count()
# A tibble: 139 × 2
# Groups: countrycode [139]
countrycode n
<chr> <int>
1 AGO 28
2 ALB 28
3 ARE 28
4 ARG 28
5 AUS 28
6 AUT 28
7 BDI 28
8 BEL 28
9 BEN 28
10 BFA 28
# ℹ 129 more rows
We have ensured that all remaining countries have values for all variables:
# Number of countries
cat(length(unique(new_data$countrycode)))
139
new_data
# A tibble: 3,892 × 5
# Groups: countrycode [139]
countrycode country year emp_pop_ratio domestic8
<chr> <chr> <dbl> <dbl> <dbl>
1 AGO Angola 1990 0.454 0
2 AGO Angola 1991 0.451 0
3 AGO Angola 1992 0.449 0
4 AGO Angola 1993 0.448 0
5 AGO Angola 1994 0.445 0
6 AGO Angola 1995 0.442 0
7 AGO Angola 1996 0.439 0
8 AGO Angola 1997 0.438 0
9 AGO Angola 1998 0.437 0
10 AGO Angola 1999 0.437 0
# ℹ 3,882 more rows
Here’s glimpse at which country had the single most yearly anti-government demonstrations over this 28-year period:
new_data %>%
arrange(desc(domestic8))
# A tibble: 3,892 × 5
# Groups: countrycode [139]
countrycode country year emp_pop_ratio domestic8
<chr> <chr> <dbl> <dbl> <dbl>
1 IND India 2016 0.401 149
2 IND India 2017 0.402 146
3 IND India 2015 0.399 110
4 USA United States 2015 0.470 81
5 SYR Syrian Arab Republic 2011 0.237 74
6 PAK Pakistan 2016 0.307 55
7 YEM Yemen 2011 0.185 55
8 USA United States 2014 0.466 50
9 USA United States 2011 0.457 49
10 USA United States 2016 0.474 49
# ℹ 3,882 more rows
Nested Data
Lastly, we may also present our new merged data as a nested data frame, a new structure. The nested data frame has one row per group (per country-code/country in my case). The third column, data, is a list of data frames:
nested_data <- new_data %>%
group_by(countrycode, country) %>%
nest()
nested_data
# A tibble: 139 × 3
# Groups: countrycode, country [139]
countrycode country data
<chr> <chr> <list>
1 AGO Angola <tibble [28 × 3]>
2 ALB Albania <tibble [28 × 3]>
3 ARE United Arab Emirates <tibble [28 × 3]>
4 ARG Argentina <tibble [28 × 3]>
5 AUS Australia <tibble [28 × 3]>
6 AUT Austria <tibble [28 × 3]>
7 BDI Burundi <tibble [28 × 3]>
8 BEL Belgium <tibble [28 × 3]>
9 BEN Benin <tibble [28 × 3]>
10 BFA Burkina Faso <tibble [28 × 3]>
# ℹ 129 more rows
Let us examine the structure of the object:
glimpse(nested_data)
Rows: 139
Columns: 3
Groups: countrycode, country [139]
$ countrycode <chr> "AGO", "ALB", "ARE", "ARG", "AUS", "AUT", "BDI", "BEL", "B…
$ country <chr> "Angola", "Albania", "United Arab Emirates", "Argentina", …
$ data <list> [<tbl_df[28 x 3]>], [<tbl_df[28 x 3]>], [<tbl_df[28 x 3]>…
The row numbers make sense, since we have indeed a sample of 139 countries. If we look at the first element of “data,” we see that it contains all the data for that country (in my case, Angola):
nested_data$data[[1]]
# A tibble: 28 × 3
year emp_pop_ratio domestic8
<dbl> <dbl> <dbl>
1 1990 0.454 0
2 1991 0.451 0
3 1992 0.449 0
4 1993 0.448 0
5 1994 0.445 0
6 1995 0.442 0
7 1996 0.439 0
8 1997 0.438 0
9 1998 0.437 0
10 1999 0.437 0
# ℹ 18 more rows
The fifth element:
nested_data$data[[5]]
# A tibble: 28 × 3
year emp_pop_ratio domestic8
<dbl> <dbl> <dbl>
1 1990 0.463 0
2 1991 0.446 0
3 1992 0.439 0
4 1993 0.436 0
5 1994 0.445 0
6 1995 0.457 0
7 1996 0.457 1
8 1997 0.456 0
9 1998 0.459 0
10 1999 0.462 0
# ℹ 18 more rows
In other words, there is one data frame per country. Presenting data in this new structure can be helpful, especially when dealing with cross-sectional data where observational units are numerous and we often need to conduct transformations and fit models using only subsets of the entire data set.
Is data deletion best practice?
At the end of the post, it may be worth writing about data deletion. We began with a sample of 182 countries, which we trimmed down to 139. Judging by numbers only, the deletion process could have been worse. However, we must take note that, when it comes to missing values, the data are not missing completely at random (MCAR). This is especially true for cross-national, socio-economic data, where sampling quality reflects inequalities that are rather hard to capture and deal with. Beyond this activity, it would be interesting to explore ways of handling missing data as well as new imputation methods that have been developed over the years.