Data Wrangling with Economics Data in R

Monday, 04 May 2020 11 min read R

Data Importation
Wrangling
Merging Data Sets
Nested Data
Is data deletion best practice?

In this post, I experiment with some data wrangling techniques. I would be using the Penn World Table version 9.1 (PWT 9.1) and the Cross National Time Series (CNTS) data set to practice these techniques.

Data Importation

pwt91 <- read_csv("pwt91.csv")

Rows: 12376 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): countrycode, country, currency_unit, i_cig, i_xm, i_xr, i_outlier,...
dbl (44): year, rgdpe, rgdpo, pop, emp, avh, hc, ccon, cda, cgdpe, cgdpo, cn...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Wrangling

We first need to “mutate” some variables. The total employment numbers do not offer as much context as the employment–population ratios; we create a new variable by dividing total employment by population. Next, we “select” the variables that are of interest to us. Suppose we wish to keep the following variables: country-code, country name, year, employment–population ratios:

pwt91 <- pwt91 %>%
  mutate(emp_pop_ratio = emp / pop) %>%
  select(countrycode, country, year, emp_pop_ratio)
pwt91

# A tibble: 12,376 × 4
   countrycode country  year emp_pop_ratio
   <chr>       <chr>   <dbl>         <dbl>
 1 ABW         Aruba    1950            NA
 2 ABW         Aruba    1951            NA
 3 ABW         Aruba    1952            NA
 4 ABW         Aruba    1953            NA
 5 ABW         Aruba    1954            NA
 6 ABW         Aruba    1955            NA
 7 ABW         Aruba    1956            NA
 8 ABW         Aruba    1957            NA
 9 ABW         Aruba    1958            NA
10 ABW         Aruba    1959            NA
# ℹ 12,366 more rows

Let’s do some simple counting to get a good sense of our data:

# Number of countries
cat(length(unique(pwt91$country)))

# Number of years covered in this data set
cat(length(unique(pwt91$year)))

# Make sure that country and country-code are consistent
cat(length(unique(pwt91$countrycode)))

# First and last year
cat(min(pwt91$year))

cat(max(pwt91$year))

Notice that there are lots of NA’s, i.e., for some combination of country and year, there are no data. More succinctly put, the further back we go in time, the more we observe missing values for countries. Also, I expect that some countries would have more missing values than others. Before we trim the sample down, we want to examine the number missing values by year and by country to see if this is indeed a widespread trend:

pwt91 %>%
  group_by(year) %>%
  count(is.na(emp_pop_ratio))

# A tibble: 136 × 3
# Groups:   year [68]
    year `is.na(emp_pop_ratio)`     n
   <dbl> <lgl>                  <int>
 1  1950 FALSE                     50
 2  1950 TRUE                     132
 3  1951 FALSE                     54
 4  1951 TRUE                     128
 5  1952 FALSE                     55
 6  1952 TRUE                     127
 7  1953 FALSE                     57
 8  1953 TRUE                     125
 9  1954 FALSE                     61
10  1954 TRUE                     121
# ℹ 126 more rows

After examining the table above, I find that the count confirms my thinking. True indicates the count for missing values, and False indicates the opposite. The first two decades over this period of 68 years (1950-2017) have many countries with missing values— more than 100 countries in our sample of 182 have missing values. As we go into the 1990’s, the number of countries with missing values drastically decreases— less than 10 countries have missing values.

pwt91 %>%
  group_by(country) %>%
  count(is.na(emp_pop_ratio))

# A tibble: 312 × 3
# Groups:   country [182]
   country             `is.na(emp_pop_ratio)`     n
   <chr>               <lgl>                  <int>
 1 Albania             FALSE                     48
 2 Albania             TRUE                      20
 3 Algeria             FALSE                     58
 4 Algeria             TRUE                      10
 5 Angola              FALSE                     48
 6 Angola              TRUE                      20
 7 Anguilla            FALSE                     29
 8 Anguilla            TRUE                      39
 9 Antigua and Barbuda FALSE                      9
10 Antigua and Barbuda TRUE                      59
# ℹ 302 more rows

OECD members such as the US, UK, Australia, Canada, France, Finland, Germany, etc., have no missing values over this period. Sub-Saharan African countries, in particular, have relatively more missing values. Then, some Latin American countries have even more missing values. This is to be expected with socio-economic data; we confirm that the sampling quality introduces some region bias. Deleting missing observations can result in biased parameters and estimates and reduce the statistical power of the analyses. However, in this case, I would be using list-wise deletion, where all observations that have missing values are deleted. This means that, if I were to continue on with my analysis, the models would only be trained on data from countries that have a more complete set of data. While there may be better ways to handle biased samples, for this activity, I would simply use the year 1990 as a cut off since a reasonable number of countries would have values from 1990 to 2017:

pwt91 <- pwt91 %>%
  filter(year >= 1990) %>%
  na.exclude()

We count the number of rows for each country to see which countries do not have all 28 years (1990-2017) worth of data, and we drop those countries:

pwt91 <- pwt91 %>%
  group_by(countrycode) %>%
  filter(n() == 28)

Let’s have a sanity check to make sure we are on track:

pwt91 %>%
  group_by(countrycode) %>%
  count()

# A tibble: 169 × 2
# Groups:   countrycode [169]
   countrycode     n
   <chr>       <int>
 1 AGO            28
 2 ALB            28
 3 ARE            28
 4 ARG            28
 5 ARM            28
 6 AUS            28
 7 AUT            28
 8 AZE            28
 9 BDI            28
10 BEL            28
# ℹ 159 more rows

# Number of countries
cat(length(unique(pwt91$countrycode)))

As can be seen, all remaining countries have values for the 28-year period; we now have a sample of 169 countries:

pwt91

# A tibble: 4,732 × 4
# Groups:   countrycode [169]
   countrycode country  year emp_pop_ratio
   <chr>       <chr>   <dbl>         <dbl>
 1 AGO         Angola   1990         0.454
 2 AGO         Angola   1991         0.451
 3 AGO         Angola   1992         0.449
 4 AGO         Angola   1993         0.448
 5 AGO         Angola   1994         0.445
 6 AGO         Angola   1995         0.442
 7 AGO         Angola   1996         0.439
 8 AGO         Angola   1997         0.438
 9 AGO         Angola   1998         0.437
10 AGO         Angola   1999         0.437
# ℹ 4,722 more rows

Merging Data Sets

Suppose after we completed our analysis, we found another data set containing some other relevant variables. We would like to “join” the two data sets by a common variable. In my case, that common variable is “country-code.” We start by loading in the second data set, which I’ve converted to a .csv file for convenience:

First we need to make sure that the “key” variables share the same names across these two data sets. Then, we “select” the variables of interest to us. The Domestic8 variable is the number of anti-government demonstrations, which, according to the CNTS user manual, includes labor strikes and demonstrations.

cnts19 <- cnts19 %>%
  rename(countrycode = Wbcode) %>%
  select(countrycode, year, domestic8)

We would want to use left_join(), because it preserves the original observations even when there isn’t a match. Notice that setting the argument (by = NULL) makes sure that R uses all variables that appear in both data sets for merging, this is the so-called natural join.

new_data <- pwt91 %>%
  left_join(cnts19, by = NULL) %>%
  na.exclude()

Joining with `by = join_by(countrycode, year)`

new_data

# A tibble: 4,455 × 5
# Groups:   countrycode [162]
   countrycode country  year emp_pop_ratio domestic8
   <chr>       <chr>   <dbl>         <dbl>     <dbl>
 1 AGO         Angola   1990         0.454         0
 2 AGO         Angola   1991         0.451         0
 3 AGO         Angola   1992         0.449         0
 4 AGO         Angola   1993         0.448         0
 5 AGO         Angola   1994         0.445         0
 6 AGO         Angola   1995         0.442         0
 7 AGO         Angola   1996         0.439         0
 8 AGO         Angola   1997         0.438         0
 9 AGO         Angola   1998         0.437         0
10 AGO         Angola   1999         0.437         0
# ℹ 4,445 more rows

Let’s check:

new_data %>%
  group_by(countrycode) %>%
  count()

# A tibble: 162 × 2
# Groups:   countrycode [162]
   countrycode     n
   <chr>       <int>
 1 AGO            28
 2 ALB            28
 3 ARE            28
 4 ARG            28
 5 ARM            26
 6 AUS            28
 7 AUT            28
 8 AZE            26
 9 BDI            28
10 BEL            28
# ℹ 152 more rows

According to the table above, after merging, we now have some countries with missing values for the domestic8 variable. Since this exercise is about data wrangling techniques and not about data analysis, we will continue to trim the sample further for practice, excluding countries that have missing values. We need to keep in mind that, as far as analysis is concerned, too much deletion of the data would lead to biases. Just how much bias are we able to tolerate is a whole new topic of discussion. For now, we will proceed:

new_data <- new_data %>%
  group_by(countrycode) %>%
  filter(n() == 28)

Now, let’s check:

new_data %>%
  group_by(countrycode) %>%
  count()

# A tibble: 139 × 2
# Groups:   countrycode [139]
   countrycode     n
   <chr>       <int>
 1 AGO            28
 2 ALB            28
 3 ARE            28
 4 ARG            28
 5 AUS            28
 6 AUT            28
 7 BDI            28
 8 BEL            28
 9 BEN            28
10 BFA            28
# ℹ 129 more rows

We have ensured that all remaining countries have values for all variables:

# Number of countries
cat(length(unique(new_data$countrycode)))

new_data

# A tibble: 3,892 × 5
# Groups:   countrycode [139]
   countrycode country  year emp_pop_ratio domestic8
   <chr>       <chr>   <dbl>         <dbl>     <dbl>
 1 AGO         Angola   1990         0.454         0
 2 AGO         Angola   1991         0.451         0
 3 AGO         Angola   1992         0.449         0
 4 AGO         Angola   1993         0.448         0
 5 AGO         Angola   1994         0.445         0
 6 AGO         Angola   1995         0.442         0
 7 AGO         Angola   1996         0.439         0
 8 AGO         Angola   1997         0.438         0
 9 AGO         Angola   1998         0.437         0
10 AGO         Angola   1999         0.437         0
# ℹ 3,882 more rows

Here’s glimpse at which country had the single most yearly anti-government demonstrations over this 28-year period:

new_data %>%
  arrange(desc(domestic8))

# A tibble: 3,892 × 5
# Groups:   countrycode [139]
   countrycode country               year emp_pop_ratio domestic8
   <chr>       <chr>                <dbl>         <dbl>     <dbl>
 1 IND         India                 2016         0.401       149
 2 IND         India                 2017         0.402       146
 3 IND         India                 2015         0.399       110
 4 USA         United States         2015         0.470        81
 5 SYR         Syrian Arab Republic  2011         0.237        74
 6 PAK         Pakistan              2016         0.307        55
 7 YEM         Yemen                 2011         0.185        55
 8 USA         United States         2014         0.466        50
 9 USA         United States         2011         0.457        49
10 USA         United States         2016         0.474        49
# ℹ 3,882 more rows

Nested Data

Lastly, we may also present our new merged data as a nested data frame, a new structure. The nested data frame has one row per group (per country-code/country in my case). The third column, data, is a list of data frames:

nested_data <- new_data %>%
  group_by(countrycode, country) %>%
  nest()
nested_data

# A tibble: 139 × 3
# Groups:   countrycode, country [139]
   countrycode country              data             
   <chr>       <chr>                <list>           
 1 AGO         Angola               <tibble [28 × 3]>
 2 ALB         Albania              <tibble [28 × 3]>
 3 ARE         United Arab Emirates <tibble [28 × 3]>
 4 ARG         Argentina            <tibble [28 × 3]>
 5 AUS         Australia            <tibble [28 × 3]>
 6 AUT         Austria              <tibble [28 × 3]>
 7 BDI         Burundi              <tibble [28 × 3]>
 8 BEL         Belgium              <tibble [28 × 3]>
 9 BEN         Benin                <tibble [28 × 3]>
10 BFA         Burkina Faso         <tibble [28 × 3]>
# ℹ 129 more rows

Let us examine the structure of the object:

glimpse(nested_data)

Rows: 139
Columns: 3
Groups: countrycode, country [139]
$ countrycode <chr> "AGO", "ALB", "ARE", "ARG", "AUS", "AUT", "BDI", "BEL", "B…
$ country     <chr> "Angola", "Albania", "United Arab Emirates", "Argentina", …
$ data        <list> [<tbl_df[28 x 3]>], [<tbl_df[28 x 3]>], [<tbl_df[28 x 3]>…

The row numbers make sense, since we have indeed a sample of 139 countries. If we look at the first element of “data,” we see that it contains all the data for that country (in my case, Angola):

nested_data$data[[1]]

# A tibble: 28 × 3
    year emp_pop_ratio domestic8
   <dbl>         <dbl>     <dbl>
 1  1990         0.454         0
 2  1991         0.451         0
 3  1992         0.449         0
 4  1993         0.448         0
 5  1994         0.445         0
 6  1995         0.442         0
 7  1996         0.439         0
 8  1997         0.438         0
 9  1998         0.437         0
10  1999         0.437         0
# ℹ 18 more rows

The fifth element:

nested_data$data[[5]]

# A tibble: 28 × 3
    year emp_pop_ratio domestic8
   <dbl>         <dbl>     <dbl>
 1  1990         0.463         0
 2  1991         0.446         0
 3  1992         0.439         0
 4  1993         0.436         0
 5  1994         0.445         0
 6  1995         0.457         0
 7  1996         0.457         1
 8  1997         0.456         0
 9  1998         0.459         0
10  1999         0.462         0
# ℹ 18 more rows

In other words, there is one data frame per country. Presenting data in this new structure can be helpful, especially when dealing with cross-sectional data where observational units are numerous and we often need to conduct transformations and fit models using only subsets of the entire data set.

Is data deletion best practice?

At the end of the post, it may be worth writing about data deletion. We began with a sample of 182 countries, which we trimmed down to 139. Judging by numbers only, the deletion process could have been worse. However, we must take note that, when it comes to missing values, the data are not missing completely at random (MCAR). This is especially true for cross-national, socio-economic data, where sampling quality reflects inequalities that are rather hard to capture and deal with. Beyond this activity, it would be interesting to explore ways of handling missing data as well as new imputation methods that have been developed over the years.

Tidyverse Economics Data Wrangling Relational Data