4 Useful Data Structures

R has many useful data structures. Data structures are, in simple terms, just a collection of data. Different data structures store the data in different ways and, therefore, have different restrictions on what kind of data they can store. Vectors and matrices are two forms of data structures and fall under the broader category of arrays. A vector is nothing else than a one dimensional array, while a matrix is a two dimensional array.

Arrays are a rather restrictive data structure, in the sense that R requires each element in an array to be of the same type. A type can be anything like a double (a real number), and integer, a boolean, or a string. Due to the restrictions that arrays impose, a vector storing both an integer and a string is not strictly possible. This, however, does not imply that R will throw an error below. Instead, R uses so-called type-promotion or type-conversion to convert one of the types into another. For example, a number can also be represented as a string (as text), and thus, the below code will convert the 1 to a string.

c(1, "a")

[1] "1" "a"

Lists are a more flexible data structure than arrays. Lists come in two types: unnamed and named. We will first focus on the unnamed versions. A list can be created simply by calling list and providing it with elements.

l <- list(1, 2, 3, 4)
l

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

Compare the output of the code above to the output you would get for the vector c(1, 2, 3, 4). Clearly, lists and vectors are not the same. Lists are much more flexible than vectors/arrays. You can imagine lists as a collection of boxes, and each element in the comma separated list in parentheses above, gets put into a separate box. A specific box can then be accessed by simply indexing it like the arrays before.

l[1]

[[1]]
[1] 1

Looking at the output, it still looks different to what we have seen before. R provides the handy typeof function to check what type an element has. Calling typeof on the first element in the list shows that the returned element is still a list. Thus, if we index lists like we index vectors or matrices, we get a list back, instead of the element inside the list.

typeof(l[1])

[1] "list"

To obtain the element inside the list, we need to use double square brackets. The returned element is now a double (a real number) as expected.

l[[1]]

[1] 1

typeof(l[[1]])

[1] "double"

One advantage of lists over arrays is the ability to store different type of data without the types being transformed. Thus, while we could not store a vector c(1, "a") without changing the double 1 to a string "1", we can store the following list.

l <- list(1, "a", TRUE)
l

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

The reason why lists can store different kind of elements comes back to the analogy above. While arrays are like a single box in which we store all elements, lists are like a collection of boxes. Each box can store just one type of data, but because we can use different boxes, we can store data of different types.

Lists can also be named. We will use this later on to package return values of a function. To name a list, we simply write list(name = element) with name and element being replaced by the actual names and elements. For example, we can store a number, letter, and boolean into a named list in the following way.

l <- list(number = 1, letter = "a", boolean = TRUE)
l

$number
[1] 1

$letter
[1] "a"

$boolean
[1] TRUE

This list can still be indexed like an unnamed list, but it can also be indexed using the names. To index a list using its names, we call list$name or list[["name"]].

l[[2]]

[1] "a"

l$letter

[1] "a"

l[["boolean"]]

[1] TRUE

While lists are much more flexible than arrays, they are still not optimal for econometrics. Econometrics is an empirical discipline and thus relies on data. Storing data in the form of lists is possible. However, manipulating the data in lists is tedious. Econometricians, therefore, much more often use so called data frames. Data frames are nothing but a table of data, with each column possibly being of a different data type.

Data frames can be created using data.frame and using a similar syntax to named lists. Different to named lists, the output is much cleaner and the data frame is easier to manipulate.

df <- data.frame(id = sample(letters, 10), grade = 1:10)
df

   id grade
1   e     1
2   o     2
3   w     3
4   k     4
5   n     5
6   m     6
7   v     7
8   b     8
9   f     9
10  c    10

A common data frame you might obtain for your concurrent macroeconomics class contains a column for the year, a column for GDP growth, a column for CPI growth, and a column for unemployment. Such a data frame might thus look like the following.

df <- data.frame(year = 1960:2024, GDP_growth = rnorm(65, mean=0.5), CPI_growth = rnorm(65), urate = runif(65, min = 3, max = 10))
df

   year  GDP_growth  CPI_growth    urate
1  1960 -0.08892621  0.16463587 5.065491
2  1961  1.24872553  0.78591116 4.266830
3  1962  0.06668438 -0.29222459 9.594971
4  1963  1.46695965  0.39828452 5.794365
5  1964  1.67476822  0.39566550 6.287162
6  1965 -0.33164329  0.21982581 7.415040
7  1966  1.32843147  0.05546870 8.851888
8  1967  2.33941362  0.86304083 9.618320
9  1968  0.82998192  0.29445480 5.880486
10 1969  0.34202845 -1.16402566 3.800347
11 1970  1.93714221  1.00697613 4.162238
12 1971 -0.61093546 -0.43794197 4.928424
13 1972 -0.32269076 -0.28197791 6.333750
14 1973  2.12371545  1.56259218 8.063485
15 1974 -0.41661919  0.44616816 8.357204
16 1975  0.92002998  1.10120342 7.693329
17 1976  0.23640530 -1.15584169 7.442181
18 1977 -0.35992292 -0.67781385 4.568331
19 1978  1.05412815  0.71357399 7.538689
20 1979  1.61667903 -1.11192905 6.670166
21 1980 -1.00211219  0.37982735 9.459323
22 1981  1.39579213 -1.16674061 4.452969
23 1982 -2.21678108 -0.74216246 5.838885
24 1983 -1.71197846  0.90218395 8.534942
25 1984  1.39350696  0.51763579 9.185164
26 1985 -0.35227946 -0.74193505 7.865650
27 1986  1.32189484 -0.70810337 9.655562
28 1987  1.81583622 -1.43456787 9.258499
29 1988  0.91614636  1.09233044 5.055258
30 1989 -0.37515893  2.61133629 5.942377
31 1990  1.19678903 -0.51074292 4.880914
32 1991  0.65947683 -1.77134081 8.708443
33 1992 -0.29349408  0.19860990 7.152598
34 1993 -0.43407339 -1.46311341 9.428337
35 1994  0.15249644  0.80053843 9.808177
36 1995  0.94966567  1.17657243 5.414148
37 1996  1.13073676  0.73802986 3.607648
38 1997  1.14460218 -0.50579893 4.171454
39 1998  2.34118599  0.36405933 5.652493
40 1999  0.26529119  0.08888244 5.794934
41 2000 -0.47247537 -1.13064966 6.741402
42 2001  0.74049731 -1.48091782 7.396782
43 2002  1.52611265 -0.39012195 9.438776
44 2003  1.71800152  0.29406199 9.715607
45 2004  0.55768243 -0.68213394 5.365145
46 2005  0.84263463 -1.33776471 5.024965
47 2006  1.37526383  0.05918326 4.475322
48 2007  0.36317796 -1.13859225 8.061715
49 2008  1.87992367  0.60606532 5.870958
50 2009 -0.36660904  0.40915355 4.271145
51 2010  2.99956215 -1.31301024 6.169534
52 2011  0.35001905  0.78546156 9.180057
53 2012  0.95686730  0.04243047 4.527058
54 2013 -1.34402135 -1.90776670 6.241158
55 2014  0.83078387  1.16082595 3.965152
56 2015 -0.28912279 -0.50285570 4.759643
57 2016  0.04145113 -0.58810780 4.338493
58 2017  0.53682475  0.48371657 9.358597
59 2018 -0.23261364 -0.84110757 3.786608
60 2019 -0.36700289 -0.05692854 6.574672
61 2020 -0.70147221 -0.93221559 5.978248
62 2021  0.43394570 -0.29860725 3.458615
63 2022 -0.87265212  0.17857421 8.926980
64 2023 -1.09866101  0.39197797 4.220703
65 2024  0.53162703  0.04931760 7.966602

While this data frame is technically still considered small, it already becomes clear that the more rows a data frame has, the more difficult it is to get an overview of the data. Having functions that only print some of the rows of the data frame to the console is thus useful. R provides the head and tail functions, with the head function printing the first few rows, and the tail function printing the last few rows.

head(df)

  year  GDP_growth CPI_growth    urate
1 1960 -0.08892621  0.1646359 5.065491
2 1961  1.24872553  0.7859112 4.266830
3 1962  0.06668438 -0.2922246 9.594971
4 1963  1.46695965  0.3982845 5.794365
5 1964  1.67476822  0.3956655 6.287162
6 1965 -0.33164329  0.2198258 7.415040

tail(df)

   year GDP_growth  CPI_growth    urate
60 2019 -0.3670029 -0.05692854 6.574672
61 2020 -0.7014722 -0.93221559 5.978248
62 2021  0.4339457 -0.29860725 3.458615
63 2022 -0.8726521  0.17857421 8.926980
64 2023 -1.0986610  0.39197797 4.220703
65 2024  0.5316270  0.04931760 7.966602

Many more useful functions exist to work with data frames. We recommend you have a small search online. Data frames will be your trustworthy companion throughout the remainder of your studies and over time you will learn many useful manipulation techniques.