{censobr} is an R package to download data from Brazil’s Population Census. It provides a very simple and efficient way to download and read the data sets and documentation of all the population censuses taken in and after 1960 in the country. The {censobr} package is built on top of the Arrow platform, which allows users to work with larger-than-memory census data using {dplyr} familiar functions.
The package currently includes 6 main functions to download census data:
read_population()
read_households()
read_mortality()
read_families()
read_emigration()
read_tracts()
|
|
|
|
|
||||||
---|---|---|---|---|---|---|---|---|---|---|
1960 | 70 | 80 | 91 | 2000 | 10 | 22 | ||||
read_population() | Amostra | Microdado | Lê os microdados de pessoas. | X | X | X | X | X | em breve | |
read_households() | Amostra | Microdado | Lê os microdados de domicílios. | X | X | X | X | X | X | em breve |
read_families() | Amostra | Microdado | Lê os microdados de famílias do censo de 2000. | X | ||||||
read_emigration() | Amostra | Microdado | Lê os microdados de emigração. | X | em breve | |||||
read_mortality() | Amostra | Microdado | Lê os microdados de mortalidade. | X | em breve | |||||
read_tracts() | Universo | Setor Censitário | Lê os dados do Universo agregados por setores censitários. | em breve | X | em breve |
{censobr} also includes a few support functions to help users navigate the documentation Brazilian censuses, providing convenient information on data variables and methodology.:
data_dictionary()
questionnaire()
interview_manual()
Finally, the package includes a function to help users to manage the data cached locally.
censobr_cache()
The syntax of all {censobr} functions to read data operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:
read_households(
year, # year of reference
columns, # select columns to read
add_labels, # add labels to categorical variables
as_data_frame, # return an Arrow DataSet or a data.frame
showProgress, # show download progress bar
cache # cache data for faster access later
)
Note: all data sets in
{censobr} are enriched with geography columns following
the name standards of the {geobr} package to help
data manipulation and integration with spatial data from {geobr}. The
added columns are:
c(‘code_muni’, ‘code_state’, ‘abbrev_state’, ‘name_state’, ‘code_region’, ‘name_region’, ‘code_weighting’)
.
Data Cache:
The first time the user runs a function, {censobr} will download the file and store it locally. This way, the data only needs to be downloaded once. More info in the Data cache section below.
Data of Brazilian censuses are often too big to load in users’ RAM
memory. To avoid this problem, {censobr} will by
default return an Arrow
table, which can be analyzed like a regular data.frame
using the dplyr
package without loading the full data to
memory.
Let’s see how {censobr} works in a couple examples:
First, let’s load the libraries we’ll be using in this vignette.
In this example we’ll be calculating the proportion of people with
higher education in different racial groups in the state of Rio de
Janeiro. First, we need to use the read_population()
function to download the population data set.
Since we don’t need to load to memory all columns from the data, we
can pass a vector with the names of the columns we’re going to use. This
might be necessary in more constrained computing environments. Note that
by setting add_labels = 'pt'
, the function returns labeled
values for categorical variables.
pop <- read_population(year = 2010,
columns = c('abbrev_state', 'V0606', 'V0010', 'V6400'),
add_labels = 'pt',
showProgress = FALSE)
class(pop)
By default, the output of the function is an
"arrow_dplyr_query"
. This is makes it possible for you to
work with the census data in a super fast and efficient way, even though
the data set might be to big for your computer memory. By setting the
parameter as_data_frame = TRUE
, the read functions load the
entire output to memory as a data.frame
. Warning:
This can cause the R session to crash in computationally constrained
environments.
The output of the read functions in {censobr} can be
analyzed like a regular data.frame
using the
{dplyr}
package. For example, one can have a quick peak
into the data set with glimpse()
In the example below, we use the dplyr
syntax to (a)
filter observations for the state of Rio de Janeiro, (b) group
observations by racial group, (c) summarize the data calculating the
proportion of individuals with higher education. Note that we need to
add a collect()
call at the end of our query.
df <- pop |>
filter(abbrev_state == "RJ") |> # (a)
compute() |>
group_by(V0606) |> # (b)
summarize(higher_edu = sum(V0010[which(V6400=="Superior completo")]) / sum(V0010), # (c)
pop = sum(V0010) ) |>
collect()
head(df)
Now we only need to plot the results.
In this example, we are going to map the proportion of households
connected to a sewage network in Brazilian municipalities First, we can
easily download the households data set with the
read_households()
function.
Now we’re going to (a) group observations by municipality, (b) get the number of households connected to a sewage network, (c) calculate the proportion of households connected, and (d) collect the results.
esg <- hs |>
compute() |>
group_by(code_muni) |> # (a)
summarize(rede = sum(V0010[which(V0207=='1')]), # (b)
total = sum(V0010)) |> # (b)
mutate(cobertura = rede / total) |> # (c)
collect() # (d)
head(esg)
In order to create a map with these values, we are going to use the {geobr} package to download the geometries of Brazilian municipalities.
Now we only need to merge the spatial data with our estimates and map the results.
esg_sf <- left_join(muni_sf, esg, by = 'code_muni')
ggplot() +
geom_sf(data = esg_sf, aes(fill = cobertura), color=NA) +
labs(title = "Share of households connected to a sewage network") +
scale_fill_distiller(palette = "Greens", direction = 1,
name='Share of\nhouseholds',
labels = scales::percent) +
theme_void()
In this final example, we’re going to visualize how the amount of money people spend on rent varies spatially across the metropolitan area of São Paulo.
First, let’s download the municipalities of the metro area of São Paulo.
metro_muni <- geobr::read_metro_area(year = 2010,
showProgress = FALSE) |>
subset(name_metro == "RM São Paulo")
We also need the polygons of the weighting areas (áreas de ponderação). With the code below, we download all weighting areas in the state of São Paulo, and then keep only the ones in the metropolitan region of São Paulo.
wt_areas <- geobr::read_weighting_area(code_weighting = "SP",
showProgress = FALSE,
year = 2010)
wt_areas <- subset(wt_areas, code_muni %in% metro_muni$code_muni)
head(wt_areas)
Now we need to calculate the average rent spent in each weighting area. Using the national household data set, we’re going to (a) filter only observations in our municipalities of interest, (b) group observations by weighting area, (c) calculate the average rent, and (d) collect the results.
rent <- hs |>
filter(code_muni %in% metro_muni$code_muni) |> # (a)
compute() |>
group_by(code_weighting) |> # (b)
summarize(avgrent=weighted.mean(x=V2011, w=V0010, na.rm=TRUE)) |> # (c)
collect() # (d)
head(rent)
Finally, we can merge the spatial data with our rent estimates and map the results.
The first time the user runs a function, {censobr}
will download the file and store it locally. This way, the data only
needs to be downloaded once. When the cache
parameter is
set to TRUE
(Default), the function will read the cached
data, which is much faster.
Users can manage the cached data sets using the
censobr_cache()
function. For example, users can:
List cached files:
Delete a particular file:
Delete all files:
By default, {censobr} files are saved in the ‘User’
directory. However, users can run the function
set_censobr_cache_dir()
to set custom cache directory.