| Title: | Identify Reference Periods in Brazil's PNADC Survey Data |
|---|---|
| Description: | Identifies reference periods (months, fortnights, and weeks) in Brazil's quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) survey data and computes calibrated weights for sub-quarterly analysis. The core algorithm uses IBGE (Instituto Brasileiro de Geografia e Estatistica) 'Parada Tecnica' (technical break) rules combined with respondent birthdates to determine which temporal period each survey observation refers to. Period identification follows a nested hierarchy enforced by construction: fortnights require months, weeks require fortnights. Achieves approximately 97% monthly determination rate with the full series (2012-2025). Strict fortnight and week rates are approximately 9% and 3% respectively, as they cannot leverage cross-quarter panel aggregation. Experimental strategies (probabilistic assignment and UPA (Primary Sampling Unit) aggregation) further improve these determination rates. The package provides adaptive hierarchical weight calibration (4/2/1 cell levels for month/fortnight/week) with period-specific smoothing to produce survey weights calibrated to SIDRA (Sistema IBGE de Recuperacao Automatica) population totals. Also includes a SIDRA mensalization module that converts 86+ official rolling quarter series from the IBGE SIDRA API (Application Programming Interface) into exact monthly estimates, without requiring access to microdata. Hecksher and Barbosa (2026) <https://osf.io/preprints/socarxiv/fra5u_v1>. |
| Authors: | Rogerio Barbosa [aut, cre] (ORCID: <https://orcid.org/0000-0002-6796-4547>, contribution: R package, dashboard, and website), Marcos Hecksher [aut] (ORCID: <https://orcid.org/0000-0003-2992-1252>, contribution: Mensalization methodology) |
| Maintainer: | Rogerio Barbosa <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2 |
| Built: | 2026-06-03 06:41:26 UTC |
| Source: | https://github.com/antrologos/pnadcperiods |
Clears all cached SIDRA data (both rolling quarter series and population). Use this if you need to force a fresh download, for example after IBGE updates their data.
clear_sidra_cache()clear_sidra_cache()
Invisibly returns TRUE if any cache was cleared, FALSE if all empty.
clear_sidra_cache()clear_sidra_cache()
For advanced users who want to compute custom starting points using their own calibrated microdata estimates.
compute_series_starting_points( monthly_estimates, rolling_quarters, calibration_start = NULL, calibration_end = NULL, scale_factor = 1000, use_series_specific_periods = TRUE, verbose = TRUE )compute_series_starting_points( monthly_estimates, rolling_quarters, calibration_start = NULL, calibration_end = NULL, scale_factor = 1000, use_series_specific_periods = TRUE, verbose = TRUE )
monthly_estimates |
data.table with columns:
|
rolling_quarters |
data.table from |
calibration_start |
Integer. Start of calibration period (YYYYMM). Default NULL uses .PNADC_DATES$DEFAULT_CALIB_START (201301). Note: CNPJ series automatically use CNPJ_CALIB_START (201601) regardless. |
calibration_end |
Integer. End of calibration period (YYYYMM). Default NULL uses .PNADC_DATES$DEFAULT_CALIB_END (201912). |
scale_factor |
Numeric. Scale factor for z_ values (usually 1000). Default 1000. |
use_series_specific_periods |
Logical. If TRUE (default), use series-specific calibration periods for CNPJ series (201601-201912) and cumsum starting dates (201510). Set to FALSE to use uniform calibration for all series. |
verbose |
Logical. Print progress? Default TRUE. |
The starting points (y0) are computed by:
Calculating cumulative variations from SIDRA rolling quarters
Computing backprojection: e0 = z / scale_factor - cum
Averaging e0 by mesnotrim over the calibration period
data.table with columns:
Character. Series name
Integer. Month position (1, 2, or 3)
Numeric. Starting point value
When use_series_specific_periods = TRUE, the following series receive
special handling for series-specific data availability:
empregadorcomcnpj, empregadorsemcnpj, contapropriacomcnpj, contapropriasemcnpj use calibration period 201601-201912 and cumsum starts from 201510 (when V4019 became available)
## Not run: rq <- fetch_sidra_rolling_quarters() z_agg <- compute_z_aggregates(calibrated_data) y0 <- compute_series_starting_points(z_agg, rq) monthly <- mensalize_sidra_series(rq, starting_points = y0) ## End(Not run)## Not run: rq <- fetch_sidra_rolling_quarters() z_agg <- compute_z_aggregates(calibrated_data) y0 <- compute_series_starting_points(z_agg, rq) monthly <- mensalize_sidra_series(rq, starting_points = y0) ## End(Not run)
Complete workflow to compute y0 starting points from raw PNADC microdata. This is a convenience wrapper that combines period identification, weight calibration, z_ aggregation, and starting point computation.
compute_starting_points_from_microdata( data, calibration_start = NULL, calibration_end = NULL, verbose = TRUE )compute_starting_points_from_microdata( data, calibration_start = NULL, calibration_end = NULL, verbose = TRUE )
data |
Stacked PNADC microdata (multiple quarters). Must contain variables
for period identification (see |
calibration_start |
Integer. Start of calibration period (YYYYMM). Default NULL uses .PNADC_DATES$DEFAULT_CALIB_START (201301). |
calibration_end |
Integer. End of calibration period (YYYYMM). Default NULL uses .PNADC_DATES$DEFAULT_CALIB_END (201912). |
verbose |
Print progress messages. |
This function performs the complete workflow:
Build crosswalk via pnadc_identify_periods()
Calibrate weights via pnadc_apply_periods()
Compute z_ aggregates via compute_z_aggregates()
Fetch SIDRA rolling quarters
Compute starting points via compute_series_starting_points()
data.table with columns:
Character. Series name
Integer. Month position (1, 2, or 3)
Numeric. Starting point value
All months are scaled uniformly to SIDRA monthly population totals.
pnadc_apply_periods for the weight calibration step
compute_z_aggregates for the z_ aggregation step
compute_series_starting_points for the y0 computation
pnadc_identify_periods for period identification
## Not run: stacked <- fst::read_fst("pnadc_stacked.fst", as.data.table = TRUE) y0 <- compute_starting_points_from_microdata(stacked) bundled <- pnadc_series_starting_points comparison <- merge(y0, bundled, by = c("series_name", "mesnotrim")) ## End(Not run)## Not run: stacked <- fst::read_fst("pnadc_stacked.fst", as.data.table = TRUE) y0 <- compute_starting_points_from_microdata(stacked) bundled <- pnadc_series_starting_points comparison <- merge(y0, bundled, by = c("series_name", "mesnotrim")) ## End(Not run)
Computes monthly z_ aggregates from PNADC microdata using calibrated monthly weights, with options for different population scaling approaches.
compute_z_aggregates(calibrated_data, verbose = TRUE)compute_z_aggregates(calibrated_data, verbose = TRUE)
calibrated_data |
PNADC microdata output from |
verbose |
Print progress messages. |
This function creates z_ indicator variables and aggregates them using the
calibrated weight_monthly from pnadc_apply_periods().
The pnadc_apply_periods() function implements the calibration
methodology as follows:
All months are scaled uniformly to SIDRA monthly population totals.
This function simply aggregates the indicators using the already-calibrated weights.
data.table with columns:
Integer YYYYMM month
Numeric weighted aggregates for each series (one column per series)
pnadc_apply_periods for the calibration step
compute_series_starting_points for the y0 computation
## Not run: crosswalk <- pnadc_identify_periods(stacked_data) calibrated <- pnadc_apply_periods(stacked_data, crosswalk, weight_var = "V1028", anchor = "quarter", calibration_unit = "month") z_agg <- compute_z_aggregates(calibrated) rq <- fetch_sidra_rolling_quarters() y0 <- compute_series_starting_points(z_agg, rq) ## End(Not run)## Not run: crosswalk <- pnadc_identify_periods(stacked_data) calibrated <- pnadc_apply_periods(stacked_data, crosswalk, weight_var = "V1028", anchor = "quarter", calibration_unit = "month") z_agg <- compute_z_aggregates(calibrated) rq <- fetch_sidra_rolling_quarters() y0 <- compute_series_starting_points(z_agg, rq) ## End(Not run)
Downloads population estimates from IBGE SIDRA API (table 6022) and transforms from moving-quarter to exact monthly values.
fetch_monthly_population( start_yyyymm = NULL, end_yyyymm = NULL, verbose = TRUE, use_cache = FALSE, cache_max_age_hours = 24 )fetch_monthly_population( start_yyyymm = NULL, end_yyyymm = NULL, verbose = TRUE, use_cache = FALSE, cache_max_age_hours = 24 )
start_yyyymm |
Integer. First month to include (YYYYMM format). If NULL, returns all available months. |
end_yyyymm |
Integer. Last month to include (YYYYMM format). If NULL, returns all available months. |
verbose |
Logical. Print progress messages? Default TRUE. |
use_cache |
Logical. If TRUE, uses cached data if available and not expired. Default FALSE (always fetch fresh data for consistency). Set to TRUE for faster repeated calls during development. |
cache_max_age_hours |
Numeric. Maximum cache age in hours before automatic expiration when use_cache=TRUE. Default 24 hours. |
SIDRA table 6022 provides moving-quarter population estimates. Each value represents the 3-month average centered on the middle month. For example, the value for code 201203 (quarter ending March 2012) represents the population for February 2012.
This function:
Fetches raw moving-quarter data from SIDRA
Transforms to exact monthly values by aligning with middle months
Extrapolates boundary months (first and last) using quadratic regression
The extrapolation uses quadratic regression on population differences to estimate the first month (Jan 2012) and the most recent month.
A data.table with columns:
ref_month_yyyymm: Integer in YYYYMM format
m_populacao: Monthly population in thousands
Returns NULL invisibly with an informative message if the SIDRA
API is unreachable (per CRAN policy on Internet resources).
This function requires the sidrar package for API access.
Install with: install.packages("sidrar")
pnadc_apply_periods which uses this function when
calibrate = TRUE
pop <- fetch_monthly_population() pop <- fetch_monthly_population(201301, 201912)pop <- fetch_monthly_population() pop <- fetch_monthly_population(201301, 201912)
Downloads PNADC labor market indicators from IBGE's SIDRA API. These series are published as rolling quarterly averages (trimestre movel), with 12 observations per year.
fetch_sidra_rolling_quarters( series = "all", theme = NULL, theme_category = NULL, subcategory = NULL, exclude_derived = FALSE, use_cache = FALSE, verbose = TRUE, retry_failed = TRUE, max_retries = 3 )fetch_sidra_rolling_quarters( series = "all", theme = NULL, theme_category = NULL, subcategory = NULL, exclude_derived = FALSE, use_cache = FALSE, verbose = TRUE, retry_failed = TRUE, max_retries = 3 )
series |
Character vector of series names to fetch, or "all" (default)
for all available series. Use |
theme |
Character vector of themes to filter by. Valid options: "labor_market", "earnings", "demographics", "social_protection", "prices". Use NULL for no filter. |
theme_category |
Character vector of theme categories to filter by. Use NULL for no filter. |
subcategory |
Character vector of subcategories to filter by. Use NULL for no filter. |
exclude_derived |
Logical. If TRUE, exclude series marked as derived (is_derived = TRUE in metadata). Default FALSE for backward compatibility. Derived series (rates) are computed from other series during mensalization, so excluding them saves API calls when fetching for mensalization. |
use_cache |
Logical. Use cached data if available? Default FALSE.
When TRUE, shows the date when data was cached (may be outdated).
Use |
verbose |
Logical. Print progress messages? Default TRUE. |
retry_failed |
Logical. Retry failed series downloads? Default TRUE. |
max_retries |
Integer. Maximum retry attempts per series. Default 3. |
Rolling quarters are labeled by their ending month:
201201 = Nov 2011 - Jan 2012 (mesnotrim = 1)
201202 = Dec 2011 - Feb 2012 (mesnotrim = 2)
201203 = Jan - Mar 2012 (mesnotrim = 3)
201204 = Feb - Apr 2012 (mesnotrim = 1)
etc.
The mesnotrim column indicates the month's position within its rolling
quarter, which is essential for the mensalization algorithm.
A data.table with columns:
Integer. YYYYMM of rolling quarter end month
Integer. Month position in quarter (1, 2, or 3)
Numeric. One column per requested series
SIDRA API may have rate limits. The function includes automatic retry logic with exponential backoff for failed requests.
Per CRAN policy, this function fails gracefully when the SIDRA API is
unreachable: it emits an informative message() (no warning, no
error) and returns NULL invisibly when no series could be
fetched. Partial failures (some series succeeded) are also reported
via message(), and the result includes only the successful
series.
get_sidra_series_metadata for available series names and metadata
mensalize_sidra_series to convert to exact months
rq <- fetch_sidra_rolling_quarters( series = c("taxadesocup", "popocup", "popdesocup") ) head(rq) rq_labor <- fetch_sidra_rolling_quarters(theme = "labor_market")rq <- fetch_sidra_rolling_quarters( series = c("taxadesocup", "popocup", "popdesocup") ) head(rq) rq_labor <- fetch_sidra_rolling_quarters(theme = "labor_market")
Returns a data.table with metadata for all PNADC rolling quarter series available from IBGE's SIDRA API.
get_sidra_series_metadata( series = "all", theme = NULL, theme_category = NULL, subcategory = NULL, lang = "pt" )get_sidra_series_metadata( series = "all", theme = NULL, theme_category = NULL, subcategory = NULL, lang = "pt" )
series |
Character vector of series names to retrieve, or "all" (default) for all series. |
theme |
Character vector of themes to filter by. Valid themes: "labor_market", "earnings", "demographics", "social_protection", "prices". Use NULL (default) for no filtering. |
theme_category |
Character vector of theme categories to filter by. Use NULL (default) for no filtering. |
subcategory |
Character vector of subcategories to filter by. Use NULL (default) for no filtering. |
lang |
Character. Language for descriptions: "pt" (Portuguese, default)
or "en" (English). When "en", the |
A data.table with columns:
Character. Internal name used in the package
Character. SIDRA API path for get_sidra()
Integer. SIDRA table number
Integer. SIDRA variable code
Character. Top-level theme
Character. Middle-level category within theme
Character. Optional subcategory for filtering
Character. Portuguese description
Character. English description
Character. Description in the requested language
Character. Unit of measurement
Character. Unit label in Portuguese
Character. Unit label in English
Logical. TRUE if computed from other series
Logical. TRUE if needs IPCA deflation
meta <- get_sidra_series_metadata() head(meta) labor <- get_sidra_series_metadata(theme = "labor_market") unemp <- get_sidra_series_metadata(theme = "labor_market", theme_category = "unemployment") meta_en <- get_sidra_series_metadata(series = c("taxadesocup", "popocup"), lang = "en")meta <- get_sidra_series_metadata() head(meta) labor <- get_sidra_series_metadata(theme = "labor_market") unemp <- get_sidra_series_metadata(theme = "labor_market", theme_category = "unemployment") meta_en <- get_sidra_series_metadata(series = c("taxadesocup", "popocup"), lang = "en")
Transforms SIDRA rolling quarterly averages into exact monthly values using the mathematical relationship between consecutive rolling quarters.
mensalize_sidra_series( rolling_quarters, starting_points = NULL, series = "all", compute_derived = TRUE, verbose = TRUE )mensalize_sidra_series( rolling_quarters, starting_points = NULL, series = "all", compute_derived = TRUE, verbose = TRUE )
rolling_quarters |
data.table from |
starting_points |
Optional data.table with precomputed starting points
(y0 values). If NULL (default), uses bundled |
series |
Character vector of series names to mensalize, or "all" (default) for all series in the input data (except price indices). |
compute_derived |
Logical. Compute derived series (rates, aggregates)? Default TRUE. |
verbose |
Logical. Print progress messages? Default TRUE. |
The algorithm exploits the mathematical property of rolling quarterly averages:
This means exact 3-month variations can be extracted from consecutive rolling quarters. By accumulating these variations separately for each month-position (1, 2, or 3), we build cumulative variation series. The only unknown is the starting level for Jan, Feb, and Mar 2012.
Starting points are estimated by:
Computing monthly estimates from calibrated microdata (z_ variables)
Calculating cumulative variations from SIDRA (cum_ variables)
Backprojecting: e0 = z_ - cum_ over calibration period (2013-2019)
Averaging e0 by month position to get y0_ for each position
Final adjustment ensures the average of 3 consecutive mensalized values equals the original rolling quarter value.
A data.table with columns:
Integer. YYYYMM exact month
Numeric. Mensalized value for each series (one column per series)
If providing custom starting points, the data.table must have columns:
series_name: Character. Series name matching rolling_quarters columns
mesnotrim: Integer (1, 2, or 3). Month position in quarter
y0: Numeric. Starting point value
The mensalization algorithm proceeds in steps:
Calculate d3 = 3 * (RQ_t - RQ_t-1)
Separate d3 by month position: d3m1, d3m2, d3m3
Cumulate separately: cum1, cum2, cum3
Apply starting points: y = y0 + cum
Final adjustment for rolling quarter consistency
fetch_sidra_rolling_quarters to obtain input data
compute_series_starting_points for custom calibration
rq <- fetch_sidra_rolling_quarters( series = c("taxadesocup", "popocup", "popdesocup") ) monthly <- mensalize_sidra_series(rq) head(monthly)rq <- fetch_sidra_rolling_quarters( series = c("taxadesocup", "popocup", "popdesocup") ) monthly <- mensalize_sidra_series(rq) head(monthly)
This function takes a crosswalk from pnadc_identify_periods and
applies it to any PNADC dataset (quarterly or annual). It can optionally
calibrate the survey weights to match external population totals at the
chosen temporal granularity (month, fortnight, or week).
pnadc_apply_periods( data, crosswalk, weight_var, anchor, calibrate = TRUE, calibration_unit = c("month", "fortnight", "week"), calibration_min_cell_size = 1, target_totals = NULL, smooth = FALSE, keep_all = TRUE, verbose = TRUE )pnadc_apply_periods( data, crosswalk, weight_var, anchor, calibrate = TRUE, calibration_unit = c("month", "fortnight", "week"), calibration_min_cell_size = 1, target_totals = NULL, smooth = FALSE, keep_all = TRUE, verbose = TRUE )
data |
A data.frame or data.table with PNADC microdata. Must contain
join keys |
crosswalk |
A data.table crosswalk from |
weight_var |
Character. Name of the survey weight column. Must be specified:
|
anchor |
Character. How to anchor the weight redistribution. Must be specified:
|
calibrate |
Logical. If TRUE (default), calibrate weights to external population totals. If FALSE, only merge the crosswalk without calibration. |
calibration_unit |
Character. Temporal unit for weight calibration.
One of |
calibration_min_cell_size |
Integer. Minimum sample size required in a cell for it to be used in hierarchical raking. Cells smaller than this threshold are collapsed to coarser levels. Default: 1 (use all cells). |
target_totals |
Optional data.table with population targets. If NULL (default), fetches monthly population from SIDRA and derives targets for fortnight/week. Each time period (month, fortnight, or week) is calibrated to the FULL Brazilian population from SIDRA. If providing custom targets, the population column ( |
smooth |
Logical. If TRUE, smooth calibrated weights to remove quarterly artifacts. Smoothing is adapted per time period: monthly (3-period window), fortnight (7-period window), weekly (no smoothing). Default: FALSE. |
keep_all |
Logical. If TRUE (default), keep all observations including those with undetermined reference periods. If FALSE, drop undetermined rows. |
verbose |
Logical. If TRUE (default), print progress messages. |
Merges a reference period crosswalk with PNADC microdata and optionally calibrates survey weights for sub-quarterly analysis.
When calibrate = TRUE, the function performs hierarchical rake weighting:
Groups observations by nested demographic/geographic cells
Iteratively adjusts weights so sub-period totals match anchor-period totals
Calibrates final weights against external population totals (FULL Brazilian population)
Optionally smooths weights to remove quarterly artifacts
All time periods (months, fortnights, and weeks) are calibrated to the FULL Brazilian population from SIDRA. This means:
Monthly weights sum to the Brazilian population for that month
Fortnight weights sum to the Brazilian population for the containing month
Weekly weights sum to the Brazilian population for the containing month
The number of hierarchical cell levels is automatically adjusted based on the calibration unit to avoid sparse cell issues:
"month": 4 levels (age, region, state, post-stratum) - full hierarchy
"fortnight": 2 levels (age, region) - simplified for lower sample size
"week": 1 level (age groups only) - minimal hierarchy for sparse data
The anchor parameter determines how weights are redistributed:
"quarter": Quarterly totals are preserved and redistributed to months/fortnights/weeks
"year": Yearly totals are preserved and redistributed to months/fortnights/weeks
Use anchor = "quarter" with quarterly V1028 weights, and
anchor = "year" with annual V1032 weights.
A data.table with the input data plus crosswalk columns:
Month position (1-3 in quarter, 1-12 in year)
Fortnight position (1-2 in month, 1-6 in quarter)
Week position (1-4 in month, 1-12 in quarter)
Integer period codes
Logical determination flags
Calibrated weights (if calibrate=TRUE)
pnadc_identify_periods to build the crosswalk
## Not run: crosswalk <- pnadc_identify_periods(pnadc_stacked) result <- pnadc_apply_periods( pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter" ) result <- pnadc_apply_periods( pnadc_annual, crosswalk, weight_var = "V1032", anchor = "year" ) result <- pnadc_apply_periods( pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter", calibration_unit = "week" ) result <- pnadc_apply_periods( pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter", calibrate = FALSE ) ## End(Not run)## Not run: crosswalk <- pnadc_identify_periods(pnadc_stacked) result <- pnadc_apply_periods( pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter" ) result <- pnadc_apply_periods( pnadc_annual, crosswalk, weight_var = "V1032", anchor = "year" ) result <- pnadc_apply_periods( pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter", calibration_unit = "week" ) result <- pnadc_apply_periods( pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter", calibrate = FALSE ) ## End(Not run)
Three experimental strategies are available, all properly nested by period:
probabilistic: For narrow ranges (2 possible periods), classifies based on where most of the date interval falls. Assigns only when confidence exceeds threshold.
upa_aggregation: Extends strictly identified periods to other observations in the same UPA-V1014 within the quarter, if a sufficient proportion already have strict identification.
both: Sequentially applies probabilistic strategy first, then UPA aggregation on top. Guarantees identification rate >= max of individual strategies.
pnadc_experimental_periods( crosswalk, strategy = c("probabilistic", "upa_aggregation", "both"), confidence_threshold = 0.9, upa_proportion_threshold = 0.5, verbose = TRUE )pnadc_experimental_periods( crosswalk, strategy = c("probabilistic", "upa_aggregation", "both"), confidence_threshold = 0.9, upa_proportion_threshold = 0.5, verbose = TRUE )
crosswalk |
A crosswalk data.table from |
strategy |
Character specifying which strategy to apply. Options: "probabilistic", "upa_aggregation", "both" |
confidence_threshold |
Numeric (0-1). Minimum confidence required to assign a probabilistic period. Used by probabilistic and combined strategies. Default 0.9. |
upa_proportion_threshold |
Numeric (0-1). Minimum proportion of UPA observations (within quarter) that must have strict identification with consensus for extending to unidentified observations. Default 0.5. |
verbose |
Logical. If TRUE, print progress information. |
Provides experimental strategies for improving period identification rates beyond the standard deterministic algorithm. All strategies respect the nested identification hierarchy: weeks require fortnights, fortnights require months.
All strategies enforce proper nesting:
Fortnights can only be assigned if month is identified (strictly OR experimentally)
Weeks can only be assigned if fortnight is identified (strictly OR experimentally)
For each period type (processed in order: months, then fortnights, then weeks):
Check that the required parent period is identified
If bounds are narrowed to exactly 2 sequential periods, calculate which period contains most of the date interval
Calculate confidence based on the proportion of interval in the likely period (0-1)
Only assign if confidence >= confidence_threshold
For months: aggregates at UPA-V1014 level across all quarters (like strict algorithm) For fortnights and weeks: works at household level within quarter
Extends strictly identified periods based on consensus within geographic groups:
Months: Uses UPA level within quarter
Fortnights/Weeks: Uses UPA level within quarter (all households in same UPA are interviewed in same fortnight/week within a quarter)
Calculate proportion of observations with strictly identified period
If proportion >= upa_proportion_threshold AND consensus exists, extend
Apply in nested order: months first, then fortnights, then weeks
Sequentially applies both strategies to maximize identification:
First, apply the probabilistic strategy (captures observations with narrow date ranges and high confidence)
Then, apply UPA aggregation (extends based on strict consensus within UPA/UPA-V1014 groups)
This guarantees that "both" identifies at least as many observations as either individual strategy alone. The strategies operate independently (UPA aggregation considers only strict identifications), so the result is the union of both strategies.
The output can be passed directly to pnadc_apply_periods() for weight calibration.
The derived columns combine strict and experimental assignments, with strict taking priority. Use the
probabilistic_assignment flag to filter if you only want strict determinations.
A modified crosswalk with additional columns. Output is directly compatible
with pnadc_apply_periods():
ref_month_in_quarter, ref_month_in_year, ref_month_yyyymm:
Month position (combined strict + experimental, strict takes priority)
ref_fortnight_in_month, ref_fortnight_in_quarter, ref_fortnight_yyyyff:
Fortnight position (combined strict + experimental)
ref_week_in_month, ref_week_in_quarter, ref_week_yyyyww:
Week position (combined strict + experimental)
determined_month, determined_fortnight, determined_week:
TRUE if period is assigned (strictly or experimentally)
determined_probable_month, determined_probable_fortnight,
determined_probable_week: TRUE if period was assigned by probabilistic strategy
probabilistic_assignment: TRUE if any period was assigned experimentally
(vs strictly deterministic)
week_1_start, week_1_end, ..., week_4_start, week_4_end:
IBGE week boundaries for the assigned month
These strategies produce "experimental" assignments, not strict determinations.
The standard pnadc_identify_periods() function should be used for
rigorous analysis. Experimental outputs are useful for:
Sensitivity analysis
Robustness checks
Research into identification algorithm improvements
pnadc_identify_periods to build the crosswalk that this function modifies.
pnadc_apply_periods to apply period crosswalk and calibrate weights.
## Not run: crosswalk <- pnadc_identify_periods(pnadc_data) crosswalk_exp <- pnadc_experimental_periods( crosswalk, strategy = "probabilistic", confidence_threshold = 0.9 ) crosswalk_exp[, .( strict = sum(!is.na(ref_month_in_quarter) & !probabilistic_assignment), experimental = sum(probabilistic_assignment, na.rm = TRUE), total = sum(determined_month) )] result <- pnadc_apply_periods(pnadc_data, crosswalk_exp, weight_var = "V1028", anchor = "quarter") strict_only <- crosswalk_exp[ probabilistic_assignment == FALSE | is.na(probabilistic_assignment) ] ## End(Not run)## Not run: crosswalk <- pnadc_identify_periods(pnadc_data) crosswalk_exp <- pnadc_experimental_periods( crosswalk, strategy = "probabilistic", confidence_threshold = 0.9 ) crosswalk_exp[, .( strict = sum(!is.na(ref_month_in_quarter) & !probabilistic_assignment), experimental = sum(probabilistic_assignment, na.rm = TRUE), total = sum(determined_month) )] result <- pnadc_apply_periods(pnadc_data, crosswalk_exp, weight_var = "V1028", anchor = "quarter") strict_only <- crosswalk_exp[ probabilistic_assignment == FALSE | is.na(probabilistic_assignment) ] ## End(Not run)
PNADC is a quarterly survey, but each interview actually refers to a specific temporal period within the quarter. This function identifies which month, fortnight (quinzena), and week each observation belongs to, enabling sub-quarterly time series analysis.
The algorithm uses a nested identification approach:
Phase 1: Identify MONTHS for all observations using:
IBGE's reference week timing rules (first reference week – ending in a Saturday – with sufficient days)
Respondent birthdates to constrain possible interview dates
UPA-panel level aggregation across ALL quarters (panel design)
Dynamic exception detection (identifies quarters needing relaxed rules)
Phase 2: Identify FORTNIGHTS for month-determined observations:
Search space constrained to 2 fortnights within determined month
Household-level aggregation within each quarter
Phase 3: Identify WEEKS for fortnight-determined observations:
Search space constrained to ~2 weeks within determined fortnight
Household-level aggregation within each quarter
pnadc_identify_periods(data, verbose = TRUE, store_date_bounds = FALSE)pnadc_identify_periods(data, verbose = TRUE, store_date_bounds = FALSE)
data |
A data.frame or data.table with PNADC microdata. Required columns:
Optional but recommended:
|
verbose |
Logical. If TRUE (default), display progress information. |
store_date_bounds |
Logical. If TRUE, stores date bounds and exception
flags in the crosswalk for optimization when calling
|
Builds a crosswalk containing reference periods (month, fortnight, and week) for PNADC survey data based on IBGE's interview timing rules.
The crosswalk contains three levels of temporal granularity:
Month: 3 per quarter, ~97\
Fortnight (quinzena): 6 per quarter, ~9\
Week: 12 per quarter, ~3\
For optimal month determination rates, input data should be stacked across multiple quarters (ideally 4+ years). The algorithm leverages PNADC's rotating panel design where the same UPA-V1014 is interviewed in the same relative position across quarterly visits.
Fortnights are numbered 1-6 per quarter (2 per month), based on the IBGE reference week calendar (not calendar days). Each IBGE "month" consists of exactly 4 reference weeks (28 days), starting on a Sunday:
Fortnight 1 in month: IBGE weeks 1-2 (days 1-14 of the IBGE month)
Fortnight 2 in month: IBGE weeks 3-4 (days 15-28 of the IBGE month)
A data.table crosswalk with columns:
Join keys (year, quarter, UPA, household, panel)
Integer. Month position in quarter (1, 2, 3) or NA
Integer. Month position in year (1-12) or NA
Integer. Fortnight position in month (1 or 2) or NA
Integer. Fortnight position in quarter (1-6) or NA
Integer. Week position in month (1-4) or NA
Integer. Week position in quarter (1-12) or NA
Date. Lower bound of the interview reference date for the individual. Only returned if store_date_bounds = TRUE
Date. Upper bound of the interview reference date for the individual. Only returned if store_date_bounds = TRUE
Date. Sunday of the IBGE first reference week of the month. Only returned if store_date_bounds = TRUE
Date. Saturday of the IBGE first reference week of the month. Only returned if store_date_bounds = TRUE
Date. Sunday of the IBGE second reference week of the month. Only returned if store_date_bounds = TRUE
Date. Saturday of the IBGE second reference week of the month. Only returned if store_date_bounds = TRUE
Date. Sunday of the IBGE third reference week of the month. Only returned if store_date_bounds = TRUE
Date. Saturday of the IBGE third reference week of the month. Only returned if store_date_bounds = TRUE
Date. Sunday of the IBGE fourth reference week of the month. Only returned if store_date_bounds = TRUE
Date. Saturday of the IBGE fourth reference week of the month. Only returned if store_date_bounds = TRUE
Integer. Maximum month position across UPA-V1014 group (for debugging). Only returned if store_date_bounds = TRUE
Integer. Minimum month position across UPA-V1014 group (for debugging). Only returned if store_date_bounds = TRUE
Integer. Maximum fortnight position within household (for debugging). Only returned if store_date_bounds = TRUE
Integer. Minimum fortnight position within household (for debugging). Only returned if store_date_bounds = TRUE
Integer. Minimum week position within household (for debugging). Only returned if store_date_bounds = TRUE
Integer. Maximum week position within household (for debugging). Only returned if store_date_bounds = TRUE
Integer. Identified reference month in the format YYYYMM, where MM follows the IBGE calendar. 1 <= MM <= 12
Integer. Identified reference fortnight in the format YYYYFF, where FF follows the IBGE calendar. 1 <= FF <= 24
Integer. Identified reference Week in the format YYYYWW, where WW follows the IBGE calendar. 1 <= WW <= 48
Logical. Flags if the month was determined.
Logical. Flags if the fortnight was determined.
Logical. Flags if the week was determined.
The algorithm enforces strict nesting by construction:
Fortnights can ONLY be identified for observations with determined months
Weeks can ONLY be identified for observations with determined fortnights
This guarantees: determined_week => determined_fortnight => determined_month
The crosswalk aggregates at different levels:
Months: UPA-V1014 level across ALL quarters (PNADC panel design ensures same month position)
Fortnights: Household level within quarter only
Weeks: Household level within quarter only
pnadc_apply_periods to apply the crosswalk and
calibrate weights
## Not run: crosswalk <- pnadc_identify_periods(pnadc_stacked) crosswalk[, .( month_rate = mean(determined_month), fortnight_rate = mean(determined_fortnight), week_rate = mean(determined_week) )] crosswalk[determined_fortnight, all(determined_month)] crosswalk[determined_week, all(determined_fortnight)] result <- pnadc_apply_periods(pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter") ## End(Not run)## Not run: crosswalk <- pnadc_identify_periods(pnadc_stacked) crosswalk[, .( month_rate = mean(determined_month), fortnight_rate = mean(determined_fortnight), week_rate = mean(determined_week) )] crosswalk[determined_fortnight, all(determined_month)] crosswalk[determined_week, all(determined_fortnight)] result <- pnadc_apply_periods(pnadc_2023, crosswalk, weight_var = "V1028", anchor = "quarter") ## End(Not run)
Pre-computed starting point values (y0) for mensalizing IBGE's rolling quarterly series into exact monthly estimates.
data(pnadc_series_starting_points)data(pnadc_series_starting_points)
A data.table with 159 rows and 3 columns:
Character. Name of the SIDRA series (53 series)
Integer. Month position in quarter (1, 2, or 3)
Numeric. Starting point value for this series and position
These starting points were computed from PNADC microdata using the full
R package pipeline, ensuring consistency with
compute_starting_points_from_microdata:
Weight calibration via pnadc_apply_periods: all months
scaled to SIDRA monthly population totals
z_ aggregates computed via compute_z_aggregates using
calibrated weight_monthly
Starting points computed via compute_series_starting_points
with CNPJ-aware calibration periods
The calibration period (2013-2019) was chosen because:
It includes stable pre-pandemic data
IBGE methodology was consistent during this period
Sufficient observations for reliable estimates
CNPJ series (empregadorcomcnpj, empregadorsemcnpj, contapropriacomcnpj, contapropriasemcnpj) use calibration period 2016-2019 with cumulative sum starting from October 2015 due to V4019 variable availability.
The bundled starting points are generated using the same pipeline as
compute_starting_points_from_microdata, ensuring that users
who compute custom starting points will get consistent results.
The bundled starting points are suitable for most users. Consider computing
custom starting points with compute_starting_points_from_microdata if:
IBGE makes major methodological changes to the PNADC
You need series not included in the bundled data
You want to use a different calibration period
You are working with updated or different microdata
Computed using data-raw/regenerate_starting_points_from_microdata.R
mensalize_sidra_series which uses this data by default
compute_series_starting_points for custom calibration
data(pnadc_series_starting_points) head(pnadc_series_starting_points) unique(pnadc_series_starting_points$series_name)data(pnadc_series_starting_points) head(pnadc_series_starting_points) unique(pnadc_series_starting_points$series_name)
Checks that input data has required columns for the specified processing.
validate_pnadc(data, check_weights = FALSE, stop_on_error = TRUE)validate_pnadc(data, check_weights = FALSE, stop_on_error = TRUE)
data |
A data.frame or data.table with PNADC microdata |
check_weights |
Logical. If TRUE, also check for weight-related variables. |
stop_on_error |
Logical. If TRUE, stops with an error. If FALSE, returns a validation report list. |
The function performs the following validations:
Checks for required columns for reference period identification:
Ano, Trimestre, UPA, V1008, V1014,
V2008, V20081, V20082, V2009
Validates year range (2012-2100 for PNADC coverage)
Validates quarter values (must be 1-4)
Validates birth day values (must be 1-31 or 99 for unknown)
Validates birth month values (must be 1-12 or 99 for unknown)
Warns about unusual ages (outside 0-130 range)
If check_weights = TRUE, also validates weight-related columns:
V1028, UF, posest, posest_sxi
If stop_on_error = TRUE, returns invisibly if valid or stops with error.
If stop_on_error = FALSE, returns a list with:
valid: Logical indicating if data passed all validations
issues: Named list of validation issues found (empty if none)
n_rows: Number of rows in input data
n_cols: Number of columns in input data
join_keys_available: Character vector of available join key columns
pnadc_identify_periods which calls this function
internally to validate input data.
# Minimal valid data (all 9 required columns) sample_data <- data.frame( Ano = 2023L, Trimestre = 1L, UPA = 110000001L, V1008 = 1L, V1014 = 1L, V2008 = 15L, V20081 = 3L, V20082 = 1990L, V2009 = 33L ) validate_pnadc(sample_data) # Data with missing columns returns issues (non-stop mode) incomplete_data <- data.frame(Ano = 2023L, Trimestre = 1L) result <- validate_pnadc(incomplete_data, stop_on_error = FALSE) result$valid # FALSE result$issues # lists missing columns# Minimal valid data (all 9 required columns) sample_data <- data.frame( Ano = 2023L, Trimestre = 1L, UPA = 110000001L, V1008 = 1L, V1014 = 1L, V2008 = 15L, V20081 = 3L, V20082 = 1990L, V2009 = 33L ) validate_pnadc(sample_data) # Data with missing columns returns issues (non-stop mode) incomplete_data <- data.frame(Ano = 2023L, Trimestre = 1L) result <- validate_pnadc(incomplete_data, stop_on_error = FALSE) result$valid # FALSE result$issues # lists missing columns