dateutils is an R
package for conveniently working with time series data in tabular
format, that is, without using one of the may R time series formats such
as ts
or xts
. dateutils is
built on top of data.table.
The main functions of dateutils are for aggregating and formatting time series data, including mixed frequency data. This latter functionality is particularly useful as it allows the incorporation of mixed frequency data into standard econometric and machine learning models, without the hassle of explicit bridge equations.
To give a few examples of using dateutils, we will
use the built in data.table named fred
, containing daily (t
bill spreads), weekly (initial jobless claims), monthly (advance retail
sales), and quarterly (gdp) data in long format. To view the data use
View(fred)
. We can begin by aggregating this data to the
lowest frequency, quarterly in this case.
library(dateutils)
library(data.table)
fred_quarterly <- agg_to_freq(fred, frq = "quarter")
print(fred_quarterly[1:5])
## ref_date series_name value n_obs
## <Date> <char> <num> <int>
## 1: 2000-03-31 advance retail sales 246798.0 3
## 2: 2000-06-30 advance retail sales 246730.7 3
## 3: 2000-09-30 advance retail sales 248863.0 3
## 4: 2000-12-31 advance retail sales 250736.7 3
## 5: 2001-03-31 advance retail sales 251895.3 3
Note that dates are indexed to the end of the quarter.
agg_to_frq()
also includes the column n_obs
giving the number of observations which were used to calculate mean
values for the quarter. We could, alternatively, create a mixed
frequency data set by aggregating to monthly. If we wanted the data in
wide format, we could then use the data.table function
dcast()
, or simply call agg_to_frq_wide()
.
## Key: <ref_date>
## ref_date advance retail sales gdp constant prices initial jobless claims
## <Date> <num> <num> <num>
## 1: 2000-01-31 243436 NA 289000
## 2: 2000-02-29 247133 NA 293750
## 3: 2000-03-31 249825 12924.18 273000
## 4: 2000-04-30 245831 NA 273000
## 5: 2000-05-31 246201 NA 282250
## 6: 2000-06-30 248160 13160.84 289000
## t bill spread 10y 3m
## <num>
## 1: 1.1620000
## 2: 0.7925000
## 3: 0.3926087
## 4: 0.1689474
## 5: 0.4459091
## 6: 0.2354545
dateutils works with the package
seasonal to seasonally adjust monthly and quarterly
data in data.table
format. For example, suppose we wanted
to seasonally adjust gdp and advance retail sales in our quarterly
dataset. We can do so as follows:
fred_sa <- seas_df_long(fred_quarterly, sa_names = c("gdp constant prices", "advance retail sales"),
transfunc = 'auto')
## Model used in SEATS is different: (0 2 2)(0 0 1)
gdp <- rbind(fred_quarterly[series_name == "gdp constant prices", .(ref_date, series_name, value)],
fred_sa$values_sa[series_name == "gdp constant prices sa"])
gdp <- dcast(gdp, ref_date ~ series_name, value.var = "value")
matplot(gdp$ref_date, gdp[,-1,with=FALSE], type = 'l')
Of course, these data are already seasonally adjusted, so adjusting again makes little difference.
dateutils is built around handling mixed frequency
data. In particular, the package allows you to easily build mixed
frequency models incorporating all available information to date without
worrying about bridge equations. We accomplish this by identifying the
pattern of missing observations in the tail of the data, and replicating
that pattern in the historical data, to generate contemporaneous right
hand side (RHS) variables, as well as the usual lagged RHS variables.
The resulting data set is uniform frequency, allowing the use of any of
the standard statistical models available in R. We can illustrate this
process again using the built in data fred
. We will
construct a nowcast of GDP, the lowest frequency variable in the data,
using the other the other three monthly, weekly, and daily series. To
begin, we can look at the tail of each of the four series.
## series_name ref_date value country pub_date pub_lag
## <char> <Date> <num> <char> <Date> <int>
## 1: gdp constant prices 2020-09-30 18596.52 united states 2020-10-30 30
## 2: gdp constant prices 2020-12-31 18794.43 united states 2021-01-29 29
The last observation we have for GDP is Q4 2020, thus our nowcast
will be for Q1 2021 (the as_of
date for this data is
February 26 2021, which was a Friday). Looking next at the monthly
data
## series_name ref_date value country pub_date pub_lag
## <char> <Date> <num> <char> <Date> <int>
## 1: advance retail sales 2020-10-31 493991 united states 2020-11-18 18
## 2: advance retail sales 2020-11-30 488652 united states 2020-12-17 17
## 3: advance retail sales 2020-12-31 484782 united states 2021-01-16 16
## 4: advance retail sales 2021-01-31 520162 united states 2021-02-18 18
In this case, our contemporaneous data will include the first month in the quarter. Lagged data (i.e. Q4 2020) is complete. Looking next at weekly data
## series_name ref_date value country pub_date pub_lag
## <char> <Date> <num> <char> <Date> <int>
## 1: initial jobless claims 2021-01-01 781000 united states 2021-01-21 20
## 2: initial jobless claims 2021-01-08 904000 united states 2021-01-28 20
## 3: initial jobless claims 2021-01-15 886000 united states 2021-02-04 20
## 4: initial jobless claims 2021-01-22 836000 united states 2021-02-11 20
## 5: initial jobless claims 2021-01-29 837000 united states 2021-02-18 20
## 6: initial jobless claims 2021-02-05 863000 united states 2021-02-25 20
Here we observe the first six weeks of the quarter for Q1 2021. And finally, for daily data
## series_name ref_date value country pub_date pub_lag
## <char> <Date> <num> <char> <Date> <int>
## 1: t bill spread 10y 3m 2021-02-23 1.33 united states 2021-02-24 1
## 2: t bill spread 10y 3m 2021-02-24 1.35 united states 2021-02-25 1
## 3: t bill spread 10y 3m 2021-02-25 1.50 united states 2021-02-26 1
We observe through February 25th, which in this case means 37
observations (values for weekends are not observed). The function
process_MF()
operates by calculating how many observations
we have for the contemporaneous (nowcast, or one step ahead of the last
observation of left hand side (LHS) data) period for each series in the
data. In this case, since we observe the first 37 days for t bill
spreads, the contemporaneous variable for t bill spreads will aggregate
the first 37 observations in the historical data as well. If an early
observation was missing, say the 22nd, than this observation would also
be omitted when calculating historical aggregates. This contemporaneous
variable will be appended with a 0 in the resulting data; the variable
at one lag will be appended with a 1, and so on. If lagged periods are
missing in the tail of the data the pattern of missing observations will
be replicated in the same way.
We can verify this approach by running the function.
MF <- process_MF(fred[series_name == "gdp constant prices"], fred[series_name != "gdp constant prices"],
LHS_lags = 3, RHS_lags = 3)
Here we are using three lags of LHS variables and three lags of RHS
varialbes. Note that data must be entered in long format. Aside from
being more efficient for mixed frequency data (i.e. avoiding a lot of
NA
entries for low frequency data), this allows for
backtesting by censoring data using the as_of
argument.
Required columns are ref_date
, series_name
,
and value
. If your data has different column names, you
will have to inform the function using the arguments
date_name
, id_name
, and
value_name
. If you wish to backtest your data using
as_of
you must also include the column
pub_date
, or inform the function which column contains
publication dates using pub_date_name
.
By default process_MF()
returns data in long format. If
you wish to have the data in matrix format, you can set
return_dt = FALSE
. Since dateutils
is built on
data.table
, we can also transform the output to wide format
as follows
## Key: <ref_date>
## ref_date gdp constant prices 0 gdp constant prices 1 gdp constant prices 2
## <Date> <num> <num> <num>
## 1: 2020-09-30 18596.52 17302.51 19010.85
## 2: 2020-12-31 18794.43 18596.52 17302.51
## 3: 2021-03-31 NA 18794.43 18596.52
## 4: 2021-06-30 NA NA 18794.43
## 5: 2021-09-30 NA NA NA
## 6: 2021-12-31 NA NA NA
## gdp constant prices 3 advance retail sales 0 initial jobless claims 0
## <num> <num> <num>
## 1: 19253.96 481627 1243833.3
## 2: 19010.85 493991 779000.0
## 3: 17302.51 520162 851166.7
## 4: 18596.52 NA NA
## 5: 18794.43 NA NA
## 6: NA NA NA
## t bill spread 10y 3m 0 advance retail sales 1 initial jobless claims 1
## <num> <num> <num>
## 1: 0.5086486 433592.3 2716615.4
## 2: 0.7275676 486223.3 1041076.9
## 3: 1.0983784 489141.7 783000.0
## 4: NA 520162.0 851166.7
## 5: NA NA NA
## 6: NA NA NA
## t bill spread 10y 3m 1 advance retail sales 2 initial jobless claims 2
## <num> <num> <num>
## 1: 0.5453968 451492.3 867846.2
## 2: 0.5371875 433592.3 2716615.4
## 3: 0.7716129 486223.3 1041076.9
## 4: 1.0983784 489141.7 783000.0
## 5: NA 520162.0 851166.7
## 6: NA NA NA
## t bill spread 10y 3m 2 advance retail sales 3 initial jobless claims 3
## <num> <num> <num>
## 1: 0.2609677 457066.3 218923.1
## 2: 0.5453968 451492.3 867846.2
## 3: 0.5371875 433592.3 2716615.4
## 4: 0.7716129 486223.3 1041076.9
## 5: 1.0983784 489141.7 783000.0
## 6: NA 520162.0 851166.7
## t bill spread 10y 3m 3
## <num>
## 1: 0.1843548
## 2: 0.2609677
## 3: 0.5453968
## 4: 0.5371875
## 5: 0.7716129
## 6: 1.0983784
Checking the result for advance retail sales is simple as we are only
using the first observation in the quarter. Thus for Q1 2021 it is
520162; for the first month in Q4 2020 it is 493991.
process_MF()
allows for multiple LHS variables, but all LHS
variables must be the same frequency (i.e. quarterly, monthly,
etc.).
So far we have transformed our mixed frequency data into meaningful,
up-to-date uniform frequency data. However, several series are still
non-stationary. Before modeling, will will have to process these data to
ensure stationarity and thereby consistent results. We can do this using
the function process()
. Instructions on how to process the
data are in a library file, in this case called
fredlib
.
## Index: <series_name>
## country series_name needs_SA frequency type unit
## <char> <char> <int> <char> <char> <char>
## 1: united states advance retail sales 0 monthly business USD Million
## 2: united states gdp constant prices 0 quarterly gdp USD Billion
## 3: united states initial jobless claims 0 weekly labour Thousand
## 4: united states t bill spread 10y 3m 0 daily markets percent
## source take_logs take_diffs
## <char> <int> <int>
## 1: FRED 1 1
## 2: FRED 1 1
## 3: FRED 1 0
## 4: FRED 0 0
Required columns are series_name
,
take_logs
, and take_diffs
. Optional columns
are country
, needs_SA
, detrend
,
center
, and scale
. Other columns, such as
type
above, are ignored. Unique identification of each
series can come from series_name
alone or
series_name
and country
pairs. Note that when
we transformed the mixed frequency data using process_MF()
we appended series names with 0 for contemporaneous data, 1, for one
lag, and so on. By default, the argument
ignore_numeric_names = TRUE
so that process()
correctly identifies each series from the library
fredlib
.
## ref_date series_name value level_value low_frequency_trend
## <Date> <fctr> <num> <num> <num>
## 1: 2000-03-31 gdp constant prices 0 NA 9.466855 0.006552842
## 2: 2000-06-30 gdp constant prices 0 0.87484058 9.485001 0.006507054
## 3: 2000-09-30 gdp constant prices 0 -0.27773103 9.486336 0.006452911
## 4: 2000-12-31 gdp constant prices 0 0.06114135 9.492545 0.006400971
## standardize_center standardize_scale
## <num> <num>
## 1: -0.001080327 0.01453897
## 2: -0.001080327 0.01453897
## 3: -0.001080327 0.01453897
## 4: -0.001080327 0.01453897
Output from dt_processed includes all of the information one would
need to convert data back to levels. If we had seasonally adjusted any
of the series, output would also include the column
seasonal_factor
.
Now that we have stationary, uniform frequency data, we can use any of the standard econometric models. For example,
X <- dcast(dt_processed, ref_date ~ series_name, value.var = "value")
out <- lm(`gdp constant prices 0` ~ `advance retail sales 0` +
`initial jobless claims 0` + `t bill spread 10y 3m 0`, data = X)
summary(out)
##
## Call:
## lm(formula = `gdp constant prices 0` ~ `advance retail sales 0` +
## `initial jobless claims 0` + `t bill spread 10y 3m 0`, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22908 -0.22884 -0.05588 0.27090 1.30454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001004 0.047091 0.021 0.983
## `advance retail sales 0` 0.834936 0.047951 17.412 < 2e-16 ***
## `initial jobless claims 0` -0.264142 0.052399 -5.041 2.88e-06 ***
## `t bill spread 10y 3m 0` 0.075944 0.048682 1.560 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4287 on 79 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.823, Adjusted R-squared: 0.8162
## F-statistic: 122.4 on 3 and 79 DF, p-value: < 2.2e-16
dateutils additionally exports a large number of small functions which names that are hopefully self explanatory. We can see the complete list as follows:
## [1] "Diff" "add_forecast_dates"
## [3] "agg_to_freq" "agg_to_freq_wide"
## [5] "allNA" "all_finite"
## [7] "any_finite" "can_seasonal"
## [9] "col_to_list" "comp_form"
## [11] "count_obs" "day"
## [13] "end_of_period" "end_of_year"
## [15] "extract_basic_character" "extract_character"
## [17] "extract_numeric" "fill_forward"
## [19] "first_of_month" "first_of_quarter"
## [21] "first_previous_quarter" "fred"
## [23] "fredlib" "get_data_frq"
## [25] "get_from_list" "index_by_friday"
## [27] "is_in" "last_in_month"
## [29] "last_in_quarter" "last_in_week"
## [31] "last_in_year" "last_obs"
## [33] "limit_character" "long_run_var"
## [35] "match_index" "match_ts_dates"
## [37] "mean_na" "month_days"
## [39] "month_to_date" "number_finite"
## [41] "numdum" "pct_chng"
## [43] "pct_response" "process"
## [45] "process_MF" "process_wide"
## [47] "rollmax" "rollmean"
## [49] "rollmin" "row_to_list"
## [51] "run_sa" "sd_na"
## [53] "seas_df_long" "seas_df_wide"
## [55] "spline_fill" "spline_fill_trend"
## [57] "stack_obs" "sum_na"
## [59] "to_ts" "total_response"
## [61] "try_detrend" "try_rm_outlier"
## [63] "try_sa" "try_trend"
## [65] "ts_to_df"
As a few more examples, we can find the last date of a time period using one of the following functions:
dates <- seq.Date(from = as.Date("2021-01-01"), to = as.Date("2021-06-30"), by = "day")
weekvals <- end_of_period(dates, period = "week")
unique(weekdays(weekvals))
## [1] "Friday"
or, to shift dates forward one month,
## [1] "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31" "2021-06-30"
## [6] "2021-07-31"