In the last post we learned to download the FF data. In this post we will learn to clean the data so we can use it for our analysis.

First lets load the libraries and the data.

library(tidyquant)
library(timetk)
ff_url <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"

temp_file <- tempfile()

download.file(ff_url, temp_file)

ff_data_raw <- read_csv(unzip(temp_file), skip = 3)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   `Mkt-RF` = col_double(),
##   SMB = col_double(),
##   HML = col_double(),
##   RF = col_double()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 8 parsing failures.
## row # A tibble: 5 x 5 col     row col    expected   actual                  file                     expected   <int> <chr>  <chr>      <chr>                   <chr>                    actual 1  1115 X1     an integer Annual Factors: Januar~ './F-F_Research_Data_Fa~ file 2  1115 <NA>   5 columns  1 columns               './F-F_Research_Data_Fa~ row 3  1116 Mkt-RF a double   Mkt-RF                  './F-F_Research_Data_Fa~ col 4  1116 SMB    a double   SMB                     './F-F_Research_Data_Fa~ expected 5  1116 HML    a double   HML                     './F-F_Research_Data_Fa~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
head(ff_data_raw)
## # A tibble: 6 x 5
##       X1 `Mkt-RF`   SMB   HML    RF
##    <int>    <dbl> <dbl> <dbl> <dbl>
## 1 192607     2.96 -2.3  -2.87  0.22
## 2 192608     2.64 -1.4   4.19  0.25
## 3 192609     0.36 -1.32  0.01  0.23
## 4 192610    -3.24  0.04  0.51  0.32
## 5 192611     2.53 -0.2  -0.35  0.31
## 6 192612     2.62 -0.04 -0.02  0.28

Looking at the warning we can see that there was some issue with parsing row 1116. So lets check and delete unwanted data.

ff_data_raw[c(1114:1118),]
## # A tibble: 5 x 5
##       X1 `Mkt-RF`    SMB    HML    RF
##    <int>    <dbl>  <dbl>  <dbl> <dbl>
## 1 201904     3.96  -1.69   1.99  0.21
## 2     NA    NA     NA     NA    NA   
## 3     NA    NA     NA     NA    NA   
## 4   1927    29.5   -2.46  -3.75  3.12
## 5   1928    35.4    4.2   -6.15  3.56

The rows after, row 1114 is not needed, so we will delete that data.

ff_data_raw <- ff_data_raw[c(1:1114),]

Now lets look at the data again.

head(ff_data_raw)
## # A tibble: 6 x 5
##       X1 `Mkt-RF`   SMB   HML    RF
##    <int>    <dbl> <dbl> <dbl> <dbl>
## 1 192607     2.96 -2.3  -2.87  0.22
## 2 192608     2.64 -1.4   4.19  0.25
## 3 192609     0.36 -1.32  0.01  0.23
## 4 192610    -3.24  0.04  0.51  0.32
## 5 192611     2.53 -0.2  -0.35  0.31
## 6 192612     2.62 -0.04 -0.02  0.28

Lets change the column names.

colnames(ff_data_raw) <- paste(c('date', "mkt_excess", "smb", "hml", "rf"))

Now lets format the dates. Currently they are in the Year/month format and parsed as int. We want the format to be a full date, recognized by R for time series analysis.

ff_data_raw %>%
  mutate(date = ymd(parse_date(date, format = "%Y%m")))
## # A tibble: 1,114 x 5
##    date       mkt_excess   smb   hml    rf
##    <date>          <dbl> <dbl> <dbl> <dbl>
##  1 1926-07-01       2.96 -2.3  -2.87  0.22
##  2 1926-08-01       2.64 -1.4   4.19  0.25
##  3 1926-09-01       0.36 -1.32  0.01  0.23
##  4 1926-10-01      -3.24  0.04  0.51  0.32
##  5 1926-11-01       2.53 -0.2  -0.35  0.31
##  6 1926-12-01       2.62 -0.04 -0.02  0.28
##  7 1927-01-01      -0.06 -0.56  4.83  0.25
##  8 1927-02-01       4.18 -0.1   3.17  0.26
##  9 1927-03-01       0.13 -1.6  -2.67  0.3 
## 10 1927-04-01       0.46  0.43  0.6   0.25
## # ... with 1,104 more rows

But since FF release their data at the end of the month, we will change the date format to last day of each month.

ff_data_raw %>%
  mutate(date = ymd(parse_date(date, format = "%Y%m"))) %>%
  mutate(date = rollback(date)) 
## # A tibble: 1,114 x 5
##    date       mkt_excess   smb   hml    rf
##    <date>          <dbl> <dbl> <dbl> <dbl>
##  1 1926-06-30       2.96 -2.3  -2.87  0.22
##  2 1926-07-31       2.64 -1.4   4.19  0.25
##  3 1926-08-31       0.36 -1.32  0.01  0.23
##  4 1926-09-30      -3.24  0.04  0.51  0.32
##  5 1926-10-31       2.53 -0.2  -0.35  0.31
##  6 1926-11-30       2.62 -0.04 -0.02  0.28
##  7 1926-12-31      -0.06 -0.56  4.83  0.25
##  8 1927-01-31       4.18 -0.1   3.17  0.26
##  9 1927-02-28       0.13 -1.6  -2.67  0.3 
## 10 1927-03-31       0.46  0.43  0.6   0.25
## # ... with 1,104 more rows

When we use the rollback() function, all dates get shift up 1 month. The big downside is, that our data is not aligned correctly. If we check the tail we lost April 2019 data. TO rectify this we can add one month to our data, before rolling it back.

ff_data_raw <- ff_data_raw %>%
  mutate(date = ymd(parse_date(date, format = "%Y%m"))) %>%
  mutate(date = date + months(1)) %>% # Add one month
  mutate(date = rollback(date)) 

ff_data_raw
## # A tibble: 1,114 x 5
##    date       mkt_excess   smb   hml    rf
##    <date>          <dbl> <dbl> <dbl> <dbl>
##  1 1926-07-31       2.96 -2.3  -2.87  0.22
##  2 1926-08-31       2.64 -1.4   4.19  0.25
##  3 1926-09-30       0.36 -1.32  0.01  0.23
##  4 1926-10-31      -3.24  0.04  0.51  0.32
##  5 1926-11-30       2.53 -0.2  -0.35  0.31
##  6 1926-12-31       2.62 -0.04 -0.02  0.28
##  7 1927-01-31      -0.06 -0.56  4.83  0.25
##  8 1927-02-28       4.18 -0.1   3.17  0.26
##  9 1927-03-31       0.13 -1.6  -2.67  0.3 
## 10 1927-04-30       0.46  0.43  0.6   0.25
## # ... with 1,104 more rows

We are almost done cleaning, we need to convert the data into decimal form. We will use a quick apply function to do that. We will want to skip the date column for that.

ff_data_raw <- ff_data_raw %>%
  mutate_at(vars(-date), function(x) x/100)

ff_data_raw
## # A tibble: 1,114 x 5
##    date       mkt_excess     smb      hml     rf
##    <date>          <dbl>   <dbl>    <dbl>  <dbl>
##  1 1926-07-31   0.0296   -0.023  -0.0287  0.0022
##  2 1926-08-31   0.0264   -0.0140  0.0419  0.0025
##  3 1926-09-30   0.0036   -0.0132  0.0001  0.0023
##  4 1926-10-31  -0.0324    0.0004  0.0051  0.0032
##  5 1926-11-30   0.0253   -0.002  -0.00350 0.0031
##  6 1926-12-31   0.0262   -0.0004 -0.0002  0.0028
##  7 1927-01-31  -0.000600 -0.0056  0.0483  0.0025
##  8 1927-02-28   0.0418   -0.001   0.0317  0.0026
##  9 1927-03-31   0.0013   -0.016  -0.0267  0.003 
## 10 1927-04-30   0.0046    0.0043  0.006   0.0025
## # ... with 1,104 more rows

That’s it, our Fama French factor data is ready for analysis. Next we will load our portfolio data and do some analysis. We will repeat this process again in Python next.