forked from timtrice/backtesting-strategies
-
Notifications
You must be signed in to change notification settings - Fork 1
/
data-quality.Rmd
171 lines (129 loc) · 9.9 KB
/
data-quality.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# Data Quality
Before doing any analysis you must always check the data to ensure quality. Do not assume that because you are getting it from a source such as Yahoo! or Google that it is clean. I'll show you why.
## Yahoo! vs. Google
I'll use **dplyr 0.4.3**, **ggplot2 2.0.0** and **tidyr 0.4.1** to help with analysis.
```{r data-quality-a-yahoo-getsymbols}
getSymbols("SPY",
src = "yahoo",
index.class = c("POSIXt", "POSIXct"),
from = "2010-01-01",
to = "2011-01-01",
adjust = TRUE)
yahoo.SPY <- SPY
summary(yahoo.SPY)
```
Above is a summary for the **SPY** data we received from Yahoo!. Examining each of the variables does not show anything out of the ordinary.
```{r data-quality-a-google-getsymbols}
rm(SPY)
getSymbols("SPY",
src = "google",
index.class = c("POSIXt", "POSIXct"),
from = "2010-01-01",
to = "2011-01-01",
adjust = TRUE)
google.SPY <- SPY
summary(google.SPY)
```
Now we have a dataset from Google. it's for the same symbol and same time frame. But now we have NA values - 8, in fact. In addition, our percentiles do not match up for any of the variables (with the exception of `Date`).
```{r data-quality-a-boxplot}
bind_rows(as.data.frame(yahoo.SPY) %>%
mutate(Src = "Yahoo"),
as.data.frame(google.SPY) %>%
mutate(Src = "Google")) %>%
gather(key, value, 1:4, na.rm = TRUE) %>%
ggplot(aes(x = key, y = value, fill = Src)) +
geom_boxplot() +
theme_bw() +
theme(legend.title = element_blank(), legend.position = "bottom") +
ggtitle("Google vs. Yahoo! (non-NA)")
```
We can see above clearly we have a mismatch of data between Google and Yahoo!. For one reason, Google does not supply a full day of data for holidays and early sessions. Let's look at the NA values:
```{r data-quality-a-holidays}
as.data.frame(google.SPY) %>%
mutate(Date = index(google.SPY)) %>%
select(Date, starts_with("SPY"), -SPY.Volume) %>%
filter(is.na(SPY.Open))
```
We can see many of these dates correspond closely to national holidays; *2010-11-24* would be Thanksgiving, *2010-12-23* would be Christimas.
So where Yahoo! does give OHLC values for these dates, Google just provides the Close. This won't affect most indicators that typically use closing data (moving averages, Bollinger Bands, etc.). However, if you are working on a strategy that triggers a day prior to one of these holidays, and you issue a buy order for the next morning, this may cause some integrity loss.
This doesn't mean you should only use Yahoo!. At this point we don't know the quality of Yahoo!'s data - we only know it *seems* complete. And this may be enough depending on what you want to do.
However, it's up to you to ensure your data is top quality.
> Garbage in, garbage out
## Examining Trades
It's not just data that we want to QA against but also our trades. After all, how disappointing would it be to think you have a winning strategy only to learn you were buying on tomorrow's close instead of today (look-ahead bias). Or that you wrote your rules incorrectly?
Every backtest must be picked apart from beginning to end. Checking our data was the first step. Checking our trades is next.
We'll reload our **Luxor** strategy and examine some of the trades for **SPY**.
```{r data-quality-b-load-strategy, results = "hide"}
rm.strat(portfolio.st)
rm.strat(account.st)
symbols <- basic_symbols()
getSymbols(Symbols = symbols, src = "yahoo", index.class = "POSIXct",
from = start_date, to = end_date, adjust = adjustment)
initPortf(name = portfolio.st, symbols = symbols, initDate = init_date)
initAcct(name = account.st, portfolios = portfolio.st, initDate = init_date,
initEq = init_equity)
initOrders(portfolio = portfolio.st, symbols = symbols, initDate = init_date)
applyStrategy(strategy.st, portfolios = portfolio.st)
checkBlotterUpdate(portfolio.st, account.st, verbose = TRUE)
updatePortf(portfolio.st)
updateAcct(account.st)
updateEndEq(account.st)
```
```{r data-quality-b-chart-posn-1, fig.cap = "SPY Trades for Jan 1, 2008 to July 1, 2008"}
chart.Posn(portfolio.st, Symbol = "SPY", Dates="2008-01-01::2008-07-01",
TA="add_SMA(n = 10, col = 2); add_SMA(n = 30, col = 4)")
```
Our strategy called for a long entry when SMA(10) was greater than or equal to SMA(30). It seems we got a cross on February 25 but the trade didn't trigger until two days later. Let's take a look.
```{r data-quality-b-mktdata}
le <- as.data.frame(mktdata["2008-02-25::2008-03-07", c(1:4, 7:10)])
DT::datatable(le,
rownames = TRUE,
extensions = c("Scroller", "FixedColumns"),
options = list(pageLength = 5,
autoWidth = TRUE,
deferRender = TRUE,
scrollX = 200,
scroller = TRUE,
fixedColumns = TRUE),
caption = htmltools::tags$caption(
"Table 6.1: mktdata object for Feb. 25, 2008 to Mar. 7, 2008"))
```
The **2008-02-25T00:00:00Z** bar shows `nFast` just fractions of a penny lower than `nSlow`. We get the cross on **2008-02-26T00:00:00Z** which gives a TRUE `long` signal. Our high on that bar is $132.61 which would be our stoplimit. On the **2008-02-27T00:00:00Z** bar we get a higher high which means our stoplimit order gets filled at $132.61. This is reflected by the faint green arrow at the top of the candles upper shadow.
```{r data-quality-b-order-book}
ob <- as.data.table(getOrderBook(portfolio.st)$Quantstrat$SPY)
DT::datatable(ob,
rownames = FALSE,
filter = "top",
extensions = c("Scroller", "FixedColumns"),
options = list(pageLength = 5,
autoWidth = TRUE,
deferRender = TRUE,
scrollX = 200,
scroller = TRUE,
fixedColumns = TRUE),
caption = htmltools::tags$caption(
"Table 6.2: Order book for SPY"))
```
When we look at the order book (Table 6.2) we get confirmation of our order. `index` reflects the date the order was submitted. `Order.StatusTime` reflects when the order was filled.
(Regarding the time stamp, ignore it. No time was provided so by default it falls to midnight Zulu time which is four to five hours ahead of EST/EDT (depending on time of year) which technically would be the previous day. To avoid confusion, just note the dates.)
If we look at `Rule` we see the value of *EnterLONG*. These are the `labels` of the rules we set up in our strategy. Now you can see how all these labels we assigned earlier start coming together.
On **2008-03-06T00:00:00Z** we get a market order to vacate all long positions and take a short positions. We see this charted in Fig. 6.1 identified with a red arrow on the same candle one bar after the cross. We stay in that position until **2008-04-01T00:00:00Z** when we flip back long.
If you flip to page 5 of Table 6.2, on **2009-11-03T00:00:00Z** you will see we had an order replaced (`Order.Status`). Let's plot this time frame and see what was going on.
```{r data-quality-b-chart-posn-2, fig.cap = "SPY Trades for Jan 1, 2008 to July 1, 2008"}
chart.Posn(portfolio.st, Symbol = "SPY", Dates="2009-08-01::2009-12-31",
TA="add_SMA(n = 10, col = 2); add_SMA(n = 30, col = 4)")
```
We got a bearish SMA cross on November 2 which submitted the short order. However, our stoplimit was with a preference of the Low and a threshold of $0.0005 or $102.98. So the order would only fill if we broke below that price. As you see, that never happened. The order stayed open until we got the bullish SMA cross on Nov. 11. At that point our short order was replaced with our long order to buy; a stoplimit at $109.50. Nov. 12 saw an inside day; the high wasn't breached therefore the order wasn't filled. However, on Nov. 13 we rallied past the high triggering the long order (green arrow). This is the last position taken in our order book.
So it seems the orders are triggering as expected.
On a side note, when I was originally writing the code I realized my short order was for +100 shares instead of -100 shares; actually, `orderqty = 100` which meant I wasn't really taking short positions.
This is why you really need to examine your strategies as soon as you create them. Before noticing the error the profit to drawdown ratio was poor. After correcting the error, it was worse. It only takes a minor barely recognizable typo to ruin results.
Finally, we'll get to the `chart.Posn()` function later in the analysis chapters. For now I want to point out one flaw (in my opinion) with the function. You may have noticed our indicators and positions didn't show up immediately on the chart. Our indicators didn't appear until the 10-bar and 30-bar periods had passed. And our positions didn't show up until a new trade was made.
You may also notice our CumPL and Drawdown graphs started at 0 on the last chart posted.
`chart.Posn()` doesn't "zoom in" as you may think. Rather it just operates on a subset of data when using the `Dates` parameter. Effectively, it's adjusting your strategy to the `Dates` parameter that is passed.
```{r data-quality-b-chart-posn-3, fig.cap = "SPY Trades for Jan 1, 2008 to July 1, 2008"}
chart.Posn(portfolio.st, Symbol = "SPY",
TA="add_SMA(n = 10, col = 2); add_SMA(n = 30, col = 4)")
```
Also, note the CumPL value of $2251.20741 and Drawdown value of -$1231.29476 are the *final* values. It does not show max profit or max drawdown. Notice the values are different from figure 6.3 and figure 6.2.
Going by that alone it may seem the strategy overall is profitable. But when you realize max drawdown was near -$3,000 and max profit was up to $3,000, it doesn't seem so promising.
Again, we'll get into all of this later. Just something to keep in mind when you start doing analysis.