Skip to content

Commit

Permalink
re-adding 50fold example now with less seed hacking
Browse files Browse the repository at this point in the history
  • Loading branch information
trevorcampbell committed Nov 15, 2023
1 parent 1b7788e commit dda1e5b
Showing 1 changed file with 25 additions and 0 deletions.
25 changes: 25 additions & 0 deletions source/classification2.md
Original file line number Diff line number Diff line change
Expand Up @@ -1100,6 +1100,31 @@ cv_10_metrics
In this case, using 10-fold instead of 5-fold cross validation did
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
you might even end up with a *higher* standard error when increasing the number of folds!
We can make the reduction in standard error more dramatic by increasing the number of folds
by a large amount. In the following code we show the result when $C = 50$;
picking such a large number of folds can take a long time to run in practice,
so we usually stick to 5 or 10.

```{code-cell} ipython3
:tags: [remove-output]
cv_50_df = pd.DataFrame(
cross_validate(
estimator=cancer_pipe,
cv=50,
X=X,
y=y
)
)
cv_50_metrics = cv_50_df.agg(["mean", "sem"])
cv_50_metrics
```

```{code-cell} ipython3
:tags: [remove-input]
# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
cv_50_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(10)
cv_50_metrics
```

```{code-cell} ipython3
:tags: [remove-cell]
Expand Down

0 comments on commit dda1e5b

Please sign in to comment.