-
Notifications
You must be signed in to change notification settings - Fork 90
/
15_encryption.Rmd
287 lines (206 loc) · 12.9 KB
/
15_encryption.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# Encryption
\index{encryption@\textbf{encryption}}
<!-- > Arguing that you don't care about privacy because you have nothing to hide is no different than saying you don't care about free speech because you have nothing to say. -->
<!-- > Edward Snowden -->
> Encryption matters, and it is not just for spies and philanderers.
> Glenn Greenwald
Health data is precious and often sensitive.
Datasets may contain patient identifiable information.
Information may be clearly disclosive, such as a patient's date of birth, post/zip code, or social security number.
Other datasets may have been processed to remove the most obviously confidential information.
These still require great care, as the data is usually only 'pseudoanonymised'.
This may mean that the data of an individual patient is disclosive when considered as a whole - perhaps the patient had a particularly rare diagnosis.
Or it may mean that the data can be combined with other datasets and in combination, individual patients can be identified.
The governance around safe data handling is one of the greatest challenges facing health data scientists today.
It needs to be taken very seriously and robust practices must be developed to ensure public confidence.
## Safe practice
Storing sensitive information as raw values leaves the data vulnerable to confidentiality breaches.
This is true even when you are working in a 'safe' environment, such as a secure server.
It is best to simply remove as much confidential information from records whenever possible.
If the data is not present, then it cannot be compromised.
This might not be a good idea if the data might need to be linked back to an individual at some unspecified point in the future.
This may be a problem if, for example, auditors of a clinical trial need to re-identify an individual from the trial data.
A study ID can be used, but that still requires the confidential data to be stored and available in a lookup table in another file.
This chapter is not a replacement for an information governance course.
These are essential and the reader should follow their institution's guidelines on this.
The chapter does introduce a useful R package and encryption functions that you may need to incorporate into your data analysis workflow.
## **encryptr** package
\index{encryption@\textbf{encryption}!encryptr}
The **encryptr** package is our own and allows users to store confidential data in a pseudoanonymised form, which is far less likely to result in re-identification.
Either columns in data frames/tibbles or whole files can be directly encrypted from R using strong RSA encryption.
The basis of RSA encryption is a public/private key pair and is the method used of many modern encryption applications.
The public key can be shared and is used to encrypt the information.
The private key is sensitive and should not be shared.
The private key requires a password to be set, which should follow modern rules on password complexity.
You know what you should do!
If the password is lost, it cannot be recovered.
## Get the package
The **encryptr** package can be installed in the standard manner or the development version can be obtained from GitHub.
Full documentation is maintained separately at [encrypt-r.org](https://encrypt-r.org).
```{r eval=FALSE}
install.packages("encryptr")
# Or the development version from Github
remotes::install_github("surgicalinformatics/encryptr")
```
## Get the data
An example dataset containing the addresses of general practitioners (family doctors) in Scotland is included in the package.
```{r eval=FALSE}
library(encryptr)
gp
#> A tibble: 1,212 x 12
#> organisation_code name address1 address2 address3 city postcode
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 S10002 MUIRHE… LIFF RO… MUIRHEAD NA DUND… DD2 5NH
#> 2 S10017 THE BL… CRIEFF … KING ST… NA CRIE… PH7 3SA
```
## Generate private/public keys
\index{encryption@\textbf{encryption}!public/private keys}
The `genkeys()` function generates a public and private key pair.
A password is required to be set in the dialogue box for the private key.
Two files are written to the active directory.
The default name for the private key is:
* `id_rsa`
And for the public key name is generated by default:
* `id_rsa.pub`
If the private key file is lost, nothing encrypted with the public key can be recovered.
Keep this safe and secure.
Do not share it without a lot of thought on the implications.
```{r eval=FALSE}
genkeys()
#> Private key written with name 'id_rsa'
#> Public key written with name 'id_rsa.pub'
```
## Encrypt columns of data
\index{encryption@\textbf{encryption}!columns, encrypt}
Once the keys are created, it is possible to encrypt one or more columns of data in a data frame/tibble using the public key.
Every time RSA encryption is used it will generate a unique output.
Even if the same information is encrypted more than once, the output will always be different.
It is therefore not possible to match two encrypted values.
These outputs are also secure from decryption without the private key.
This may allow sharing of data within or between research teams without sharing confidential data.
Encrypting columns to a ciphertext is straightforward.
However, as stated above, an important principle is dropping sensitive data which is never going to be required.
Do not hoard more data than you need to answer your question.
```{r eval=FALSE}
library(dplyr)
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode)
gp_encrypt
#> A tibble: 1,212 x 8
#> organisation_code city county postcode
#> <chr> <chr> <chr> <chr>
#> 1 S10002 DUNDEE ANGUS 796284eb46ca…
#> 2 S10017 CRIEFF PERTHSHIRE 639dfc076ae3…
```
## Decrypt specific information only
\index{encryption@\textbf{encryption}!columns, decrypt}
Decryption requires the private key generated using `genkeys()` and the password set at the time.
The password and file are not replaceable so need to be kept safe and secure.
It is important to only decrypt the specific pieces of information that are required.
The beauty of this system is that when decrypting a specific cell, the rest of the data remain secure.
```{r eval=FALSE}
gp_encrypt %>%
slice(1:2) %>% # Only decrypt the rows and columns necessary
decrypt(postcode)
#> A tibble: 1,212 x 8
#> organisation_code city county postcode
#> <chr> <chr> <chr> <chr>
#> 1 S10002 DUNDEE ANGUS DD2 5NH
#> 2 S10017 CRIEFF PERTHSHIRE PH7 3SA
```
## Using a lookup table
\index{encryption@\textbf{encryption}!lookup table}
Rather than storing the ciphertext in the working data frame, a lookup table can be used as an alternative.
Using `lookup = TRUE` has the following effects:
* returns the data frame / tibble with encrypted columns removed and a `key` column included;
* returns the lookup table as an object in the R environment;
* creates a lookup table `.csv` file in the active directory.
```{r eval=FALSE}
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode, telephone, lookup = TRUE)
#> Lookup table object created with name 'lookup'
#> Lookup table written to file with name 'lookup.csv'
gp_encrypt
#> A tibble: 1,212 x 7
#> key organisation_code city county opendate
#> <int> <chr> <chr> <chr> <date>
#> 1 1 S10002 DUNDEE ANGUS 1995-05-01
#> 2 2 S10017 CRIEFF PERTHSHIRE 1996-04-06
```
The file creation can be turned off with `write_lookup = FALSE` and the name of the lookup can be changed with `lookup_name = "anyNameHere"`.
The created lookup file should be itself encrypted using the method below.
Decryption is performed by passing the lookup object or file to the `decrypt()` function.
```{r eval=FALSE}
gp_encrypt %>%
decrypt(postcode, telephone, lookup_object = lookup)
# Or
gp_encrypt %>%
decrypt(postcode, telephone, lookup_path = "lookup.csv")
#> A tibble: 1,212 x 8
#> postcode telephone organisation_code city county opendate
#> <chr> <chr> <chr> <chr> <chr> <date>
#> 1 DD2 5NH 01382 580264 S10002 DUNDEE ANGUS 1995-05-01
#> 2 PH7 3SA 01764 652283 S10017 CRIEFF PERTHSHIRE 1996-04-06
```
## Encrypting a file
\index{encryption@\textbf{encryption}!file, encrypt}
Encrypting the object within R has little point if a file with the disclosive information is still present on the system.
Files can be encrypted and decrypted using the same set of keys.
To demonstrate, the included dataset is written as a .csv file.
```{r eval=FALSE}
write_csv(gp, "gp.csv")
encrypt_file("gp.csv")
#> Encrypted file written with name 'gp.csv.encryptr.bin'
```
Check that the file can be decrypted prior to removing the original file from your system.
Warning: it is strongly suggested that the original unencrypted data file backed up in a secure system in case de-encryption is not possible, e.g., the private key file or password is lost.
## Decrypting a file
\index{encryption@\textbf{encryption}!file, decrypt}
The `decrypt_file` function will not allow the original file to be overwritten, therefore use the option to specify a new name for the unencrypted file.
```{r eval=FALSE}
decrypt_file("gp.csv.encryptr.bin", file_name = "gp2.csv")
#> Decrypted file written with name 'gp2.csv'
```
## Ciphertexts are not matchable
The ciphertext produced for a given input will change with each encryption.
This is a feature of the RSA algorithm.
Ciphertexts should not therefore be attempted to be matched between datasets encrypted using the same public key.
This is a conscious decision given the risks associated with sharing the necessary details.
## Providing a public key
\index{encryption@\textbf{encryption}!public key sharing}
In collaborative projects where data may be pooled, a public key can be made available by you via a link to enable collaborators to encrypt sensitive data.
This provides a robust method for sharing potentially disclosive data points.
```{r eval=FALSE}
gp_encrypt = gp %>%
select(-c(name, address1, address2, address3)) %>%
encrypt(postcode, telephone, public_key_path =
"https://argonaut.is.ed.ac.uk/public/id_rsa.pub")
```
## Use cases
\index{encryption@\textbf{encryption}!use cases}
### Blinding in trials
A potential application is maintaining blinding / allocation concealment in randomised controlled clinical trials.
Using the same method of encryption, it is possible to encrypt the participant allocation group, allowing the sharing of data without compromising blinding.
If other members of the trial team are permitted to see treatment allocation (unblinded), then the decryption process can be followed to reveal the group allocation.
The benefit of this approach is that each ciphertext is unique.
This prevents researchers identifying patterns of outcomes or adverse events within a named group such as "Group A".
Instead, each participant appears to have a truly unique allocation group which can only be revealed by the decryption process.
In situations such as block randomisation, where the trial enrolment personnel are blinded to the allocation, this unique ciphertext further limits the impact of selection bias.
### Re-contacting participants
Clinical research often requires further contact of participants for either planned follow-up or sometimes in cases of early cessation of trials due to harm.
**encryptr** allows the storage of contact details in pseudoanonymised format that can be decrypted only when necessary.
For example, investigators running a randomised clinical trial of a novel therapeutic agent may decide that all enrolled participants taking another medication should withdraw due to a major drug interaction.
Using a basic filter, patients taking this medication could be identified and the telephone numbers decrypted for these participants.
The remaining telephone numbers would remain encrypted preventing unnecessary re-identification of participants.
### Long-term follow-up of participants
Researchers with approved projects may one day receive approval to carry out additional follow-up through tracking of outcomes through electronic healthcare records or re-contact of patients.
Should a follow-up study be approved, patient identifiers stored as ciphertexts could then be decrypted to allow matching of the participant to their own health records.
## Summary
All confidential information must be treated with the utmost care.
Data should never be carried on removable devices or portable computers.
Data should never be sent by open email.
Encrypting data provides some protection against disclosure.
But particularly in healthcare, data often remains potentially disclosive (or only pseudoanonymised) even after encryption of identifiable variables.
Treat it with great care and respect.