forked from oasis-tcs/virtio-spec
-
Notifications
You must be signed in to change notification settings - Fork 0
/
virtio-fs.tex
342 lines (254 loc) · 16.3 KB
/
virtio-fs.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
\section{File System Device}\label{sec:Device Types / File System Device}
The virtio file system device provides file system access. The device either
directly manages a file system or it acts as a gateway to a remote file system.
The details of how the device implementation accesses files are hidden by the
device interface, allowing for a range of use cases.
Unlike block-level storage devices such as virtio block and SCSI, the virtio
file system device provides file-level access to data. The device interface is
based on the Linux Filesystem in Userspace (FUSE) protocol. This consists of
requests for file system traversal and access the files and directories within
it. The protocol details are defined by \hyperref[intro:FUSE]{FUSE}.
The device acts as the FUSE file system daemon and the driver acts as the FUSE
client mounting the file system. The virtio file system device provides the
mechanism for transporting FUSE requests, much like /dev/fuse in a traditional
FUSE application.
This section relies on definitions from \hyperref[intro:FUSE]{FUSE}.
\subsection{Device ID}\label{sec:Device Types / File System Device / Device ID}
26
\subsection{Virtqueues}\label{sec:Device Types / File System Device / Virtqueues}
\begin{description}
\item[0] hiprio
\item[1] notification queue
\item[2\ldots n] request queues
\end{description}
The notification queue only exists if VIRTIO_FS_F_NOTIFICATION is set.
\subsection{Feature bits}\label{sec:Device Types / File System Device / Feature bits}
\begin{description}
\item[VIRTIO_FS_F_NOTIFICATION (0)] Device has support for FUSE notify
messages. The notification queue is virtqueue 1.
\end{description}
\subsection{Device configuration layout}\label{sec:Device Types / File System Device / Device configuration layout}
\begin{lstlisting}
struct virtio_fs_config {
char tag[36];
le32 num_request_queues;
le32 notify_buf_size;
};
\end{lstlisting}
The \field{tag} and \field{num_request_queues} fields are always available.
The \field{notify_buf_size} field is only available when
VIRTIO_FS_F_NOTIFICATION is set.
\begin{description}
\item[\field{tag}] is the name associated with this file system. The tag is
encoded in UTF-8 and padded with NUL bytes if shorter than the
available space. This field is not NUL-terminated if the encoded bytes
take up the entire field.
\item[\field{num_request_queues}] is the total number of request virtqueues
exposed by the device. Each virtqueue offers identical functionality and
there are no ordering guarantees between requests made available on
different queues. Use of multiple queues is intended to increase
performance.
\item[\field{notify_buf_size}] is the minimum number of bytes required for each
buffer in the notification queue.
\end{description}
\drivernormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
The driver MUST NOT write to device configuration fields.
The driver MAY use from one up to \field{num_request_queues} request virtqueues.
\devicenormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
The device MUST set \field{num_request_queues} to 1 or greater.
The device MUST set \field{notify_buf_size} to be large enough to hold any of
the FUSE notify messages that this device emits.
\subsection{Device Initialization}\label{Device Types / File System Device / Device Initialization}
On initialization the driver first discovers the device's virtqueues.
The driver populates the notification queue with buffers for receiving FUSE
notify messages if VIRTIO_FS_F_NOTIFICATION is set.
The FUSE session is started by sending a FUSE\_INIT request as defined by the
FUSE protocol on one request virtqueue. All virtqueues provide access to the
same FUSE session and therefore only one FUSE\_INIT request is required
regardless of the number of available virtqueues.
\subsection{Device Operation}\label{sec:Device Types / File System Device / Device Operation}
Device operation consists of operating the virtqueues to facilitate file system
access.
The FUSE request types are as follows:
\begin{itemize}
\item Normal requests are made available by the driver on request queues and
are used by the device.
\item High priority requests (FUSE\_INTERRUPT, FUSE\_FORGET, and
FUSE\_BATCH\_FORGET) are made available by the driver on the hiprio queue
so the device is able to process them even if the request queues are
full.
\end{itemize}
FUSE notify messages are received on the notification queue if
VIRTIO_FS_F_NOTIFICATION is set.
\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
The driver enqueues normal requests on an arbitrary request queue. High
priority requests are not placed on request queues. The device processes
requests in any order. The driver is responsible for ensuring that ordering
constraints are met by making available a dependent request only after its
prerequisite request has been used.
Requests have the following format with endianness chosen by the driver in the
FUSE\_INIT request used to initiate the session as detailed below:
\begin{lstlisting}
struct virtio_fs_req {
// Device-readable part
struct fuse_in_header in;
u8 datain[];
// Device-writable part
struct fuse_out_header out;
u8 dataout[];
};
\end{lstlisting}
Note that the words "in" and "out" follow the FUSE meaning and do not indicate
the direction of data transfer under VIRTIO. "In" means input to a request and
"out" means output from processing a request.
\field{in} is the common header for all types of FUSE requests.
\field{datain} consists of request-specific data, if any. This is identical to
the data read from the /dev/fuse device by a FUSE daemon.
\field{out} is the completion header common to all types of FUSE requests.
\field{dataout} consists of request-specific data, if any. This is identical
to the data written to the /dev/fuse device by a FUSE daemon.
For example, the full layout of a FUSE\_READ request is as follows:
\begin{lstlisting}
struct virtio_fs_read_req {
// Device-readable part
struct fuse_in_header in;
union {
struct fuse_read_in readin;
u8 datain[sizeof(struct fuse_read_in)];
};
// Device-writable part
struct fuse_out_header out;
u8 dataout[out.len - sizeof(struct fuse_out_header)];
};
\end{lstlisting}
The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
of request types and their contents.
The endianness of the FUSE protocol session is detectable by inspecting the
uint32\_t \field{in.opcode} field of the FUSE\_INIT request sent by the driver
to the device. This allows the device to determine whether the session is
little-endian or big-endian. The next FUSE\_INIT message terminates the
current session and starts a new session with the possibility of changing
endianness.
\subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
The hiprio queue follows the same request format as the request queues. This
queue only contains FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET
requests.
Interrupt and forget requests have a higher priority than normal requests. The
separate hiprio queue is used for these requests to ensure they can be
delivered even when all request queues are full.
\devicenormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
The device MUST NOT pause processing of the hiprio queue due to activity on a
normal request queue.
The device MAY process request queues concurrently with the hiprio queue.
\drivernormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
The driver MUST submit FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET requests solely on the hiprio queue.
The driver MUST not submit normal requests on the hiprio queue.
The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
\subsubsection{Device Operation: Notification Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Notification Queue}
The notification queue is populated with buffers by the driver and these
buffers are used by the device to emit FUSE notify messages. Notification
queue buffer layout is as follows:
\begin{lstlisting}
struct virtio_fs_notify {
// Device-writable part
struct fuse_out_header out_hdr;
char outarg[notify_buf_size - sizeof(struct fuse_out_header)];
};
\end{lstlisting}
\field{outarg} contains the FUSE notify message payload that depends on the
type of notification being emitted.
If the driver provides notification queue buffers at a slower rate than the
device emits FUSE notify messages then the virtqueue will eventually become
empty. The behavior in response to an empty virtqueue depends on the FUSE
notify message type. The following FUSE notify message types are supported:
\begin{itemize}
\item FUSE\_NOTIFY\_LOCK messages are delivered when buffers become available again. The device has resources for a certain number of lock requests. If the device runs out of resources new lock requests fail with ENOLCK.
\end{itemize}
\drivernormative{\paragraph}{Device Operation: Notification Queue}{Device Types / File System Device / Device Operation / Device Operation: Notification Queue}
The driver MUST provide buffers of at least \field{notify_buf_size} bytes.
The driver SHOULD replenish notification queue buffers sufficiently quickly so
that there is always at least one available buffer.
\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
driver-provided buffer and the device. In cases where data transfer is
undesirable, the device can map file contents into the DAX window shared memory
region. The driver then accesses file contents directly in device-owned memory
without a data transfer.
The DAX Window is an alternative mechanism for accessing file contents.
FUSE\_READ/FUSE\_WRITE requests and DAX Window accesses are possible at the
same time. Providing the DAX Window is optional for devices. Using the DAX
Window is optional for drivers.
Shared memory region ID 0 is called the DAX window. Drivers map this shared
memory region with writeback caching as if it were regular RAM. The contents
of the DAX window are undefined unless a mapping exists for that range.
The driver maps a file range into the DAX window using the FUSE\_SETUPMAPPING
request. Alignment constraints for FUSE\_SETUPMAPPING and FUSE\_REMOVEMAPPING
requests are communicated during FUSE\_INIT negotiation.
When a FUSE\_SETUPMAPPING request perfectly overlaps a previous mapping, the
previous mapping is replaced. When a mapping partially overlaps a previous
mapping, the previous mapping is split into one or two smaller mappings. When
a mapping is partially unmapped it is also split into one or two smaller
mappings.
Establishing new mappings or splitting existing mappings consumes resources.
If the device runs out of resources the FUSE\_SETUPMAPPING request fails until
resources are available again following FUSE\_REMOVEMAPPING.
After FUSE\_SETUPMAPPING has completed successfully the file range is
accessible from the DAX window at the offset provided by the driver in the
request. A mapping is removed using the FUSE\_REMOVEMAPPING request.
Data is only guaranteed to be persistent when a FUSE\_FSYNC request is used by
the device after having been made available by the driver following the write.
\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
The device MAY provide the DAX Window to memory-mapped access to file contents. If present, the DAX Window MUST be shared memory region ID 0.
The device MUST support FUSE\_READ and FUSE\_WRITE requests regardless of whether the DAX Window is being used or not.
The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
The device MUST reject mappings that would go beyond the end of the DAX window.
\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
The driver SHOULD be prepared to find shared memory region ID 0 absent and fall back to FUSE\_READ and FUSE\_WRITE requests.
The driver MAY use both FUSE\_READ/FUSE\_WRITE requests and the DAX Window to access file contents.
The driver MUST NOT access DAX window areas that have not been mapped.
\subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
The device provides access to a file system containing files owned by one or
more POSIX user ids and group ids. The device has no secure way of
differentiating between users originating requests via the driver. Therefore
the device accepts the POSIX user ids and group ids provided by the driver and
security is enforced by the driver rather than the device. It is nevertheless
possible for devices to implement POSIX user id and group id mapping or
whitelisting to control the ownership and access available to the driver.
File systems containing special files including device nodes and setuid
executable files pose a security concern. These properties are defined by the
file type and mode, which are set by the driver when creating new files or by
changes at a later time. These special files present a security risk when the
file system is shared with another machine. A setuid executable or a device
node placed by a malicious machine make it possible for unprivileged users on
other machines to elevate their privileges through the shared file system.
This issue can be solved on some operating systems using mount options that
ignore special files. It is also possible for devices to implement
restrictions on special files by refusing their creation.
When the device provides shared access to a file system between multiple
machines, symlink race conditions, exhausting file system capacity, and
overwriting or deleting files used by others are factors to consider. These
issues have a long history in multi-user operating systems and also apply to
virtio-fs. They are typically managed at the file system administration level
by providing shared access only to mutually trusted users.
Multiple machines sharing access to a file system are susceptible to timing
side-channel attacks. By measuring the latency of accesses to file contents or
file system metadata it is possible to infer whether other machines also
accessed the same information. Short latencies indicate that the information
was cached due to a previous access. This can reveal sensitive information,
such as whether certain code paths were taken. The DAX Window provides direct
access to file contents and is therefore a likely target of such attacks.
These attacks are also possible with traditional FUSE requests. The safest
approach is to avoid sharing file systems between untrusted machines.
\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
When a driver is migrated to a new device it is necessary to consider the FUSE
session and its state. The continuity of FUSE inode numbers (also known as
nodeids) and fh values is necessary so the driver can continue operation
without disruption.
It is possible to maintain the FUSE session across live migration either by
transferring the state or by redirecting requests from the new device to the
old device where the state resides. The details of how to achieve this are
implementation-dependent and are not visible at the device interface level.
Maintaining version and feature information negotiated by FUSE\_INIT is
necessary so that no FUSE protocol feature changes are visible to the driver
across live migration. The FUSE\_INIT information forms part of the FUSE
session state that needs to be transferred during live migration.