forked from diveintomark/diveintopython3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
files.html
607 lines (485 loc) · 49.9 KB
/
files.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
<!DOCTYPE html>
<meta charset=utf-8>
<title>Files - Dive Into Python 3</title>
<!--[if IE]><script src=j/html5.js></script><![endif]-->
<link rel=stylesheet href=dip3.css>
<style>
body{counter-reset:h1 11}
mark{display:inline}
</style>
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
<link rel=stylesheet media=print href=print.css>
<meta name=viewport content='initial-scale=1.0'>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input type=search name=q size=25 placeholder="powered by Google™"> <input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span class=u>‣</span> <a href=table-of-contents.html#files>Dive Into Python 3</a> <span class=u>‣</span>
<p id=level>Difficulty level: <span class=u title=intermediate>♦♦♦♢♢</span>
<h1>Files</h1>
<blockquote class=q>
<p><span class=u>❝</span> A nine mile walk is no joke, especially in the rain. <span class=u>❞</span><br>— Harry Kemelman, <cite>The Nine Mile Walk</cite>
</blockquote>
<p id=toc>
<h2 id=divingin>Diving In</h2>
<p class=f>My Windows laptop had 38,493 files before I installed a single application. Installing Python 3 added almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the concept is so ingrained that most people would have trouble <a href=http://en.wikipedia.org/wiki/Computer_file#History>imagining an alternative</a>. Your computer is, metaphorically speaking, drowning in files.
<h2 id=reading>Reading From Text Files</h2>
<p>Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:
<pre class='nd pp'><code>a_file = open('examples/chinese.txt', encoding='utf-8')</code></pre>
<p>Python has a built-in <code>open()</code> function, which takes a filename as an argument. Here the filename is <code class=pp>'examples/chinese.txt'</code>. There are five interesting things about this filename:
<ol>
<li>It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-opening function could have taken two arguments — a directory path and a filename — but the <code>open()</code> function only takes one. In Python, whenever you need a “filename,” you can include some or all of a directory path as well.
<li>The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.
<li>The directory path does not begin with a slash or a drive letter, so it is called a <i>relative path</i>. Relative to what, you might ask? Patience, grasshopper.
<li>It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and directories. Python 3 fully supports non-<abbr>ASCII</abbr> pathnames.
<li>It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a figment of <a href=http://en.wikipedia.org/wiki/Filesystem_in_Userspace>an entirely virtual filesystem</a>. If your computer considers it a file and can access it as a file, Python can open it.
</ol>
<p>But that call to the <code>open()</code> function didn’t stop at the filename. There’s another argument, called <code>encoding</code>. Oh dear, <a href=strings.html#boring-stuff>that sounds dreadfully familiar</a>.
<h3 id=encoding>Character Encoding Rears Its Ugly Head</h3>
<p>Bytes are bytes; <a href=strings.html#byte-arrays>characters are an abstraction</a>. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters (otherwise known as a string).
<pre>
# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
<samp class=p>>>> </samp><kbd class=pp>file = open('examples/chinese.txt')</kbd>
<samp class=p>>>> </samp><kbd class=pp>a_string = file.read()</kbd>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined></samp>
<samp class=p>>>> </samp></pre>
<aside>The default encoding is platform-dependent.</aside>
<p>What just happened? You didn’t specify a character encoding, so Python is forced to use the default encoding. What’s the default encoding? If you look closely at the traceback, you can see that it’s dying in <code>cp1252.py</code>, meaning that Python is using CP-1252 as the default encoding here. (CP-1252 is a common encoding on computers running Microsoft Windows.) The CP-1252 character set doesn’t support the characters that are in this file, so the read fails with an ugly <code>UnicodeDecodeError</code>.
<p>But wait, it’s worse than that! The default encoding is <em>platform-dependent</em>, so this code <em>might</em> work on your computer (if your default encoding is <abbr>UTF-8</abbr>), but then it will fail when you distribute it to someone else (whose default encoding is different, like CP-1252).
<blockquote class=note>
<p><span class=u>☞</span>If you need to get the default character encoding, import the <code>locale</code> module and call <code>locale.getpreferredencoding()</code>. On my Windows laptop, it returns <code>'cp1252'</code>, but on my Linux box upstairs, it returns <code>'UTF8'</code>. I can’t even maintain consistency in my own house! Your results may be different (even on Windows) depending on which version of your operating system you have installed and how your regional/language settings are configured. This is why it’s so important to specify the encoding every time you open a file.
</blockquote>
<h3 id=file-objects>Stream Objects</h3>
<p>So far, all we know is that Python has a built-in function called <code>open()</code>. The <code>open()</code> function returns a <i>stream object</i>, which has methods and attributes for getting information about and manipulating a stream of characters.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.name</kbd> <span class=u>①</span></a>
<samp class=pp>'examples/chinese.txt'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.encoding</kbd> <span class=u>②</span></a>
<samp class=pp>'utf-8'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.mode</kbd> <span class=u>③</span></a>
<samp class=pp>'r'</samp></pre>
<ol>
<li>The <code>name</code> attribute reflects the name you passed in to the <code>open()</code> function when you opened the file. It is not normalized to an absolute pathname.
<li>Likewise, <code>encoding</code> attribute reflects the encoding you passed in to the <code>open()</code> function. If you didn’t specify the encoding when you opened the file (bad developer!) then the <code>encoding</code> attribute will reflect <code>locale.getpreferredencoding()</code>.
<li>The <code>mode</code> attribute tells you in which mode the file was opened. You can pass an optional <var>mode</var> parameter to the <code>open()</code> function. You didn’t specify a mode when you opened this file, so Python defaults to <code>'r'</code>, which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in which you deal with bytes instead of strings).
</ol>
<blockquote class=note>
<p><span class=u>☞</span>The <a href=http://docs.python.org/3.1/library/io.html#module-interface>documentation for the <code>open()</code> function</a> lists all the possible file modes.
</blockquote>
<h3 id=read>Reading Data From A Text File</h3>
<p>After you open a file for reading, you’ll probably want to read from it at some point.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_file = open('examples/chinese.txt', encoding='utf-8')</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>①</span></a>
<samp class=pp>'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>②</span></a>
<samp class=pp>''</samp></pre>
<ol>
<li>Once you open a file (with the correct encoding), reading from it is just a matter of calling the stream object’s <code>read()</code> method. The result is a string.
<li>Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider reading past end-of-file to be an error; it simply returns an empty string.
</ol>
<aside>Always specify an <code>encoding</code> parameter when you open a file.</aside>
<p>What if you want to re-read a file?
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>①</span></a>
<samp class=pp>''</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>②</span></a>
<samp class=pp>0</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(16)</kbd> <span class=u>③</span></a>
<samp class=pp>'Dive Into Python'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>④</span></a>
<samp class=pp>' '</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd>
<samp class=pp>'是'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>⑤</span></a>
<samp class=pp>20</samp></pre>
<ol>
<li>Since you’re still at the end of the file, further calls to the stream object’s <code>read()</code> method simply return an empty string.
<li>The <code>seek()</code> method moves to a specific byte position in a file.
<li>The <code>read()</code> method can take an optional parameter, the number of characters to read.
<li>If you like, you can even read one character at a time.
<li>16 + 1 + 1 = … 20?
</ol>
<p>Let’s try that again.
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(17)</kbd> <span class=u>①</span></a>
<samp class=pp>17</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>②</span></a>
<samp class=pp>'是'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>③</span></a>
<samp class=pp>20</samp></pre>
<ol>
<li>Move to the 17<sup>th</sup> byte.
<li>Read one character.
<li>Now you’re on the 20<sup>th</sup> byte.
</ol>
<p>Do you see it yet? The <code>seek()</code> and <code>tell()</code> methods always count <em>bytes</em>, but since you opened this file as text, the <code>read()</code> method counts <em>characters</em>. Chinese characters <a href=strings.html#boring-stuff>require multiple bytes to encode in <abbr>UTF-8</abbr></a>. The English characters in the file only require one byte each, so you might be misled into thinking that the <code>seek()</code> and <code>read()</code> methods are counting the same thing. But that’s only true for some characters.
<p>But wait, it gets worse!
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(18)</kbd> <span class=u>①</span></a>
<samp class=pp>18</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(1)</kbd> <span class=u>②</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
a_file.read(1)
File "C:\Python31\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte</samp></pre>
<ol>
<li>Move to the 18<sup>th</sup> byte and try to read one character.
<li>Why does this fail? Because there isn’t a character at the 18<sup>th</sup> byte. The nearest character starts at the 17<sup>th</sup> byte (and goes for three bytes). Trying to read a character from the middle will fail with a <code>UnicodeDecodeError</code>.
</ol>
<h3 id=close>Closing Files</h3>
<p>Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them.
<pre class='nd screen'>
# continued from the previous example
<samp class=p>>>> </samp><kbd class=pp>a_file.close()</kbd></pre>
<p>Well <em>that</em> was anticlimactic.
<p>The stream object <var>a_file</var> still exists; calling its <code>close()</code> method doesn’t destroy the object itself. But it’s not terribly useful.
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>①</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<pyshell#24>", line 1, in <module>
a_file.read()
ValueError: I/O operation on closed file.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>②</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
a_file.seek(0)
ValueError: I/O operation on closed file.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd> <span class=u>③</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
a_file.tell()
ValueError: I/O operation on closed file.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.close()</kbd> <span class=u>④</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.closed</kbd> <span class=u>⑤</span></a>
<samp class=pp>True</samp></pre>
<ol>
<li>You can’t read from a closed file; that raises an <code>IOError</code> exception.
<li>You can’t seek in a closed file either.
<li>There’s no current position in a closed file, so the <code>tell()</code> method also fails.
<li>Perhaps surprisingly, calling the <code>close()</code> method on a stream object whose file has been closed does <em>not</em> raise an exception. It’s just a no-op.
<li>Closed stream objects do have one useful attribute: the <code>closed</code> attribute will confirm that the file is closed.
</ol>
<h3 id=with>Closing Files Automatically</h3>
<aside><code>try..finally</code> is good. <code>with</code> is better.</aside>
<p>Stream objects have an explicit <code>close()</code> method, but what happens if your code has a bug and crashes before you call <code>close()</code>? That file could theoretically stay open for much longer than necessary. While you’re debugging on your local computer, that’s not a big deal. On a production server, maybe it is.
<p>Python 2 had a solution for this: the <code>try..finally</code> block. That still works in Python 3, and you may see it in other people’s code or in older code that was <a href=case-study-porting-chardet-to-python-3.html>ported to Python 3</a>. But Python 2.6 introduced a cleaner solution, which is now the preferred solution in Python 3: the <code>with</code> statement.
<pre class='nd pp'><code>with open('examples/chinese.txt', encoding='utf-8') as a_file:
a_file.seek(17)
a_character = a_file.read(1)
print(a_character)</code></pre>
<p>This code calls <code>open()</code>, but it never calls <code>a_file.close()</code>. The <code>with</code> statement starts a code block, like an <code>if</code> statement or a <code>for</code> loop. Inside this code block, you can use the variable <var>a_file</var> as the stream object returned from the call to <code>open()</code>. All the regular stream object methods are available — <code>seek()</code>, <code>read()</code>, whatever you need. When the <code>with</code> block ends, <em>Python calls <code>a_file.close()</code> automatically</em>.
<p>Here’s the kicker: no matter how or when you exit the <code>with</code> block, Python will close that file… even if you “exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire program comes to a screeching halt, that file will get closed. Guaranteed.
<blockquote class=note>
<p><span class=u>☞</span>In technical terms, the <code>with</code> statement creates a <dfn>runtime context</dfn>. In these examples, the stream object acts as a <dfn>context manager</dfn>. Python creates the stream object <var>a_file</var> and tells it that it is entering a runtime context. When the <code>with</code> code block is completed, Python tells the stream object that it is exiting the runtime context, and the stream object calls its own <code>close()</code> method. See <a href=special-method-names.html#context-managers>Appendix B, “Classes That Can Be Used in a <code>with</code> Block”</a> for details.
</blockquote>
<p>There’s nothing file-specific about the <code>with</code> statement; it’s just a generic framework for creating runtime contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a stream object, then it does useful file-like things (like closing the file automatically). But that behavior is defined in the stream object, not in the <code>with</code> statement. There are lots of other ways to use context managers that have nothing to do with files. You can even create your own, as you’ll see later in this chapter.
<h3 id=for>Reading Data One Line At A Time</h3>
<p>A “line” of a text file is just what you think it is — you type a few words and press <kbd>ENTER</kbd>, and now you’re on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated, because text files can use several different characters to mark the end of a line. Every operating system has its own convention. Some use a carriage return character, others use a line feed character, and some use both characters at the end of every line.
<p>Now breathe a sigh of relief, because <em>Python handles line endings automatically</em> by default. If you say, “I want to read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and and it will all Just Work.
<blockquote class=note>
<p><span class=u>☞</span>If you need fine-grained control over what’s considered a line ending, you can pass the optional <code>newline</code> parameter to the <code>open()</code> function. See <a href=http://docs.python.org/3.1/library/io.html#module-interface>the <code>open()</code> function documentation</a> for all the gory details.
</blockquote>
<p>So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.
<p class=d>[<a href=examples/oneline.py>download <code>oneline.py</code></a>]
<pre class=pp><code>line_number = 0
<a>with open('examples/favorite-people.txt', encoding='utf-8') as a_file: <span class=u>①</span></a>
<a> for a_line in a_file: <span class=u>②</span></a>
line_number += 1
<a> print('{:>4} {}'.format(line_number, a_line.rstrip())) <span class=u>③</span></a></code></pre>
<ol>
<li>Using <a href=#with>the <code>with</code> pattern</a>, you safely open the file and let Python close it for you.
<li>To read a file one line at a time, use a <code>for</code> loop. That’s it. Besides having explicit methods like <code>read()</code>, <em>the stream object is also an <a href=iterators.html>iterator</a></em> which spits out a single line every time you ask for a value.
<li>Using <a href=strings.html#formatting-strings>the <code>format()</code> string method</a>, you can print out the line number and the line itself. The format specifier <code>{:>4}</code> means “print this argument right-justified within 4 spaces.” The <var>a_line</var> variable contains the complete line, carriage returns and all. The <code>rstrip()</code> string method removes the trailing whitespace, including the carriage return characters.
</ol>
<pre class='screen cmdline'>
<samp class=p>you@localhost:~/diveintopython3$ </samp><kbd class=pp>python3 examples/oneline.py</kbd>
<samp> 1 Dora
2 Ethan
3 Wesley
4 John
5 Anne
6 Mike
7 Chris
8 Sarah
9 Alex
10 Lizzie</samp></pre>
<blockquote class=pf>
<p>Did you get this error?
<pre class='nd screen'>
<samp class=p>you@localhost:~/diveintopython3$ </samp><kbd class=pp>python3 examples/oneline.py</kbd>
<samp class=traceback>Traceback (most recent call last):
File "examples/oneline.py", line 4, in <module>
print('{:>4} {}'.format(line_number, a_line.rstrip()))
ValueError: zero length field name in format</samp></pre>
<p>If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.
<p>Python 3.0 supported string formatting, but only with <a href=strings.html#formatting-strings>explicitly numbered format specifiers</a>. Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the Python 3.0-compatible version for comparison:
<pre class='pp nd'><code>print('{<mark>0</mark>:>4} {<mark>1</mark>}'.format(line_number, a_line.rstrip()))</code></pre>
</blockquote>
<p class=a>⁂
<h2 id=writing>Writing to Text Files</h2>
<aside>Just open a file and start writing.</aside>
<p>You can write to files in much the same way that you read from them. First you open a file and get a stream object, then you use methods on the stream object to write data to the file, then you close the file.
<p>To open a file for writing, use the <code>open()</code> function and specify the write mode. There are two file modes for writing:
<ul>
<li>“Write” mode will overwrite the file. Pass <code>mode='w'</code> to the <code>open()</code> function.
<li>“Append” mode will add data to the end of the file. Pass <code>mode='a'</code> to the <code>open()</code> function.
</ul>
<p>Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the file doesn’t exist yet, create a new empty file just so you can open it for the first time” function. Just open a file and start writing.
<p>You should always close a file as soon as you’re done writing to it, to release the file handle and ensure that the data is actually written to disk. As with reading data from a file, you can call the stream object’s <code>close()</code> method, or you can use the <code>with</code> statement and let Python close the file for you. I bet you can guess which technique I recommend.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>with open('test.log', mode='w', encoding='utf-8') as a_file:</kbd> <span class=u>①</span></a>
<a><samp class=p>... </samp><kbd class=pp> a_file.write('test succeeded')</kbd> <span class=u>②</span></a>
<samp class=p>>>> </samp><kbd class=pp>with open('test.log', encoding='utf-8') as a_file:</kbd>
<samp class=p>... </samp><kbd class=pp> print(a_file.read())</kbd>
<samp class=pp>test succeeded</samp>
<a><samp class=p>>>> </samp><kbd class=pp>with open('test.log', mode='a', encoding='utf-8') as a_file:</kbd> <span class=u>③</span></a>
<samp class=p>... </samp><kbd class=pp> a_file.write('and again')</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('test.log', encoding='utf-8') as a_file:</kbd>
<samp class=p>... </samp><kbd class=pp> print(a_file.read())</kbd>
<a><samp class=pp>test succeededand again</samp> <span class=u>④</span></a></pre>
<ol>
<li>You start boldly by creating the new file <code>test.log</code> (or overwriting the existing file), and opening the file for writing. The <code>mode='w'</code> parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file (if any), because that data is gone now.
<li>You can add data to the newly opened file with the <code>write()</code> method of the stream object returned by the <code>open()</code> function. After the <code>with</code> block ends, Python automatically closes the file.
<li>That was so fun, let’s do it again. But this time, with <code>mode='a'</code> to append to the file instead of overwriting it. Appending will <em>never</em> harm the existing contents of the file.
<li>Both the original line you wrote and the second line you appended are now in the file <code>test.log</code>. Also note that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the <code>'\r'</code> character, and/or a line feed with the <code>'\n'</code> character. Since you didn’t do either, everything you wrote to the file ended up on one line.
</ol>
<h3 id=encoding-again>Character Encoding Again</h3>
<p>Did you notice the <code>encoding</code> parameter that got passed in to the <code>open()</code> function while you were <a href=#writing>opening a file for writing</a>? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain <i>strings</i>, they contain <i>bytes</i>. Reading a “string” from a text file only works because you told Python what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the same problem in reverse. You can’t write characters to a file; <a href=strings.html#byte-arrays>characters are an abstraction</a>. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way to be sure it’s performing the correct conversion is to specify the <code>encoding</code> parameter when you open the file for writing.
<p class=a>⁂
<h2 id=binary>Binary Files</h2>
<p class=ss><img src=examples/beauregard.jpg alt='my dog Beauregard' width=100 height=100>
<p>Not all files contain text. Some of them contain pictures of my dog.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>an_image = open('examples/beauregard.jpg', mode='rb')</kbd> <span class=u>①</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.mode</kbd> <span class=u>②</span></a>
<samp class=pp>'rb'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.name</kbd> <span class=u>③</span></a>
<samp class=pp>'examples/beauregard.jpg'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.encoding</kbd> <span class=u>④</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'</samp></pre>
<ol>
<li>Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the <code>mode</code> parameter contains a <code>'b'</code> character.
<li>The stream object you get from opening a file in binary mode has many of the same attributes, including <code>mode</code>, which reflects the <code>mode</code> parameter you passed into the <code>open()</code> function.
<li>Binary stream objects also have a <code>name</code> attribute, just like text stream objects.
<li>Here’s one difference, though: a binary stream object has no <code>encoding</code> attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out of a binary file is exactly what you put into it, no conversion necessary.
</ol>
<p>Did I mention you’re reading bytes? Oh yes you are.
<pre class=screen>
# continued from the previous example
<samp class=p>>>> </samp><kbd class=pp>an_image.tell()</kbd>
<samp class=pp>0</samp>
<a><samp class=p>>>> </samp><kbd class=pp>data = an_image.read(3)</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>data</kbd>
<samp class=pp>b'\xff\xd8\xff'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>type(data)</kbd> <span class=u>②</span></a>
<samp class=pp><class 'bytes'></samp>
<a><samp class=p>>>> </samp><kbd class=pp>an_image.tell()</kbd> <span class=u>③</span></a>
<samp class=pp>3</samp>
<samp class=p>>>> </samp><kbd class=pp>an_image.seek(0)</kbd>
<samp class=pp>0</samp>
<samp class=p>>>> </samp><kbd class=pp>data = an_image.read()</kbd>
<samp class=p>>>> </samp><kbd class=pp>len(data)</kbd>
<samp class=pp>3150</samp></pre>
<ol>
<li>Like text files, you can read binary files a little bit at a time. But there’s a crucial difference…
<li>…you’re reading bytes, not strings. Since you opened the file in binary mode, the <code>read()</code> method takes <em>the number of bytes to read</em>, not the number of characters.
<li>That means that there’s never <a href=#read>an unexpected mismatch</a> between the number you passed into the <code>read()</code> method and the position index you get out of the <code>tell()</code> method. The <code>read()</code> method reads bytes, and the <code>seek()</code> and <code>tell()</code> methods track the number of bytes read. For binary files, they’ll always agree.
</ol>
<p class=a>⁂
<h2 id=file-like-objects>Stream Objects From Non-File Sources</h2>
<aside>To read from a fake file, just call <code>read()</code>.</aside>
<p>Imagine you’re writing a library, and one of your library functions is going to read some data from a file. The function could simply take a filename as a string, go open the file for reading, read it, and close it before exiting. But you shouldn’t do that. Instead, your <abbr>API</abbr> should take <em>an arbitrary stream object</em>.
<p>In the simplest case, a stream object is anything with a <code>read()</code> method which takes an optional <var>size</var> parameter and returns a string. When called with no <var>size</var> parameter, the <code>read()</code> method should read everything there is to read from the input source and return all the data as a single value. When called with a <var>size</var> parameter, it reads that much from the input source and returns that much data. When called again, it picks up where it left off and returns the next chunk of data.
<p>That sounds exactly like the stream object you get from opening a real file. The difference is that <em>you’re not limiting yourself to real files</em>. The input source that’s being “read” could be anything: a web page, a string in memory, even the output of another program. As long as your functions take a stream object and simply call the object’s <code>read()</code> method, you can handle any input source that acts like a file, without specific code to handle each kind of input.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_string = 'PapayaWhip is the new black.'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>import io</kbd> <span class=u>①</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_file = io.StringIO(a_string)</kbd> <span class=u>②</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>③</span></a>
<samp class=pp>'PapayaWhip is the new black.'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd> <span class=u>④</span></a>
<samp class=pp>''</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.seek(0)</kbd> <span class=u>⑤</span></a>
<samp class=pp>0</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_file.read(10)</kbd> <span class=u>⑥</span></a>
<samp class=pp>'PapayaWhip'</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.tell()</kbd>
<samp class=pp>10</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.seek(18)</kbd>
<samp class=pp>18</samp>
<samp class=p>>>> </samp><kbd class=pp>a_file.read()</kbd>
<samp class=pp>'new black.'</samp></pre>
<ol>
<li>The <code>io</code> module defines the <code>StringIO</code> class that you can use to treat a string in memory as a file.
<li>To create a stream object out of a string, create an instance of the <code>io.StringIO()</code> class and pass it the string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of stream-like things with it.
<li>Calling the <code>read()</code> method “reads” the entire “file,” which in the case of a <code>StringIO</code> object simply returns the original string.
<li>Just like a real file, calling the <code>read()</code> method again returns an empty string.
<li>You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the <code>seek()</code> method of the <code>StringIO</code> object.
<li>You can also read the string in chunks, by passing a <var>size</var> parameter to the <code>read()</code> method.
</ol>
<blockquote class=note>
<p><span class=u>☞</span><code>io.StringIO</code> lets you treat a string as a text file. There’s also a <code>io.BytesIO</code> class, which lets you treat a byte array as a binary file.
</blockquote>
<h3 id=gzip>Handling Compressed Files</h3>
<p>The Python standard library contains modules that support reading and writing compressed files. There are a number of different compression schemes; the two most popular on non-Windows systems are <a href=http://docs.python.org/3.1/library/gzip.html>gzip</a> and <a href=http://docs.python.org/3.1/library/bz2.html>bzip2</a>. (You may have also encountered <a href=http://docs.python.org/3.1/library/zipfile.html>PKZIP archives</a> and <a href=http://docs.python.org/3.1/library/tarfile.html>GNU Tar archives</a>. Python has modules for those, too.)
<p>The <code>gzip</code> module lets you create a stream object for reading or writing a gzip-compressed file. The stream object it gives you supports the <code>read()</code> method (if you opened it for reading) or the <code>write()</code> method (if you opened it for writing). That means you can use the methods you’ve already learned for regular files to <em>directly read or write a gzip-compressed file</em>, without creating a temporary file to store the decompressed data.
<p>As an added bonus, it supports the <code>with</code> statement too, so you can let Python automatically close your gzip-compressed file when you’re done with it.
<pre class='nd screen cmdline'>
<samp class=p>you@localhost:~$ </samp><kbd>python3</kbd>
<samp class=p>>>> </samp><kbd class=pp>import gzip</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>with gzip.open('out.log.gz', mode='wb') as z_file:</kbd> <span class=u>①</span></a>
<samp class=p>... </samp><kbd class=pp> z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))</kbd>
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd class=pp>exit()</kbd>
<a><samp class=p>you@localhost:~$ </samp><kbd>ls -l out.log.gz</kbd> <span class=u>②</span></a>
<samp>-rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz</samp>
<a><samp class=p>you@localhost:~$ </samp><kbd>gunzip out.log.gz</kbd> <span class=u>③</span></a>
<a><samp class=p>you@localhost:~$ </samp><kbd>cat out.log</kbd> <span class=u>④</span></a>
<samp>A nine mile walk is no joke, especially in the rain.</samp></pre>
<ol>
<li>You should always open gzipped files in binary mode. (Note the <code>'b'</code> character in the <code>mode</code> argument.)
<li>I constructed this example on Linux. If you’re not familiar with the command line, this command is showing the “long listing” of the gzip-compressed file you just created in the Python Shell. This listing shows that the file exists (good), and that it is 79 bytes long. That’s actually larger than the string you started with! The gzip file format includes a fixed-length header that contains some metadata about the file, so it’s inefficient for extremely small files.
<li>The <code>gunzip</code> command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file named the same as the compressed file but without the <code>.gz</code> file extension.
<li>The <code>cat</code> command displays the contents of a file. This file contains the string you originally wrote directly to the compressed file <code>out.log.gz</code> from within the Python Shell.
</ol>
<blockquote class=pf>
<p>Did you get this error?
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>with gzip.open('out.log.gz', mode='wb') as z_file:</kbd>
<samp class=p>... </samp><kbd class=pp> z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))</kbd>
<samp class=p>... </samp>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GzipFile' object has no attribute '__exit__'</samp></pre>
<p>If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.
<p>Python 3.0 had a <code>gzip</code> module, but it did not support using a gzipped-file object as a context manager. Python 3.1 added the ability to use gzipped-file objects in a <code>with</code> statement.
</blockquote>
<p class=a>⁂
<h2 id=stdio>Standard Input, Output, and Error</h2>
<aside><code>sys.stdin</code>, <code>sys.stdout</code>, <code>sys.stderr</code>.</aside>
<p>Command-line gurus are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you.
<p>Standard output and standard error (commonly abbreviated <code>stdout</code> and <code>stderr</code>) are pipes that are built into every <abbr>UNIX</abbr>-like system, including Mac OS X and Linux. When you call the <code>print()</code> function, the thing you’re printing is sent to the <code>stdout</code> pipe. When your program crashes and prints out a traceback, it goes to the <code>stderr</code> pipe. By default, both of these pipes are just connected to the terminal window where you are working; when your program prints something, you see the output in your terminal window, and when a program crashes, you see the traceback in your terminal window too. In the graphical Python Shell, the <code>stdout</code> and <code>stderr</code> pipes default to your “Interactive Window”.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
<a><samp class=p>... </samp><kbd class=pp> print('PapayaWhip')</kbd> <span class=u>①</span></a>
<samp>PapayaWhip
PapayaWhip
PapayaWhip</samp>
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
<a><samp class=p>... </samp><kbd class=pp> l = sys.stdout.write('is the')</kbd> <span class=u>②</span></a>
<samp>is theis theis the</samp>
<samp class=p>>>> </samp><kbd class=pp>for i in range(3):</kbd>
<a><samp class=p>... </samp><kbd class=pp> l = sys.stderr.write('new black')</kbd> <span class=u>③</span></a>
<samp>new blacknew blacknew black</samp></pre>
<ol>
<li>The <code>print()</code> function, in a loop. Nothing surprising here.
<li><code>stdout</code> is defined in the <code>sys</code> module, and it is a <a href=#file-like-objects>stream object</a>. Calling its <code>write()</code> function will print out whatever string you give it, then return the length of the output. In fact, this is what the <code>print</code> function really does; it adds a carriage return to the end of the string you’re printing, and calls <code>sys.stdout.write</code>.
<li>In the simplest case, <code>sys.stdout</code> and <code>sys.stderr</code> send their output to the same place: the Python <abbr>IDE</abbr> (if you’re in one), or the terminal (if you’re running Python from the command line). Like standard output, standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write carriage return characters.
</ol>
<p><code>sys.stdout</code> and <code>sys.stderr</code> are stream objects, but they are write-only. Attempting to call their <code>read()</code> method will always raise an <code>IOError</code>.
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
<samp class=p>>>> </samp><kbd class=pp>sys.stdout.read()</kbd>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: not readable</samp></pre>
<h3 id=redirect>Redirecting Standard Output</h3>
<p><code>sys.stdout</code> and <code>sys.stderr</code> are stream objects, albeit ones that only support writing. But they’re not constants; they’re variables. That means you can assign them a new value — any other stream object — to redirect their output.
<p class=d>[<a href=examples/stdout.py>download <code>stdout.py</code></a>]
<pre class=pp><code>import sys
class RedirectStdoutTo:
def __init__(self, out_new):
self.out_new = out_new
def __enter__(self):
self.out_old = sys.stdout
sys.stdout = self.out_new
def __exit__(self, *args):
sys.stdout = self.out_old
print('A')
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
print('B')
print('C')</code></pre>
<p>Check this out:
<pre class='nd screen cmdline'>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>python3 stdout.py</kbd>
<samp>A
C</samp>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>cat out.log</kbd>
<samp>B</samp></pre>
<blockquote class=pf>
<p>Did you get this error?
<pre class='nd screen'>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd class=pp>python3 stdout.py</kbd>
<samp class=traceback> File "stdout.py", line 15
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
^
SyntaxError: invalid syntax</samp></pre>
<p>If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.
<p>Python 3.0 supported the <code>with</code> statement, but each statement can only use one context manager. Python 3.1 allows you to chain multiple context managers in a single <code>with</code> statement.
</blockquote>
<p>Let’s take the last part first.
<pre class=pp><code>print('A')
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
print('B')
print('C')</code></pre>
<p>That’s a complicated <code>with</code> statement. Let me rewrite it as something more recognizable.
<pre class=pp><code>with open('out.log', mode='w', encoding='utf-8') as a_file:
with RedirectStdoutTo(a_file):
print('B')</code></pre>
<p>As the rewrite shows, you actually have <em>two</em> <code>with</code> statements, one nested within the scope of the other. The “outer” <code>with</code> statement should be familiar by now: it opens a <abbr>UTF-8</abbr>-encoded text file named <code>out.log</code> for writing and assigns the stream object to a variable named <var>a_file</var>. But that’s not the only thing odd here.
<pre class='nd pp'><code>with RedirectStdoutTo(a_file):</code></pre>
<p>Where’s the <code>as</code> clause? The <code>with</code> statement doesn’t actually require one. Just like you can call a function and ignore its return value, you can have a <code>with</code> statement that doesn’t assign the <code>with</code> context to a variable. In this case, you’re only interested in the side effects of the <code>RedirectStdoutTo</code> context.
<p>What are those side effects? Take a look inside the <code>RedirectStdoutTo</code> class. This class is a custom <a href=special-method-names.html#context-managers>context manager</a>. Any class can be a context manager by defining two <a href=iterators.html#a-fibonacci-iterator>special methods</a>: <code>__enter__()</code> and <code>__exit__()</code>.
<pre class=pp><code>class RedirectStdoutTo:
<a> def __init__(self, out_new): <span class=u>①</span></a>
self.out_new = out_new
<a> def __enter__(self): <span class=u>②</span></a>
self.out_old = sys.stdout
sys.stdout = self.out_new
<a> def __exit__(self, *args): <span class=u>③</span></a>
sys.stdout = self.out_old</code></pre>
<ol>
<li>The <code>__init__()</code> method is called immediately after an instance is created. It takes one parameter, the stream object that you want to use as standard output for the life of the context. This method just saves the stream object in an instance variable so other methods can use it later.
<li>The <code>__enter__()</code> method is a <a href=iterators.html#a-fibonacci-iterator>special class method</a>; Python calls it when entering a context (<i>i.e.</i> at the beginning of the <code>with</code> statement). This method saves the current value of <code>sys.stdout</code> in <var>self.out_old</var>, then redirects standard output by assigning <var>self.out_new</var> to <var>sys.stdout</var>.
<li>The <code>__exit__()</code> method is another special class method; Python calls it when exiting the context (<i>i.e.</i> at the end of the <code>with</code> statement). This method restores standard output to its original value by assigning the saved <var>self.out_old</var> value to <var>sys.stdout</var>.
</ol>
<p>Putting it all together:
<pre class=pp><code>
<a>print('A') <span class=u>①</span></a>
<a>with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file): <span class=u>②</span></a>
<a> print('B') <span class=u>③</span></a>
<a>print('C') <span class=u>④</span></a></code></pre>
<ol>
<li>This will print to the <abbr>IDE</abbr> “Interactive Window” (or the terminal, if running the script from the command line).
<li>This <a href=#with><code>with</code> statement</a> takes <em>a comma-separated list of contexts</em>. The comma-separated list acts like a series of nested <code>with</code> blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The first context opens a file; the second context redirects <code>sys.stdout</code> to the stream object that was created in the first context.
<li>Because this <code>print()</code> function is executed with the context created by the <code>with</code> statement, it will not print to the screen; it will write to the file <code>out.log</code>.
<li>The <code>with</code> code block is over. Python has told each context manager to do whatever it is they do upon exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context changed <code>sys.stdout</code> back to its original value, then the first context closed the file named <code>out.log</code>. Since standard output has been restored to its original value, calling the <code>print()</code> function will once again print to the screen.
</ol>
<p>Redirecting standard error works exactly the same way, using <code>sys.stderr</code> instead of <code>sys.stdout</code>.
<p class=a>⁂
<h2 id=furtherreading>Further Reading</h2>
<ul>
<li><a href=http://docs.python.org/py3k/tutorial/inputoutput.html#reading-and-writing-files>Reading and writing files</a> in the Python.org tutorial
<li><a href=http://docs.python.org/3.1/library/io.html><code>io</code> module</a>
<li><a href=http://docs.python.org/3.1/library/stdtypes.html#file-objects>Stream objects</a>
<li><a href=http://docs.python.org/3.1/library/stdtypes.html#context-manager-types>Context manager types</a>
<li><a href=http://docs.python.org/3.1/library/sys.html#sys.stdout><code>sys.stdout</code> and <code>sys.stderr</code></a>
<li><a href=http://en.wikipedia.org/wiki/Filesystem_in_Userspace><abbr>FUSE</abbr> on Wikipedia</a>
</ul>
<p class=v><a href=refactoring.html rel=prev title='back to “Refactoring”'><span class=u>☜</span></a> <a href=xml.html rel=next title='onward to “XML”'><span class=u>☞</span></a>
<p class=c>© 2001–11 <a href=about.html>Mark Pilgrim</a>
<script src=j/jquery.js></script>
<script src=j/prettify.js></script>
<script src=j/dip3.js></script>