When using different dumpers with yaml.dump, disk flushing behavior is inconsistent. #831

mark007 · 2024-09-10T12:38:15Z

I notice when I have a snippet of code that writes an especially small piece of data to disk using dump or dump_all, when reading that file immediately afterwards, sometimes the contents are not yet there, unless I do a file.flush().

I can reproduce this issue only when specifying a Dumper, for example the yaml.CSafeDumper. If I don't specify a Dumper, the contents of the file are always there immediately after the dump call is completed.
eg
This snippet of code results in the file contents showing as empty after its loaded.

output_file=output.yaml
with open(output_file, 'w') as file:
        data="x: 1"
        yaml.dump_all(data, stream=file, Dumper=yaml.CSafeDumper)
        with open(file.name, 'r') as temp_file_read:
            contents = temp_file_read.read()
            print(f"File Contents:\n{contents}")

This snippet of code, where the Dumper is not given, results in the contents of the file being there as expected.

output_file=output.yaml
with open(output_file, 'w') as file:
        data="x: 1"
        yaml.dump_all(data, stream=file)
        with open(file.name, 'r') as temp_file_read:
            contents = temp_file_read.read()
            print(f"File Contents:\n{contents}")

This snippet where I use both the Dumper and a file.flush(), the data is also seen to be there.

output_file=output.yaml
with open(output_file, 'w') as file:
        data="x: 1"
        yaml.dump_all(data, stream=file)
        file.flush()
        with open(file.name, 'r') as temp_file_read:
            contents = temp_file_read.read()
            print(f"File Contents:\n{contents}")

If the file read happens after the with block / context manager, its fine, the file must get flushed in that case. However there can be many cases, especially when dealing with tempfiles, that we would want to write data to disk, and then use it/read it, within the same context manager, so the consistency of any pyyaml dump flushing behavior would be important.

I can't find this documented anywhere. Is it expected, or is it a bug that can be resolved.

The text was updated successfully, but these errors were encountered:

nitzmahone · 2024-09-10T23:49:23Z

Yeah, I'm noting a distinct lack of flush() in the CEmitter output handler or anywhere else in CEmitter, where the pure-Python Emitter flushes after writing the stream end.

Digging around in libyaml's flush code, it looks like it's managing its own layer of internal buffering in its "flush" impl, but in neither place do I see any explicit flushing of the underlying stream handle.

Assuming that's accurate, it might be reasonable to add an explicit flush to at least StreamEndEvent for the buffered + streamed cases in the Cython wrapper, but that code is currently blissfully unaware of the buffering (since libyaml's driving the stream interactions through its write_handler callback). Adding an unconditional flush in the write handler would be easy, but waaaay overkill.

I'll mark this as something to look into for the next release, but in the meantime, if you really need unbuffered IO behavior, I'd suggest requesting an unbuffered binary stream when you call open().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using different dumpers with yaml.dump, disk flushing behavior is inconsistent. #831

When using different dumpers with yaml.dump, disk flushing behavior is inconsistent. #831

mark007 commented Sep 10, 2024 •

edited

Loading

nitzmahone commented Sep 10, 2024

When using different dumpers with yaml.dump, disk flushing behavior is inconsistent. #831

When using different dumpers with yaml.dump, disk flushing behavior is inconsistent. #831

Comments

mark007 commented Sep 10, 2024 • edited Loading

nitzmahone commented Sep 10, 2024

mark007 commented Sep 10, 2024 •

edited

Loading