Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/file reporter #78

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Hrsj123
Copy link

@Hrsj123 Hrsj123 commented Feb 12, 2024

Closes #74

Here, I have written a generic FileReporter according to your recommendations mentioned in PR #74.
To specialize the method to both reading and writing to file in any particular style, we can utilize the Parser class to define read and write methods.
I have defined JSONParser and YAMLFileParser as sample in file.py for handling read and write operations in json and yaml files.
Since the FileReporter read and write functions utilized the Parser class, it automatically adjusts the read and write methods based on the file extension we are writing to.

Registering a Parser

class MyCustomParser(Parser):
        def parse_file(self, records):
            # Implement your custom parsing logic here
            ...
        def write_records(self, records):
            # Implement your custom file writing logic here
            ...
# Register your custom parser with a distinct file type
MyCustomParser.register("my_custom_format")

Once the parser is registered with a particular extension, we can utilize that parser implicitly based on the extension of file_name we want to write to.

Do let me know your views and recommendations on these changes.

@Hrsj123 Hrsj123 marked this pull request as draft February 12, 2024 15:06
Copy link
Collaborator

@nicholasjng nicholasjng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this looks promising, but we'll have to tweak it slightly to make it as effective as possible.

If anything is unclear, either ping me here or I can push some aspects as I have them in my head to your branch if you prefer.

from nnbench.types import BenchmarkRecord


class Parser:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification: We're not actually parsing things when we are reading files, that is the job of the respective modules (json|yaml|toml).

Comment on lines 117 to 118
# Register custom parsers here
parsers = {"json": JsonParser, "yaml": YamlParser}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of making it a module-wide default, but we should take care that

a) the variable is private (i.e. has a leading underscore) to prevent accidental export, and
b) we should make the value structure of the map as easy as possible.

To address b), I would start by making it a tuple[ser, de] where ser is a Callable[[IO, dict[str, Any]], None], i.e., a function taking a file descriptor in write mode and a record and writing it to a file, and de being a Callable[[IO], dict[str, Any]], a function taking a file descriptor in read mode and returning the loaded record.

You can then register the SerDe factories based on whether the necessary packages are installed (json is available out of the box, yaml and toml are not).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I use a class-based approach to register the SerDe factories (like I used in Parser class) ?
One other option is to use a simple register method which takes 3 arguments (i.e., Ser and De functions and a file_type) as arguments.
Also, I think the class-based approach is much more concise to define the ser and de methods on the user side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with either, though the optional import part (i.e., erroring if toml/yaml are not installed) will be a bit easier in the class case.

For now, I think the quickest way is a functional approach, though. Maybe like this:

_file_loaders: dict[str, tuple[Any, Any]] = {}

def yaml_load(fp: IO, options=None):
    try:
        import yaml
    except ImportError:
        raise ModuleNotFoundError("`pyyaml` is not installed")
    
    # takes no options, but the slot is useful for passing options to file loaders.
    obj = yaml.safe_load(fp)
    return BenchmarkRecord(context=obj["context"], benchmarks=obj["benchmarks"])

def yaml_save(record: BenchmarkRecord, fp: IO, options=None) -> None:
    try:
        import yaml
    except ImportError:
        raise ModuleNotFoundError("`pyyaml` is not installed")

    yaml.safe_dump(record, fp, **(options or {})


_file_loaders["yaml"] = (yaml_save, yaml_load)

With an option of defining e.g. a register_file_io(ser, de) later to do the dict insertion if we want.

src/nnbench/reporter/file.py Outdated Show resolved Hide resolved
src/nnbench/reporter/file.py Outdated Show resolved Hide resolved
Comment on lines 175 to 185
if not self.dir:
raise BaseException("Directory is not initialized")
file_path = os.path.join(self.dir, file_name)
file_type = file_name.split(".")[1]
try:
with open(file_path) as file:
data = file.read()
parsed_data = parse_records(data, file_type)
return parsed_data
except FileNotFoundError:
raise ValueError(f"Could not read the file: {file_path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the restructured file loading dict I talked about earlier, but in essence, all that should happen here is the file being loaded with open (like you did), and then calling the deserializer on the opened file.

In particular, no error handling for open() should be necessary, since those are informative enough for the user on their own.

Comment on lines 188 to 202
if not self.dir:
raise BaseException("Directory is not initialized")

file_path = os.path.join(self.dir, file_name)
if not os.path.exists(file_path): # Create the file
with open(file_path, "w") as file:
file.write("")
try:
parsed_records = self.read(file_name)
file_type = file_name.split(".")[1]
new_records = append_record_to_records(parsed_records, record, file_type)
with open(file_path, "w") as file:
file.write(new_records)
except FileNotFoundError:
raise ValueError(f"Could not read the file: {file_path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, just load the serializer from the dict and call it on the opened file.

Comment on lines 204 to 205
def finalize(self) -> None:
del self.dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not do what you think it does - it just stages the self.dir variable for garbage collection.
To safely remove the directory (which you might not want to do anyway, we could add a flag in the constructor for that?), you should call shutil.rmtree(self.dir, ignore_errors=True). (Though you might want to check existence first and set ignore_errors to False.

…ded changes to read and write methods of base `BenchmarkReporter`.
@Hrsj123
Copy link
Author

Hrsj123 commented Feb 13, 2024

I've changed the value value of the _file_loader registry and made the necessary changes recommended by you. Do review the changes and let me know if anything else can be improved.

@nicholasjng
Copy link
Collaborator

Thanks!

To get you up to speed to the current situation, we (I) have found that a file reporter is necessary as a building block for the duckDB reporter implementation that I started in #75. For this reason, I implemented a file reporter myself in that PR, which we will go ahead with ASAP.

That is not meant as a knock on you - you did a great job here, and a lot of the logic you brought in here also made it into #75, we just had to speed things up a bit to consolidate the interface. If you are interested, I'd like to welcome you to help improve on it (especially the file driver registration/deregistration hooks you gave here come to mind)!

I'm going to open some tickets related to improvement of that file IO solution, which I can assign to you if you want, just let me know by commenting on the respective issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants