Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __getitem__ and __getattr__ 🧙magic🧙‍♀️ methods for Job #450

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Andrew-S-Rosen
Copy link
Member

@Andrew-S-Rosen Andrew-S-Rosen commented Oct 10, 2023

Summary

Following in @janosh's footsteps, I propose a __getitem__ magic method for Job, such that if you index a Job object by a key (as if it were a dictionary) or an index (as if it were a list), the .output is implicitly called for the user and the OutputReference is returned.

Motivation

Consider these two toy functions:

@job
def make_str(s):
    return {"hello": s}

@job
def capitalize(s):
    return s.upper()

If you want to make a Flow of them, you have to currently do:

from jobflow import Flow

job1 = make_str("world")
job2 = capitalize(job1.output["hello"])
flow = Flow([job1, job2])

However, new users to Jobflow often stumble on this in my experience because they expect to be able to do:

from jobflow import Flow

job1 = make_str("world")
job2 = capitalize(job1["hello"])
flow = Flow([job1, job2])

They can now do that without any issues 😄

@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

Merging #450 (e29fb7d) into main (cdbcbd0) will not change coverage.
The diff coverage is 100.00%.

❗ Current head e29fb7d differs from pull request most recent head 57402ee. Consider uploading reports for the commit 57402ee to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #450   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           20        20           
  Lines         1521      1523    +2     
  Branches       419       419           
=========================================
+ Hits          1521      1523    +2     
Files Coverage Δ
src/jobflow/core/job.py 100.00% <100.00%> (ø)

@Andrew-S-Rosen Andrew-S-Rosen changed the title Add __getitem__ magic method for Job Add __getitem__ 🧙magic🧙 method for Job Oct 10, 2023
@Andrew-S-Rosen Andrew-S-Rosen changed the title Add __getitem__ 🧙magic🧙 method for Job Add __getitem__ 🧙magic🧙‍♀️ method for Job Oct 10, 2023
@utf
Copy link
Member

utf commented Oct 10, 2023

This is a very interesting idea, thanks! I have a couple of comments:

Incompatibility with output attributes

One potential confusion is that you aren't able to do something like:

job2 = capitalize(job1.hello)

We could add that functionality in, but it then raises an issue with job1.output e.g., if your output document has an attribute called output this will return the OutputReference and not OutputReference.output.

A potential fix would be to rename all of the Job attributes and functions to attribute_, E.g., job.output_, job.graph_, job.update_maker_kwargs_ etc. This would be a breaking change but not too onerous to fix.

Incompatibility with jobs without a dictionary output

Another issue is with jobs that output a python primitive directly. E.g., your capitalize function. Consider these jobs.

@job
def capitalize(s):
    return s.upper()

@job
def decapitalize(s):
    return s.lower()

This change might think users than can do:

job1 = capitalize("hello")
job2 = decapitalize(job1)
flow = Flow([job1, job2])

But that will fail since jobs can only accept OutputReferences (or other python objects) as inputs.

To enable this behaviour we could make it so that if a Job is given as an input, then we use the whole OutputReference by default.

I think supporting both attributes and whole input jobs would actually be a really nice feature and make jobflow easier to use.

Flows

Finally, just a note that we could also do something similar with Flows. E.g., this would enable you to write:

@job
def add(a, b):
    return a + b

add_first = add(1, 5)
add_second = add(add_first, 5)

flow = Flow([add_first, add_second], output=add_second)

add_third = add(flow, 5)

Together with #416 we would have a very nice API, e.g.

@job
def add(a, b):
    return a + b

@flow
def multi_add(a, b):
    add_first = add(1, 5)
    return add(add_first, 5)

result = add(multi_add(1, 5), 5)

@Andrew-S-Rosen
Copy link
Member Author

@utf. Thanks for expanding on my initial idea to transform this into a great discussion about the API as a whole. I think these are all fantastic. In particular

"To enable this behaviour we could make it so that if a Job is given as an input, then we use the whole OutputReference by default."

This was a feature I was actually hoping already existed a few weeks back --- I tried it out, and of course it wasn't the case. But I think this would be elegant because I can't really see many use cases for people passing Job objects between jobs and wanting the behavior as it stands right now. As an aside that's not too crucial for this discussion, it would also help with interconversion between several workflow tools since that kind of pattern is pretty common.

On a practical note, I am happy to get help on this from anyone reading this message! I probably can't tackle your suggestions imminently, but it's important enough in my opinion that it will be high up on my to-do list.

@janosh
Copy link
Member

janosh commented Oct 10, 2023

@Andrew-S-Rosen Thanks for tagging me. These sound like great UX improvements!

@Andrew-S-Rosen
Copy link
Member Author

FWIW: Going to try to tackle more of @utf's ideas next week!

@Andrew-S-Rosen Andrew-S-Rosen changed the title Add __getitem__ 🧙magic🧙‍♀️ method for Job Add __getitem__ and __getattr__ 🧙magic🧙‍♀️ method for Job Oct 12, 2023
@Andrew-S-Rosen Andrew-S-Rosen changed the title Add __getitem__ and __getattr__ 🧙magic🧙‍♀️ method for Job Add __getitem__ and __getattr__ 🧙magic🧙‍♀️ method2 for Job Oct 12, 2023
@Andrew-S-Rosen Andrew-S-Rosen changed the title Add __getitem__ and __getattr__ 🧙magic🧙‍♀️ method2 for Job Add __getitem__ and __getattr__ 🧙magic🧙‍♀️ methods for Job Oct 12, 2023
@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Oct 12, 2023

@utf: Alright, we are getting there! First, a summary and then a question for you.

Summary

I have now added a __getattr__ method, in addition to __getitem__, that allows for (implicit) parsing of attributes from the Job output. Tests have been added.

I have also made it so that if a Job is passed to another Job, the .output is called. This is a breaking change because before it would raise a UserWarning, but I think it is more logical. The UserWarning tests have been removed, and a test for the job-in-job behavior was added.

Questions

Question 1

I have the following test:

from jobflow import Flow, job, run_locally
from dataclasses import dataclass

@job
def make_str(s):

    @dataclass
    class MyClass:
        hello: str = s

    return MyClass

@job
def capitalize(s):
    return s.upper()

job1 = make_str("world")
job2 = capitalize(job1.hello)

flow = Flow([job1, job2])

responses = run_locally(flow, ensure_success=True)
assert responses[job2.uuid][1].output == "WORLD"

The traceback is as follows:

Traceback (most recent call last):
  File "C:\Users\asros\github\jobflow\src\jobflow\managers\local.py", line 101, in _run_job
    response = job.run(store=store)
  File "C:\Users\asros\github\jobflow\src\jobflow\core\job.py", line 605, in run
    # if Job was created using the job decorator, then access the original function
  File "C:\Users\asros\github\jobflow\src\jobflow\core\job.py", line 711, in resolve_args
    store,
  File "C:\Users\asros\github\jobflow\src\jobflow\core\reference.py", line 451, in find_and_resolve_references
    resolved_references = resolve_references(
  File "C:\Users\asros\github\jobflow\src\jobflow\core\reference.py", line 346, in resolve_references
    resolved_references[ref] = ref.resolve(
  File "C:\Users\asros\github\jobflow\src\jobflow\core\reference.py", line 178, in resolve
    data = data[attr] if attr_type == "i" else getattr(data, attr)
AttributeError: 'dict' object has no attribute 'hello'

2023-10-12 14:45:24,767 INFO Finished executing jobs locally

To my untrained eye, it seems that the output is being serialized into a dictionary format automatically (MSONable-related?) rather than staying a class. Do you have any suggestions?

Question 2

There is a new E RecursionError: maximum recursion depth exceeded in comparison introduced in a few tests now. I'm tackling one issue at a time, but if you know the answer on how to fix it, do let me know!

Comment

You mentioned earlier that this could result in clashes with "first-class" Job attributes, like output. While true, I just want to note the __getattr__ will only be activated if the attribute doesn't exist in the first place. That means if the user called .output, it will return the OutputReference (even if the returned class has an attribute .output). We can either rename the native attributes like you suggested or be clear in the documentation somewhere about what is not allowed.
 

Todo

  • Apply the same logic to Flow (perhaps in a new PR?).

@utf
Copy link
Member

utf commented Oct 13, 2023

Awesome, thanks @Andrew-S-Rosen.

For question 1, I think this is because the class is defined in the scope of make_str, so it won't be accessible outside of that function. This means monty decoder can't deserialise it as it can't find the definition to import. If adjust your test to the follow, it should work:

from jobflow import Flow, job, run_locally
from dataclasses import dataclass


@dataclass
class MyClass:
    hello: str 

@job
def make_str(s):
    return MyClass(hello=s)

@job
def capitalize(s):
    return s.upper()

job1 = make_str("world")
job2 = capitalize(job1.hello)

flow = Flow([job1, job2])

responses = run_locally(flow, ensure_success=True)
assert responses[job2.uuid][1].output == "WORLD"

@davidwaroquiers
Copy link
Contributor

davidwaroquiers commented Oct 13, 2023

Hello everyone,

I think this is definitely a very nice addition! I have one question. This is related to the "real" attributes of the Job object, also raised by @utf with a potential solution with the renaming to _ or something, which hopefully does not clash... (someone could still define an output with e.g. output_ as a key, even it would be rather weird). I am just wondering if anyone thought of subclassing the Job (and if it makes sense ? maybe it wouldn't even work ?). If that is the case, also the attributes of these subclasses should be appended with _ (or any other modification to avoid clash with the __getattr__).

Anyway, very nice thinking to improve user friendliness!

@Andrew-S-Rosen
Copy link
Member Author

@utf: Thank you! That fixes it and makes a lot of sense.

@janosh: It seems this change caused some of your magic method tests to fail with a recursion error, such as flow = flow + job. Do you happen to know why this might be?

@janosh
Copy link
Member

janosh commented Oct 13, 2023

Looks like the issue here is __getattr__ calls getattr on self.output which itself calls __getattr__ if the attribute doesn't exist in self.output, leading to a loop. Safer to raise AttributeError if name can't be found in self.output:

def __getattr__(self, name: str) -> OutputReference:
    if attr := getattr(self.output, name, None):
        return attr
    raise AttributeError(f"{type(self).__name__} has no attribute {name!r}")

@davidwaroquiers
Copy link
Contributor

davidwaroquiers commented Oct 13, 2023

Flows

Finally, just a note that we could also do something similar with Flows. E.g., this would enable you to write:

@job
def add(a, b):
    return a + b

add_first = add(1, 5)
add_second = add(add_first, 5)

flow = Flow([add_first, add_second], output=add_second)

add_third = add(flow, 5)

Together with #416 we would have a very nice API, e.g.

@job
def add(a, b):
    return a + b

@flow
def multi_add(a, b):
    add_first = add(1, 5)
    return add(add_first, 5)

result = add(multi_add(1, 5), 5)

I was gonna start writing something about this and then I read your comment @utf . Indeed, it would really be nice to be able to do that as well, otherwise we are stuck with a difference of behavior between jobs and flows, i.e. you can use job directly without the .output to access a job output but for flows is different. And this difference is I believe worse than the one with @job and no @flow because its on the user, while for jobs having @job but flows not having @flow, at least it "only" concerns developers (or expert users).

I will try to get back a bit on the @flow decorator proof of concept implementation I did to see if I manage to get it to work properly when it is inside jobflow.

@Andrew-S-Rosen
Copy link
Member Author

@janosh: Thanks! For your example though, we don't know if name is in the output until the OutputReference is resolved at runtime, right? That seems like that would cause a problem with your proposed change. But I'll look into this.

@davidwaroquiers: Yes, I agree. I view the Flow analogue as being a requirement. Thanks for taking a look at @flow again! I'm really excited about this prospect.

@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Nov 2, 2023

In full disclosure, I am going to need to put this PR on hold. Anyone willing to can feel free to expand on it with a follow-up PR if they wish!

If we start seeing some movement with the @flow decorator though, I'd be substantially inspired to revisit this ASAP 😉

@davidwaroquiers davidwaroquiers mentioned this pull request Feb 19, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants