Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: add conversion functions for TES and WES to WRROC #5

Closed
wants to merge 7 commits into from
Closed

Conversation

Karanjot786
Copy link
Member

@Karanjot786 Karanjot786 commented Jun 15, 2024

Description

This PR adds the implementation of the convert_tes_to_wrroc and convert_wes_to_wrroc functions. These functions are responsible for converting TES (Task Execution Service) and WES (Workflow Execution Service) tasks and runs into WRROC (Workflow Run RO-Crate) format.

Changes Included

  • convert_tes_to_wrroc function: Converts TES tasks to WRROC format by mapping TES properties to WRROC properties.
  • convert_wes_to_wrroc function: Converts WES runs to WRROC format by mapping WES properties to WRROC properties.
  • Helper functions:
    • convert_to_iso8601: Converts timestamps from RFC 3339 format to ISO 8601 format.
    • convert_status: Maps TES and WES task statuses to appropriate CreateAction statuses in WRROC.

Summary by Sourcery

This pull request adds conversion functions for transforming TES tasks and WES runs into WRROC format, along with helper functions for timestamp conversion and status mapping.

  • New Features:
    • Introduced convert_tes_to_wrroc function to convert TES tasks to WRROC format.
    • Introduced convert_wes_to_wrroc function to convert WES runs to WRROC format.
  • Enhancements:
    • Added helper function convert_to_iso8601 to convert timestamps from RFC 3339 format to ISO 8601 format.
    • Added helper function convert_status to map TES and WES task statuses to appropriate CreateAction statuses in WRROC.

Copy link

sourcery-ai bot commented Jun 15, 2024

Reviewer's Guide by Sourcery

This pull request introduces two new functions, convert_tes_to_wrroc and convert_wes_to_wrroc, which convert TES (Task Execution Service) tasks and WES (Workflow Execution Service) runs into WRROC (Workflow Run RO-Crate) format. Additionally, helper functions convert_to_iso8601 and convert_status are implemented to assist with timestamp and status conversions, respectively.

File-Level Changes

Files Changes
src/converter.py Introduced functions for converting TES and WES data to WRROC format, along with necessary helper functions for timestamp and status conversions.

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Karanjot786 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 7 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

src/converter.py Outdated
"@id": tes_data["id"],
"name": tes_data.get("name", ""),
"description": tes_data.get("description", ""),
"instrument": tes_data["executors"][0]["image"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential IndexError if 'executors' list is empty

Consider adding a check to ensure 'executors' list is not empty before accessing the first element to avoid potential IndexError.

src/converter.py Outdated
"name": tes_data.get("name", ""),
"description": tes_data.get("description", ""),
"instrument": tes_data["executors"][0]["image"],
"object": [{"@id": input["url"], "name": input["path"]} for input in tes_data["inputs"]],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential KeyError if 'url' or 'path' keys are missing in 'inputs'

Consider using the .get() method or adding a check to ensure 'url' and 'path' keys exist in each input dictionary to avoid potential KeyError.

src/converter.py Outdated
"description": tes_data.get("description", ""),
"instrument": tes_data["executors"][0]["image"],
"object": [{"@id": input["url"], "name": input["path"]} for input in tes_data["inputs"]],
"result": [{"@id": output["url"], "name": output["path"]} for output in tes_data["outputs"]],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential KeyError if 'url' or 'path' keys are missing in 'outputs'

Consider using the .get() method or adding a check to ensure 'url' and 'path' keys exist in each output dictionary to avoid potential KeyError.

src/converter.py Outdated
"@id": wes_data["run_id"],
"name": wes_data["run_log"].get("name", ""),
"description": wes_data["run_log"].get("description", ""),
"object": [{"@id": input["location"], "name": input["name"]} for input in wes_data["request"]["workflow_params"]],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential KeyError if 'location' or 'name' keys are missing in 'workflow_params'

Consider using the .get() method or adding a check to ensure 'location' and 'name' keys exist in each workflow_params dictionary to avoid potential KeyError.

src/converter.py Outdated
"name": wes_data["run_log"].get("name", ""),
"description": wes_data["run_log"].get("description", ""),
"object": [{"@id": input["location"], "name": input["name"]} for input in wes_data["request"]["workflow_params"]],
"result": [{"@id": output["location"], "name": output["name"]} for output in wes_data["run_log"]["outputs"]],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential KeyError if 'location' or 'name' keys are missing in 'outputs'

Consider using the .get() method or adding a check to ensure 'location' and 'name' keys exist in each output dictionary to avoid potential KeyError.

src/converter.py Outdated

def convert_to_iso8601(timestamp):
# Assuming timestamp is in RFC 3339 format, convert to ISO 8601
return datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ").isoformat() + "Z"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Assumption about timestamp format

The function assumes the timestamp is always in RFC 3339 format. Consider adding error handling for cases where the timestamp format is different.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Karanjot786, please check this out, verify if the format is the WES and TES standard, if yes, add error handling and if not, add support for other formats, and error handling.

From the example I tested here, it doesn't seem to be the standard format

src/converter.py Outdated
"end_time": "2024-01-01T01:00:00Z"
}
}
wrroc = convert_wes_to_wrroc(wes_run_example)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Example usage in production code

Consider moving the example usage to a separate script or a test file to keep the production code clean.

Suggested change
wrroc = convert_wes_to_wrroc(wes_run_example)
# src/converter.py
if __name__ == "__main__":
wrroc = convert_wes_to_wrroc(wes_run_example)
print(wrroc)

@Karanjot786 Karanjot786 enabled auto-merge (rebase) June 15, 2024 15:35
src/converter.py Outdated

def convert_to_iso8601(timestamp):
# Assuming timestamp is in RFC 3339 format, convert to ISO 8601
return datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ").isoformat() + "Z"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Karanjot786, please check this out, verify if the format is the WES and TES standard, if yes, add error handling and if not, add support for other formats, and error handling.

From the example I tested here, it doesn't seem to be the standard format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add some basic error handling to the conversion, and it seems you're checking for the wes output in the wrong object, it's not supposed to be in the run_logs.

In the mapping document, createAction.instrument is supposed to be a 1 to 1 match for the executors.image field but it only takes one identifier and the executor image can have multiple inputs. @uniqueg, was this done on purpose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand what you mean. There is only one image per executor in TES, but there may be multiple executors. We need to consider this in the mapping.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uniqueg, yeah while there can be multiple executors per task, it seems the createAction.instrument field only takes one identifier as explained here. So not sure how they would be a 1 to 1 match between the executors.image (since there can be multiple of these in one task) and the createAction.instruments field. I might be wrong though, just wanted to see if there might be some context that I am missing.

@Karanjot786
Copy link
Member Author

Hi @SalihuDickson ,

Thank you for the feedback. I'll verify the WES and TES formats against their standards and update the conversion functions accordingly. I'll also ensure to add error handling and support for other formats if needed. Regarding the WES output, I'll correct the object reference to avoid looking in the wrong place. Additionally, I'll review the mapping for CreateAction.instrument to ensure it correctly handles multiple executor images.

I'll make the necessary adjustments and resubmit the PR for review.

@Karanjot786
Copy link
Member Author

Hi @SalihuDickson ,

I have made the following updates based on your feedback:

  1. Verified and Corrected Format Handling:

    • Ensured that the WES and TES data formats are correctly handled according to their respective standards.
    • Added checks to ensure the executors list in TES data is not empty before accessing its first element.
  2. Added Error Handling:

    • Implemented error handling for missing keys, invalid formats, and other potential exceptions in both convert_tes_to_wrroc and convert_wes_to_wrroc functions.
    • Used the .get() method to safely access keys in inputs and outputs for both TES and WES data, preventing potential KeyError issues.
  3. Corrected Mapping for CreateAction.instrument:

    • Ensured that the CreateAction.instrument field correctly handles multiple executor images as needed.
  4. Enhanced Timestamp Conversion:

    • Added error handling in the convert_to_iso8601 function to manage different timestamp formats, assuming RFC 3339 format and converting to ISO 8601.

Please review the changes and let me know if there are any further adjustments required.

@SalihuDickson
Copy link

@Karanjot786, thank you for getting to this so quickly. However these corrections may be making this PR unnecessarily complex. So I think you should revert the most recent commit, so we can merge and work on these changes on the next one. I think you should also separate the WES and TES conversions into different PRs so we can add a little more specificity.

What I want you to consider for the next PR:

First I think we have 3 types of data when working with error handling, when you're dealing with user defined data,

  1. data that must be present and must be of a certain type (or range of types) if not, an error should be thrown
  2. data is optional but should be of a certain type (or range of types) if present or an error should be thrown
  3. And data that is recommended but not mandatory (could have a default value), this should show a warning

I think you need to figure out which applies to each field and write your error handling accordingly, so that we can avoid cryptic errors and gracefully handle these simple error scenarios. Something you can also do is collect all the errors in a list and then throw them all at once to ensure the user sees everything that is wrong with their data without having to rerun the program over and over.

We will also need to consider some best practices for error handling like having one function that all errors are passed into, to ensure uniform error handling and to constants for similar errors. This is quite a lot to consider, please feel free to ask me about anything you may be confused about : )

@SalihuDickson
Copy link

And you are currently looking for the output field in the run_logs but according to the WES specification, the output field isn't supposed to be in the run_logs. Please look at it here

@Karanjot786
Copy link
Member Author

Hi @SalihuDickson ,

Considering the feedback and to simplify the review process, may I cancel the current PR and create two separate PRs? One for convert_tes_to_wrroc and another for convert_wes_to_wrroc, each in different files. This approach will allow us to address any issues with more specificity and clarity.

Please let me know if this works for you.

@SalihuDickson
Copy link

Yes @Karanjot786, that is fine.

Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Karanjot786, but I think we are getting ahead of ourselves here. There are a couple of issues that I think need to be addressed before coding:

  1. Make sure to properly set up the repo first, including packaging, dependency management, linting, static code/type checks, (API) documentation and a CI pipeline before adding any code. Have a look at how @JaeAeich and @Rahuljagwani are doing this.
  2. For the actual code, please prepare an issue to discuss the design of the software. We want to write well organized code that can be extended and easily read, understood and maintained - not just a single script with a bunch of functions. So, please open an issue and think of an OOP-based architecture to organize your code (again, please have a look at what others are doing for inspiration) - then describe it in that issue. Think about abstractions. Think how you want to represent TES and WES request payloads and responses, validate them etc. Same thing for ROC.
  3. Once we have the code design, break down the work into small work packages - perhaps you want to start with your abstraction layer, then do the TES to WRROC, then the WRROC to TES, then the WES to WRROC, then the WRROC to WES etc.
  4. Make sure that each work package (1 work package = 1 PR, roughly) does not become too big. But make sure that all code you add is tested (provide unit tests covering all your code! - think about integration tests as well) and documented.

You can leave this PR open as a draft PR to copy some code over from later on, perhaps, but I don't think this will be merged.

@@ -0,0 +1,173 @@
import datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the contributor guidelines carefully. For Python these include:

  • Include Google style docstrings on all levels (package, module, class, method and function)
  • Provide type hints for all function/method signatures (arguments and return value) and global vars, and ideally for local vars as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand what you mean. There is only one image per executor in TES, but there may be multiple executors. We need to consider this in the mapping.

import json

def convert_tes_to_wrroc(tes_data):
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should wrap only the code that you expect to raise particular errors in try/except - not just a bunch of commands and catch some generic errors. There is no added benefit of catching, say, an indexing error and then just say that there was an indexing error - the regular error and traceback will have all of that info already - and more.

Comment on lines +6 to +10
if not tes_data or not isinstance(tes_data, dict):
raise ValueError("Invalid TES data provided")

if not tes_data.get("executors"):
raise ValueError("TES data missing 'executors' list or 'executors' list is empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insufficient to validate the TES response data. It's not like any dict with a executors key would be a valid TES response. You need to define models for TES and WES, e.g., with Pydantic. Please discuss this with @JaeAeich - he is already writing models for TES (you could reuse them and add the models for WES).

Comment on lines +12 to +37
wrroc = {
"@context": "https://w3id.org/ro/crate/1.1/context",
"@graph": [
{
"@type": "Dataset",
"conformsTo": "https://w3id.org/ro/wfrun/process/0.4"
},
{
"@type": "CreateAction",
"@id": tes_data.get("id"),
"name": tes_data.get("name", ""),
"description": tes_data.get("description", ""),
"instrument": tes_data["executors"][0].get("image"),
"object": [{"@id": input.get("url"), "name": input.get("path")} for input in tes_data.get("inputs", []) if input.get("url") and input.get("path")],
"result": [{"@id": output.get("url"), "name": output.get("path")} for output in tes_data.get("outputs", []) if output.get("url") and output.get("path")],
"startTime": convert_to_iso8601(tes_data["logs"][0].get("start_time")),
"endTime": convert_to_iso8601(tes_data["logs"][0].get("end_time")),
"actionStatus": convert_status(tes_data.get("state")),
},
{
"@type": "SoftwareApplication",
"@id": tes_data["executors"][0].get("image"),
"name": tes_data.get("name", "")
}
]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if you can model that with Pydantic. The members whose names start with @ might give you problems, but probably you could get around it with the use of aliases (https://docs.pydantic.dev/latest/api/fields/).

Comment on lines +40 to +45
except KeyError as e:
raise ValueError(f"Missing expected key: {e}")
except IndexError as e:
raise ValueError(f"Missing expected index: {e}")
except Exception as e:
raise ValueError(f"An error occurred during TES to WRROC conversion: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these are helpful... Think of a proper error handling strategy. Look at other code, read some articles etc.

@JaeAeich
Copy link

JaeAeich commented Jun 20, 2024

  1. Make sure to properly set up the repo first, including packaging, dependency management, linting, static code/type checks, (API) documentation and a CI pipeline before adding any code.

Hey please consider linting and checks for non code files, since this this is a new proj we wouldn't end up with the prob I have with implementing it with TESK (ie huge changes). Consider looking at this PR, you don't necessarily need to implement pre-commit hook as that can hinder with DX, but the .pre-commit-config.yaml in that PR is has pretty good set of rules which you can put in CI.

@Karanjot786 Karanjot786 closed this by deleting the head repository Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants