Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Included ingstion page works but throws errors #2

Open
canon-cmre-kym-eden opened this issue Jul 16, 2024 · 5 comments
Open

Included ingstion page works but throws errors #2

canon-cmre-kym-eden opened this issue Jul 16, 2024 · 5 comments

Comments

@canon-cmre-kym-eden
Copy link

canon-cmre-kym-eden commented Jul 16, 2024

Preamble

While trying to figure out implementing my own ingesters I came across some issues, and updated with fixes.

Initial problem

result = tk.get_action("ingest_import_records")({}, data)
for id_ in result:

The result returned by ingest_import_records({}, data) is {'fail': 0, 'success': 1} not a list of ids, consequently the unpack on line 44 returns the keys fail and success and the call to package_show({}, {"id": "fail"}) crashes out with "package not found".

Looking back at ae63b57, the function that became ingest_import_records, did return a list of ids in the past.

Edit

The success/fail count is recorded in

class StatArtifacts(Artifacts):
but the function calls are passed data (with ids) which is discarded. But switching
artifacts = make_artifacts(data_dict["report"])
to "details" means the rest could be unpacked in the view as expected.

@canon-cmre-kym-eden
Copy link
Author

canon-cmre-kym-eden commented Jul 16, 2024

To solve the default behaviour
replace

for id_ in result:
pkg = tk.get_action("package_show")({}, {"id": id_})
tk.h.flash_success(
"Success: <a href='{url}'>{title}</a>".format(
title=pkg["title"],
url=tk.h.url_for(pkg["type"] + ".read", id=pkg["name"]),
),
True,
)

with

if data['report'] == 'details':
  for ingested in result:
    pkg = tk.get_action("package_show")({}, ingested['result']['result'])
    tk.h.flash_success(
      "Success: <a href='{url}'>{title}</a>".format(
      title=pkg["title"],
      url=tk.h.url_for(pkg["type"] + ".read", id=pkg["name"]),
    ),
    True,
  )
else:
  tk.h.flash_success("Success: {success}</br>Failed: {fail}".format(**result), True)

But a more elegant solution would be to pass the Artifacts object out of ingest_import_records and switch on type, or better yet, let each implementation of it return a string representation for consumption by flash_success.

@canon-cmre-kym-eden
Copy link
Author

canon-cmre-kym-eden commented Jul 16, 2024

Out of date documentation?

ckanext-ingest/README.md

Lines 128 to 132 in 81dd3a0

return {
"success": True,
"result": dataset,
"details": {}
}

Should this return IngestionResult instead of dict?

@canon-cmre-kym-eden canon-cmre-kym-eden changed the title Upload page works but throws errors Included ingstion page works but throws errors Jul 18, 2024
@canon-cmre-kym-eden
Copy link
Author

canon-cmre-kym-eden commented Jul 22, 2024

data = parse_params(tk.request.form)

CKAN parse_params function is not compatible with the hierarchical StrategyOptions class.
class RecordOptions(TypedDict, total=False):
"""Options for Record extracted by Strategy."""
# trigger an update if the data produced by the record already exists
update_existing: bool
# return more details after producing the data
verbose: bool
# custom options that are not stabilized yet
extras: dict[str, Any]
class StrategyOptions(TypedDict, total=False):
"""Options for Strategy."""
# options passed into every record produced by the strategy
record_options: RecordOptions
# function that returns a file from source using its name. Can be used to
# refer files in archive when creating a resource record with uploaded
# file, for example.
locator: Callable[..., Any]
# strategy that should be used for nested sources. For example, for files
# inside an archivecd
nested_strategy: str
# custom options that are not stabilized yet
extras: dict[str, Any]

It's not possible to pass the nested update_existing option this way, let alone the the form structure.

Edit

I've managed to get it working fine by replacing parse_params with dict, and packing the form parameters correctly (e.g. data['options'] = {'record_options': {'update_existing': data.get('update_existing', False)}})

@smotornyuk
Copy link
Member

Thanks, good catch.

This UI page is experimental and will change in the future. I'd recommend to make a copy and adapt it to your needs.

But it shouldn't be so broken, indeed. I adopted your suggestions.

  1. Passing artifacts out of the action is against CKAN's ideas of API actions. But we can initialize artifacts inside the view and pass them into action. Now it's possible to pass initialized artifacts object instead of string stats|details|tmp into ingest_import_records. It also means that you can even implement your version of Artifacts and use it instead of ones, available out of the box.

  2. I did not add a method that renders content for flash messages. I'm thinking about formalizing and documenting the Artifacts structure and documenting it so that something for flash messages may be added in the future. For now, I'll keep this ugly section in the view. BTW, this is still far from ideal from the safety point of view: ingestion process can create something other from datasets, so blindly using package_show is pretty dangerous. But, as I mentioned, this view is experimental and undocumented, so we can afford this risk for now.

  3. Should this return IngestionResult instead of dict?

    IngestionResult IS a dict, so it makes no difference. But I'm thinking about switching to dataclasses in v2 of the extension, so it has sense to call IngestionResult(...) to reduce amount of changes in the future.

  4. Yes, assigning update_existing manually totally makes sense.

@canon-cmre-kym-eden
Copy link
Author

Thanks for picking this up. Sorry it's a bit of a stream of consciousness. I wanted to progress but also come back and fix it later. I figured the page was an example from which to build ones own ingestion pipeline, and it was very helpful to have, even in this state.

I like your solution to not being able to pass the types out of the action. Much easier to synchronise the result data with the view. I hadn't considered the risks of rendering the package details. I mainly wanted the details for debugging and wonder if it's worth adding a check for debug/dev mode, so it only renders the details in non-production deployments?

The reason I asked about using explicit type rather than dict was that I seemed to be getting multiple nested levels of {'success': True, 'result': {'success': True, 'result': {...}}, and I thought switching to explicit type, meant it unpacked rather than add the dict into the result of the next level. I'm now pretty sure this isn't the case so, sorry about that, but it would be great switch to types in the future.

Thanks for the feedback and glad you're keeping this plug-in alive. So many useful ones are 9 years abandoned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants