Enumerate unzipped files #10842

ridoo · 2023-03-23T16:05:39Z

The last base_file candidate overrides previously ones. Also, different files with the same extension (file1.csv, file2.csv) are not be accessible from file_paths as they get overridden, too.

An enumerated list of unzipped files should resolve this.

Checklist

Reviewing is a process done by project maintainers, mostly on a volunteer basis. We try to keep the overhead as small as possible and appreciate if you help us to do so by completing the following items. Feel free to ask in a comment if you have troubles with any of them.

For all pull requests:

Confirm you have read the contribution guidelines
You have sent a Contribution Licence Agreement (CLA) as necessary (not required for small changes, e.g., fixing typos in the documentation)
Make sure the first PR targets the master branch, eventual backports will be managed later. This can be ignored if the PR is fixing an issue that only happens in a specific branch, but not in newer ones.

The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):

There is a ticket in https://github.com/GeoNode/geonode/issues describing the issue/improvement/feature (a notable exemption is, changes not visible to end-users)
The issue connected to the PR must have Labels and Milestone assigned
PR for bug fixes and small new features are presented as a single commit
Commit message must be in the form "[Fixes #<issue_number>] Title of the Issue"
New unit tests have been added covering the changes, unless there is an explanation on why the tests are not necessary/implemented
This PR passes all existing unit tests (test results will be reported by travis-ci after opening this PR)
This PR passes the QA checks: black geonode && flake8 geonode
Commits changing the settings, UI, existing user workflows, or adding new functionality, need to include documentation updates
Commits adding new texts do use gettext and have updated .po / .mo files (without location infos)

Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.

cla-bot · 2023-03-23T16:05:43Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @github-actions[bot] on file. In order for us to review and merge your code, please contact the project maintainers to get yourself added.

gannebamm · 2023-03-23T16:14:44Z

@ridoo, please take a look at your commit history.

ridoo · 2023-03-24T11:31:32Z

Kontext:

iterdir() seems to be platform dependent. It happened to me that the order of the returned items differed on different platforms. In cases where a zip file contains multiple base_file candidates (in my case json and two csv files) the last one wins. In combination with the arbitrary order of the return value of iterdir(), this becomes non-deterministic.

Also, different files with the same extension (some_data.csv, more_data.csv) will not be accessible from file_paths as they get overridden, too.

The fix enumerates all files to make them accessible from file_paths.

ridoo · 2023-03-24T15:23:43Z

Seems that one test fails (test_geoserver_proxy_strip_paths (geonode.geoserver.tests.test_helpers.HelperTest) ... FAIL) according to restrictions on /geoserver/rest/layers. Not sure right now, if this relates to the recent adjustments made in geoservers rest.properties ...

t-book · 2023-03-25T15:46:57Z

A little improvement might be to use a context manager (with ...)

geonode/geonode/storage/data_retriever.py

Line 208 in d10741f

the_zip = zipfile.ZipFile(zip_file, allowZip64=True)

gannebamm · 2023-03-30T07:33:30Z

@afabiani I see the same tests failing for recent PRs e.g. the dependabot bumping. Can you confirm @ridoo idea? Is this due to the new rest security rules?

gannebamm · 2023-04-06T07:22:00Z

@afabiani I see the same tests failing for recent PRs e.g. the dependabot bumping. Can you confirm @ridoo idea? Is this due to the new rest security rules?

@ridoo please see #10887. If the issue gets resolved restart the tests to see if this will run successfully

giohappy · 2023-04-11T09:57:05Z

@ridoo the issue with the failing test has been fixed in fa2c6da

codecov · 2023-04-11T15:32:38Z

Codecov Report

Merging #10842 (356cf3c) into master (43188af) will increase coverage by 0.05%.
Report is 3 commits behind head on master.
The diff coverage is 56.25%.

❗ Current head 356cf3c differs from pull request most recent head 9623ff9. Consider uploading reports for the commit 9623ff9 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10842      +/-   ##
==========================================
+ Coverage   62.16%   62.21%   +0.05%     
==========================================
  Files         871      828      -43     
  Lines       52020    51262     -758     
  Branches     6495     6569      +74     
==========================================
- Hits        32338    31895     -443     
+ Misses      18133    17671     -462     
- Partials     1549     1696     +147

gannebamm · 2023-04-13T10:51:47Z

I see it passes tests now and only shows codecov fails, which I would ignore.
@ridoo, please incorporate the context manager idea of @t-book (#10842 (comment)), and then I think this can get merged. @giohappy Do you agree?

giohappy · 2023-04-13T11:27:49Z

@gannebamm I've asked for a quick internal review, but it looks good to me.

mattiagiupponi

Hi @ridoo
I tried the PR and overall the idea looks good but is not working as expected.
The consistency between the files must be preserved.
I tried to upload a zip with multiple and mixed supported types

But after running the code, I can see that the payload the key needed for an shp (for example) refers to different files:

If we start to have the possibility to refer to multiple files, I guess is better to keep the consistency between them. For example in this case I expect:

the key without index should refer to the same shp/gpkg
the key with the same index should refer to the same shp/gkp

We have also to consider eventually the XML file and SLD file that can belong to the same layer

ridoo · 2023-04-18T17:45:01Z

@mattiagiupponi thanks for having a look at this.

It seems to be unclear to me, what is actually acceptable content of an uploaded zip file. However, what about a simple sort before doing the whole magic (see my commit 573c3dc)

The zipped test collection could contain more mixed content and the test could be improved as well. But it may give you the idea, hopefully :)

mattiagiupponi · 2023-04-19T09:40:17Z

It seems to be unclear to me, what is actually acceptable content of an uploaded zip file. However, what about a simple sort before doing the whole magic (see my commit 573c3dc)

In GeoNode one zip contains just 1 layer (relation 1-1), but here since we are introducing the idea to have possible multiple base_file is important to keep the consistency with the previous behavior, anyway I don't think that a sorted could be enough I guess it would be safer to have a code which can group directly by file name instead of relying on the sorting function.

(IMO) A good way would be to:

scan the zip file
group the file available in the zip by name
based on the file extension, verify that the needed file for that resource is available (this information is available via the get_supported_datasets_file_types
Return the first completed set found

@giohappy what do you think?

ridoo · 2023-04-19T10:36:09Z

(IMO) A good way would be to:

scan the zip file

group the file available in the zip by name

based on the file extension, verify that the needed file for that resource is available (this information is available via >the get_supported_datasets_file_types

Return the first completed set found

Sorting before running the refactored loop would give you named groups, no?

However, for a datapackage.json which references just other files, a named group won't be helpful as it would for shapefiles. Here the actual handler (from the new geonode-importer) actually decides what to do with the input data. Making too strict decisions or verifications during a simple unzip step would go too far in my opinion. What do you think?

ridoo · 2023-05-15T11:39:25Z

Hey @mattiagiupponi @giohappy do you see any chance to see this PR merged into an upcoming release (4.2.x or even 4.1.x)?

If you need further input from our side, please let me know! 🌞

cc @gannebamm

giohappy · 2023-05-24T07:35:16Z

@ridoo sorry for the late reply. We can plan to have it in 4.2.0, but I think we need more thinking to implement a clean solution for the grouping of files.

ridoo · 2023-05-26T06:51:28Z

@giohappy no worries .. are you going to share the discussion on this ticket or elsewhere?

Test example is available still from the PR GeoNode#10842

ridoo · 2023-10-11T13:29:24Z

@giohappy do you have any updates on this one? If you need further clarifications or work on my side please let me know

last statement from @mattiagiupponi was to discuss, what is acceptable content of an uploaded zip file (e.g. containing mixed content). See here: #10842 (comment)

giohappy · 2023-10-12T09:46:50Z

@ridoo from what I see your proposal doesn't work in several ways:

1. We don't support multiple base_files

Every single (multi)file to be handled by the importer must have a base_file key. Outside the issue of sorting on grouping, you end up with a data structure where you have 1 base_file and multiple shp_file, shx_file, etc.
The last "group" of files that are processed overwrite the single available base_file value. I suspect this is what happens in the screenshot from @mattiagiupponi:

I suspect that your test is a false positive because you only take one group of files (the ones that generate the shp_file, etc. keys without postfixes. If you run the test on all the other files in _files you would get an assertion error since it wouldn't find a matched base_file

2. The importer can only handle one multi(file) at a time

The importer creates one (multi)format at a time. The create method triggers only one handler. If the zip file you send contains multiple files, it will get the handler only for the single base_file inside the retrieved paths.
So, in the end, the additional files would be discarded.
For the problem at point 1, you might end up with a base_file file that doesn't match with the other unprefixed keys.

Multifile support must be rethought IMHO, both at the StorageManager and the ImporterViewSet level

ridoo · 2023-10-12T11:56:47Z

@ridoo from what I see your proposal doesn't work in several ways:

1. We don't support multiple base_files

Every single (multi)file to be handled by the importer must have a base_file key. Outside the issue of sorting on grouping, you end up with a data structure where you have 1 base_file and multiple shp_file, shx_file, etc. The last "group" of files that are processed overwrite the single available base_file value.

Well, the multi-mixed content example was one of @mattiagiupponi 's tests. That is why I was asking on the criteria, what actually is acceptable content of a zip file. If I understand you correctly, only one file assemble (e.g. shp, shx, prj, dbf) is allowed. This would be ok to me, and does not interfere with the issue I tried to describe.

I suspect that your test is a false positive because you only take one group of files (the ones that generate the shp_file, etc. keys without postfixes. If you run the test on all the other files in _files you would get an assertion error since it wouldn't find a matched base_file

Well, I do not mind (for this issue) if one file assemble overrides the other. What I tried to describe is the fact, that different handlers may decide on different base_files. In my case, my upload handler has a json file as base_file and multiple csv files as related files. See my comment: #10842 (comment). Sorting was just a quick hack to group each file assembly so that they get processed sequentially.

Multifile support must be rethought IMHO, both at the StorageManager and the ImporterViewSet level

I agree here, but as said, this was not the intent of this issue.

giohappy · 2023-10-13T10:16:42Z

@ridoo I admit I lost track of this issue by reading only the comments. I've read the initial description again and now things are much clearer 😀.

Okay, so the goal was to not break the upload in case multiple files are put inside the uploaded zip file and let the handlers manage the enumerated paths.

BTW, even taking into account only specific cases, your patch would support only a very specific case: you have a custom handler registered at the top of the handlers registry.

Example:
You have a zip containing 1 json file + many csv files. Both json and csv are good candidates for the base_file. The base_file selection would depend on the (lexicographical) order, am I wrong?

So, when the ImporterViewSet selects the right handler, it do it by asking each handler if it can handle the fileset, and so the first one in the list will be selected, and with the factory handlers, it could either the geojson or the csv handler.

If your custom handler is at the top of the handlers registry, it might return immediately because you know how to handle that specific fileset.

If this is your expected scenario that's fine, although I don't think we can control the ordering of the registry at the moment.

iterdir() is platform dependent, that is the order of the returned items may be different on different platforms. In cases where a zip file contains multiple base_file candidates it will be overridden by the last one found (which varies on different platforms). Also, different files with the same extension (file1.csv, file2.csv) will not be accessible from file_paths as they get overridden, too. The fix enumerates all files to make them accessible from file_paths.

Ensures that unpacked content is sorted before getting handled

ridoo · 2023-10-13T12:09:29Z

Okay, so the goal was to not break the upload in case multiple files are put inside the uploaded zip file and let the handlers manage the enumerated paths.

That's right.

Example: You have a zip containing 1 json file + many csv files. Both json and csv are good candidates for the base_file. The base_file selection would depend on the (lexicographical) order, am I wrong?

To my understanding right now, the unpack process is just making a best guess on the base_file, depending on what geonode.utils.get_supported_datasets_file_types() returned first item is. This is ok, as I am free to re-set the base_file in my handler.

So, when the ImporterViewSet selects the right handler, it do it by asking each handler if it can handle the fileset, and so the first one in the list will be selected, and with the factory handlers, it could either the geojson or the csv handler.

If your custom handler is at the top of the handlers registry, it might return immediately because you know how to handle that specific fileset.

This is the way I have it configured and it works.

If this is your expected scenario that's fine, although I don't think we can control the ordering of the registry at the moment.

Ok, perhaps this issue targets multiple things here. The main issue, actually, is that unpacking files overrides files having the same extensions. This is why the PR is calle "Enumerate unzipped files", ie. two csv files in a zip would be unpacked to files[csv_1] and files[csv_2]

giohappy · 2023-10-13T12:26:02Z

This is the way I have it configured and it works.

@ridoo ok, the ordering of the handlers in the settings is respected when they're registered

@mattiagiupponi to me this PR is harmless. It doesn't break the current behavior (which is undefined behavior in case the zip contains multiple filesets) and it offers support to handlers that can handle multiple file sets.

mattiagiupponi · 2023-10-13T12:35:57Z

@mattiagiupponi to me this PR is harmless. It doesn't break the current behavior (which is undefined behavior in case the zip contains multiple filesets) and it offers support to handlers that can handle multiple file sets.

i agree

@ridoo can you please fix the flake8 and black issues? So we can have also the greenline from CircleCI.
Flake8 is complaining about this:

^@^@All done! ✨ 🍰 ✨
503 files would be left unchanged.
geonode/storage/data_retriever.py:224:39: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`
geonode/storage/data_retriever.py:231:16: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`

I guess replacing the type(existing) == list with isinstance(existing, list) should be enough

ridoo · 2023-10-13T13:52:09Z

Oh yes, will do this next.

One quick note: I removed the test which was based on wrong assumptions. If you would like to see tests on this part, I would have to do this some time next week.

gitguardian · 2023-10-16T10:39:57Z

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.

^{_{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}}

mattiagiupponi · 2023-10-16T10:54:49Z

Hi @ridoo, a small test to be sure that works as expected it would be nice

ridoo · 2023-10-16T16:56:23Z

@mattiagiupponi please have a look at my (really simple) test on enumeration/indexing multiple files with the same extension.

I just added a copy of zipped example.csv file. As ordering is undefined, I changed an existing test as well and commented where appropriate.

mattiagiupponi

@ridoo it looks good to me

See GeoNode/geonode#10842

Test example is available still from the PR GeoNode#10842

ridoo force-pushed the keep-unzipped-files-accessible branch from 28ab262 to 9ea06df Compare March 23, 2023 16:25

cla-bot bot added the cla-signed CLA Bot: community license agreement signed label Mar 23, 2023

ridoo force-pushed the keep-unzipped-files-accessible branch 2 times, most recently from 3761d1a to d10741f Compare March 24, 2023 08:17

giohappy requested a review from mattiagiupponi April 13, 2023 12:12

mattiagiupponi requested changes Apr 17, 2023

View reviewed changes

ridoo mentioned this pull request May 9, 2023

GNIP 92: non-spatial structured data as (pre)viewable FAIR datasets #8714

Open

5 tasks

ridoo added a commit to Thuenen-GeoNode-Development/geonode that referenced this pull request Sep 20, 2023

Remove example test

6ea4ba3

Test example is available still from the PR GeoNode#10842

ridoo added a commit to Thuenen-GeoNode-Development/geonode that referenced this pull request Oct 9, 2023

Remove example test

104270d

Test example is available still from the PR GeoNode#10842

ridoo added 2 commits October 13, 2023 12:44

Sorts files during unzip

15e4d45

Ensures that unpacked content is sorted before getting handled

ridoo force-pushed the keep-unzipped-files-accessible branch 2 times, most recently from 4d3040d to 15e4d45 Compare October 13, 2023 11:17

Resolve minor issues

37be03a

ridoo force-pushed the keep-unzipped-files-accessible branch from 8bd8c0d to 37be03a Compare October 13, 2023 14:04

ridoo force-pushed the keep-unzipped-files-accessible branch from d06323d to b75d185 Compare October 16, 2023 12:43

Ensure index on extensions found multiple times

9623ff9

ridoo force-pushed the keep-unzipped-files-accessible branch from b75d185 to 9623ff9 Compare October 16, 2023 12:45

mattiagiupponi approved these changes Oct 17, 2023

View reviewed changes

giohappy merged commit 9ca2f13 into GeoNode:master Oct 17, 2023
16 checks passed

ridoo added a commit to GeoNodeUserGroup-DE/contrib_datapackage that referenced this pull request Nov 2, 2023

Update README after merged GeoNode PR

553dae9

See GeoNode/geonode#10842

ridoo added a commit to Thuenen-GeoNode-Development/geonode that referenced this pull request Feb 22, 2024

Remove example test

10a0c36

Test example is available still from the PR GeoNode#10842

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enumerate unzipped files #10842

Enumerate unzipped files #10842

ridoo commented Mar 23, 2023

cla-bot bot commented Mar 23, 2023

gannebamm commented Mar 23, 2023

ridoo commented Mar 24, 2023

ridoo commented Mar 24, 2023 •

edited

Loading

t-book commented Mar 25, 2023

gannebamm commented Mar 30, 2023

gannebamm commented Apr 6, 2023

giohappy commented Apr 11, 2023

codecov bot commented Apr 11, 2023 •

edited

Loading

gannebamm commented Apr 13, 2023

giohappy commented Apr 13, 2023

mattiagiupponi left a comment

ridoo commented Apr 18, 2023 •

edited

Loading

mattiagiupponi commented Apr 19, 2023

ridoo commented Apr 19, 2023 •

edited

Loading

ridoo commented May 15, 2023 •

edited

Loading

giohappy commented May 24, 2023

ridoo commented May 26, 2023

ridoo commented Oct 11, 2023 •

edited

Loading

giohappy commented Oct 12, 2023

ridoo commented Oct 12, 2023 •

edited

Loading

1. We don't support multiple base_files

giohappy commented Oct 13, 2023

ridoo commented Oct 13, 2023

giohappy commented Oct 13, 2023

mattiagiupponi commented Oct 13, 2023

ridoo commented Oct 13, 2023 •

edited

Loading

gitguardian bot commented Oct 16, 2023 •

edited

Loading

mattiagiupponi commented Oct 16, 2023

ridoo commented Oct 16, 2023

mattiagiupponi left a comment

Enumerate unzipped files #10842

Enumerate unzipped files #10842

Conversation

ridoo commented Mar 23, 2023

Checklist

cla-bot bot commented Mar 23, 2023

gannebamm commented Mar 23, 2023

ridoo commented Mar 24, 2023

ridoo commented Mar 24, 2023 • edited Loading

t-book commented Mar 25, 2023

gannebamm commented Mar 30, 2023

gannebamm commented Apr 6, 2023

giohappy commented Apr 11, 2023

codecov bot commented Apr 11, 2023 • edited Loading

Codecov Report

gannebamm commented Apr 13, 2023

giohappy commented Apr 13, 2023

mattiagiupponi left a comment

Choose a reason for hiding this comment

ridoo commented Apr 18, 2023 • edited Loading

mattiagiupponi commented Apr 19, 2023

ridoo commented Apr 19, 2023 • edited Loading

ridoo commented May 15, 2023 • edited Loading

giohappy commented May 24, 2023

ridoo commented May 26, 2023

ridoo commented Oct 11, 2023 • edited Loading

giohappy commented Oct 12, 2023

1. We don't support multiple base_files

2. The importer can only handle one multi(file) at a time

ridoo commented Oct 12, 2023 • edited Loading

1. We don't support multiple base_files

giohappy commented Oct 13, 2023

ridoo commented Oct 13, 2023

giohappy commented Oct 13, 2023

mattiagiupponi commented Oct 13, 2023

ridoo commented Oct 13, 2023 • edited Loading

gitguardian bot commented Oct 16, 2023 • edited Loading

️✅ There are no secrets present in this pull request anymore.

mattiagiupponi commented Oct 16, 2023

ridoo commented Oct 16, 2023

mattiagiupponi left a comment

Choose a reason for hiding this comment

ridoo commented Mar 24, 2023 •

edited

Loading

codecov bot commented Apr 11, 2023 •

edited

Loading

ridoo commented Apr 18, 2023 •

edited

Loading

ridoo commented Apr 19, 2023 •

edited

Loading

ridoo commented May 15, 2023 •

edited

Loading

ridoo commented Oct 11, 2023 •

edited

Loading

ridoo commented Oct 12, 2023 •

edited

Loading

ridoo commented Oct 13, 2023 •

edited

Loading

gitguardian bot commented Oct 16, 2023 •

edited

Loading