Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor file management system #1543

Open
lucassousaf opened this issue Aug 7, 2023 · 1 comment
Open

Refactor file management system #1543

lucassousaf opened this issue Aug 7, 2023 · 1 comment
Assignees
Labels
Backend Backend related issue

Comments

@lucassousaf
Copy link
Contributor

The current management of files (images, CVs, etc) have a few problems that can be a problem from both maintenance and database performance in the future:

  • Some images for certain entities (stakeholder, event and organisation) store their images in base64 format in the database.
    Other entities store their images directly in Google Cloud Storage (GCS).
  • In this case we store the full URL including the bucket-name in the database. That is not a good idea if you want to change buckets or storage without affecting the existing images. Ideally the bucket-name should be a system configuration that is used at runtime to upload and serve the images and not be stored in the database.
  • GPML is using a single GCS bucket to store the images and files for both test and production environment.

The idea is to centralize the images management in a way that:

  • All files are stored in GCS.
  • Each environment has its own bucket.
  • All images/files metadata will be stored in a generic entity named File . This entity will hold metadata such as object_key, extension, filename, created_at and last_modified_at.
@lucassousaf lucassousaf self-assigned this Aug 7, 2023
@lucassousaf lucassousaf added the Backend Backend related issue label Aug 7, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Visibility will determine in which bucket the file will be stored.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* This only take into consideration those files that are already
stored in GCS. Other files stored in the database in base64 format or
public URLs pointing to other sources are going to be migrated by
code. We'll develop an admin API for that use case.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* We'll need to sign the URLs for files stored in private buckets so
the FE can download the files without the need ask the user to login
into GC.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* With this we can specify a vector containing N fns to execute in
sequence. Each fn needs to signal either success or failure using a
map like so:

`{:success? true|false}`

Extra details in the map are allowed and recommended as they are passed
as the context for the next function in the sequence. Each fn node can
have a rollback function associated with it in case the next fn in the
sequence fails. Rollback functions are not mandatory and you can use
them when needed. In the following commits there will be a full
example of this mechanism in the new file service.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Important to note the with-constraint-violation-check. This macro
basically allows to execute a JDBC call specifying which
constraint violation checks to look for in case it happens and return the
desired error's reason. If it's not a PSQLException it catches
anything else with Throwable.
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Also added a generic function in the gpml.domain.types namespace to
generate the Malli schema for a given type. Which comes in handy as we
always endup writing the same code for enums.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

Notable functions:
- create-file
- delete-file
- get-file
- get-file-url
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* From now on GPML will store files in different buckets depending on
its visibility.
- stakeholders and organisations entities will save their files in a
private bucket.
- the rest of entities will save their files in a public bucket.
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Misc: fixed clj-kondo warnings.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Fix get-public-file-url in file service to use the correct
configuration setting.
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Using another random UUID can be a lot of noise. If the file doesn't
have a filename which is the case for raw base64 strings.
* Perphas an improvement for this would be for the client to send more
information about the image file. But for this specific use case the
filenames don't matter much since we'll not be listing them in the
client side.
* The filename will be useful for future developments where we'll have
to store other files other than files such as workspace documents.
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
[Re #1543]

* Useful if we want to convert a file to a domain when pulled from the DB using join
and JSON aggregations.
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 9, 2023
lucassousaf added a commit that referenced this issue Aug 10, 2023
[Re #1543]

* To be used at the API level since due to historic reasons the API
handler's run inside a transaction and to signal failure and abort the
transaction we have to throw an exception. This should be changed in
the future and use a more ideal approach like we do in the
gpml.service.file namespace.
lucassousaf added a commit that referenced this issue Aug 10, 2023
lucassousaf added a commit that referenced this issue Aug 15, 2023
[Re #1543]

* Wrong destructuring when getting the old CV file id.
* Also remove :cv and :photo keys from response as they are not
needed anymore.
lucassousaf added a commit that referenced this issue Aug 15, 2023
[Re #1543]

* Correct instances where we use image uploads.
lucassousaf added a commit that referenced this issue Aug 15, 2023
[Re #1543]

* Mocks an object storage in the container's local file system.
lucassousaf added a commit that referenced this issue Aug 16, 2023
lucassousaf added a commit that referenced this issue Aug 16, 2023
[Re #1543]

* We should always use the same name for properties in both BE and
FE. In this case BE and DB uses picture while FE photo in some
ocasions. So we should stick to a single name.
* Frontend will do the necessary changes to update this as well.
lucassousaf added a commit that referenced this issue Aug 16, 2023
lucassousaf added a commit that referenced this issue Aug 16, 2023
[Re #1543]

* The conditional was wrong. It should be a cond instead of cond->. We
can only apply a single update not many in this case.
joseAyudarte91 pushed a commit that referenced this issue Aug 21, 2023
* Add migration to setup File model

[Re #1543]

* Visibility will determine in which bucket the file will be stored.

* Add migration to adapt old files data to new file model

[Re #1543]

* This only take into consideration those files that are already
stored in GCS. Other files stored in the database in base64 format or
public URLs pointing to other sources are going to be migrated by
code. We'll develop an admin API for that use case.

* Refactor StorageClient protocol extension to support signing URLs

[Re #1543]

* We'll need to sign the URLs for files stored in private buckets so
the FE can download the files without the need ask the user to login
into GC.

* Add thread transactions mechanism for executing seq of actions

[Re #1543]

* With this we can specify a vector containing N fns to execute in
sequence. Each fn needs to signal either success or failure using a
map like so:

`{:success? true|false}`

Extra details in the map are allowed and recommended as they are passed
as the context for the next function in the sequence. Each fn node can
have a rollback function associated with it in case the next fn in the
sequence fails. Rollback functions are not mandatory and you can use
them when needed. In the following commits there will be a full
example of this mechanism in the new file service.

* Add db/jdbc_util.clj namespace with JDBC utilities

[Re #1543]

* Important to note the with-constraint-violation-check. This macro
basically allows to execute a JDBC call specifying which
constraint violation checks to look for in case it happens and return the
desired error's reason. If it's not a PSQLException it catches
anything else with Throwable.

* Impl. file persistence layer

[Re #1543]

* Impl. file domain layer

[Re #1543]

* Also added a generic function in the gpml.domain.types namespace to
generate the Malli schema for a given type. Which comes in handy as we
always endup writing the same code for enums.

* Impl. file service for file management

[Re #1543]

Notable functions:
- create-file
- delete-file
- get-file
- get-file-url

* Add private and public GCS bucket environment variables

[Re #1543]

* From now on GPML will store files in different buckets depending on
its visibility.
- stakeholders and organisations entities will save their files in a
private bucket.
- the rest of entities will save their files in a public bucket.

* Add missing caml-snake-kebab dependency

[Re #1543]

* Improve get-file to also include the file URL

[Re #1543]

* Misc: fixed clj-kondo warnings.

* Remove silly docstring on get-files* hugsql function

[Re #1543]

* Add storage API host name environment variable

[Re #1543]

* Fix get-public-file-url in file service to use the correct
configuration setting.

* Use file-id as part of filename

[Re #1543]

* Using another random UUID can be a lot of noise. If the file doesn't
have a filename which is the case for raw base64 strings.
* Perphas an improvement for this would be for the client to send more
information about the image file. But for this specific use case the
filenames don't matter much since we'll not be listing them in the
client side.
* The filename will be useful for future developments where we'll have
to store other files other than files such as workspace documents.

* Fix get-public-file-url missing bucket-name

[Re #1543]

* Fix get-file result map

[Re #1543]

* Instrument submission API to build image URLs

[Re #1543]

* Instrument event API to create and upload files

[Re #1543]

* Instrument browse API to build file URLs

[Re #1543]

* Add utility function to add URLs to files

[Re #1543]

* Refactor browse API to use file utility fn to add urls

[Re #1543]

* Add file domain utility function to decode obj with schema

[Re #1543]

* Useful if we want to convert a file to a domain when pulled from the DB using join
and JSON aggregations.

* Instrument detail GET API to add files URLs

[Re #1543]

* Instrument cummunity API to add files URLs

[Re #1543]

* Add missing environment variable

[Re #1543]

* Fix clj-kondo warnings on browse API

[Re #1543]

* Add file handler fn to create files

[Re #1543]

* To be used at the API level since due to historic reasons the API
handler's run inside a transaction and to signal failure and abort the
transaction we have to throw an exception. This should be changed in
the future and use a more ideal approach like we do in the
gpml.service.file namespace.

* Update policy API to create files

[Re #1543]

* Update resource API to create files

[Re #1543]

* Update case_study API to create files

[Re #1543]

* Update initiative API to create files

[Re #1543]

* Update technology API to create files

[Re #1543]

* Update event API to create files

[Re #1543]

* Add get-files service function

[Re #1543]

* Clean up unnused stakeholder hugsql functions

[Re #1543]

* Update stakeholder hugsql function to support picture_id/cv_id

[Re #1543]

* Add delete-stakeholder* hugsql function

[Re #1543]

* We'll need this to rollback stakeholder creation in case of an error.

* Add persistence layer delete-stakeholder and get-stakeholder fns

[Re #1543]

* Also declared the other existing function as it helps track which
are used and which are not.
* The new functions `delete-stakeholder` and `get-stakeholder` are
intermediare functions to handle errors. In fact all other HugSQL
functions should work this way. But this change can be done organically.

* Extract download-image to gpml.util.image namespace

[Re #1543]

* We'll use it from other places too.

* Impl stakeholder service

[Re #1543]

* In order to refactor the files creation and retrieval it needs a
more controlled way to handle errors and rollbacks. So I decided to
refactor the code to create, update and get stakeholder profile to
better handle errors and rollbacks.

* Refactor check-approved to get-user-info in auth-middleware

[Re #1543]

* In some endpoints we need the user information and we can inject it
here when running the auth-middleware. User approval check is now done
per endpoint using RBAC to identify if the users has the necessary
permissions to consume the endpoint. This also take into consideration
user approval.

* Refactor stakeholder handler to use new stakeholder service

[Re #1543]

* Fix clj-kondo warnings

[Re #1543]

* Add mapping from mime-types to common-extensions

[Re #1543]

* We weren't storing the right extension for the file in some
cases. The mime-type's suffix is not always the same as the extension
name. for the most common mime-types such as `application/pdf`,
`image/jpeg` and `image/png`, the extension of the file can be easily
inferred by take the suffix of the mime-type (the string after the
slash), but for other like docx files it's completely different. So,
we are mapping those that are used in GPML to make sure we store the
right extension name.

* Add missing images files ids to initiative topic query

[Re #1543]

* We need this information later to process the files correctly.

* Fix detail GET API using wrong resource type name

[Re #1543]

* Refactor detail PUT API to use new file management system

[Re #1543]

* Fix community API organisation picture key

[Re #1543]

* Fix stakeholder GET API using the wrong user-id

[Re #1543]

* Fix stakeholder PUT restricted API not adding user-id to body

[Re #1543]

* Fix community API not handling files URLs correctly

[Re #1543]

* Now we bring the files using the query and the add the URLs for each
of them.
* Since the query is of UNION type the image is placed in a common key
that for historical reasons is `picture`.

* Update organisation API to use new file manager

[Re #1543]

* Some functions didn't have the necessary information to do the file
handling so I had to modify the references as well.

* Update export API to add files URLs to exported entities

[Re #1543]

* Fix invitation API ig keys to use global common config

[Re #1543]

* Fix 192 migration to make organisation images private

[Re #1543]

* Impl case_study hugsql functions for file migration

[Re #1543]

* Impl stakeholder hugsql function for file migration

[Re #1543]

* Impl organisation hugsql function for file migration

[Re #1543]

* Impl event hugsql function for file migration

[Re #1543]

* Impl files-migrator API to programmatically migrate files

[Re #1543]

* This API endpoint will serve to trigger file migration for certain
entities. These entities are `event`, `case_study`, `stakeholder` and
`organisation. These entities' files can't be migrated using a
database migration because some of records store files in a
different. Some store files as base64 or public URLs (not pointing to
GCS). So we have to go row by row and handle each case and finally
upload the file to GCS.
* This API will lose its purpose once the migration is done as well as
the associated hugsql function. In which case we can safely delete
everything.

* Update brs_api_importer to use new file manager

[Re #1543]

* We'll have to test this properly when we resume BRS integration.

* Fix wrong param destructuring in files-migrator API

[Re #1543]

* Add get-blob-signed-url method to file-system storage mock

[Re #1543]

* Remove deprecated handler/image_test.clj namespace

[Re #1543]

* We are going to remove its associatiated namespace in the following
commits as well.

* Remove deprecated handler/image usage in handler/resource_test

[Re #1543]

* Remove deprecated handler/image related code

[Re #1543]

* Fix stakeholder CV file creation in service layer

[Re #1543]

* Wrong destructuring when getting the old CV file id.
* Also remove :cv and :photo keys from response as they are not
needed anymore.

* Update tests affected by new file management system

[Re #1543]

* Correct instances where we use image uploads.

* Add storage-client mock adapter for tests

[Re #1543]

* Mocks an object storage in the container's local file system.

* Bump google-cloud-storage version to 2.26.0

[Re #1543]

* Update stakeholder API to use picture key instead of photo

[Re #1543]

* We should always use the same name for properties in both BE and
FE. In this case BE and DB uses picture while FE photo in some
ocasions. So we should stick to a single name.
* Frontend will do the necessary changes to update this as well.

* Fix clj-kondo warnings

[Re #1543]

* Fix conditional to add files URLs in detail API

[Re #1543]

* The conditional was wrong. It should be a cond instead of cond->. We
can only apply a single update not many in this case.

* Add exclusion for org.checkerframework/checker-qual

[Re #1543]
joseAyudarte91 added a commit that referenced this issue Aug 22, 2023
[Re #1543]

We needed to take into account dash characters among the
allowed/expected file names when building the temporal image tables for
the "image" and "thumbnail" properties during the DB migration related
to already stored GCS images for the different resource types.
Otherwise, the file name string generation ends up with an empty value,
which violates the NOT NULL constraint in the File table's Name
column.

After further checking the code and both Prod and Test Env DBs, we are
confident that these are the only changes needed to fix the issue and
avoid any other surprises (since we know how the code is building the
image urls when uploading them to GCS).
joseAyudarte91 added a commit that referenced this issue Aug 22, 2023
* Fix file management refactor DB migration

[Re #1543]

We needed to take into account dash characters among the
allowed/expected file names when building the temporal image tables for
the "image" and "thumbnail" properties during the DB migration related
to already stored GCS images for the different resource types.
Otherwise, the file name string generation ends up with an empty value,
which violates the NOT NULL constraint in the File table's Name
column.

After further checking the code and both Prod and Test Env DBs, we are
confident that these are the only changes needed to fix the issue and
avoid any other surprises (since we know how the code is building the
image urls when uploading them to GCS).

* Fix formatting of misc. files
joseAyudarte91 added a commit that referenced this issue Aug 22, 2023
[Re #1543]

The DB query used to update the case studies with the migrated images
was missing the actual `update table set` prefiex before the dynamic
HugSQL-related part. After this the API should also work for migrating
images that were not stored previously in GCS.
joseAyudarte91 added a commit that referenced this issue Aug 22, 2023
[Re #1543]

The DB query used to update the case studies with the migrated images
was missing the actual `update table set` prefiex before the dynamic
HugSQL-related part. After this the API should also work for migrating
images that were not stored previously in GCS.
joseAyudarte91 added a commit that referenced this issue Aug 24, 2023
[Re #1543]

We were not including the namespace that extends the GCSStorageClient
protocol with the methods to get blob signed URL and delete blob-relate
functionality, that was preventing the reading of private image files
from GCS. This was not reproduced locally, since there we load always
all the namespaces, which is not the case with the Uberjar. So now we
require the mentioned file as part of the File service, that is
required already, fixing the issue.
joseAyudarte91 added a commit that referenced this issue Aug 24, 2023
[Re #1543]

We were not including the namespace that extends the GCSStorageClient
protocol with the methods to get blob signed URL and delete blob-relate
functionality, that was preventing the reading of private image files
from GCS. This was not reproduced locally, since there we load always
all the namespaces, which is not the case with the Uberjar. So now we
require the mentioned file as part of the File service, that is
required already, fixing the issue.
joseAyudarte91 added a commit that referenced this issue Aug 24, 2023
[Re #1543]

Removed forgotten print statement that shouldn't be there.
joseAyudarte91 added a commit that referenced this issue Aug 24, 2023
[Re #1543]

Removed forgotten print statement that shouldn't be there.
joseAyudarte91 added a commit that referenced this issue Aug 28, 2023
[Re #1543]

We were using the inverse logic to handle the result of deleting the
image of a resource that was going to be replaced by another image as
part of the update process. The main problem is that we were throwing
an exception to rollback the DB-related updates on the resource data
when the old image delete process was successful. However, we want the
opposite and throw the exception only when the process goes wrong and
continue with the DB-related updates otherwise. Now we have fixed that
logic in all the affected places.

Tests have been updated as well to avoid passing explicit nil `image`
property when updating resources, since in that case we are expecting a
previous old image when we do not have it for those cases.
joseAyudarte91 added a commit that referenced this issue Aug 28, 2023
[Re #1543]

We were not checking if we had a previous image before trying to delete
it in the case of replacing the image of a resource by a nil value. Now
we do it.
joseAyudarte91 added a commit that referenced this issue Aug 28, 2023
[Re #1543]

In the case of failing the process of updating a resource, when updating
images, we could end-up keeping deleted image referemces from Object
Storage (GCS), in the DB, as a result of rolling back the DB-related
operation. Hence, if we end-up in that case, we try to delete those
files from the DB, so at least it's clearer for the user and no broken
links are displayed. This is the simples approach, since rollbacking
fully the changes to re-create the files would imply more complexity
and resources consumption for the server and it is not worth it, since
the user wanted to replace those files anyway, so he can just try it
again by uploading the new files in newer updates.

This process has to be improved since we are doing the mentioned
operations as part of a Catch exception block, so this code is not
catched in another block.
@joseAyudarte91
Copy link
Contributor

Leaving this issue opened to perform some post-migration clean-up tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend Backend related issue
Projects
None yet
Development

No branches or pull requests

2 participants