Feat: Adding method to generate an example config based on the connector spec. #80

tinomerl · 2024-02-29T14:11:56Z

Proposed fix for #79 . Adds a new method which is a bit more straightforward for generating a config.

All the tests were successful, except for the snowflake ones. I currently don't have access to a snowflake instance.

aaronsteers · 2024-02-29T21:36:51Z

@tinomerl - Thanks for creating this PR and the related issue!

I'll be reviewing this afternoon and will try to have a review posted tonight or tomorrow.

tinomerl · 2024-03-01T08:56:28Z

@aaronsteers , thanks for the response!

It's just a quick wrapper for, hopefully, lowering the entry barrier for generating a config. The method could potentially be more sophisticated and taking the user a bit more by the hand. But that's just an idea for the moment. Let me know if there's anything to improve or do in a better way, I'm eager to help! :)

aaronsteers · 2024-03-02T19:01:08Z

Hi, @tinomerl. Thanks for your patience in my being back to you.

I've looked at the code and I am not sure if this would generate an example config, or if it would return the JSON schema spec that is expected. Do you have a sample output from running?

Either way, I think both are potentially helpful:

A means to retrieve the JSON schema of the expected configuration object.
A means to obtain a sample config, as a JSON or Yaml blob.

Related to this, yesterday I was speaking with Paul, creator of Airbyte Serverless and he was leveraging Octavia CLI code to generate sample config in the form of a Yaml document. I haven't looked deeply into that approach but it seems like another potentially good path forward. (Whether rendered as Yaml or JSON is less of a concern, but would be good to understand the prior art used there.)

If you have thoughts and especially if you can provide sample outputs, that would be appreciated. I'll plan to spend a bit more time on this and will report back this coming week.

Cc @marcosmarxm

tinomerl · 2024-03-04T07:52:22Z

Hey @aaronsteers,

no worries. Currently it generates a config by fetching the JSON Spec. It is then returned to the Terminal or as a file. So it's more like your first point. Here's an example JSON from the Faker source

{
    "count": {
        "title": "Count",
        "description": "How many users should be generated in total.  This setting does not apply to the purchases or products stream.",
        "type": "integer",
        "minimum": 1,
        "default": 1000,
        "order": 0
    },
    "seed": {
        "title": "Seed",
        "description": "Manually control the faker random seed to return the same values on subsequent runs (leave -1 for random)",
        "type": "integer",
        "default": -1,
        "order": 1
    },
    "records_per_slice": {
        "title": "Records Per Stream Slice",
        "description": "How many fake records will be in each page (stream slice), before a state message is emitted?",
        "type": "integer",
        "minimum": 1,
        "default": 1000,
        "order": 2
    },
    "always_updated": {
        "title": "Always Updated",
        "description": "Should the updated_at values for every record be new each sync?  Setting this to false will case the source to stop emitting records after COUNT records have been emitted.",
        "type": "boolean",
        "default": true
    },
    "parallelism": {
        "title": "Parallelism",
        "description": "How many parallel workers should we use to generate fake data?  Choose a value equal to the number of CPUs you will allocate to this source.",
        "type": "integer",
        "minimum": 1,
        "default": 4,
        "order": 4
    }
}

I looked into the repo for airbyte_serverless. If it's this one. In it's core jinja templates are used to generate a config which is then rendered and saved as a yaml file with a lot of comments showing which part is required/optional or part of oneOf

Example config for Faker source

source:
  docker_image: "airbyte/source-faker:0.1.4" # GENERATED | string | A Public Docker Airbyte Source. Example: `airbyte/source-faker:0.1.4`. (see connectors list at: "https://hub.docker.com/search?q=airbyte%2Fsource-" )
  config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config by following the documentation https://docs.airbyte.com/integrations/sources/faker
    count: 1000 # REQUIRED | integer | How many users should be generated in total.  This setting does not apply to the purchases or products stream.
    seed: -1 # OPTIONAL | integer | Manually control the faker random seed to return the same values on subsequent runs (leave -1 for random)
    records_per_sync: 500 # OPTIONAL | integer | How many fake records will be returned for each sync, for each stream?  By default, it will take 2 syncs to create the requested 1000 records.
    records_per_slice: 100 # OPTIONAL | integer | How many fake records will be in each page (stream slice), before a state message is emitted?
  streams: # OPTIONAL | string | Comma-separated list of streams to retrieve. If missing, all streams are retrieved from source.

destination:
  connector: "bigquery" # GENERATED | string | An AirbyteServerless Destination Connector. Must be one of ['print', 'bigquery']
  config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
    buffer_size_max: 10000 # OPTIONAL | integer | maximum number of records in buffer before writing to destination (defaults to 10000 when not specified)
    dataset: "" # REQUIRED | string | Destination dataset. Must be fully qualified with project like `PROJECT.DATASET`

remote_runner:
  type: "cloud_run_job" # GENERATED | string | Runner Type. Must be one of ['direct', 'cloud_run_job']
  config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
    project:  # REQUIRED | string | GCP Project where cloud run job will be deployed
    region: "europe-west1" # REQUIRED | string | Region where cloud run job will be deployed
    service_account: "" # OPTIONAL | string | Service account email used bu Cloud Run Job. If empty default compute service account will be used
    env_vars:  # OPTIONAL | dict | Environements Variables

Example config for file source

``` source: docker_image: "airbyte/source-file:0.4.0" # GENERATED | string | A Public Docker Airbyte Source. Example: `airbyte/source-faker:0.1.4`. (see connectors list at: "https://hub.docker.com/search?q=airbyte%2Fsource-" ) config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config by following the documentation https://docs.airbyte.com/integrations/sources/file dataset_name: # REQUIRED | string | The Name of the final table to replicate this file into (should include letters, numbers dash and underscores only). format: "csv" # REQUIRED | string | The Format of the file which should be replicated (Warning: some formats may be experimental, please refer to the docs). reader_options: # OPTIONAL | string | This should be a string in JSON format. It depends on the chosen file format to provide additional options and tune its behavior. | Examples: {},] } url: # REQUIRED | string | The URL path to access the file which should be replicated. | Examples: https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv, gs://my-google-bucket/data.csv, s3://gdelt-open-data/events/20190914.export.csv provider: ## -------- Pick one valid structure among the examples below: -------- storage: "HTTPS" # REQUIRED | string user_agent: ""# OPTIONAL | boolean | Add User-Agent to request ## -------- Another valid structure for provider: -------- # storage: "GCS" # REQUIRED | string # service_account_json: ${SERVICE_ACCOUNT_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | In order to access private Buckets stored on Google Cloud, this connector would need a service account json credentials with the proper permissions as described <a href="https://cloud.google.com/iam/docs/service-accounts" target="_blank">here</a>. Please generate the credentials.json file and copy/paste its content to this field (expecting JSON formats). If accessing publicly available data, this field is not necessary. ## -------- Another valid structure for provider: -------- # storage: "S3" # REQUIRED | string # aws_access_key_id: # OPTIONAL | string | In order to access private Buckets stored on AWS S3, this connector would need credentials with the proper permissions. If accessing publicly available data, this field is not necessary. # aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY} # SECRET (please store in environment variables) | OPTIONAL | string | In order to access private Buckets stored on AWS S3, this connector would need credentials with the proper permissions. If accessing publicly available data, this field is not necessary. ## -------- Another valid structure for provider: -------- # storage: "AzBlob" # REQUIRED | string # storage_account: # REQUIRED | string | The globally unique name of the storage account that the desired blob sits within. See <a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-account-overview" target="_blank">here</a> for more details. # sas_token: ${SAS_TOKEN} # SECRET (please store in environment variables) | OPTIONAL | string | To access Azure Blob Storage, this connector would need credentials with the proper permissions. One option is a SAS (Shared Access Signature) token. If accessing publicly available data, this field is not necessary. # shared_key: ${SHARED_KEY} # SECRET (please store in environment variables) | OPTIONAL | string | To access Azure Blob Storage, this connector would need credentials with the proper permissions. One option is a storage account shared key (aka account key or access key). If accessing publicly available data, this field is not necessary. ## -------- Another valid structure for provider: -------- # storage: "SSH" # REQUIRED | string # user: # REQUIRED | string # password: ${PASSWORD} # SECRET (please store in environment variables) | OPTIONAL | string # host: # REQUIRED | string # port: "22" # OPTIONAL | string ## -------- Another valid structure for provider: -------- # storage: "SCP" # REQUIRED | string # user: # REQUIRED | string # password: ${PASSWORD} # SECRET (please store in environment variables) | OPTIONAL | string # host: # REQUIRED | string # port: "22" # OPTIONAL | string ## -------- Another valid structure for provider: -------- # storage: "SFTP" # REQUIRED | string # user: # REQUIRED | string # password: ${PASSWORD} # SECRET (please store in environment variables) | OPTIONAL | string # host: # REQUIRED | string # port: "22" # OPTIONAL | string ## -------- Another valid structure for provider: -------- # storage: "local" # REQUIRED | string | WARNING: Note that the local storage URL available for reading must start with the local mount "/local/" at the moment until we implement more advanced docker mounting options. streams: # OPTIONAL | string | Comma-separated list of streams to retrieve. If missing, all streams are retrieved from source.

destination:
connector: "bigquery" # GENERATED | string | An AirbyteServerless Destination Connector. Must be one of ['print', 'bigquery']
config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
buffer_size_max: 10000 # OPTIONAL | integer | maximum number of records in buffer before writing to destination (defaults to 10000 when not specified)
dataset: "" # REQUIRED | string | Destination dataset. Must be fully qualified with project like PROJECT.DATASET

</p>
</details> 

Since the current way i implemented this method it only returns the spec and doesn't explain which part is required or optional i like the way Paul did it way more. It's smoother and gives the user more information.

I could look more into it and try something which is more like the examples provided. What do you think about it?

aaronsteers · 2024-03-18T17:00:12Z

@tinomerl - Thank you for your patience. We've discussed internally how to represent this to the users, and I think we would want to expose this as a simple dictionary @property of the source, which would provide the JSON Schema, perhaps called config_json_schema or config_jsonschema.

We are planning to pick this up this week.

tinomerl · 2024-03-18T17:14:28Z

@aaronsteers, thanks for the message and the info. No worries about the silence.
Let me know if I can be of any help or what changes I should make in my code.

aaronsteers · 2024-03-19T17:46:41Z

@tinomerl - Thanks for this feedback. Can you take a look at the updates in this PR here:

Feat: Make config spec available to users (orig. contribution from @tinomerl) #133

I've applied the changes as mentioned above - renaming the method to be clear it is generating the spec, but not exactly a perfect example, and decoupling the print from the property output itself. I've also kept the proposed capability from here to optionally specify an output file.

If this looks good to you, we can continue from that PR. Either way, I'll make sure to call out your valued contribution in our release notes.

tinomerl · 2024-03-20T07:04:28Z

@aaronsteers just looked at the PR and gotta say i like it. Including the filepath as a parameter and the general naming is a bit more straightforward. I'm taking notes. We can continue in the new PR then. Do you need anything else from me? :)

aaronsteers · 2024-03-20T15:52:12Z

@tinomerl - Thanks for the review and feedback!

To preserve your name as original author and contributor on our main branch, I'll actually go ahead and bring those changes into this PR and then merge this PR. Will try to complete that by EOD ~~today~~ tomorrow. 🚀

…ig' into feature/example_config

aaronsteers

Approving for merge. Thanks again for contributing to PyAirbyte. 🚀

tinomerl added 4 commits February 29, 2024 15:06

updated poetry.lock

56beac3

wrote example method

214d92a

added tests for example method

4c727b7

updated method and docs to easier way

9cf3082

tinomerl changed the title ~~Adding method to generate an example config based on the connector spec.~~ Feat: Adding method to generate an example config based on the connector spec. Mar 1, 2024

aaronsteers linked an issue Mar 4, 2024 that may be closed by this pull request

Making it easier to create a config #79

Closed

aaronsteers added 3 commits March 19, 2024 10:35

separate property vs print method

488d5c8

refactoring

0e4d9c1

update tests

48e1ace

aaronsteers mentioned this pull request Mar 19, 2024

Feat: Make config spec available to users (orig. contribution from @tinomerl) #133

Closed

aaronsteers self-requested a review March 22, 2024 02:03

aaronsteers added 2 commits March 21, 2024 19:23

Merge remote-tracking branch 'origin/main' into feature/example_config

2ce113c

Merge remote-tracking branch 'origin/tinomerl/aj/feature/example_conf…

f816a0f

…ig' into feature/example_config

aaronsteers approved these changes Mar 22, 2024

View reviewed changes

mark bigquery test as requiring creds

223c83a

aaronsteers merged commit c8ceafb into airbytehq:main Mar 22, 2024
6 of 12 checks passed

tinomerl deleted the feature/example_config branch March 28, 2024 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Adding method to generate an example config based on the connector spec. #80

Feat: Adding method to generate an example config based on the connector spec. #80

tinomerl commented Feb 29, 2024

aaronsteers commented Feb 29, 2024

tinomerl commented Mar 1, 2024

aaronsteers commented Mar 2, 2024 •

edited

Loading

tinomerl commented Mar 4, 2024 •

edited

Loading

aaronsteers commented Mar 18, 2024

tinomerl commented Mar 18, 2024

aaronsteers commented Mar 19, 2024 •

edited

Loading

tinomerl commented Mar 20, 2024

aaronsteers commented Mar 20, 2024 •

edited

Loading

aaronsteers left a comment

Feat: Adding method to generate an example config based on the connector spec. #80

Feat: Adding method to generate an example config based on the connector spec. #80

Conversation

tinomerl commented Feb 29, 2024

aaronsteers commented Feb 29, 2024

tinomerl commented Mar 1, 2024

aaronsteers commented Mar 2, 2024 • edited Loading

tinomerl commented Mar 4, 2024 • edited Loading

aaronsteers commented Mar 18, 2024

tinomerl commented Mar 18, 2024

aaronsteers commented Mar 19, 2024 • edited Loading

tinomerl commented Mar 20, 2024

aaronsteers commented Mar 20, 2024 • edited Loading

aaronsteers left a comment

Choose a reason for hiding this comment

aaronsteers commented Mar 2, 2024 •

edited

Loading

tinomerl commented Mar 4, 2024 •

edited

Loading

aaronsteers commented Mar 19, 2024 •

edited

Loading

aaronsteers commented Mar 20, 2024 •

edited

Loading