Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Adding method to generate an example config based on the connector spec. #80

Merged
merged 10 commits into from
Mar 22, 2024

Conversation

tinomerl
Copy link
Contributor

Proposed fix for #79 . Adds a new method which is a bit more straightforward for generating a config.

All the tests were successful, except for the snowflake ones. I currently don't have access to a snowflake instance.

@aaronsteers
Copy link
Contributor

@tinomerl - Thanks for creating this PR and the related issue!

I'll be reviewing this afternoon and will try to have a review posted tonight or tomorrow.

@tinomerl
Copy link
Contributor Author

tinomerl commented Mar 1, 2024

@aaronsteers , thanks for the response!

It's just a quick wrapper for, hopefully, lowering the entry barrier for generating a config. The method could potentially be more sophisticated and taking the user a bit more by the hand. But that's just an idea for the moment. Let me know if there's anything to improve or do in a better way, I'm eager to help! :)

@tinomerl tinomerl changed the title Adding method to generate an example config based on the connector spec. Feat: Adding method to generate an example config based on the connector spec. Mar 1, 2024
@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 2, 2024

Hi, @tinomerl. Thanks for your patience in my being back to you.

I've looked at the code and I am not sure if this would generate an example config, or if it would return the JSON schema spec that is expected. Do you have a sample output from running?

Either way, I think both are potentially helpful:

  1. A means to retrieve the JSON schema of the expected configuration object.
  2. A means to obtain a sample config, as a JSON or Yaml blob.

Related to this, yesterday I was speaking with Paul, creator of Airbyte Serverless and he was leveraging Octavia CLI code to generate sample config in the form of a Yaml document. I haven't looked deeply into that approach but it seems like another potentially good path forward. (Whether rendered as Yaml or JSON is less of a concern, but would be good to understand the prior art used there.)

If you have thoughts and especially if you can provide sample outputs, that would be appreciated. I'll plan to spend a bit more time on this and will report back this coming week.

Cc @marcosmarxm

@tinomerl
Copy link
Contributor Author

tinomerl commented Mar 4, 2024

Hey @aaronsteers,

no worries. Currently it generates a config by fetching the JSON Spec. It is then returned to the Terminal or as a file. So it's more like your first point. Here's an example JSON from the Faker source

{
    "count": {
        "title": "Count",
        "description": "How many users should be generated in total.  This setting does not apply to the purchases or products stream.",
        "type": "integer",
        "minimum": 1,
        "default": 1000,
        "order": 0
    },
    "seed": {
        "title": "Seed",
        "description": "Manually control the faker random seed to return the same values on subsequent runs (leave -1 for random)",
        "type": "integer",
        "default": -1,
        "order": 1
    },
    "records_per_slice": {
        "title": "Records Per Stream Slice",
        "description": "How many fake records will be in each page (stream slice), before a state message is emitted?",
        "type": "integer",
        "minimum": 1,
        "default": 1000,
        "order": 2
    },
    "always_updated": {
        "title": "Always Updated",
        "description": "Should the updated_at values for every record be new each sync?  Setting this to false will case the source to stop emitting records after COUNT records have been emitted.",
        "type": "boolean",
        "default": true
    },
    "parallelism": {
        "title": "Parallelism",
        "description": "How many parallel workers should we use to generate fake data?  Choose a value equal to the number of CPUs you will allocate to this source.",
        "type": "integer",
        "minimum": 1,
        "default": 4,
        "order": 4
    }
}

I looked into the repo for airbyte_serverless. If it's this one. In it's core jinja templates are used to generate a config which is then rendered and saved as a yaml file with a lot of comments showing which part is required/optional or part of oneOf

Example config for Faker source

source:
  docker_image: "airbyte/source-faker:0.1.4" # GENERATED | string | A Public Docker Airbyte Source. Example: `airbyte/source-faker:0.1.4`. (see connectors list at: "https://hub.docker.com/search?q=airbyte%2Fsource-" )
  config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config by following the documentation https://docs.airbyte.com/integrations/sources/faker
    count: 1000 # REQUIRED | integer | How many users should be generated in total.  This setting does not apply to the purchases or products stream.
    seed: -1 # OPTIONAL | integer | Manually control the faker random seed to return the same values on subsequent runs (leave -1 for random)
    records_per_sync: 500 # OPTIONAL | integer | How many fake records will be returned for each sync, for each stream?  By default, it will take 2 syncs to create the requested 1000 records.
    records_per_slice: 100 # OPTIONAL | integer | How many fake records will be in each page (stream slice), before a state message is emitted?
  streams: # OPTIONAL | string | Comma-separated list of streams to retrieve. If missing, all streams are retrieved from source.

destination:
  connector: "bigquery" # GENERATED | string | An AirbyteServerless Destination Connector. Must be one of ['print', 'bigquery']
  config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
    buffer_size_max: 10000 # OPTIONAL | integer | maximum number of records in buffer before writing to destination (defaults to 10000 when not specified)
    dataset: "" # REQUIRED | string | Destination dataset. Must be fully qualified with project like `PROJECT.DATASET`

remote_runner:
  type: "cloud_run_job" # GENERATED | string | Runner Type. Must be one of ['direct', 'cloud_run_job']
  config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
    project:  # REQUIRED | string | GCP Project where cloud run job will be deployed
    region: "europe-west1" # REQUIRED | string | Region where cloud run job will be deployed
    service_account: "" # OPTIONAL | string | Service account email used bu Cloud Run Job. If empty default compute service account will be used
    env_vars:  # OPTIONAL | dict | Environements Variables

Example config for file source

``` source: docker_image: "airbyte/source-file:0.4.0" # GENERATED | string | A Public Docker Airbyte Source. Example: `airbyte/source-faker:0.1.4`. (see connectors list at: "https://hub.docker.com/search?q=airbyte%2Fsource-" ) config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config by following the documentation https://docs.airbyte.com/integrations/sources/file dataset_name: # REQUIRED | string | The Name of the final table to replicate this file into (should include letters, numbers dash and underscores only). format: "csv" # REQUIRED | string | The Format of the file which should be replicated (Warning: some formats may be experimental, please refer to the docs). reader_options: # OPTIONAL | string | This should be a string in JSON format. It depends on the chosen file format to provide additional options and tune its behavior. | Examples: {},] } url: # REQUIRED | string | The URL path to access the file which should be replicated. | Examples: https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv, gs://my-google-bucket/data.csv, s3://gdelt-open-data/events/20190914.export.csv provider: ## -------- Pick one valid structure among the examples below: -------- storage: "HTTPS" # REQUIRED | string user_agent: ""# OPTIONAL | boolean | Add User-Agent to request ## -------- Another valid structure for provider: -------- # storage: "GCS" # REQUIRED | string # service_account_json: ${SERVICE_ACCOUNT_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | In order to access private Buckets stored on Google Cloud, this connector would need a service account json credentials with the proper permissions as described &lt;a href="https://cloud.google.com/iam/docs/service-accounts" target="_blank"&gt;here&lt;/a&gt;. Please generate the credentials.json file and copy/paste its content to this field (expecting JSON formats). If accessing publicly available data, this field is not necessary. ## -------- Another valid structure for provider: -------- # storage: "S3" # REQUIRED | string # aws_access_key_id: # OPTIONAL | string | In order to access private Buckets stored on AWS S3, this connector would need credentials with the proper permissions. If accessing publicly available data, this field is not necessary. # aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY} # SECRET (please store in environment variables) | OPTIONAL | string | In order to access private Buckets stored on AWS S3, this connector would need credentials with the proper permissions. If accessing publicly available data, this field is not necessary. ## -------- Another valid structure for provider: -------- # storage: "AzBlob" # REQUIRED | string # storage_account: # REQUIRED | string | The globally unique name of the storage account that the desired blob sits within. See <a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-account-overview" target="_blank">here</a> for more details. # sas_token: ${SAS_TOKEN} # SECRET (please store in environment variables) | OPTIONAL | string | To access Azure Blob Storage, this connector would need credentials with the proper permissions. One option is a SAS (Shared Access Signature) token. If accessing publicly available data, this field is not necessary. # shared_key: ${SHARED_KEY} # SECRET (please store in environment variables) | OPTIONAL | string | To access Azure Blob Storage, this connector would need credentials with the proper permissions. One option is a storage account shared key (aka account key or access key). If accessing publicly available data, this field is not necessary. ## -------- Another valid structure for provider: -------- # storage: "SSH" # REQUIRED | string # user: # REQUIRED | string # password: ${PASSWORD} # SECRET (please store in environment variables) | OPTIONAL | string # host: # REQUIRED | string # port: "22" # OPTIONAL | string ## -------- Another valid structure for provider: -------- # storage: "SCP" # REQUIRED | string # user: # REQUIRED | string # password: ${PASSWORD} # SECRET (please store in environment variables) | OPTIONAL | string # host: # REQUIRED | string # port: "22" # OPTIONAL | string ## -------- Another valid structure for provider: -------- # storage: "SFTP" # REQUIRED | string # user: # REQUIRED | string # password: ${PASSWORD} # SECRET (please store in environment variables) | OPTIONAL | string # host: # REQUIRED | string # port: "22" # OPTIONAL | string ## -------- Another valid structure for provider: -------- # storage: "local" # REQUIRED | string | WARNING: Note that the local storage URL available for reading must start with the local mount "/local/" at the moment until we implement more advanced docker mounting options. streams: # OPTIONAL | string | Comma-separated list of streams to retrieve. If missing, all streams are retrieved from source.

destination:
connector: "bigquery" # GENERATED | string | An AirbyteServerless Destination Connector. Must be one of ['print', 'bigquery']
config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
buffer_size_max: 10000 # OPTIONAL | integer | maximum number of records in buffer before writing to destination (defaults to 10000 when not specified)
dataset: "" # REQUIRED | string | Destination dataset. Must be fully qualified with project like PROJECT.DATASET

remote_runner:
type: "cloud_run_job" # GENERATED | string | Runner Type. Must be one of ['direct', 'cloud_run_job']
config: # PREGENERATED | object | PLEASE UPDATE this pre-generated config
project: # REQUIRED | string | GCP Project where cloud run job will be deployed
region: "europe-west1" # REQUIRED | string | Region where cloud run job will be deployed
service_account: "" # OPTIONAL | string | Service account email used bu Cloud Run Job. If empty default compute service account will be used
env_vars: # OPTIONAL | dict | Environements Variables

</p>
</details> 

Since the current way i implemented this method it only returns the spec and doesn't explain which part is required or optional i like the way Paul did it way more. It's smoother and gives the user more information.

I could look more into it and try something which is more like the examples provided. What do you think about it?

@aaronsteers aaronsteers linked an issue Mar 4, 2024 that may be closed by this pull request
@aaronsteers
Copy link
Contributor

@tinomerl - Thank you for your patience. We've discussed internally how to represent this to the users, and I think we would want to expose this as a simple dictionary @property of the source, which would provide the JSON Schema, perhaps called config_json_schema or config_jsonschema.

We are planning to pick this up this week.

@tinomerl
Copy link
Contributor Author

@aaronsteers, thanks for the message and the info. No worries about the silence.
Let me know if I can be of any help or what changes I should make in my code.

@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 19, 2024

@tinomerl - Thanks for this feedback. Can you take a look at the updates in this PR here:

I've applied the changes as mentioned above - renaming the method to be clear it is generating the spec, but not exactly a perfect example, and decoupling the print from the property output itself. I've also kept the proposed capability from here to optionally specify an output file.

If this looks good to you, we can continue from that PR. Either way, I'll make sure to call out your valued contribution in our release notes.

@tinomerl
Copy link
Contributor Author

@aaronsteers just looked at the PR and gotta say i like it. Including the filepath as a parameter and the general naming is a bit more straightforward. I'm taking notes. We can continue in the new PR then. Do you need anything else from me? :)

@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 20, 2024

@tinomerl - Thanks for the review and feedback!

To preserve your name as original author and contributor on our main branch, I'll actually go ahead and bring those changes into this PR and then merge this PR. Will try to complete that by EOD today tomorrow. 🚀

@aaronsteers aaronsteers self-requested a review March 22, 2024 02:03
Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for merge. Thanks again for contributing to PyAirbyte. 🚀

@aaronsteers aaronsteers merged commit c8ceafb into airbytehq:main Mar 22, 2024
6 of 12 checks passed
@tinomerl tinomerl deleted the feature/example_config branch March 28, 2024 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Making it easier to create a config
2 participants