Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hubspot Data Ingestion into the Data Platform #1324

Open
3 tasks
quazi-h opened this issue Oct 9, 2024 · 3 comments
Open
3 tasks

Hubspot Data Ingestion into the Data Platform #1324

quazi-h opened this issue Oct 9, 2024 · 3 comments
Assignees
Labels
Data Engineering product:data-platform Issues related to the Data Platform product

Comments

@quazi-h
Copy link
Contributor

quazi-h commented Oct 9, 2024

User Story

As a user of Hubspot, I want the data in our Hubspot instance ingested into the Data Platform so I can use it for queries. I want to be able to preserve that data if we ever decide to close our Hubspot accounts.

Description/Context

Set up a new data source for each Hubspot property so that we have a single data source / inbound connection per property.
The unique CRM properties we want to preserve from Hubspot are Contacts, Companies, Deals, and Tickets.
Get the data into the appropriate raw schemas.
Prioritize Bootcamps data first since they are planning to shut down their Hubspot account by the end of October.

Acceptance Criteria

  • Add 3 new data sources in Airbyte for each Hubspot property
  • Run syncs and verify that the data has been ingested into our raw dbt layer on Starburst
  • Ensure all of the Bootcamps Hubspot data is available in our Data Lake
@quazi-h quazi-h added Data Engineering product:data-platform Issues related to the Data Platform product labels Oct 9, 2024
@quazi-h quazi-h self-assigned this Oct 9, 2024
@quazi-h
Copy link
Contributor Author

quazi-h commented Oct 17, 2024

Following the HubSpot documentation, I was able to create a new "private app" in our Hubspot Instance named Airbyte-Integration. HubSpot was able to provide an Access token that I used to set up a new Data Source in our production instance of Airbyte Open Source.

I was unable to use the access token to set up a source/connection in our QA instance of Airbyte, although I'm not sure exactly why.
I kept getting this generic error message from the Airbyte UI when trying to test and set up the source:
io.temporal.failure.ActivityFailure: Activity with activityType='RunWithWorkload' failed: 'Activity task failed'. scheduledEventId=5, startedEventId=6, activityId=d33b90b6-0a79-3081-8690-1fc70d2ca0ea, identity='1@4b451b0e4d2e', retryState=RETRY_STATE_MAXIMUM_ATTEMPTS_REACHED. When I realized it was working right away in our production instance, I gave up trying to get it set up in QA.

@quazi-h
Copy link
Contributor Author

quazi-h commented Oct 17, 2024

There are a number of scopes available to allow access to when configuring the private app in HubSpot.
This includes all properties under their "cms" and "crm" object hierarchies as well as a number of other categories including:
automation, communication preferences, conversations, marketing, settings, and other. I am unsure of whether there is any usable or important data in any of those categories. When navigating through the HubSpot UI, I was able to see that we only have data for "Contacts" and "Deals" in the CRM data. I prioritized those two and created Airbyte connections (Hubspot Contacts and Hubspot Deals) to sync that data to our data warehouse and confirmed that all records exist in the new raw tables.

All the deals data (2,495 records) appear to be from MITXONLINE
SELECT split(properties_dealname,'-')[1], count(*) FROM "ol_data_lake_production"."ol_warehouse_production_raw"."deals" group by split(properties_dealname,'-')[1];

There doesn't seem to be any fields I can use to filter by product in the contact data. I also found a field called "properties" that contains nested json.

I created a third Airbyte sync labeled "Airbyte Other" so I could enable additional scopes in the private app and see what additional data streams are available from those additional scopes. Those syncs are currently starting and so far I have found a few additional streams of data: contacts_property_history and owners. It may be worth going back to the private app scopes and enabling every available property in HubSpot to make sure we are getting everything, especially since Bootcamps is closing their Hubspot account by the end of the month.

@quazi-h
Copy link
Contributor Author

quazi-h commented Oct 17, 2024

Some questions I had:

  1. Should I be running a full refresh or append historical changes? Currently set to refresh daily.
  2. Naming convention or stream prefix for the new tables? "raw__hubspot__"
  3. One source in Airbyte, one connection per Hubspot property. Is that for CRM properties like Contacts, Deals?
    (I'm currently in the process of discovering additional streams available after enabling additional data scopes in HubSpot.)
    There is a sync running that has so far pulled 1,532,978 records from the "contacts_property_history" stream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Engineering product:data-platform Issues related to the Data Platform product
Projects
None yet
Development

No branches or pull requests

1 participant