Skip to content

datastaxdevs/workshop-beam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workshop Apache Beam and Google DataFlow

Gitpod ready-to-code License Apache2 Discord

📋 Table of content

HouseKeeping

LAB

WalkThrough


HouseKeeping

Objectives

  • Introduce AstraDB and Vector Search capability
  • Give you an first understanding about Apache Beam and Google DataFlow
  • Discover NoSQL dsitributed databases and specially Apache Cassandra™.
  • Getting familiar with a few Google Cloud Platform services

Frequently asked questions

1️⃣ Can I run this workshop on my computer?

There is nothing preventing you from running the workshop on your own machine, If you do so, you will need the following

  1. git installed on your local system
  2. Java installed on your local system
  3. Maven installed on your local system

In this readme, we try to provide instructions for local development as well - but keep in mind that the main focus is development on Gitpod, hence We can't guarantee live support about local development in order to keep on track with the schedule. However, we will do our best to give you the info you need to succeed.

2️⃣ What other prerequisites are required?
  • You will need an enough *real estate* on screen, we will ask you to open a few windows and it does not file mobiles (tablets should be OK)
  • You will need a GitHub account eventually a google account for the Google Authentication (optional)
  • You will need an Astra account: don't worry, we'll work through that in the following
  • As Intermediate level we expect you to know what java and maven are

3️⃣ Do I need to pay for anything for this workshop?
No. All tools and services we provide here are FREE. FREE not only during the session but also after.

4️⃣ Will I get a certificate if I attend this workshop?
Attending the session is not enough. You need to complete the homeworks detailed below and you will get a nice badge that you can share on linkedin or anywhere else *(open api badge)*

Materials for the Session

It doesn't matter if you join our workshop live or you prefer to work at your own pace, we have you covered. In this repository, you'll find everything you need for this workshop:


LAB

1 - Create your DataStax Astra account

ℹ️ Account creation tutorial is available in awesome astra

click the image below or go to https://astra.datastax./com


2 - Create an Astra Token

ℹ️ Token creation tutorial is available in awesome astra

  • Locate Settings(#1) in the menu on the left, thenToken Management` (#2)

  • Select the role Organization Administrator before clicking [Generate Token]

The Token is in fact three separate strings: a Client ID, a Client Secret and the token proper. You will need some of these strings to access the database, depending on the type of access you plan. Although the Client ID, strictly speaking, is not a secret, you should regard this whole object as a secret and make sure not to share it inadvertently (e.g. committing it to a Git repository) as it grants access to your databases.

{
  "ClientId": "ROkiiDZdvPOvHRSgoZtyAapp",
  "ClientSecret": "fakedfaked",
  "Token":"AstraCS:fake"
}

3 - Copy the token value in your clipboard

You can also leave the windo open to copy the value in a second.

4 - Open Gitpod

↗️ Right Click and select open as a new Tab...

Open in Gitpod

5 - Set up the CLI with your token

In gitpod, in a terminal window:

  • Login
astra login --token AstraCS:fake
  • Validate your are setup
astra org

Output

gitpod /workspace/workshop-beam (main) $ astra org
+----------------+-----------------------------------------+
| Attribute      | Value                                   |
+----------------+-----------------------------------------+
| Name           | [email protected]             |
| id             | f9460f14-9879-4ebe-83f2-48d3f3dce13c    |
+----------------+-----------------------------------------+

6 - Create destination Database and a keyspace

ℹ️ You can notice we enabled the Vector Search capability

  • Create db workshop_beam and wait for the DB to become active
astra db create workshop_beam -k beam --vector --if-not-exists

💻 Output

  • List databases
astra db list

💻 Output

  • Describe your db
astra db describe workshop_beam

💻 Output

7 - Create Destination table

  • Create Table:
astra db cqlsh workshop_beam -k beam \
  -e  "CREATE TABLE IF NOT EXISTS fable(document_id TEXT PRIMARY KEY, title TEXT, document TEXT)"
  • Show Table:
astra db cqlsh workshop_beam -k beam -e "SELECT * FROM  fable"

8 - Setup env variables

  • Create .env file with variables
astra db create-dotenv workshop_beam 
  • Display the file
cat .env
  • Load env variables
set -a
source .env
set +a
env | grep ASTRA

9 - Setup project

This command will allows to validate that Java , maven and lombok are working as expected

mvn clean compile

10 - Run Importing flow

  • Open the CSV. It is very short and simple for demo purpose (and open API prices laters :) ).
/workspace/workshop-beam/samples-beam/src/main/resources/fables_of_fontaine.csv
  • Open the Java file with the code
gp open /workspace/workshop-beam/samples-beam/src/main/java/com/datastax/astra/beam/genai/GenAI_01_ImportData.java

  • Run the Flow
cd samples-beam
mvn clean compile exec:java \
 -Dexec.mainClass=com.datastax.astra.beam.genai.GenAI_01_ImportData \
 -Dexec.args="\
 --astraToken=${ASTRA_DB_APPLICATION_TOKEN} \
 --astraSecureConnectBundle=${ASTRA_DB_SECURE_BUNDLE_PATH} \
 --astraKeyspace=${ASTRA_DB_KEYSPACE} \
 --csvInput=`pwd`/src/main/resources/fables_of_fontaine.csv"

11 - Validate Data

astra db cqlsh workshop_beam -k beam -e "SELECT * FROM  fable"

WalkThrough

We will now compute the embedding leveraging OpenAPI. It is not free, you need to provide your credit card to access the API. This part is a walkthrough. If you have an openAI key follow with me !

1 Run Flow Compute

  • Setup Open AI
export OPENAI_API_KEY="<your_api_key>"
  • Open the Java file with the code
gp open /workspace/workshop-beam/samples-beam/src/main/java/com/datastax/astra/beam/genai/GenAI_02_CreateEmbeddings.java
  • Run the flow
mvn clean compile exec:java \
 -Dexec.mainClass=com.datastax.astra.beam.genai.GenAI_02_CreateEmbeddings \
 -Dexec.args="\
 --astraToken=${ASTRA_DB_APPLICATION_TOKEN} \
 --astraSecureConnectBundle=${ASTRA_DB_SECURE_BUNDLE_PATH} \
 --astraKeyspace=${ASTRA_DB_KEYSPACE} \
 --openAiKey=${OPENAI_API_KEY} \
 --table=fable"

2 Validate Output

astra db cqlsh workshop_beam -k beam -e "SELECT * FROM  fable"

3 Create Google Project

  • Create GCP Project

Note: If you don't plan to keep the resources that you create in this guide, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project. Create a new Project in Google Cloud Console or select an existing one.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project

4 Enable Billing

Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project

5 Save project ID:

The project identifier is available in the column ID. We will need it so let's save it as an environment variable

export GCP_PROJECT_ID=integrations-379317
export GCP_PROJECT_CODE=747469159044
export [email protected]
export GCP_COMPUTE_ENGINE=747469159044-compute@developer.gserviceaccount.com

6 Download and install gCoud CLI

curl https://sdk.cloud.google.com | bash

Do not forget to open a new Tab.

7 Authenticate with Google Cloud

Run the following command to authenticate with Google Cloud:

  • Execute:
gcloud auth login
  • Authenticate as your google Account

8 Set your project

If you haven't set your project yet, use the following command to set your project ID:

gcloud config set project ${GCP_PROJECT_ID}
gcloud projects describe ${GCP_PROJECT_ID}

9 Enable needed API

gcloud services enable dataflow compute_component \
   logging storage_component storage_api \
   bigquery pubsub datastore.googleapis.com \
   cloudresourcemanager.googleapis.com

10 Add Roles to dataflow users

To complete the steps, your user account must have the Dataflow Admin role and the Service Account User role. The Compute Engine default service account must have the Dataflow Worker role. To add the required roles in the Google Cloud console:

gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} \
    --member="user:${GCP_USER}" \
    --role=roles/iam.serviceAccountUser
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID}  \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role=roles/dataflow.admin
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID}  \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role=roles/dataflow.worker
gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID}  \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role=roles/storage.objectAdmin

To connect to AstraDB you need a token (credentials) and a zip used to secure the transport. Those two inputs should be defined as secrets.

```
gcloud secrets create astra-token \
   --data-file <(echo -n "${ASTRA_TOKEN}") \
   --replication-policy="automatic"

gcloud secrets create cedrick-demo-scb \
   --data-file ${ASTRA_SCB_PATH} \
   --replication-policy="automatic"

gcloud secrets add-iam-policy-binding cedrick-demo-scb \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role='roles/secretmanager.secretAccessor'

gcloud secrets add-iam-policy-binding astra-token \
    --member="serviceAccount:${GCP_COMPUTE_ENGINE}" \
    --role='roles/secretmanager.secretAccessor'
    
gcloud secrets list
```

12 Make sure you are in samples-dataflow folder

cd samples-dataflow
pwd

13 ✅ Make sure you have those variables initialized

We assume the table languages exists and has been populated in 3.1

export ASTRA_SECRET_TOKEN=projects/747469159044/secrets/astra-token/versions/2
export ASTRA_SECRET_SECURE_BUNDLE=projects/747469159044/secrets/secure-connect-bundle-demo/versions/1

14 - ✅ Run the pipeline

mvn compile exec:java \
 -Dexec.mainClass=com.datastax.astra.dataflow.AstraDb_To_BigQuery_Dynamic \
 -Dexec.args="\
 --astraToken=${ASTRA_SECRET_TOKEN} \
 --astraSecureConnectBundle=${ASTRA_SECRET_SECURE_BUNDLE} \
 --keyspace=${ASTRA_KEYSPACE} \
 --table=fable \
 --runner=DataflowRunner \
 --project=${GCP_PROJECT_ID} \
 --region=us-central1"

15 - ✅ Show the Content of the Table

A dataset with the keyspace name and a table with the table name have been created in BigQuery.

bq head -n 10 ${ASTRA_KEYSPACE}.${ASTRA_TABLE}

The END

About

Getting Started with Beam and Astra

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages