I want to build A geo-spatial Saas platform - is Flyte the right fit? #1706

kumare3 · 2021-10-18T23:03:37Z

kumare3
Oct 18, 2021
Maintainer

Hello, I'm new on Flyte, but at the moment I try to figure out if Flyte is the right tool for my use-case. I would like to describe my use-case and I hope I get some information to make a decision:

My envrionment is

I'm working in ML with geo-spatial data
my data size started at 20 GB up to multiple TB (avg 10 TB)
data can be stored as CSV files or Apache Parquet
I'm using in general Spark (without Hadoop) and data are stored in a S3
for geo-spatial procession I'm using Apache Sedona which needs for a K8 Pod 10 GB of memory
I'm building a SaaS platform where my users can define their own workflows, so Flyte can be part of this solution, but my users should be able to see the Flyte UI directly, so everything should be hidden for my users with an own frontend

Use-Case

My users on the SaaS should create their own workflows, so can I create workflows e.g. on Flyte with a REST-API from my backend? The user should say "here are my data files a.txt, b.txt, use Task A, than Task B for that processing"
Each Task of a Workflow can create huge data size (up to terabytes), so e.g. how can I do this in Flyte? My current manually approach is, that I create a new Parquet file and the next task read it. So how can I pass such large data structure from one task to the next one?
A lot of my tasks should use Spark with Sedona, but how can I append the Sedona dependencies into my Flyte tasks? Especially for Spark the Task code and the Spark cluster need the same version
Can I setup my own Spark cluster and how can I use this in Flyte, I have seen the TearUp/TearDown mechanism, but can be this replaced by a static Spark cluster?
I have deployed Flyte and I see there will be created a lot of namespaces e.g. for production, development etc and if I create a main Spark cluster how the communication inside the K8 is defined, because with that solution my Spark Master must be available from outside
Can I add C++ tasks? If I need to create an wrapper for that is this possible? I need some numerical algorithms which are have mostly C++ libraries and also used MPI as dependency

posted by @flashpixx in slack

Answered by kumare3

Oct 18, 2021

@flashpixx, Firstly thank you for considering Flyte and such a detailed question. I will try to answer it by answering your subquestions and then dive into details for anything else.

I'm building a SaaS platform where my users can define their own workflows, so Flyte can be part of this solution, but my users should be able to see the Flyte UI directly, so everything should be hidden for my users with an own frontend

This is perfectly valid and what Flyte is designed for. FlyteConsole uses the Flyte Control plane APIs to visualize an execution. Everything in Flyte is API First and controlled by an API. All registrations are done through an API, all executions can be invoked from the API…

View full answer

kumare3 · 2021-10-18T23:34:28Z

kumare3
Oct 18, 2021
Maintainer Author

@flashpixx, Firstly thank you for considering Flyte and such a detailed question. I will try to answer it by answering your subquestions and then dive into details for anything else.

I'm building a SaaS platform where my users can define their own workflows, so Flyte can be part of this solution, but my users should be able to see the Flyte UI directly, so everything should be hidden for my users with an own frontend

This is perfectly valid and what Flyte is designed for. FlyteConsole uses the Flyte Control plane APIs to visualize an execution. Everything in Flyte is API First and controlled by an API. All registrations are done through an API, all executions can be invoked from the API and all visualization of execution status are driven by the API. These APIs are pretty stable and we do not make backwards incompatible changes, barring a major version release. Even when we do release 1.0.0 it will be mostly removing deprecated fields and cleaning up the API. There should be none or extremely minor change to the client if at all required. Infact, we have 3 friendly clients built for the API

Python Client & programmatic interface
Golang Client
a TypeScript client used by flyteConsole
You can generate your own client from the protobuf spec (we publish a few and plan to make it extremely easy to build one using buf) in the future

You should be able to build a UI on top of the API, infact some of our users have already done that like Striveworks and Latch.bio

for geo-spatial procession I'm using Apache Sedona which needs for a K8 Pod 10 GB of memory

You can always customize the amount of CPU/Mem required for t task

data can be stored as CSV files or Apache Parquet

Flyte has a type system that is greatly suited for unstructured datasets

My users on the SaaS should create their own workflows, so can I create workflows e.g. on Flyte with a REST-API from my backend? The user should say "here are my data files a.txt, b.txt, use Task A, than Task B for that processing"

Absolutely, we actually have no-code workflows created on top of Flyte. you can create a set of tasks that can be easily stitched together into a workflow. The only thing your client needs to do is

Register the Tasks independently by invoking the task registration API - CreateTask
Then as the user wants (you can create a drag and drop interface) that will generate the WorkflowSpecification - called WorkflowTemplate and invoke the CreateWorkflow endpoint.

Once registered a workflow is immutable. you can mutate it, but only new versions will be registered. For launching this workflow+version combination is used. We will perform type checking etc at the API entrypoint

Each Task of a Workflow can create huge data size (up to terabytes), so e.g. how can I do this in Flyte? My current manually approach is, that I create a new Parquet file and the next task read it. So how can I pass such large data structure from one task to the next one?

This is what most tasks in Flyte do. Depending on the size, you can run it as a single node job and depending on the size of your cluster and the max machine configuration, you can have one pod run with almost 1TB ram and 60+ cores (check AWS/GCP) large machine sizes.

Flytekit (python SDK) offers simplified API for data handling. here and even for Schema types (backed by parquet).
But, you can ofcourse roll your own in python (done more simply - using TypeTransformers).
The idea here is that the backend has its own type-system and you need to somehow build them - anyhow - you can build them even in UI. They just need to conform to whats declared Type declaration and Runtime Type Transport

A lot of my tasks should use Spark with Sedona, but how can I append the Sedona dependencies into my Flyte tasks? Especially for Spark the Task code and the Spark cluster need the same version

This is what Flyte is designed for. Every execution is completely isolated. Infact every task can use a different version of Spark and different version of Sedona. This is because executions are containerized. If using Spark on Kubernetes, this is even more seamless and Flyte will manage everything for you - as shown here. The clusters are called as Ephemeral Spark clusters and the version of spark is actuallly defined by the user, as the users container is converted to a spark runner.

Can I setup my own Spark cluster and how can I use this in Flyte, I have seen the TearUp/TearDown mechanism, but can be this replaced by a static Spark cluster?
Yes you can, but today we dont have a native plugin built for this. Are you thinking of using Databricks or your own stuff? In any case 2 ways

Write a simple python task that does this for you - calls a service, launches a task etc. This can be a flytekit plugin as well - you can contribute it! Example
You can make this better write a backend plugin. Refer to Athena, Snowflake, Sagemaker etc plugins. We can help you get started

I have deployed Flyte and I see there will be created a lot of namespaces e.g. for production, development etc and if I create a main Spark cluster how the communication inside the K8 is defined, because with that solution my Spark Master must be available from outside

By default Flyte starts in multi-namespace mode, where each namspace is project-domain configuration. This is a multi-tenancy primitive and we leverage K8s resource quotas. But, this is not a required configuration. You can run Flyte, so that it uses one or a few namespaces only. Some of our users run it so that there is one namespace per domain. Ofcourse you can disable all domains and have only one. We will need to dive deep into the networking setup. Please continue to ping us on Slack. We will be more than happy to help

Can I add C++ tasks? If I need to create an wrapper for that is this possible? I need some numerical algorithms which are have mostly C++ libraries and also used MPI as dependency

Absolutely
You can do this in 2 ways

Write a python wrapper to do this OR
Use raw container tasks

Raw container tasks, manage the data marshalling etc, but no need to install python, java etc in your container.

Conclusion
IMO, your usecase is a perfect fit for Flyte and there are multiple Geospatial companies - Blackshark.ai, Lyft mapping, Woven planet HD mapping and a couple more that use Flyte for geospatial compute. We designed Flyte for folks like you to level up and build an amazing platform on top and we cannot wait to see what you build. ❤️

1 reply

flashpixx Oct 19, 2021

Thanks for your reply, I need to think about al these topic, because I'm starting with Flyte.
I have got a question for the Spark Cluster:

At the moment I have a fixed Spark Cluster with Bitnami Helm Chart, because Sedona has got special requirements, so I have created K8 nodes and lock the Spark Worker Pods to these K8 nodes. If I would like to use this with Flyte how I can do this lock mechanism with NodeSelector in K8? For my worker pods I'm using K8 nodes with 40 GB memory, 120 GB and 8 physical cores

The reseason for a static Spark Cluster is, that I have got other application which also use the cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to build A geo-spatial Saas platform - is Flyte the right fit? #1706

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

I want to build A geo-spatial Saas platform - is Flyte the right fit? #1706

kumare3 Oct 18, 2021 Maintainer

Replies: 1 comment · 1 reply

kumare3 Oct 18, 2021 Maintainer Author

flashpixx Oct 19, 2021

kumare3
Oct 18, 2021
Maintainer

Replies: 1 comment 1 reply

kumare3
Oct 18, 2021
Maintainer Author