ML Data-Pipeline Ingestion - Optimisation for scaling #32

Shakthieshwari · 2022-09-22T15:06:01Z

Shakthieshwari
Sep 22, 2022

Hello Team,

We are planning to do few ML Data Optimisation for scaling .
JIra Ticket Link :- https://project-sunbird.atlassian.net/browse/OB-57

Problem Statement :- Avoid Deletion of Projects Druid Datasource -> Program Dashboard CSV use this datasource

Reason for Deletion of Datasource :- Since the Status of the project vary every time and druid doesn't support updating a record, We are daily deleting the entire data from druid and re-ingesting the whole data into druid on a daily basis to get the updated status of a submission.

Concern :- Huge Data Handling

Approach(Solution) :- Please check this confluence doc https://project-sunbird.atlassian.net/l/cp/P7nq918u , we have detailed out the design.

Similar to OnDemandDruidExhaustJob, we need to create the OnDemandCassandraExhaustJob Data Product.

Please provide us your @SanthoshVasabhaktula @sowmya-dixit @anandp504 approval and suggestions, if we can go a head on this.

Cc- @aishwaryashikshalokam @Ashwiniev95 @Prateek-slokam @aks30 @kiranharidas187 @vijiurs @snehangsude

Please do the needful at the earliest, as this is very highest priority for the program launch.

Thanks

SanthoshVasabhaktula · 2022-09-23T06:55:48Z

SanthoshVasabhaktula
Sep 23, 2022
Maintainer

@Shakthieshwari

Druid is never supposed to be used as source of truth. There are two approaches over here:

If the source database is either Mongo or Cassandra and storing data in Druid as a snapshot for ease of querying, then apply mutations to druid instead of full restore (Druid does support updates). We can design a mutation framework to apply mutations to data-sources in Druid. Do not this is not a one release activity.
Alternative write custom data products to query the source database itself. In the above case what is the source database?

In addition, don't generalize an exhaust job on Cassandra. Cassandra query patterns are different and needs to be fine tuned specific to corresponding tables similar to ProgressExhaust or ResponseExhaust. You can just create a ProjectExhaust dataproduct similar to ProgressExhaust.

3 replies

Shakthieshwari Sep 23, 2022
Author

Thanks @SanthoshVasabhaktula .

Yes, we agree . In the approach we suggested, we are not using druid at all.

Source Database will be cassandra.

Sure we will create a custom data products for our resource. Can you please confirm if this https://github.com/Sunbird-Lern/data-products/blob/release-5.0.0/lern-data-products/src/main/scala/org/sunbird/lms/exhaust/collection/ProgressExhaustJobV2.scala is the git repo ? we should refer and push our new data product to this repo ?

Thanks

SanthoshVasabhaktula Sep 23, 2022
Maintainer

Yes that is the right job, but not the right building block. Which building block does projects and observations fall into? Please create the job in that specific building block. If you are unsure, please discuss with Sunbird PM's to identify the right building block.

Shakthieshwari Sep 23, 2022
Author

Thanks @SanthoshVasabhaktula

@sowmya-dixit @anandp504 @reshmi-nair @kameshbhr Please do help us out which building block should we create our custom data product ? If required, we can connect for 30mins

Thanks

rhwarrier · 2022-09-26T06:16:17Z

rhwarrier
Sep 26, 2022
Maintainer

@Shakthieshwari which block do Projects and Observations sit under currently? Is it SB Ed ?

6 replies

rhwarrier Sep 27, 2022
Maintainer

Obsrv is primarily about the underlying infra - the data pipeline, denorm jobs, and the reporting layer. The only data product inclusion here is the summariser, given its generic nature.
All data products that are written for specific initiatives should stay with the instance code - for example, data products being used by Diksha are being moved out to Diksha for the implementation team to manage. It is meant for a specific use case, and is not part of the building block.
@SanthoshVasabhaktula @anandp504 please add.

Shakthieshwari Sep 27, 2022
Author

@rhwarrier @sowmya-dixit @anandp504 @reshmi-nair Can you please let me know then where do i create our custom data-product for ml projects?

We till now worked only on generic data product in the sunbird-obsrv building block. If this https://github.com/Sunbird-Lern/data-products/blob/release-5.0.0/lern-data-products/src/main/scala/org/sunbird/lms/exhaust/collection/ProgressExhaustJobV2.scala is not , we are confused, which repo to use for the custom data product ?

Thanks

SanthoshVasabhaktula Sep 27, 2022
Maintainer

@Shakthieshwari - Please create the data product in whatever BB projects and observations are located. If it is Sunbird-Ed then create the data product in Sunbird-Ed. My assumption was that manage learn services would have been separated into independent BB already. If not, can you please work with Vijayashree and create a BB for it.

@rhwarrier @alok-os

alok-os Sep 27, 2022

couple of points

there is nothing ML services in Sunbird. ML (Manage Learn) is an construct in context to use cases which an adopter might enable using Sunbird BBs. @Shakthieshwari - can you pls add Vijayshree into this thread. I am not able to find her user name.
@Shakthieshwari - pls request Vijayshree and Khushboo to initiate a call to discuss and finalize what are the "new components" SL has been contributing and which BB these components should be in

Shakthieshwari Sep 27, 2022
Author

@vijiurs can you please help here ?

Thanks

vijiurs · 2022-09-27T06:36:38Z

vijiurs
Sep 27, 2022

@alok Gupta ***@***.***> , Sure, Will schedule a call in the next couple of days to discuss the SL capabilities and alignment of the same to the Sunbird BBs. Regards Vijayashree

…

On Tue, Sep 27, 2022 at 10:34 AM Alok Gupta ***@***.***> wrote: couple of points 1. there is nothing ML services in Sunbird. ML (Manage Learn) is an construct in context to use cases which an adopter might enable using Sunbird BBs. @Shakthieshwari <https://github.com/Shakthieshwari> - can you pls add Vijayshree into this thread. I am not able to find her user name. 2. @Shakthieshwari <https://github.com/Shakthieshwari> - pls request Vijayshree and Khushboo to initiate a call to discuss and finalize what are the "new components" SL has been contributing and which BB these components should be in — Reply to this email directly, view it on GitHub <#32 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASLP6Q7AEEDZBFAFWJATVGTWAJ54DANCNFSM6AAAAAAQTE2QZM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sunbird Obsrv

ML Data-Pipeline Ingestion - Optimisation for scaling #32

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Sunbird Obsrv

ML Data-Pipeline Ingestion - Optimisation for scaling #32

Shakthieshwari Sep 22, 2022

Replies: 3 comments · 9 replies

SanthoshVasabhaktula Sep 23, 2022 Maintainer

Shakthieshwari Sep 23, 2022 Author

SanthoshVasabhaktula Sep 23, 2022 Maintainer

Shakthieshwari Sep 23, 2022 Author

rhwarrier Sep 26, 2022 Maintainer

rhwarrier Sep 27, 2022 Maintainer

Shakthieshwari Sep 27, 2022 Author

SanthoshVasabhaktula Sep 27, 2022 Maintainer

alok-os Sep 27, 2022

Shakthieshwari Sep 27, 2022 Author

vijiurs Sep 27, 2022

Shakthieshwari
Sep 22, 2022

Replies: 3 comments 9 replies

SanthoshVasabhaktula
Sep 23, 2022
Maintainer

Shakthieshwari Sep 23, 2022
Author

SanthoshVasabhaktula Sep 23, 2022
Maintainer

Shakthieshwari Sep 23, 2022
Author

rhwarrier
Sep 26, 2022
Maintainer

rhwarrier Sep 27, 2022
Maintainer

Shakthieshwari Sep 27, 2022
Author

SanthoshVasabhaktula Sep 27, 2022
Maintainer

Shakthieshwari Sep 27, 2022
Author

vijiurs
Sep 27, 2022