You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024)
This was an in person workshop, I ran at IBM techXchange conference in Las Vegas (Oct 21, 2024)
The workshop is in 2 parts (2 hrs total)
Part 1 - DPK Intro -showcasing core features of DPK
Part 2 - RAG application.
I used newly released IBM-GRANITE-3.0 model for RAG. Worked really well.
The intro notebooks (part 1) runs on Google colab. Lot of attendees did run this notebook using colab along with me.
This was great, as it gave them a pretty good idea of what DPK can be used for, without having to setup their laptop.
RAG application is designed to run on local python env. Some of them started setting up their local python env during the workshop. But pip install over conference wifi was slow (to be expected)
Afterwards, a few came up to me and chatted about their use cases.
Notes:
There is good amount of interest in what DPK can do for them.
We need to keep getting the word out.
We need to have as many example notebooks as colab friendly (ready to run on colab with single click). I have been doing this and will continue to advocate for this
One of them tried using their own PDFs and remarked that the pd2parquet conversion seems to be slow. This is a known issue, and we may want to prioritize investigating this. [Bug] improve performance of pdf2parquet #573
Also interest in extraction of tables from PDF and processing OCR forms.
I think docling can do these already, need to create some tutorials (on my radar)
There was interest in PII remover. We need to have a good tutorial
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024)
This was an in person workshop, I ran at IBM techXchange conference in Las Vegas (Oct 21, 2024)
The workshop is in 2 parts (2 hrs total)
Part 1 - DPK Intro -showcasing core features of DPK
Part 2 - RAG application.
I used newly released IBM-GRANITE-3.0 model for RAG. Worked really well.
The intro notebooks (part 1) runs on Google colab. Lot of attendees did run this notebook using colab along with me.
This was great, as it gave them a pretty good idea of what DPK can be used for, without having to setup their laptop.
RAG application is designed to run on local python env. Some of them started setting up their local python env during the workshop. But pip install over conference wifi was slow (to be expected)
Afterwards, a few came up to me and chatted about their use cases.
Notes:
There is good amount of interest in what DPK can do for them.
We need to keep getting the word out.
We need to have as many example notebooks as colab friendly (ready to run on colab with single click). I have been doing this and will continue to advocate for this
One of them tried using their own PDFs and remarked that the pd2parquet conversion seems to be slow. This is a known issue, and we may want to prioritize investigating this.
[Bug] improve performance of pdf2parquet #573
Also interest in extraction of tables from PDF and processing OCR forms.
I think docling can do these already, need to create some tutorials (on my radar)
There was interest in PII remover. We need to have a good tutorial
native windows support . I think we are pretty close to achieving this now
[Bug] unable to install release 0.2.1 on windows (native) #644
There was a question about integration with InstructLab. This is something worth highlighting (tutorial, workshop ..etc)
Interest in processing HTML and EXCEL spreadsheets.
We can process HTML now. But the excel was interesting.
Question about how we can keep external metadata about documents. And how we can use them for vector search / RAG.
Beta Was this translation helpful? Give feedback.
All reactions