Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing a typo ! #293

Merged
merged 1 commit into from
May 2, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions chapters/en/unit4/multimodal-models/tasks-models-part1.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In this section, we will briefly look at the different multimodal tasks involvin
## Examples of Tasks
Before looking into specific models, it's crucial to understand the diverse range of tasks involving image and text. These tasks include but are not limited to:

- **Visual Question Anwering (VQA) and Visual Reasoning:** Imagine a machine that looks at a picture and understands your questions about it. Visual Question Answering (VQA) is just that! It trains computers to extract meaning from images and answer questions like "Who's driving the car?" while Visual Reasoning is the secret sauce, enabling the machine to go beyond simple recognition and infer relationships, compare objects, and understand scene context to give accurate answers. It's like asking a detective to read the clues in a picture, only much faster and better!
- **Visual Question Answering (VQA) and Visual Reasoning:** Imagine a machine that looks at a picture and understands your questions about it. Visual Question Answering (VQA) is just that! It trains computers to extract meaning from images and answer questions like "Who's driving the car?" while Visual Reasoning is the secret sauce, enabling the machine to go beyond simple recognition and infer relationships, compare objects, and understand scene context to give accurate answers. It's like asking a detective to read the clues in a picture, only much faster and better!

- **Document Visual Question Answering (DocVQA):** Imagine a computer understanding both the text and layout of a document, like a map or contract, and then answering questions about it directly from the image. That's Document Visual Question Answering (DocVQA) in a nutshell. It combines computer vision for processing image elements and natural language processing to interpret text, allowing machines to "read" and answer questions about documents just like humans do. Think of it as supercharging document search with AI to unlock all the information trapped within those images.

Expand Down Expand Up @@ -392,4 +392,4 @@ Congratulations! you made it till the end. Now on to the next section for more o
## References
1. [Vision-Language Pre-training: Basics, Recent Advances, and Future Trends](https://arxiv.org/abs/2210.09263)<a id="pretraining-paper"></a>
2. [Document Collection Visual Question Answering](https://arxiv.org/abs/2104.14336)<a id="doc-vqa-paper"></a>
3. [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)<a id="grounding-dino"></a>
3. [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)<a id="grounding-dino"></a>
Loading