Skip to content

Commit

Permalink
Add borders to slide images for consistency
Browse files Browse the repository at this point in the history
  • Loading branch information
ljvmiranda921 committed Jul 7, 2024
1 parent 3416f15 commit c789e7c
Showing 1 changed file with 57 additions and 43 deletions.
100 changes: 57 additions & 43 deletions notebook/_posts/2024-02-21-talk-unc-charlotte.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,20 +48,22 @@ There are several efforts to automate the fact-checking process.
A common approach is to treat it as an NLP pipeline composed of different tasks ([Guo et al., 2022](https://aclanthology.org/2022.tacl-1.11/)).
Today, we will only focus on **claim detection**, the first step in an automated fact-checking pipeline.

![](/assets/png/talk-unc-charlotte/slide05.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide06.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide07.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide08.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide05.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide06.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide07.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide08.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

Detecting claims is usually a dual problem: you'd also want to find the premises that support it.
Together, the claim and its premises make up an **argument.**
Applying NLP to this domain is often called **argument mining.**
For this talk, I want to introduce two argument mining sub-tasks: (1) first, we want to highlight the claim and premise given a text (*claim & premise extraction*), and then, (2) we want to determine if a text supports, opposes, or is neutral to a certain topic (*stance detection*).

![](/assets/png/talk-unc-charlotte/slide11.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide13.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide11.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide13.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

So, our general approach is to reframe these two sub-tasks as NLP tasks.
First, we treat claim & premise extraction as a span labeling problem.
Expand All @@ -73,8 +75,9 @@ Notice how we've decomposed this general problem of disinformation into tractabl
And it is an important muscle to train.
In computer science, we often learn about the *divide and conquer* algorithm, and this is a good application of that approach to a more fuzzy and, admittedly, complex problem.

![](/assets/png/talk-unc-charlotte/slide15.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide15.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

As we already know, training NLP models such as a span or text categorizer requires a lot of data.
I want to talk about different methods of collecting this dataset and emphasize how LLMs can fit into this workflow.
Expand All @@ -87,16 +90,18 @@ And then, on the right, we have more automated methods that rely heavily on a re
LLMs, as advanced as they are, still fall in between.
They're not fully manual but also not fully automated because writing a prompt still requires tuning and domain expertise.

![](/assets/png/talk-unc-charlotte/traditional.png){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide18.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/traditional.png" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide18.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

But why are we still interested in LLMs?
It's because LLMs provide something that most semi-automated methods can't: a model pretrained on web-scale data, and a highly flexible zero-shot capability.
Let me put this in a Venn diagram&mdash; and for each space in this diagram, I'll talk about how LLMs can specifically help in our annotation workflows.

![](/assets/png/talk-unc-charlotte/slide21.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide21.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

### Bootstrapping in a human-in-the-loop workflow

Expand All @@ -105,9 +110,10 @@ Here, an LLM is a drop-in replacement for a base model that you'd usually train.
LLMs differ because they were pretrained on web-scale data, giving it enough capacity even for your domain-specific task.
So, how good is an LLM annotator?

![](/assets/png/talk-unc-charlotte/slide24.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide25.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide24.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide25.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

To test this question, I worked on a portion of the UKP Sentential Argument Mining corpus ([Stab et al., 2018](https://aclanthology.org/D18-1402/)).
It contains several statements across various topics, and the task is to determine whether the statement supports, opposes, or is neutral to the topic&mdash; a text categorization problem.
Expand All @@ -117,9 +123,10 @@ My findings show that LLMs, when prompted in a zero-shot manner, are competitive
In addition, I also found myself annotating faster (and more correctly) when correcting LLM annotations compared to annotating from scratch.
The latter finding is important because correcting annotations induces less cognitive load and human effort ([Li et al., 2023](https://aclanthology.org/2023.emnlp-main.92/), [Zhang et al., 2023](https://arxiv.org/abs/2311.04345)).

![](/assets/png/talk-unc-charlotte/slide27.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide28.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide27.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide28.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

So, if LLMs can already provide competitive annotations, is our problem solved?
We don't have to annotate anymore?
Expand All @@ -134,18 +141,20 @@ LLMs have zero-shot capabilities, i.e., we can always frame structured predictio
Back then, you'd need to train separate supervised models to achieve multi-task skills.
I want to use an LLM's flexibility to enhance the annotation experience.

![](/assets/png/talk-unc-charlotte/slide30.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide30.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

This time, I want to introduce two workflows.
The first one is still a text categorization problem, but I want to ask an LLM to pre-highlight the claims and premises so I can reference them during annotation.
For the second one, I want to ask the LLM to do the reasoning for me.
I'll let it identify the claims and premises, then pre-annotate an answer, and then give me a reason for choosing that answer.
This exercise aims to explore creative ways we can harness LLMs.

![](/assets/png/talk-unc-charlotte/slide31.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide34.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide31.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide34.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

The process is similar to the first section, but I prompt for auxiliary information instead of prompting for the direct labels.
LLMs make this possible because we can formulate each task as a question-answer pair.
Expand All @@ -154,12 +163,12 @@ The prompt on the left is a straightforward span labeling prompt, where we ask t
On the other hand, the prompt on the right is a chain-of-thought prompt ([Wei et al., 2023](https://arxiv.org/abs/2201.11903)).
Here, we induce an LLM to perform a series of reasoning tasks to arrive at a final answer.


![](/assets/png/talk-unc-charlotte/slide32.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide35.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide33.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide37.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide32.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide35.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide33.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide37.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

The good thing about [Prodigy](https://prodigy.ai) is that you can easily incorporate this extra information in your annotation UI.
On the bottom left, you'll see that it highlights the claims and premises for each statement, allowing you to focus on the relevant details when labeling.
Expand All @@ -176,25 +185,28 @@ In most annotation projects, researchers write an **annotation guideline** to se
These guidelines aim to reduce uncertainty about the phenomenon we are annotating.
We can even think of these as prompts for humans!

![](/assets/png/talk-unc-charlotte/slide39.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide39.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

This time, I want to focus on a simple task: determine whether a statement is an argument.
It sounds easy because it's "just" a binary classification task.
However, after looking through various argument mining papers and their annotation guidelines, I realized that they each have their definition of what makes an argument!

![](/assets/png/talk-unc-charlotte/slide40.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide40.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

So, this got me thinking: what if we include the annotation guideline in the prompt?
You can check my entire experiment in this [blog post](/notebook/2023/03/25/langchain-annotation/).
Back then, you could not fit a whole document into an LLM's limited context length, so I used a continuous prompting strategy that showed chunks of the document and let the LLM update their answer based on new information.
Langchain calls this a ["refine chain"](https://js.langchain.com/docs/modules/chains/document/refine) in their docs.
As an aside, I've opted into using [minichain](https://github.com/srush/MiniChain) in my recent projects as it is more lightweight and enough for my needs.

![](/assets/png/talk-unc-charlotte/slide41.jpg){:width="360px"}
![](/assets/png/talk-unc-charlotte/slide42.jpg){:width="360px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide41.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
<img src="/assets/png/talk-unc-charlotte/slide42.jpg" style="border: 1px solid black; padding: 2px; width: 360px">
</div>

Including an annotation guideline in the prompt resulted in worse results&mdash;surprising.
I couldn't delve further, but I hypothesize that writing prompts for LLMs have a particular "dialect" vastly different from how we talk as humans.
Expand All @@ -203,8 +215,9 @@ There are many confounding factors, of course.
Maybe the refine strategy is not the best, or maybe I should've processed the text much better.
An LLM's prompt sensitivity is still an open problem.

![](/assets/png/talk-unc-charlotte/slide44.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide44.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

But I learned one thing: we can use LLMs as a "first pass" when iterating over our annotation guidelines.
Typically, you'd start with a pilot annotation with a small group of annotators as you write the guidelines, but there's an opportunity to incorporate LLMs into the mix.
Expand All @@ -221,8 +234,9 @@ Here, you already have a function in mind and need to collect enough data to tra
On the other hand, **descriptive** annotation aims to capture the whole diversity of human judgment.
You'd usually find this in subjective tasks like hate speech detection or human preference collection.

![](/assets/png/talk-unc-charlotte/slide48.jpg){:width="600px"}
{: style="text-align: center;"}
<div style="text-align: center;">
<img src="/assets/png/talk-unc-charlotte/slide48.jpg" style="border: 1px solid black; padding: 10px; width: 700px">
</div>

LLMs are pretty good at prescriptive annotation tasks.
Some empirical evidence that supports it ([Ashok et al., 2023](https://arxiv.org/pdf/2305.15444.pdf); [Chen et al., 2023](https://arxiv.org/pdf/2311.08723.pdf); [Sun et al., 2023](https://arxiv.org/pdf/2305.08377.pdf)), and it allows us to access the web-scale data it was pretrained upon.
Expand Down

0 comments on commit c789e7c

Please sign in to comment.