Benchmarking Legal Knowledge of Large Language Models

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly-specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform law-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench.

Tasks in LawBench are based on the law system of China. A similar bench based on the American law system is available here.

✨ Introduction

LawBench has been meticulously crafted to have precise assessment of the LLMs’ legal capabilities. In designing the testing tasks, we simulated three dimensions of judicial cognition and selected 20 tasks to evaluate the abilities of large models. Compared to some existing benchmarks with only multiple-choice questions, we include more diverse types of tasks closely related to real-world applications, such as legal entity recognition, reading comprehension, criminal damages calculation and consultation. We recognize that the security policies of current large models may decline to respond to certain legal queries or experience difficulty in comprehending instructions, leading to a lack of response. Therefore, we have developed a separate evaluation metric "abstention rate" to measure how often the model refuses to provide the answer, or fail to understand the instructions properly. We report the performances of 51 large language models on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs.

📖 Dataset

Our dataset include 20 diverse tasks covering 3 cognitive levels:

Legal knowledge memorization: whether large language models can memorize necessary legal concepts, terminologies, articles and facts in their parameters.
Legal knowledge understanding: whether large language models can comprehend entities, events, and relationships within legal texts, so as to understand the meanings and connotations of legal text.
Legal knowledge applying: whether large language models can properly utilize their legal knowledge, reason over it to solve realistic legal tasks in downstream applications.

Task List

The following is the included task list. Every task has 500 examples.

Cognitive Level	ID	Tasks	Data Sources	Metrics	Type
Legal Knowledge Memorization	1-1	Article Recitation	FLK	ROUGE-L	Generation
Legal Knowledge Memorization	1-2	Knowledge Question Answering	JEC_QA	Accuracy	SLC
Legal Knowledge Understanding	2-1	Document Proofread	CAIL2022	F0.5	Generation
	2-2	Dispute Focus Identification	LAIC2021	F1	MLC
	2-3	Marital Disputes Identification	AIStudio	F1	MLC
	2-4	Issue Topic Identification	CrimeKgAssitant	Accuracy	SLC
	2-5	Reading Comprehension	CAIL2019	rc-F1	Extraction
	2-6	Name Entity Recognition	CAIL2021	soft-F1	Extraction
	2-7	Opinion Summarization	CAIL2022	ROUGE-L	Generation
	2-8	Argument Mining	CAIL2022	Accuracy	SLC
	2-9	Event Detection	LEVEN	F1	MLC
	2-10	Trigger Word Extraction	LEVEN	soft-F1	Extraction
Legal Knowledge Applying	3-1	Fact-based Article Prediction	CAIL2018	F1	MLC
	3-2	Scene-based Article Prediction	LawGPT_zh Project	ROUGE-L	Generation
	3-3	Charge Prediction	CAIL2018	F1	MLC
	3-4	Prison Term Prediction w.o Article	CAIL2018	Normalized log-distance	Regression
	3-5	Prison Term Prediction w. Article	CAIL2018	Normalized log-distance	Regression
	3-6	Case Analysis	JEC_QA	Accuracy	SLC
	3-7	Crimal Damages Calculation	LAIC2021	Accuracy	Regression
	3-8	Consultation	hualv.com	ROUGE-L	Generation

Data Format

The data is stored under the data folder. Every task is stored in the <task_id>.json file. The json file can be loaded via json.load as a list of dictionaries. The data format is as follows (use task 3-2 as an example):

[
  {
    "instruction": "请根据具体场景与问题给出法律依据，只需要给出具体法条内容，每个场景仅涉及一个法条。",
    "question": "场景:某个地区的三个以上专业农民合作社想要出资设立农民专业合作社联合社，以提高其在市场中的竞争力和规模效应。根据哪条法律，三个以上的农民专业合作社可以出资设立农民专业合作社联合社？",
    "answer": "根据《农民专业合作社法》第五十六条，三个以上的农民专业合作社在自愿的基础上，可以出资设立农民专业合作社联合社。该联合社应当有自己的名称、组织机构和住所，由联合社全体成员制定并承认的章程，以及符合章程规定的成员出资。"
  },
]

Model Output Format

The model outputs are stored under the predictions/zero_shot and predictions/one_shot folder. Every system has its own subfolder. Within each subfolder, task predictions are stored in <task_id>.json file. The json file can be loaded via json.load as a dictionary. The data format is as follows (use task 3-2 from GPT-4 zero-shot prediction as an example):

{
    "0": {
        "origin_prompt": [
            {
                "role": "HUMAN",
                "prompt": "请根据具体场景与问题给出法律依据，只需要给出具体法条内容，每个场景仅涉及一个法条。\n场景:某个地区的三个以上专业农民合作社想要出资设立农民专业合作社联合社，以提高其在市场中的竞争力和规模效应。根据哪条法律，三个以上的农民专业合作社可以出资设立农民专业合作社联合社？"
            }
        ],
        "prediction": "根据《中华人民共和国农民专业合作社法》第十七条：“三个以上的农民专业合作社可以出资设立农民专业合作社联合社。”",
        "refr": "根据《农民专业合作社法》第五十六条，三个以上的农民专业合作社在自愿的基础上，可以出资设立农民专业合作社联合社。该联合社应当有自己的名称、组织机构和住所，由联合社全体成员制定并承认的章程，以及符合章程规定的成员出资。"
    },

📖 Model List

We test 51 popular large language models. We group them as in the following table:

Model	Parameters	SFT	RLHF	Access	Base Model
Multilingual LLMs
MPT	7B	✗	✗	Weights	-
MPT-Instruct	7B	✓	✗	Weights	MPT-7B
LLaMA	7/13/30/65B	✗	✗	Weights	-
LLaMA-2	7/13/70B	✗	✗	Weights	-
LLaMA-2-Chat	7/13/70B	✓	✓	Weights	LLaMA-2-7/13/70B
Alpaca-v1.0	7B	✓	✗	Weights	LLaMA-7B
Vicuna-v1.3	7/13/33B	✓	✗	Weights	LLaMA-7/13/33B
WizardLM	7B	✓	✗	Weights	LLaMA-7B
StableBeluga2	70B	✓	✗	Weights	LLaMA-2-70B
ChatGPT	N/A	✓	✓	API	-
GPT-4	N/A	✓	✓	API	-
Chinese-oriented LLMs
MOSS-Moon	16B	✗	✗	Weights	-
MOSS-Moon-SFT	16B	✓	✗	Weights	MOSS-Moon
TigerBot-Base	7B	✗	✗	Weights	-
TigerBot-SFT	7B	✓	✗	Weights	TigerBot-Base
GoGPT	7B	✓	✗	Weights	LLaMA-7B
ChatGLM2	6B	✓	✗	Weights	ChatGLM
Ziya-LLaMA	13B	✓	✓	Weights	LLaMA-13B
Baichuan	7/13B	✗	✗	Weights	-
Baichuan-13B-Chat	13B	✓	✗	Weights	Baichuan-13B
XVERSE	13B	✗	✗	Weights	-
InternLM	7B	✗	✗	Weights	-
InternLM-Chat	7B	✓	✗	Weights	InternLM-7B
InternLM-Chat-7B-8K	7B	✓	✗	Weights	InternLM-7B
Qwen	7B	✗	✗	Weights	-
Qwen-7B-Chat	7B	✓	✗	Weights	Qwen-7B
Yulan-Chat-2	13B	✓	✗	Weights	LLaMA-2-13B
BELLE-LLaMA-2	13B	✓	✗	Weights	LLaMA-2-13B
Chinese-LLaMA-2	7B	✓	✗	Weights	LLaMA-2-7B
Chinese-Alpaca-2	7B	✓	✗	Weights	LLaMA-2-7B
LLaMA-2-Chinese	7/13B	✓	✗	Weights	LLaMA-2-7/13B
Legal Specific LLMs
HanFei	7B	✓	✗	Weights	HanFei
LaWGPT-7B-beta1.0	7B	✓	✗	Weights	Chinese-LLaMA
LaWGPT-7B-beta1.1	7B	✓	✗	Weights	Chinese-alpaca-plus-7B
LexiLaw	6B	✓	✗	Weights	ChatGLM-6B
Wisdom-Interrogatory	7B	✓	✗	Weights	Baichuan-7B
Fuzi-Mingcha	6B	✓	✗	Weights	ChatGLM-6B
Lawyer-LLaMA	13B	✓	✗	Weights	LLaMA
ChatLaw	13/33B	✓	✗	Weights	Ziya-LLaMA-13B/Anima-33B

📊 Model Performance

We test the model performance under 2 scenarios: (1) zero-shot, where only instructions are provided in the prompt, and (2) one-shot, where instructions and one-shot examples are concatenated in the prompt.

Zero-shot Performance

Average performance (zero-shot) of 51 LLMs evaluated on LawBench.

We show the performances of top-5 models with the highest average scores.

Note: gpt-3.5-turbo is version 2023.6.13, and all gpt-3.5-turbo results below are for this version

Task ID	GPT4	GPT-3.5-turbo	StableBeluga2	qwen-7b-chat	internlm-chat-7b-8k
AVG	52.35	42.15	39.23	37.00	35.73
1-1	15.38	15.86	14.58	18.54	15.45
1-2	55.20	36.00	34.60	34.00	40.40
2-1	12.53	9.10	7.70	22.56	22.64
2-2	41.65	32.37	25.57	27.42	35.46
2-3	69.79	51.73	44.20	31.42	28.96
2-4	44.00	41.20	39.00	35.00	35.60
2-5	56.50	53.75	52.03	48.48	54.13
2-6	76.60	69.55	65.54	37.88	17.95
2-7	37.92	33.49	39.07	36.04	27.11
2-8	61.20	36.40	45.80	24.00	36.20
2-9	78.82	66.48	65.27	44.88	62.93
2-10	65.09	39.05	41.64	18.90	20.94
3-1	52.47	29.50	16.41	44.62	34.86
3-2	27.54	31.30	24.52	33.50	19.11
3-3	41.99	35.52	22.82	40.67	41.05
3-4	82.62	78.75	76.06	76.74	63.21
3-5	81.91	76.84	65.35	77.19	67.20
3-6	48.60	27.40	34.40	26.80	34.20
3-7	77.60	61.20	56.60	42.00	43.80
3-8	19.65	17.45	13.39	19.32	13.37

One-Shot Performance

Average performance (one-shot) of 51 LLMs evaluated on LawBench.

We show the performances of top-5 models with the highest average scores.

Task ID	GPT4	GPT-3.5-turbo	qwen-7b-chat	StableBeluga2	internlm-chat-7b-8k
AVG	53.85	44.52	38.99	38.97	37.28
1-1	17.21	16.15	17.73	15.03	15.16
1-2	54.80	37.20	28.60	36.00	40.60
2-1	18.31	13.50	25.16	8.93	21.64
2-2	46.00	40.60	27.40	15.00	36.60
2-3	69.99	54.01	32.96	41.76	30.91
2-4	44.40	41.40	31.20	38.00	33.20
2-5	64.80	61.98	46.71	53.55	54.35
2-6	79.96	74.04	57.34	64.99	26.86
2-7	40.52	40.68	42.58	45.06	30.56
2-8	59.00	37.40	26.80	37.60	30.60
2-9	76.55	67.59	50.63	65.89	63.42
2-10	65.26	40.04	21.27	40.54	20.69
3-1	53.20	30.81	52.86	16.87	38.88
3-2	33.15	34.49	34.49	32.44	28.70
3-3	41.30	34.55	39.91	23.07	42.25
3-4	83.21	77.12	78.47	75.80	67.74
3-5	82.74	73.72	73.92	63.59	71.10
3-6	49.60	31.60	26.80	33.00	36.20
3-7	77.00	66.40	44.60	56.00	44.00
3-8	19.90	17.17	20.39	16.24	12.11

🛠️ How to Evaluate Model

We design different rule-based parsing to extract answers from model predictions. The evaluation scripts for every task is in evaluation/evaluation_functions.

Steps

The steps to evaluate the model predictions are as below:

Put prediction results from all systems under a folder F. Every system has one subfolder.
Under the subfolder of every system, every task has a prediction file. The name of every task is the task id.
Enter the evaluation folder and run "python main.py -i F -o <metric_result>"

The data format is as below:

data/
├── system-1
│   ├── 1-1.json
│   ├── 1-2.json
│   ├── ...
├── system-2
│   ├── 1-1.json
│   ├── 1-2.json
│   ├── ...
├── ...

The output result will be saved in <metric_result>.

For example, the zero-shot predictions from the 51 tested models are saved in predictions/zero_shot. You can run

cd evaluation
python main.py -i ../predictions/zero_shot -o ../predictions/zero_shot/results.csv

to get their evaluation results stored as ../predictions/zero_shot/results.csv.

Result Format

The result file is a csv file with four columns: task, model_name, score and abstention_rate:

Column	Description
task	Task name. Set as the name of the prediction file
model_name	Model name. Set as the name of the folder storing the prediction files
score	Model score for the corresponding task.
abstention_rate	Abstention rate for the corresponding task. This rate indicates how often the answer cannot be extracted from the model prediction.

Requirement

rouge_chinese==1.0.3
cn2an==0.5.22
ltp==4.2.13
OpenCC==1.1.6
python-Levenshtein==0.21.1
pypinyin==0.49.0
tqdm==4.64.1
timeout_decorator==0.5.0

📌 Licenses

LawBench is a mix of created and transformed datasets. We ask that you follow the license of the dataset creator. Please see the task list for the original source of each task.

🔜 Future Plan

ROUGE-L is not a good metric to evaluate long-form generation results. We will explore using large language model-based evaluation metrics dedicated to law tasks.
We will keep updating the task list included in LawBench. We welcome external contributors to collaborate with.

If you have law datasets that you would like to include or evaluate your own models. Feel free to contact us.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
evaluation		evaluation
figs		figs
predictions		predictions
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Legal Knowledge of Large Language Models

✨ Introduction

📖 Dataset

Task List

Data Format

Model Output Format

📖 Model List

📊 Model Performance

Zero-shot Performance

One-Shot Performance

🛠️ How to Evaluate Model

Steps

Result Format

Requirement

📌 Licenses

🔜 Future Plan

About

Releases

Packages

Languages

License

Yggdrasill7D6/LawBench

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Legal Knowledge of Large Language Models

✨ Introduction

📖 Dataset

Task List

Data Format

Model Output Format

📖 Model List

📊 Model Performance

Zero-shot Performance

One-Shot Performance

🛠️ How to Evaluate Model

Steps

Result Format

Requirement

📌 Licenses

🔜 Future Plan

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages