Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Column name input_col not in the dataset. Current columns in the dataset: [] #291

Open
bf-yang opened this issue Aug 27, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@bf-yang
Copy link

bf-yang commented Aug 27, 2023

When I run python cli_demo.py, it reports errors:

Generating examples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 26273.52it/s]
The generated dataset is ready.
The model has not been trained.
Processing datasets.
Traceback (most recent call last):
File "/home/bufang/prompt2model/cli_demo.py", line 435, in
main()
File "/home/bufang/prompt2model/cli_demo.py", line 321, in main
t5_modified_dataset_dicts = t5_processor.process_dataset_dict(
File "/home/bufang/prompt2model/prompt2model/dataset_processor/base.py", line 100, in process_dataset_dict
dataset_dict[dataset_split]
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2901, in map
return self.remove_columns(remove_columns)
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2064, in remove_columns
raise ValueError(
ValueError: Column name input_col not in the dataset. Current columns in the dataset: []

How to fix this error? Thanks

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Aug 27, 2023

Thanks for your bug report!

ValueError: Column name input_col not in the dataset. Current columns in the dataset: []

This means your dataset is an empty dataset, right? 🤔

Could you check if your two datasets, [dataset_dict, retrieved_dataset_dict], which have some potential error? I guess is the problem of retrieved_dataset_dict, so may you send us your prompt, chosen retrieved dataset, and chosen columns?

@bf-yang

@bf-yang
Copy link
Author

bf-yang commented Aug 27, 2023

@zhaochenyang20
Sure. In fact, I just refer to the prompt examples provided by this repo.
The prompt is:
"""Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50."
output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)."
output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries."
output="Europe"
"""

Then it shows:
1): yulongmannlp/dev_para Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
2): yulongmannlp/dev_orig Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
3): yulongmannlp/adv_para Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
4): yulongmannlp/adv_ori Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
5): squad Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
6): lhoestq/squad Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
7): lhoestq/custom_squad Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
8): hapandya/sqnnr Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
9): Yulong-W/squadpararobustness Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
10): Yulong-W/squadpara Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
11): Yulong-W/squadorirobustness Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
12): Yulong-W/squadori Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
13): MajdTannous/Test3 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
14): MajdTannous/Test2 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
15): MajdTannous/Dataset1 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
16): BerMaker/test Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
17): TenzinGayche/Demo-datasets Stanford Question Answering Dataset (DemoDatasets) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
18): SajjadAyoubi/persian_qa \\\\Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced dataset consists of more than 9,000 entries. Each entry can be either an impossible to answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".
19): web_questions This dataset consists of 6,642 question/answer pairs. The questions are supposed to be answerable by Freebase, a large knowledge graph. The questions are mostly centered around a single named entity. The questions are popular ones asked on the web (at least in 2013).
20): race Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension.
21): EleutherAI/race Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension.
22): the-coorporation/the_squad_qg A preprocessed version of the Standford Question Answering Dataset (SQuAD) version 2.0 consisting of contexts and questions only. Duplicate contexts have been removed and corresponding questions have been merged into an array per context. Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
23): imagenet-1k ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, ImageNet hopes to offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy. ImageNet 2012 is the most commonly used subset of ImageNet. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images
24): AlexFierro9/imagenet-1k_test ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, ImageNet hopes to offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy. ImageNet 2012 is the most commonly used subset of ImageNet. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images
25): dbpedia_14 The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000. There are 3 columns in the dataset (same for train and test splits), corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). There are no new lines in title or content.

Then I select this dataset.
image

Chosen columns are:
image

Chosen model is:
image

However, it finally reports error:
image

After that, it report error like this.

@zhaochenyang20
Copy link
Collaborator

Oops. Let me see!

@zhaochenyang20
Copy link
Collaborator

I rerun your process but get the right result of the retrieved dataset. 🤔

image

Would you please run this script to save your retrieved_dataset separately?

from prompt2model.prompt_parser import OpenAIInstructionParser, TaskType

prompt = """Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50."
output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)."
output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries."
output="Europe"
"""

prompt_spec = OpenAIInstructionParser(task_type=TaskType.TEXT_GENERATION)
prompt_spec.parse_from_prompt(prompt)
print(f"Instruction: {prompt_spec.instruction}")
print(f"exmaples: {prompt_spec.examples}")

from prompt2model.dataset_retriever import DescriptionDatasetRetriever

retriever = DescriptionDatasetRetriever()
retrieved_dataset_dict = retriever.retrieve_dataset_dict(prompt_spec)
retrieved_dataset_dict.save_to_disk("retrieved_dataset")

@zhaochenyang20
Copy link
Collaborator

BTW, we are developing our colab demo. Maybe it would be easier to use.

@zhaochenyang20
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 added the bug Something isn't working label Aug 27, 2023
@neubig neubig reopened this Aug 29, 2023
@neubig
Copy link
Collaborator

neubig commented Aug 29, 2023

Hi, I don't think this is fixed. I just ran the cli_demo.py on the main branch and encountered the same error.

Traceback (most recent call last):
  File "/Users/gneubig/work/prompt2model/cli_demo.py", line 435, in <module>
    main()
  File "/Users/gneubig/work/prompt2model/cli_demo.py", line 321, in main
    t5_modified_dataset_dicts = t5_processor.process_dataset_dict(
  File "/Users/gneubig/work/prompt2model/prompt2model/dataset_processor/base.py", line 100, in process_dataset_dict
    dataset_dict[dataset_split]
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2995, in map
    return self.remove_columns(remove_columns)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/fingerprint.py", line 511, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2155, in remove_columns
    raise ValueError(
ValueError: Column name input_col not in the dataset. Current columns in the dataset: []

@zhaochenyang20
Copy link
Collaborator

Which dataset? generated or retrieved?

@neubig
Copy link
Collaborator

neubig commented Aug 29, 2023

This line:

t5_modified_dataset_dicts = t5_processor.process_dataset_dict(

@zhaochenyang20
Copy link
Collaborator

DATASET_DICTS = [dataset_dict, retrieved_dataset_dict]

I am pretty familiar with generated_dataset, so I guess the problem lies in retrieved_dataset_dict. 🤔

Have you checked it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants