Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: module 'evaluate' has no attribute 'load' #609

Open
Adesoji1 opened this issue Jul 31, 2024 · 7 comments
Open

AttributeError: module 'evaluate' has no attribute 'load' #609

Adesoji1 opened this issue Jul 31, 2024 · 7 comments

Comments

@Adesoji1
Copy link

image

i am running evaluate==0.4.2 on python 3.11 and i get this error, please help.

@raghavm1
Copy link

raghavm1 commented Aug 3, 2024

Can you check your import statement? I'm able to use evaluate 0.4.2 on python 3.11. Feel free to share more of your code here so that there's more clarity. Maybe double check if evaluate is installed properly and its version.

@Adesoji1
Copy link
Author

Adesoji1 commented Aug 3, 2024

@raghavm1
`
import json
import torch
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq, GenerationConfig
from datasets import Dataset, load_metric
from sqlalchemy.orm import Session
from app.database import SessionLocal
from app import models
import re
import spacy
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import evaluate
import numpy as np
from transformers import EvalPrediction

Load the SpaCy model

nlp = spacy.load("en_core_web_sm")

Load Preprocessed Data from Database

def load_books_from_db():
db: Session = SessionLocal()
books = db.query(models.Book).limit(3).all() # Increase limit if needed
db.close()
return books

Preprocess text with SpaCy

def preprocess_text(text):
doc = nlp(text)
tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
cleaned_text = ' '.join(tokens)
return cleaned_text

Additional text cleaning

def clean_text(text):
text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single space
text = re.sub(r'[^\w\s]', '', text) # Remove non-alphanumeric characters
text = text.strip()
text = text.lower()
return text

def prepare_dataset(books):
data = []
for book in tqdm(books, desc="Processing books"):
text = preprocess_text(book.text)
text = clean_text(text)
data.append({"text": text, "summary": book.title}) # Adjust based on actual fields

# Split the data into training, validation, and test sets
train_data, test_data = train_test_split(data, test_size=0.3)
train_data, val_data = train_test_split(train_data, test_size=0.1)

train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)

return {"train": train_dataset, "validation": val_dataset, "test": test_dataset}

def tokenize_function(example):
inputs = tokenizer(example["text"], max_length=1024, truncation=True, padding="max_length")
targets = tokenizer(example["summary"], max_length=128, truncation=True, padding="max_length")
inputs["labels"] = targets["input_ids"]
return inputs

def tokenize_datasets(datasets):
tokenized_datasets = {}
for split, dataset in datasets.items():
tokenized_datasets[split] = dataset.map(tokenize_function, batched=True)
return tokenized_datasets

def compute_metrics(pred: EvalPrediction):
rouge = evaluate.load("rouge")
pred_ids = pred.predictions[0] if isinstance(pred.predictions, tuple) else pred.predictions
pred_ids = np.argmax(pred_ids, axis=-1)

pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_ids = pred.label_ids
label_ids[label_ids == -100] = tokenizer.pad_token_id
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

rouge_output = rouge.compute(predictions=pred_str, references=label_str, use_stemmer=True)
return {
    "rouge1": rouge_output["rouge1"].mid.fmeasure,
    "rouge2": rouge_output["rouge2"].mid.fmeasure,
    "rougeL": rouge_output["rougeL"].mid.fmeasure,
}

Define GenerationConfig

generation_config = GenerationConfig(
early_stopping=True,
num_beams=4,
no_repeat_ngram_size=3,
forced_bos_token_id=BartTokenizer.bos_token_id,
forced_eos_token_id=BartTokenizer.eos_token_id,
decoder_start_token_id=BartTokenizer.bos_token_id # Add decoder_start_token_id
)

def train_model(tokenized_datasets):
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",  # Use eval_strategy instead of evaluation_strategy
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy="epoch",
    save_total_limit=2,
    fp16=True,
    logging_dir='./logs',
    logging_steps=10,
    report_to="tensorboard",
    load_best_model_at_end=True,
    metric_for_best_model="rougeL",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

if name == "main":
model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

books = load_books_from_db()
datasets = prepare_dataset(books)
tokenized_datasets = tokenize_datasets(datasets)

# Print some tokenized samples for inspection
print("Sample tokenized data:")
for i in range(min(5, len(tokenized_datasets["train"]))):  # Ensure we don't access out of bounds
    print(tokenized_datasets["train"][i])

# Check for CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)  # Move the model to the device

train_model(tokenized_datasets)

# Define a function to generate summary using the GenerationConfig
def generate_summary(input_text):
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    summary_ids = model.generate(inputs["input_ids"], **generation_config.to_dict())
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Test the summary generation
test_text = "Your test text here"
print(generate_summary(test_text))

`

@raghavm1
Copy link

raghavm1 commented Aug 3, 2024

Ah wait I just realised - your script is called evaluate.py
This causes confusion - python confuses between your evaluate script and the evaluate library that you intend to use. As a result, it might be importing your script instead of the library, and you don't seem to have a load function in your script, which you want to actually use from the evaluate library and is present there.

Try renaming your script from evaluate.py to something else and try again.

Your imports seem fine.

@Adesoji1
Copy link
Author

Adesoji1 commented Aug 3, 2024

@raghavm1 i used train.py not evaluate.py

@raghavm1
Copy link

raghavm1 commented Aug 3, 2024

image

i am running evaluate==0.4.2 on python 3.11 and i get this error, please help.

Referring to your screenshot, you're running evaluate.py here

@Adesoji1
Copy link
Author

Adesoji1 commented Aug 3, 2024

@raghavm1 yes you are correct.

@raghavm1
Copy link

raghavm1 commented Aug 3, 2024

Great! For questions and doubts about usage of evaluate, you can also check out discuss.huggingface.co in the future, where there's an active community as well. I think this space is mostly for reporting bugs and concerns on functionality of the library itself, not the usage as much.

Feel free to close this issue if its solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants