Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning Mistral with deepspeed #101

Open
achangtv opened this issue Jan 29, 2024 · 2 comments
Open

Finetuning Mistral with deepspeed #101

achangtv opened this issue Jan 29, 2024 · 2 comments

Comments

@achangtv
Copy link

In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:

def load_training_dataset(
    tokenizer,
    path_or_dataset: str = DEFAULT_TRAINING_DATASET,
    max_seq_len: int = 256,
    seed: int = DEFAULT_SEED,
) -> Dataset:
    logger.info(f"Loading dataset from {path_or_dataset}")
    dataset = load_dataset(path_or_dataset)
    logger.info(f"Training: found {dataset['train'].num_rows} rows")
    logger.info(f"Eval: found {dataset['test'].num_rows} rows")

The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read dataset = Dataset.from_spark(path_or_dataset) but this gave me the error pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration. You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code to dataset = path_or_dataset? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?

@es94129
Copy link
Contributor

es94129 commented Jan 31, 2024

If you would like to pass in a Spark dataframe, dataset = Dataset.from_spark(df) looks good to me.

Regarding the PySparkRuntimeError, are you running the code in Databricks? It shall set the Spark master for you.

@achangtv
Copy link
Author

achangtv commented Jan 31, 2024

I am running the code in Databricks, although I did clone the repo so I am running it within Repos and not Workspace. Should I just copy over the whole folder into workspace? Or maybe the problem is the type of compute? I was using a multi GPU compute with an ML runtime, I can try again with a single GPU set up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants