You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:
def load_training_dataset(
tokenizer,
path_or_dataset: str = DEFAULT_TRAINING_DATASET,
max_seq_len: int = 256,
seed: int = DEFAULT_SEED,
) -> Dataset:
logger.info(f"Loading dataset from {path_or_dataset}")
dataset = load_dataset(path_or_dataset)
logger.info(f"Training: found {dataset['train'].num_rows} rows")
logger.info(f"Eval: found {dataset['test'].num_rows} rows")
The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read dataset = Dataset.from_spark(path_or_dataset) but this gave me the error pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration. You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code to dataset = path_or_dataset? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?
The text was updated successfully, but these errors were encountered:
I am running the code in Databricks, although I did clone the repo so I am running it within Repos and not Workspace. Should I just copy over the whole folder into workspace? Or maybe the problem is the type of compute? I was using a multi GPU compute with an ML runtime, I can try again with a single GPU set up.
In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:
The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read
dataset = Dataset.from_spark(path_or_dataset)
but this gave me the errorpyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration.
You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code todataset = path_or_dataset
? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?The text was updated successfully, but these errors were encountered: