Skip to content

Commit

Permalink
[DOCS] Embeddings tutorial: Temporarily remove full dataset (#1039)
Browse files Browse the repository at this point in the history
Co-authored-by: Xiayue Charles Lin <[email protected]>
  • Loading branch information
xcharleslin and Xiayue Charles Lin authored Jun 21, 2023
1 parent 1907413 commit 2bcaae4
Showing 1 changed file with 2 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@
"\n",
"We will use the **StackExchange crawl from the [RedPajamas dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)**. It is 75GB of `jsonl` files. \n",
"\n",
"*EDIT (June 2023): Our hosted version of the full dataset is temporarily unavailable. Please enjoy the demo with the sample dataset for now.*\n",
"\n",
"**Note:** This demo runs best on a cluster with many GPUs available. Information on how to connect Daft to a cluster is available [here](https://www.getdaft.io/projects/docs/en/stable/learn/user_guides/scaling-up.html). \n",
"\n",
"If running on a single node, you can use the provided subsample of the data, which is 75MB in size. If you like, you can also truncate either dataset to a desired number of rows using `df.limit`."
Expand Down Expand Up @@ -87,7 +89,6 @@
"source": [
"import daft\n",
"\n",
"FULL_DATA_PATH = \"s3://daft-public-data/redpajama-1t/stackexchange/*\"\n",
"SAMPLE_DATA_PATH = \"s3://daft-public-data/redpajama-1t-sample/stackexchange_sample.jsonl\"\n",
"\n",
"df = daft.read_json(SAMPLE_DATA_PATH)\n",
Expand Down

0 comments on commit 2bcaae4

Please sign in to comment.