[DOCS] Embeddings tutorial: Temporarily remove full dataset (#1039)

Co-authored-by: Xiayue Charles Lin <[email protected]>
Eventual-Inc · Jun 21, 2023 · 2bcaae4 · 2bcaae4
1 parent 1907413
commit 2bcaae4
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/tutorials/embeddings/daft_tutorial_embeddings_stackexchange.ipynb b/tutorials/embeddings/daft_tutorial_embeddings_stackexchange.ipynb
@@ -52,6 +52,8 @@
     "\n",
     "We will use the **StackExchange crawl from the [RedPajamas dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)**. It is 75GB of `jsonl` files. \n",
     "\n",
+    "*EDIT (June 2023): Our hosted version of the full dataset is temporarily unavailable. Please enjoy the demo with the sample dataset for now.*\n",
+    "\n",
     "**Note:** This demo runs best on a cluster with many GPUs available. Information on how to connect Daft to a cluster is available [here](https://www.getdaft.io/projects/docs/en/stable/learn/user_guides/scaling-up.html). \n",
     "\n",
     "If running on a single node, you can use the provided subsample of the data, which is 75MB in size. If you like, you can also truncate either dataset to a desired number of rows using `df.limit`."
@@ -87,7 +89,6 @@
    "source": [
     "import daft\n",
     "\n",
-    "FULL_DATA_PATH = \"s3://daft-public-data/redpajama-1t/stackexchange/*\"\n",
     "SAMPLE_DATA_PATH = \"s3://daft-public-data/redpajama-1t-sample/stackexchange_sample.jsonl\"\n",
     "\n",
     "df = daft.read_json(SAMPLE_DATA_PATH)\n",