-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #80 from stanford-oval/dev-python-pkg
Wrap the project as a python package and support pip install
- Loading branch information
Showing
27 changed files
with
346 additions
and
243 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,15 +6,16 @@ | |
|
||
<p align="center"> | ||
| <a href="http://storm.genie.stanford.edu"><b>Research preview</b></a> | <a href="https://arxiv.org/abs/2402.14207"><b>Paper</b></a> | <a href="https://storm-project.stanford.edu/"><b>Website</b></a> | | ||
|
||
</p> | ||
|
||
**Latest News** 🔥 | ||
|
||
- [2024/07] You can now install our package with `pip install knowledge-storm`! | ||
- [2024/07] We add `VectorRM` to support grounding on user-provided documents, complementing existing support of search engines (`YouRM`, `BingSearch`). (check out [#58](https://github.com/stanford-oval/storm/pull/58)) | ||
- [2024/07] We release demo light for developers a minimal user interface built with streamlit framework in Python, handy for local development and demo hosting (checkout [#54](https://github.com/stanford-oval/storm/pull/54)) | ||
- [2024/06] We will present STORM at NAACL 2024! Find us at Poster Session 2 on June 17 or check our [presentation material](assets/storm_naacl2024_slides.pdf). | ||
- [2024/05] We add Bing Search support in [rm.py](src/rm.py). Test STORM with `GPT-4o` - we now configure the article generation part in our demo using `GPT-4o` model. | ||
- [2024/04] We release refactored version of STORM codebase! We define [interface](src/interface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src/storm_wiki`](src/storm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration. | ||
- [2024/05] We add Bing Search support in [rm.py](knowledge_storm/rm.py). Test STORM with `GPT-4o` - we now configure the article generation part in our demo using `GPT-4o` model. | ||
- [2024/04] We release refactored version of STORM codebase! We define [interface](knowledge_storm/interface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src/storm_wiki`](knowledge_storm/storm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration. | ||
|
||
## Overview [(Try STORM now!)](https://storm.genie.stanford.edu/) | ||
|
||
|
@@ -46,25 +47,89 @@ Based on the separation of the two stages, STORM is implemented in a highly modu | |
|
||
|
||
|
||
## Getting started | ||
## Installation | ||
|
||
### 1. Setup | ||
|
||
Below, we provide a quick start guide to run STORM locally. | ||
To install the knowledge storm library, use `pip install knowledge-storm`. | ||
|
||
You could also install the source code which allows you to modify the behavior of STORM engine directly. | ||
1. Clone the git repository. | ||
```shell | ||
git clone https://github.com/stanford-oval/storm.git | ||
cd storm | ||
``` | ||
```shell | ||
git clone https://github.com/stanford-oval/storm.git | ||
cd storm | ||
``` | ||
|
||
2. Install the required packages. | ||
```shell | ||
conda create -n storm python=3.11 | ||
conda activate storm | ||
pip install -r requirements.txt | ||
``` | ||
3. Set up OpenAI API key (if you want to use OpenAI models to power STORM) and [You.com search API](https://api.you.com/) key. Create a file `secrets.toml` under the root directory and add the following content: | ||
|
||
|
||
## API | ||
The STORM knowledge curation engine is defined as a simple Python `STORMWikiRunner` class. | ||
|
||
As STORM is working in the information curation layer, you need to set up the information retrieval module and language model module to create a `STORMWikiRunner` instance. Here is an example of using You.com search engine and OpenAI models. | ||
```python | ||
import os | ||
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs | ||
from knowledge_storm.lm import OpenAIModel | ||
from knowledge_storm.rm import YouRM | ||
lm_configs = STORMWikiLMConfigs() | ||
openai_kwargs = { | ||
'api_key': os.getenv("OPENAI_API_KEY"), | ||
'temperature': 1.0, | ||
'top_p': 0.9, | ||
} | ||
# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality. | ||
# For a good practice, choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation. | ||
# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations. | ||
gpt_35 = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs) | ||
gpt_4 = OpenAIModel(model='gpt-4-o', max_tokens=3000, **openai_kwargs) | ||
lm_configs.set_conv_simulator_lm(gpt_35) | ||
lm_configs.set_question_asker_lm(gpt_35) | ||
lm_configs.set_outline_gen_lm(gpt_4) | ||
lm_configs.set_article_gen_lm(gpt_4) | ||
lm_configs.set_article_polish_lm(gpt_4) | ||
# Check out the STORMWikiRunnerArguments class for more configurations. | ||
engine_args = STORMWikiRunnerArguments(...) | ||
rm = YouRM(ydc_api_key=os.getenv('YDC_API_KEY'), k=engine_args.search_top_k) | ||
runner = STORMWikiRunner(engine_args, lm_configs, rm) | ||
``` | ||
|
||
Currently, our package support: | ||
- `OpenAIModel`, `AzureOpenAIModel`, `ClaudeModel`, `VLLMClient`, `TGIClient`, `TogetherClient`, `OllamaClient` as language model components | ||
- `YouRM`, `BingSearch`, `VectorRM` as retrieval module components | ||
|
||
:star2: **PRs for integrating more language models into [knowledge_storm/lm.py](knowledge_storm/lm.py) and search engines/retrievers into [knowledge_storm/rm.py](knowledge_storm/rm.py) are highly appreciated!** | ||
|
||
The `STORMWikiRunner` instance can be evoked with the simple `run` method: | ||
```python | ||
topic = input('Topic: ') | ||
runner.run( | ||
topic=topic, | ||
do_research=True, | ||
do_generate_outline=True, | ||
do_generate_article=True, | ||
do_polish_article=True, | ||
) | ||
runner.post_run() | ||
runner.summary() | ||
``` | ||
- `do_research`: if True, simulate conversations with difference perspectives to collect information about the topic; otherwise, load the results. | ||
- `do_generate_outline`: if True, generate an outline for the topic; otherwise, load the results. | ||
- `do_generate_article`: if True, generate an article for the topic based on the outline and the collected information; otherwise, load the results. | ||
- `do_polish_article`: if True, polish the article by adding a summarization section and (optionally) removing duplicate content; otherwise, load the results. | ||
|
||
|
||
## Quick Start with Example Scripts | ||
|
||
We provide scripts in our [examples folder](examples) as a quick start to run STORM with different configurations. | ||
|
||
**To run STORM with `gpt` family models with default configurations:** | ||
1. We suggest using `secrets.toml` to set up the API keys. Create a file `secrets.toml` under the root directory and add the following content: | ||
```shell | ||
# Set up OpenAI API key. | ||
OPENAI_API_KEY="your_openai_api_key" | ||
|
@@ -77,74 +142,31 @@ Below, we provide a quick start guide to run STORM locally. | |
# Set up You.com search API key. | ||
YDC_API_KEY="your_youcom_api_key" | ||
``` | ||
2. Run the following command. | ||
``` | ||
python examples/run_storm_wiki_gpt.py \ | ||
--output-dir $OUTPUT_DIR \ | ||
--retriever you \ | ||
--do-research \ | ||
--do-generate-outline \ | ||
--do-generate-article \ | ||
--do-polish-article | ||
``` | ||
|
||
**To run STORM using your favorite language models or grounding on your own corpus:** Check out [examples/README.md](examples/README.md). | ||
|
||
### 2. Running STORM-wiki locally | ||
|
||
**To run STORM with `gpt` family models with default configurations**: Make sure you have set up the OpenAI API key and run the following command. | ||
|
||
``` | ||
python examples/run_storm_wiki_gpt.py \ | ||
--output-dir $OUTPUT_DIR \ | ||
--retriever you \ | ||
--do-research \ | ||
--do-generate-outline \ | ||
--do-generate-article \ | ||
--do-polish-article | ||
``` | ||
- `--do-research`: if True, simulate conversation to research the topic; otherwise, load the results. | ||
- `--do-generate-outline`: If True, generate an outline for the topic; otherwise, load the results. | ||
- `--do-generate-article`: If True, generate an article for the topic; otherwise, load the results. | ||
- `--do-polish-article`: If True, polish the article by adding a summarization section and (optionally) removing duplicate content. | ||
We provide more example scripts under [`examples`](examples) to demonstrate how you can run STORM using your favorite language models or grounding on your own corpus. | ||
## Customize STORM | ||
|
||
### Customization of the Pipeline | ||
## Customization of the Pipeline | ||
|
||
Besides running scripts in `examples`, you can customize STORM based on your own use case. STORM engine consists of 4 modules: | ||
If you have installed the source code, you can customize STORM based on your own use case. STORM engine consists of 4 modules: | ||
|
||
1. Knowledge Curation Module: Collects a broad coverage of information about the given topic. | ||
2. Outline Generation Module: Organizes the collected information by generating a hierarchical outline for the curated knowledge. | ||
3. Article Generation Module: Populates the generated outline with the collected information. | ||
4. Article Polishing Module: Refines and enhances the written article for better presentation. | ||
|
||
The interface for each module is defined in `src/interface.py`, while their implementations are instantiated in `src/storm_wiki/modules/*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs). | ||
:star2: **You can share your customization of `Engine` by making PRs to this repo!** | ||
### Customization of Retriever Module | ||
As a knowledge curation engine, STORM grabs information from the Retriever module. The Retriever modules are implemented in [`src/rm.py`](src/rm.py). Currently, STORM supports the following retrievers: | ||
The interface for each module is defined in `knowledge_storm/interface.py`, while their implementations are instantiated in `knowledge_storm/storm_wiki/modules/*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs). | ||
|
||
- `YouRM`: You.com search engine API | ||
- `BingSearch`: Bing Search API | ||
- `VectorRM`: a retrieval model that retrieves information from user provide corpus | ||
:star2: **PRs for integrating more search engines/retrievers are highly appreciated!** | ||
### Customization of Language Models | ||
STORM provides the following language model implementations in [`src/lm.py`](src/lm.py): | ||
- `OpenAIModel` | ||
- `ClaudeModel` | ||
- `VLLMClient` | ||
- `TGIClient` | ||
- `TogetherClient` | ||
:star2: **PRs for integrating more language model clients are highly appreciated!** | ||
:bulb: **For a good practice,** | ||
- choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation. | ||
- if you need to conduct the actual writing step, choose a more powerful model for `article_gen_lm`. Based on our experiments, weak models are bad at generating text with citations. | ||
- for open models, adding one-shot example can help it better follow instructions. | ||
Please refer to the scripts in the [`examples`](examples) directory for concrete guidance on customizing the language model used in the pipeline. | ||
|
||
## Replicate NAACL2024 result | ||
|
||
|
@@ -157,7 +179,7 @@ Please switch to the branch `NAACL-2024-code-backup` | |
|
||
The FreshWiki dataset used in our experiments can be found in [./FreshWiki](FreshWiki). | ||
|
||
Run the following commands under [./src](src). | ||
Run the following commands under [./src](knowledge_storm). | ||
|
||
#### Pre-writing Stage | ||
For batch experiment on FreshWiki dataset: | ||
|
@@ -196,7 +218,7 @@ python -m scripts.run_writing --input-source console --engine gpt-4 --do-polish- | |
The generated article will be saved in `{output_dir}/{topic}/storm_gen_article.txt` and the references corresponding to citation index will be saved in `{output_dir}/{topic}/url_to_info.json`. If `--do-polish-article` is set, the polished article will be saved in `{output_dir}/{topic}/storm_gen_article_polished.txt`. | ||
|
||
### Customize the STORM Configurations | ||
We set up the default LLM configuration in `LLMConfigs` in [src/modules/utils.py](src/modules/utils.py). You can use `set_conv_simulator_lm()`,`set_question_asker_lm()`, `set_outline_gen_lm()`, `set_article_gen_lm()`, `set_article_polish_lm()` to override the default configuration. These functions take in an instance from `dspy.dsp.LM` or `dspy.dsp.HFModel`. | ||
We set up the default LLM configuration in `LLMConfigs` in [src/modules/utils.py](knowledge_storm/modules/utils.py). You can use `set_conv_simulator_lm()`,`set_question_asker_lm()`, `set_outline_gen_lm()`, `set_article_gen_lm()`, `set_article_polish_lm()` to override the default configuration. These functions take in an instance from `dspy.dsp.LM` or `dspy.dsp.HFModel`. | ||
### Automatic Evaluation | ||
|
@@ -224,7 +246,11 @@ For rubric grading, we use the [prometheus-13b-v1.0](https://huggingface.co/prom | |
</details> | ||
## Contributions | ||
## Roadmap & Contributions | ||
Our team is actively working on: | ||
1. Human-in-the-Loop Functionalities: Supporting user participation in the knowledge curation process. | ||
2. Information Abstraction: Developing abstractions for curated information to support presentation formats beyond the Wikipedia-style report. | ||
If you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase! | ||
Contact person: [Yijia Shao](mailto:[email protected]) and [Yucheng Jiang](mailto:[email protected]) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.