๐ Awesome lists about all kinds of LLM related datasets
- Automated Programming Progress Standard: A collection of 12,500 challenging mathematical problems from competitions, providing step-by-step solutions for training models in answer derivation and explanation generation
- GSM8k Dataset: A collection of 8,500 grade school math problems. This dataset tests the multi-step reasoning abilities of models, highlighting their limitations despite the simplicity of the problems
- MathQA:A large-scale dataset of math word problems.
- AQUA-RAT: A algebraic word problem dataset, with multiple choice questions annotated with rationales.
- Magicoder
- Salesforce/xlam-function-calling-60k: APIGen Function-Calling Datasets
- ImageInWords: Unlocking Hyper-Detailed Image Descriptions
- Mendeley digital knee X-ray images
- PAD-UFES-20
- UltraMedical: Building Specialized Generalists in Biomedicine.
- MS MARCO Web Search: A large-scale information-rich web dataset, featuring millions of real clicked query-document labels