1/4
Here are some high quality synthetic datasets following ideas from..
Textbooks are all you need [1][2]:
1. Filtering for "high educational value" sources.
2. Rephrasing as "Textbooks" (Explanations, Examples, Exercises..).
Tiny Stories [3]:
1. Limit the words diversity to only use simpler words ("understandable by children").
2. Measure: grammar, creativity and consistency for training sample.
3. Promote diversity by prompting the generator model to generate texts that must include specific words. (randomly sampled from a list each time).
The magic of IF [4]:
1. Programming structure (if/else conditional statements) promote reasoning skills for LLMs.
---
The Datasets
Recently, these ideas were reproduced.
The resulting datasets are avilable on huggingface.
> Huge shout out to Nam Pham, the author of these datasets and the person behind this massive effort!
Huggingface Profile:
nampdn-ai (Nam Pham)
---
The Tiny Series
Tiny Textbooks:
420k "things of internet" synthetic textbooks.
- https://huggingface.co/datasets/nampdn-ai/tiny-textbooks…
Tiny Webtext:
A 6GB (4.5M records) variety of diverse webtext. "enriched with critical thinking methods to make unbiased English dataset."
- https://huggingface.co/datasets/nampdn-ai/tiny-webtext…
Tiny Lessons:
Various lessons about "things of internet" augmented in a bite-sized textbook Markdown format.
- https://huggingface.co/datasets/nampdn-ai/tiny-lessons…
Tiny Codes:
1.6 millions short and clear code snippets to help LLMs learn how to reason.
- https://huggingface.co/datasets/nampdn-ai/tiny-codes…
Multilingual (Tiny-bridgedict):
A dataset that links and transfers knowledge between English, Vietnamese, Chinese.
- https://huggingface.co/datasets/nampdn-ai/tiny-bridgedict…
---
[1] Textbooks are all you need:
[2306.11644] Textbooks Are All You Need
[2] Textbooks are all you need 2:
[2309.05463] Textbooks Are All You Need II: phi-1.5 technical report
[3] Tiny Stories:
[2305.07759] TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
[4] The magic of IF: https://aclanthology.org/2023.findings-acl.574.pdf…
2/4
Here are some high quality synthetic datasets following ideas from..
Textbooks are all you need [1][2]:
1. Filtering for "high educational value" sources.
2. Rephrasing as "Textbooks" (Explanations, Examples, Exercises..).
Tiny Stories [3]:
1. Limit the words diversity to only use simpler words ("understandable by children").
2. Measure: grammar, creativity and consistency for training sample.
3. Promote diversity by prompting the generator model to generate texts that must include specific words. (randomly sampled from a list each time).
The magic of IF [4]:
1. Programming structure (if/else conditional statements) promote reasoning skills for LLMs.
---
The Datasets
Recently, these ideas were reproduced.
The resulting datasets are avilable on huggingface.
> Huge shout out to Nam Pham, the author of these datasets and the person behind this massive effort!
Huggingface Profile: Here are some high quality synthetic datasets following ideas from..
Textbooks are all you need [1][2]:
1. Filtering for "high educational value" sources.
2. Rephrasing as "Textbooks" (Explanations, Examples, Exercises..).
Tiny Stories [3]:
1. Limit the words diversity to only use simpler words ("understandable by children").
2. Measure: grammar, creativity and consistency for training sample.
3. Promote diversity by prompting the generator model to generate texts that must include specific words. (randomly sampled from a list each time).
The magic of IF [4]:
1. Programming structure (if/else conditional statements) promote reasoning skills for LLMs.
---
The Datasets
Recently, these ideas were reproduced.
The resulting datasets are avilable on huggingface.
> Huge shout out to Nam Pham, the author of these datasets and the person behind this massive effort!
Huggingface Profile:
https://https://huggingface.co/nampdn-ai --- The Tiny Series
Tiny Textbooks:
420k "things of internet" synthetic textbooks.
-
---
The Tiny Series
Tiny Textbooks:
420k "things of internet" synthetic textbooks.
-
https://https://huggingface.co/datasets/nampdn-ai/tiny-textbooksn-ai/tiny-textbooks…… Tiny Webtext:
A 6GB (4.5M records) variety of diverse webtext. "enriched with critical thinking methods to make unbiased English dataset."
-
Tiny Webtext:
A 6GB (4.5M records) variety of diverse webtext. "enriched with critical thinking methods to make unbiased English dataset."
-
https://https://huggingface.co/datasets/nampdn-ai/tiny-webtextn-ai/tiny-webtext…… Tiny Lessons:
Various lessons about "things of internet" augmented in a bite-sized textbook Markdown format.
-
Tiny Lessons:
Various lessons about "things of internet" augmented in a bite-sized textbook Markdown format.
-
https://https://huggingface.co/datasets/nampdn-ai/tiny-lessonsn-ai/
Tiny Codes:
1.6 millions short and clear code snippets to help LLMs learn how to reason.
-
Tiny Codes:
1.6 millions short and clear code snippets to help LLMs learn how to reason.
-
https://https://huggingface.co/datasets/nampdn-ai/tiny-codesn-ai/
Multilingual (Tiny-bridgedict):
A dataset that links and transfers knowledge between English, Vietnamese, Chinese.
-
Multilingual (Tiny-bridgedict):
A dataset that links and transfers knowledge between English, Vietnamese, Chinese.
-
https://https://huggingface.co/datasets/nampdn-ai/tiny-bridgedictn-ai/
---
[1]Textbooks are all you need:
---
[1] Textbooks are all you need:
https://https://arxiv.org/abs/2306.11644
[2] Textbooks are all you need 2:
https://https://arxiv.org/abs/2309.05463
[3]Tiny Stories:
[3] Tiny Stories:
https://https://arxiv.org/abs/2305.07759
[4]The magic of IF:
[4] The magic of IF:
https://https://aclanthology.org/2023.findings-acl.574.pdfacl.574.pdf