BART Arabic From Scratch

Implements the BART: Denoising Sequence-to-Sequence Pre-training paper from scratch using PyTorch, focused on abstractive summarization task in Arabic.

Goal

The objective is not to create something novel but to gain a deeper understanding of transformer architectures. By applying the concepts in the paper, I aim to grasp both theoretical and practical aspects in depth.

Data

I used the BBC Arabic dataset for training and evaluation. It contains text-summary pairs, with 32,473 records for training and 4,689 for validation. The dataset size is too small relative to the model.

Model

Paper: BART: Denoising Sequence-to-Sequence Pre-training.
Type: Transformer.
Architecture: Encoder-Decoder.
Size: 174M parameters.
Language: Arabic.
Framework: PyTorch.

Results

The model's performance is subpar, mainly due to insufficient data. However, with larger, more suitable datasets, I am confident the model would improve significantly.

Epoch	Loss(train)	Loss(validation)	Epoch Time (hours)	Training Time (hours)	Device
1	10.03	9.72	0.23	1.1	1 x L4OS
2	9.61	9.44	0.22	1.1	1 x L4OS
3	9.36	9.22	0.22	1.1	1 x L4OS
4	9.16	9.05	0.22	1.1	1 x L4OS
5	9.01	8.92	0.22	1.1	1 x L4OS

Additions

The paper used a Byte-Pair Encoding (BPE) tokenizer, but no Arabic-only BPE tokenizer was available. So, I built one and uploaded it to Hugging Face: arabic-bpe-tokenizer. The model itself is also available as arab-bart-base-174M, with detailed documentation.

Todo

Fine-tune the model with a larger dataset.
Create an inference API and integrate it with the Hugging Face Transformers library for easier use.

Acknowledgements

This project follows the architecture and configurations from the BART: Denoising Sequence-to-Sequence Pre-training paper by Meta AI, and I am grateful to Lightning.AI for providing free hardware resources for training.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cleaned_data		cleaned_data
data		data
tokenizer		tokenizer
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BART Arabic From Scratch

Goal

Data

Model

Results

Additions

Todo

Acknowledgements

About

Releases

Packages

Languages

License

IsmaelMousa/bart-arabic-pytorch-scratch

Folders and files

Latest commit

History

Repository files navigation

BART Arabic From Scratch

Goal

Data

Model

Results

Additions

Todo

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages