BigScience is a one-year research workshop involving over 800 researchers from 60 countries to build and study very large multilingual language models and datasets. It was granted 5 million GPU hours on the Jean Zay supercomputer in France. The workshop aims to advance AI/NLP research by creating shared models and data as well as tools for researchers. Several working groups are studying issues like bias, scaling, and engineering challenges of training such large models. The first model, T0, showed strong zero-shot performance. Upcoming work includes further model training and papers.
A Journey Into the Emotions of Software Developers
Tds — big science dec 2021
1. BigScience
A one-year research workshop on large multilingual
datasets and large language models
— original slides by Suzana Ilić from HuggingFace @suzatweet —
2. BigScience
A one-year research workshop on large multilingual
datasets and large language models
— original slides by Suzana Ilić from HuggingFace @suzatweet —
Gérard DUPONT
Research scientist/engineer
working on NLP, IR, ML, RL and
large scale data processing
@ggdupont
3. Many recent developments in NLP stem from
training larger language models on larger
datasets with compute resources typically
only available in industry.
Brown (2020): Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
https://hellofuture.orange.com/en/the-gpt-3-language-model-revolution-or-evolution/
5. Issues and questions
● Research:
○ Models not designed as general research tools (lack access to training data, private
models, research questions asked after the model is trained, anglo-centric models)
○ Difficult involvement of academic researchers
○ Lack of fields diversity of the research teams building them (limited size of the teams)
● Environmental:
○ Training parallel models in private setting => duplication of energy requirements
○ Carbon footprint not documented/taken into account
● Ethical and societal:
○ Shortcomings in the text corpora used to train these models, ranging from
non-representativeness of populations to a predominance of potentially harmful
stereotypes or the inclusion of personally-identifying information
○ Ethical/bias/usage question are usually asked a-posteriori
6. The BigScience approach
The Large Hadron Collider is a particle physics research tools which
- has involved 10.000 researchers
- from 100 countries
- lead to the discovery of 59 hadrons
- publication of more than 2.800 papers (😱)
In many scientific fields (epidemiology, space, fusion…), large-scale and worldwide research
collaborations create tools useful for the entire research community, like the LHC, ITER,
ISS…
Isn’t it time to build similar large, diverse, open research collaborations in AI/NLP as well?
7. Large scale public compute infrastructure exists
Jean Zay supercomputer at IDRIS (South of Paris, France)
● Cumulated peak performance of 28 Pflop/s with a total of 2696 Nvidia V100 GPUs
● Omni-PAth interconnection network 100 Gb/s : 4 links per converged node
● Parallel storage device with a capacity of 2.2 PB SSD disks (GridScaler GS18K SSD)
8. Short history
- 🐣 Early 2021: Discussions between Thomas Wolf (HuggingFace), Stéphane
Requena (GENCI) and Pierre-François Lavallée (IDRIS)
- Very quickly: HF + the French academic and industrial AI and NLP
research communities joined the discussion
- 📝 February 2021: Grant application for 5 million GPU hours
- 🌐 Following the grant submission
- open/extend to international research community
- organization of the project with the structure of a research workshop
- 🚀 19/04 & 28/04: Grant accepted - Kickoff event - officially started
9. Concept
- Gather a large research community:
- consider in advance the research questions that would be interesting to answer
- ask as questions as much as possible ‘a-priori’ rather than ‘a-posteriori’
- reflect on and prepare the tools needed to answer these questions
- Create and share research artifacts with the scientific community:
- a very large multilingual corpus constituted in a way that is responsible, diverse, and mindful of
ethical and legal issues
- a very large multilingual language model exhibiting non-trivial zero shot behaviors in a way
that make it accessible to researchers
- code tools associated with these artifacts for simple use
- Find and share processes, documents and infrastructures favoring the
replication of such scientific collaborative efforts in the future
10. Where are we now?
The largest AI research collaboration to date. More than 800 researchers from 60
different countries and more than 250 institutions have joined BigScience.
11. A mega-collaboration
Building and investigating the
model from all angles: bias,
social impact, ethics,
capabilities, limitations and
potential improvements,
specific domain performances,
carbon impact, general
AI/cognitive research
landscape
12. 🌕🚀 Data Working Group
A Large Multilingual Dataset for a Large Multilingual Model
● Data Governance and Archival Strategies
● Defining a management and ownership structure for the dataset
● Scoping out legal concerns and societal impact of data choices
● Privacy
● Ethical and Legal Scholarship
● Data Sourcing and Representativeness
● Defining a set of languages and text sources, as well as
frameworks for representativeness / diversity
● Exploring different modes of data collection from web crawling to
participatory methods and collaboration with existing data orgs
● Data Tooling
● Developing tools to gather and process text from the identified
sources to be both easy to use at training time and respectful of
the data subjects’ rights
13. 🛠 Data Tooling
Icons made by Smashicons, Kiranshastry, Pixel perfect, Freepik from Flaticon
Books
Gutemberg
Web Crawled data
Oscar
Document Classifier
🔖 Index, Interconnect &
Persist
📥 Ingest ⚙ Augment & Transform
📤 Filter & Export
https://github.com/bigscience-workshop/data-tooling
We will train the final model on many
distinct data sources. For the
proof-of-concept we include spoken
text, books and web crawled data.
We build a specific connector between
Hugging Face dataset and
Elasticsearch, SQL, & Memmap
backend to simplify indexing and
usage.
We need the ability to dynamically run
classifiers on the corpus and add
features/columns. These information
may be used when exporting a subset
for training. Permits deduping,
detecting lang, masking PII, and
detecting metadata. We export a dataset subsample in
the corresponding jsonl format, with
one file per document. This output is
used for final training on Jean Zay
super-computer.
Visualize & Explore
Allow the exploration of a dataset and
the augmented features to better
understand the samples and biases
in the data.
Dashboards
�� ��
Govern
OAuth & logs
��
Allow us to fulfill ethical and legal
duties.
Spoken Text
OpenSubtitles, Europarl
INA collection
14. 🌕🚀 Working Group on “Engineering/scaling”
A working group discussing
● the technical challenges of training at scale on several hundred GPUs, and
● how to make the best use of the (very large) compute budget we have
The compute budget is given in hours of GPU usage (5 millions GPU hours).
Depending on the scaling efficiency (how much idle time for each GPUs) the overall (1) duration of the
training (2) actual FLOPS can vary in very significant proportions.
This Working Group will collaborate with the modeling team on one hand and with the scaling teams
from NVIDIA/Microsoft/Facebook to ensure that the model is implemented in the most efficient way.
Note that participating to this working group does not imply that you will have direct access to the supercomputer since
there are additional (quite strong) national restrictions on the access to this machine (see some details in the section
on access to compute here). It does mean however that you will participate in the discussion on these aspects.
16. 🌕🚀 First Modeling paper
Multitask Prompted Training Enables Zero-Shot Task
Generalization by Sanh et at. (2021)
T0 shows zero-shot task generalization on English natural
language prompts, outperforming GPT-3 on many tasks, while
being 16x smaller!
To create T0, we fine-tuned T5 on a multi-task mixture of
prompted datasets from Promptsource. When evaluated on
zero-shot tasks, we found that it matched or exceeded GPT-3's
performance on 9 of 11 datasets.
Model: https://huggingface.co/bigscience/T0pp
Repo: https://github.com/bigscience-workshop/promptsource
Paper: https://arxiv.org/abs/2110.08207
17.
18. What’s coming up next?
● Finished training the first test model: a 13B English
decoder-only model trained to investigate instabilities at
large scale, currently training a second model and planning
the first large-scale multilingual model
● Several papers submitted
● Several hackathons (ongoing and upcoming)
● Working towards the main model training
19. To learn more about the effort and join or follow:
● Website: bigscience.huggingface.co
● Twitter: @BigScienceW
● YouTube: BigScienceResearchWorkshop
Thank you!