SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
BigScience
A one-year research workshop on large multilingual
datasets and large language models
— original slides by Suzana Ilić from HuggingFace @suzatweet —
BigScience
A one-year research workshop on large multilingual
datasets and large language models
— original slides by Suzana Ilić from HuggingFace @suzatweet —
Gérard DUPONT
Research scientist/engineer
working on NLP, IR, ML, RL and
large scale data processing
@ggdupont
Many recent developments in NLP stem from
training larger language models on larger
datasets with compute resources typically
only available in industry.
Brown (2020): Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
https://hellofuture.orange.com/en/the-gpt-3-language-model-revolution-or-evolution/
https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-53
0b-the-worlds-largest-and-most-powerful-generative-language-model/
https://lair.lighton.ai/akronomicon/
Issues and questions
● Research:
○ Models not designed as general research tools (lack access to training data, private
models, research questions asked after the model is trained, anglo-centric models)
○ Difficult involvement of academic researchers
○ Lack of fields diversity of the research teams building them (limited size of the teams)
● Environmental:
○ Training parallel models in private setting => duplication of energy requirements
○ Carbon footprint not documented/taken into account
● Ethical and societal:
○ Shortcomings in the text corpora used to train these models, ranging from
non-representativeness of populations to a predominance of potentially harmful
stereotypes or the inclusion of personally-identifying information
○ Ethical/bias/usage question are usually asked a-posteriori
The BigScience approach
The Large Hadron Collider is a particle physics research tools which
- has involved 10.000 researchers
- from 100 countries
- lead to the discovery of 59 hadrons
- publication of more than 2.800 papers (😱)
In many scientific fields (epidemiology, space, fusion…), large-scale and worldwide research
collaborations create tools useful for the entire research community, like the LHC, ITER,
ISS…
Isn’t it time to build similar large, diverse, open research collaborations in AI/NLP as well?
Large scale public compute infrastructure exists
Jean Zay supercomputer at IDRIS (South of Paris, France)
● Cumulated peak performance of 28 Pflop/s with a total of 2696 Nvidia V100 GPUs
● Omni-PAth interconnection network 100 Gb/s : 4 links per converged node
● Parallel storage device with a capacity of 2.2 PB SSD disks (GridScaler GS18K SSD)
Short history
- 🐣 Early 2021: Discussions between Thomas Wolf (HuggingFace), Stéphane
Requena (GENCI) and Pierre-François Lavallée (IDRIS)
- 󰔡 Very quickly: HF + the French academic and industrial AI and NLP
research communities joined the discussion
- 📝 February 2021: Grant application for 5 million GPU hours
- 🌐 Following the grant submission
- open/extend to international research community
- organization of the project with the structure of a research workshop
- 🚀 19/04 & 28/04: Grant accepted - Kickoff event - officially started
Concept
- Gather a large research community:
- consider in advance the research questions that would be interesting to answer
- ask as questions as much as possible ‘a-priori’ rather than ‘a-posteriori’
- reflect on and prepare the tools needed to answer these questions
- Create and share research artifacts with the scientific community:
- a very large multilingual corpus constituted in a way that is responsible, diverse, and mindful of
ethical and legal issues
- a very large multilingual language model exhibiting non-trivial zero shot behaviors in a way
that make it accessible to researchers
- code tools associated with these artifacts for simple use
- Find and share processes, documents and infrastructures favoring the
replication of such scientific collaborative efforts in the future
Where are we now?
The largest AI research collaboration to date. More than 800 researchers from 60
different countries and more than 250 institutions have joined BigScience.
A mega-collaboration
Building and investigating the
model from all angles: bias,
social impact, ethics,
capabilities, limitations and
potential improvements,
specific domain performances,
carbon impact, general
AI/cognitive research
landscape
🌕🚀 Data Working Group
A Large Multilingual Dataset for a Large Multilingual Model
● Data Governance and Archival Strategies
● Defining a management and ownership structure for the dataset
● Scoping out legal concerns and societal impact of data choices
● Privacy
● Ethical and Legal Scholarship
● Data Sourcing and Representativeness
● Defining a set of languages and text sources, as well as
frameworks for representativeness / diversity
● Exploring different modes of data collection from web crawling to
participatory methods and collaboration with existing data orgs
● Data Tooling
● Developing tools to gather and process text from the identified
sources to be both easy to use at training time and respectful of
the data subjects’ rights
🛠 Data Tooling
Icons made by Smashicons, Kiranshastry, Pixel perfect, Freepik from Flaticon
Books
Gutemberg
Web Crawled data
Oscar
Document Classifier
🔖 Index, Interconnect &
Persist
📥 Ingest ⚙ Augment & Transform
📤 Filter & Export
https://github.com/bigscience-workshop/data-tooling
We will train the final model on many
distinct data sources. For the
proof-of-concept we include spoken
text, books and web crawled data.
We build a specific connector between
Hugging Face dataset and
Elasticsearch, SQL, & Memmap
backend to simplify indexing and
usage.
We need the ability to dynamically run
classifiers on the corpus and add
features/columns. These information
may be used when exporting a subset
for training. Permits deduping,
detecting lang, masking PII, and
detecting metadata. We export a dataset subsample in
the corresponding jsonl format, with
one file per document. This output is
used for final training on Jean Zay
super-computer.
Visualize & Explore
Allow the exploration of a dataset and
the augmented features to better
understand the samples and biases
in the data.
Dashboards
�� ��
Govern
OAuth & logs
��
Allow us to fulfill ethical and legal
duties.
Spoken Text
OpenSubtitles, Europarl
INA collection
🌕🚀 Working Group on “Engineering/scaling”
A working group discussing
● the technical challenges of training at scale on several hundred GPUs, and
● how to make the best use of the (very large) compute budget we have
The compute budget is given in hours of GPU usage (5 millions GPU hours).
Depending on the scaling efficiency (how much idle time for each GPUs) the overall (1) duration of the
training (2) actual FLOPS can vary in very significant proportions.
This Working Group will collaborate with the modeling team on one hand and with the scaling teams
from NVIDIA/Microsoft/Facebook to ensure that the model is implemented in the most efficient way.
Note that participating to this working group does not imply that you will have direct access to the supercomputer since
there are additional (quite strong) national restrictions on the access to this machine (see some details in the section
on access to compute here). It does mean however that you will participate in the discussion on these aspects.
🌕🚀 Working Group on “Engineering/scaling”
🌕🚀 First Modeling paper
Multitask Prompted Training Enables Zero-Shot Task
Generalization by Sanh et at. (2021)
T0 shows zero-shot task generalization on English natural
language prompts, outperforming GPT-3 on many tasks, while
being 16x smaller!
To create T0, we fine-tuned T5 on a multi-task mixture of
prompted datasets from Promptsource. When evaluated on
zero-shot tasks, we found that it matched or exceeded GPT-3's
performance on 9 of 11 datasets.
Model: https://huggingface.co/bigscience/T0pp
Repo: https://github.com/bigscience-workshop/promptsource
Paper: https://arxiv.org/abs/2110.08207
What’s coming up next?
● Finished training the first test model: a 13B English
decoder-only model trained to investigate instabilities at
large scale, currently training a second model and planning
the first large-scale multilingual model
● Several papers submitted
● Several hackathons (ongoing and upcoming)
● Working towards the main model training
To learn more about the effort and join or follow:
● Website: bigscience.huggingface.co
● Twitter: @BigScienceW
● YouTube: BigScienceResearchWorkshop
Thank you!

Más contenido relacionado

La actualidad más candente

Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptxbodaceacat
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)Evert Lammerts
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Rodrigo Urubatan
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingTobias Kuhn
 
Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16Boaz Menuhin
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowMatthias Feys
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Tech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemTech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar
 
Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...The HDF-EOS Tools and Information Center
 
Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)Ovidiu Farauanu
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...Big Data Spain
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoopvishnu rao
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"Paco Nathan
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Heritage data beyond the GLAM
Heritage data beyond the GLAMHeritage data beyond the GLAM
Heritage data beyond the GLAMdatable_be
 

La actualidad más candente (20)

Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16
 
Sociopath presentation
Sociopath presentationSociopath presentation
Sociopath presentation
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Tech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemTech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed System
 
Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...
 
Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoop
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Heritage data beyond the GLAM
Heritage data beyond the GLAMHeritage data beyond the GLAM
Heritage data beyond the GLAM
 

Similar a Tds — big science dec 2021

Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Dataconomy Media
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfGeethaPratyusha
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Kaitlin Thaney
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
jlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARjlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARJonathan Lettvin
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningmustafa sarac
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectGoethe Univeristy
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source frameworkedunextgen
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework edunextgen
 

Similar a Tds — big science dec 2021 (20)

Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
 
On Big Data
On Big DataOn Big Data
On Big Data
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
jlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARjlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STAR
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee Projeect
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source framework
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Tds — big science dec 2021

  • 1. BigScience A one-year research workshop on large multilingual datasets and large language models — original slides by Suzana Ilić from HuggingFace @suzatweet —
  • 2. BigScience A one-year research workshop on large multilingual datasets and large language models — original slides by Suzana Ilić from HuggingFace @suzatweet — Gérard DUPONT Research scientist/engineer working on NLP, IR, ML, RL and large scale data processing @ggdupont
  • 3. Many recent developments in NLP stem from training larger language models on larger datasets with compute resources typically only available in industry. Brown (2020): Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165 https://hellofuture.orange.com/en/the-gpt-3-language-model-revolution-or-evolution/
  • 5. Issues and questions ● Research: ○ Models not designed as general research tools (lack access to training data, private models, research questions asked after the model is trained, anglo-centric models) ○ Difficult involvement of academic researchers ○ Lack of fields diversity of the research teams building them (limited size of the teams) ● Environmental: ○ Training parallel models in private setting => duplication of energy requirements ○ Carbon footprint not documented/taken into account ● Ethical and societal: ○ Shortcomings in the text corpora used to train these models, ranging from non-representativeness of populations to a predominance of potentially harmful stereotypes or the inclusion of personally-identifying information ○ Ethical/bias/usage question are usually asked a-posteriori
  • 6. The BigScience approach The Large Hadron Collider is a particle physics research tools which - has involved 10.000 researchers - from 100 countries - lead to the discovery of 59 hadrons - publication of more than 2.800 papers (😱) In many scientific fields (epidemiology, space, fusion…), large-scale and worldwide research collaborations create tools useful for the entire research community, like the LHC, ITER, ISS… Isn’t it time to build similar large, diverse, open research collaborations in AI/NLP as well?
  • 7. Large scale public compute infrastructure exists Jean Zay supercomputer at IDRIS (South of Paris, France) ● Cumulated peak performance of 28 Pflop/s with a total of 2696 Nvidia V100 GPUs ● Omni-PAth interconnection network 100 Gb/s : 4 links per converged node ● Parallel storage device with a capacity of 2.2 PB SSD disks (GridScaler GS18K SSD)
  • 8. Short history - 🐣 Early 2021: Discussions between Thomas Wolf (HuggingFace), Stéphane Requena (GENCI) and Pierre-François Lavallée (IDRIS) - 󰔡 Very quickly: HF + the French academic and industrial AI and NLP research communities joined the discussion - 📝 February 2021: Grant application for 5 million GPU hours - 🌐 Following the grant submission - open/extend to international research community - organization of the project with the structure of a research workshop - 🚀 19/04 & 28/04: Grant accepted - Kickoff event - officially started
  • 9. Concept - Gather a large research community: - consider in advance the research questions that would be interesting to answer - ask as questions as much as possible ‘a-priori’ rather than ‘a-posteriori’ - reflect on and prepare the tools needed to answer these questions - Create and share research artifacts with the scientific community: - a very large multilingual corpus constituted in a way that is responsible, diverse, and mindful of ethical and legal issues - a very large multilingual language model exhibiting non-trivial zero shot behaviors in a way that make it accessible to researchers - code tools associated with these artifacts for simple use - Find and share processes, documents and infrastructures favoring the replication of such scientific collaborative efforts in the future
  • 10. Where are we now? The largest AI research collaboration to date. More than 800 researchers from 60 different countries and more than 250 institutions have joined BigScience.
  • 11. A mega-collaboration Building and investigating the model from all angles: bias, social impact, ethics, capabilities, limitations and potential improvements, specific domain performances, carbon impact, general AI/cognitive research landscape
  • 12. 🌕🚀 Data Working Group A Large Multilingual Dataset for a Large Multilingual Model ● Data Governance and Archival Strategies ● Defining a management and ownership structure for the dataset ● Scoping out legal concerns and societal impact of data choices ● Privacy ● Ethical and Legal Scholarship ● Data Sourcing and Representativeness ● Defining a set of languages and text sources, as well as frameworks for representativeness / diversity ● Exploring different modes of data collection from web crawling to participatory methods and collaboration with existing data orgs ● Data Tooling ● Developing tools to gather and process text from the identified sources to be both easy to use at training time and respectful of the data subjects’ rights
  • 13. 🛠 Data Tooling Icons made by Smashicons, Kiranshastry, Pixel perfect, Freepik from Flaticon Books Gutemberg Web Crawled data Oscar Document Classifier 🔖 Index, Interconnect & Persist 📥 Ingest ⚙ Augment & Transform 📤 Filter & Export https://github.com/bigscience-workshop/data-tooling We will train the final model on many distinct data sources. For the proof-of-concept we include spoken text, books and web crawled data. We build a specific connector between Hugging Face dataset and Elasticsearch, SQL, & Memmap backend to simplify indexing and usage. We need the ability to dynamically run classifiers on the corpus and add features/columns. These information may be used when exporting a subset for training. Permits deduping, detecting lang, masking PII, and detecting metadata. We export a dataset subsample in the corresponding jsonl format, with one file per document. This output is used for final training on Jean Zay super-computer. Visualize & Explore Allow the exploration of a dataset and the augmented features to better understand the samples and biases in the data. Dashboards �� �� Govern OAuth & logs �� Allow us to fulfill ethical and legal duties. Spoken Text OpenSubtitles, Europarl INA collection
  • 14. 🌕🚀 Working Group on “Engineering/scaling” A working group discussing ● the technical challenges of training at scale on several hundred GPUs, and ● how to make the best use of the (very large) compute budget we have The compute budget is given in hours of GPU usage (5 millions GPU hours). Depending on the scaling efficiency (how much idle time for each GPUs) the overall (1) duration of the training (2) actual FLOPS can vary in very significant proportions. This Working Group will collaborate with the modeling team on one hand and with the scaling teams from NVIDIA/Microsoft/Facebook to ensure that the model is implemented in the most efficient way. Note that participating to this working group does not imply that you will have direct access to the supercomputer since there are additional (quite strong) national restrictions on the access to this machine (see some details in the section on access to compute here). It does mean however that you will participate in the discussion on these aspects.
  • 15. 🌕🚀 Working Group on “Engineering/scaling”
  • 16. 🌕🚀 First Modeling paper Multitask Prompted Training Enables Zero-Shot Task Generalization by Sanh et at. (2021) T0 shows zero-shot task generalization on English natural language prompts, outperforming GPT-3 on many tasks, while being 16x smaller! To create T0, we fine-tuned T5 on a multi-task mixture of prompted datasets from Promptsource. When evaluated on zero-shot tasks, we found that it matched or exceeded GPT-3's performance on 9 of 11 datasets. Model: https://huggingface.co/bigscience/T0pp Repo: https://github.com/bigscience-workshop/promptsource Paper: https://arxiv.org/abs/2110.08207
  • 17.
  • 18. What’s coming up next? ● Finished training the first test model: a 13B English decoder-only model trained to investigate instabilities at large scale, currently training a second model and planning the first large-scale multilingual model ● Several papers submitted ● Several hackathons (ongoing and upcoming) ● Working towards the main model training
  • 19. To learn more about the effort and join or follow: ● Website: bigscience.huggingface.co ● Twitter: @BigScienceW ● YouTube: BigScienceResearchWorkshop Thank you!