SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
#TechSEOBoost | @CatalystSEM
THANK YOU TO OUR SPONSORS
Generating Qualitative Content with GPT-2
in All Languages
Vincent Terrasi, OnCrawl
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
In All Languages
Generating Qualitative
Content
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
SEO Use-cases
• Image captioning with Pythia
• Visual question & Answering
• Abstractive Summarization with BERTsum
• Full Article generation with GPT-2
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Text Spinners are bad
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Google, What is bad generated content in 2016?
• Text translated by an automated tool without human review or curation before
publishing
• Text generated through automated processes, such as Markov chains
• Text generated using automated synonymizing or obfuscation techniques
• Text generated from scraping Atom/RSS feeds or search results
• Stitching or combining content from different web pages without adding sufficient value
https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Google, What is bad generated content in 2019?
• Text that makes no sense to the reader but which may contain search keywords.
• Text translated by an automated tool without human review or curation before
publishing
• Text generated through automated processes, such as Markov chains
• Text generated using automated synonymizing or obfuscation techniques
• Text generated from scraping Atom/RSS feeds or search results
• Stitching or combining content from different web pages without adding sufficient value
https://support.google.com/webmasters/answer/2721306?hl=en
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Surprise!
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
2019, the best year for
using AI for text
generation
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2BERT
ELMO ULM-FIT
J Howard
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2BERT
ELMO ULM-FIT
J Howard
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Transformer and Attention Model
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Pattern 5: Attention to other words predictive (next word) of word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Pattern 5: Attention to other words predictive (next word) of word
Pattern 6: Attention to delimiter tokens
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
State of the Art
⚫ All models exist for English
⚫ Documentation is good
⚫ So we just need to translate
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
There are a lot of biases:
◦ Small Talk
◦ Idioms
◦ Local Named Entities
◦ Rarest Verbs
◦ Uncommon Tenses
◦ Gender rules
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
How to scale?
Create your own model
in your language
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Objectives
Use only qualitative methods to improve
the quality of content created by humans
Extract the knowledge learnt by the Deep
Learning.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Why others attempts have
failed?
Quantitative:
You need a lot of data: more than 100 000
texts with a minimum of 500 words
Qualitative:
You need qualitative texts
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2
Recipe
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 1: Training the model
This method without pretraining requires significant computing power.
You need GPUs! 3 days to get my first result with one GPU.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 2: Generating the compressed training dataset - 1/2
GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of
data compression.
Why?
- Predicting the next character is too imprecise
- Predicting the next word is too precive and take a lot of computing power.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 2: Generating the compressed training dataset - 2/2
Use SentencePiece to generate my BPE files.
Why?
- Unsupervised text tokenizer and detokenizer
- Purely end-to-end system that does not depend on language-specific
pre/postprocessing.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Size of attention: no greater accuracy if you increase this value
- n_head:12
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Size of attention: no greater accuracy if you increase this value
- n_head:12
Number of layers: no greater accuracy if you increase this value
- n_layer:12
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 4: Generating article text
Once the model has been trained, the gpt-2-gen command is used to generate a text.
The first parameter is the path to the model.
The second is the beginning of the sentence.
Then there are two optional parameters:
o --tokens-to-generate: number of tokens to generate, default 42
o --top-k: number of candidate tokens each time, by default 8.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Results & Quality
Evaluated subjectively by a native reader.
API pylanguagetool was used to quantifiably
confirm the quality of results and did not find
any errors in the generated text.
https://github.com/Findus23/pyLanguagetool
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
You can find my Google Colab Notebook
here for the French
https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV
Warning: it is just an example using limited
data.
NOW it is your turn.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Further ?
Parameters Objectives Use Cases
top-k < 10
token < 10
High Performance
Very high qualitative content related
to your original training content
Anchors for Internal Linking
Variant of Title
Variant of Meta
top-k > 50
token > 400
Low Performance
Low qualitative content because the
model is weak, but the model
successfully extracts all concepts
that GPT-2 learnt about your dataset.
Guides to help you write, compared
to a query, with the stated purpose of
saving you time.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Thank You
vincent@oncrawl.com
Catalyst | @CatalystSEM | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/

Más contenido relacionado

La actualidad más candente

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 

La actualidad más candente (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT
BERTBERT
BERT
 
LSTM Based Sentiment Analysis
LSTM Based Sentiment AnalysisLSTM Based Sentiment Analysis
LSTM Based Sentiment Analysis
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
MongoDB Atlas Data Lake 집중 분석 [MongoDB]
MongoDB Atlas Data Lake 집중 분석 [MongoDB]MongoDB Atlas Data Lake 집중 분석 [MongoDB]
MongoDB Atlas Data Lake 집중 분석 [MongoDB]
 
Bert
BertBert
Bert
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Lazy Load '22 - Performance Mistakes - An HTTP Archive Deep Dive
Lazy Load  '22 - Performance Mistakes - An HTTP Archive Deep DiveLazy Load  '22 - Performance Mistakes - An HTTP Archive Deep Dive
Lazy Load '22 - Performance Mistakes - An HTTP Archive Deep Dive
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
[216]네이버 검색 사용자를 만족시켜라!   의도파악과 의미검색[216]네이버 검색 사용자를 만족시켜라!   의도파악과 의미검색
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Python for SEO
Python for SEOPython for SEO
Python for SEO
 

Similar a Generating Qualitative Content with GPT-2 in All Languages

Similar a Generating Qualitative Content with GPT-2 in All Languages (20)

Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition
 
TechSEO Boost - Apps script for SEOs
TechSEO Boost - Apps script for SEOsTechSEO Boost - Apps script for SEOs
TechSEO Boost - Apps script for SEOs
 
Analyzing Real Time News
Analyzing Real Time NewsAnalyzing Real Time News
Analyzing Real Time News
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
 
Machine Learning for Designers
Machine Learning for DesignersMachine Learning for Designers
Machine Learning for Designers
 
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSISMOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
 
Improve existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit testsImprove existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit tests
 
Deep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science ExperienceDeep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science Experience
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
 
Five steps to search and store tweets by keywords
Five steps to search and store tweets by keywordsFive steps to search and store tweets by keywords
Five steps to search and store tweets by keywords
 
MmIT webinar 2018 - Essential tools and technologies for the library and info...
MmIT webinar 2018 - Essential tools and technologies for the library and info...MmIT webinar 2018 - Essential tools and technologies for the library and info...
MmIT webinar 2018 - Essential tools and technologies for the library and info...
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastText
 
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
 
Thesis Presentation V4
Thesis Presentation V4Thesis Presentation V4
Thesis Presentation V4
 
How can AI be a creative partner for PR & marketing?
How can AI be a creative partner for PR & marketing?How can AI be a creative partner for PR & marketing?
How can AI be a creative partner for PR & marketing?
 
Sentiment analysis on demonetisation
Sentiment analysis on demonetisationSentiment analysis on demonetisation
Sentiment analysis on demonetisation
 
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
 

Más de Catalyst

New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Catalyst
 

Más de Catalyst (20)

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO Experimentation
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing Programmatic
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel Imperative
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020
 
Keynote: Bias in Search and Recommender Systems
Keynote: Bias in Search and Recommender SystemsKeynote: Bias in Search and Recommender Systems
Keynote: Bias in Search and Recommender Systems
 
NLP Powered Outreach Link Building
NLP Powered Outreach Link BuildingNLP Powered Outreach Link Building
NLP Powered Outreach Link Building
 
NLP for SEO
NLP for SEONLP for SEO
NLP for SEO
 
What I Learned Building a Toy Example to Crawl & Render like Google
What I Learned Building a Toy Example to Crawl & Render like GoogleWhat I Learned Building a Toy Example to Crawl & Render like Google
What I Learned Building a Toy Example to Crawl & Render like Google
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive Search
 

Último

Último (20)

Alpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxAlpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptx
 
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
 
Optimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered PromptsOptimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered Prompts
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
Discover Ardency Elite: Elevate Your Lifestyle
Discover Ardency Elite: Elevate Your LifestyleDiscover Ardency Elite: Elevate Your Lifestyle
Discover Ardency Elite: Elevate Your Lifestyle
 
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdfTAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
 
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night ServiceVIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
 
The Art of sales from fictional characters.
The Art of sales from fictional characters.The Art of sales from fictional characters.
The Art of sales from fictional characters.
 
The seven principles of persuasion by Dr. Robert Cialdini
The seven principles of persuasion by Dr. Robert CialdiniThe seven principles of persuasion by Dr. Robert Cialdini
The seven principles of persuasion by Dr. Robert Cialdini
 
Best 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In ChandigarhBest 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In Chandigarh
 
The 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptxThe 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptx
 
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
 
How consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their livesHow consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their lives
 
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdfMicro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
 
Aligarh Hire 💕 8250092165 Young and Hot Call Girls Service Agency Escorts
Aligarh Hire 💕 8250092165 Young and Hot Call Girls Service Agency EscortsAligarh Hire 💕 8250092165 Young and Hot Call Girls Service Agency Escorts
Aligarh Hire 💕 8250092165 Young and Hot Call Girls Service Agency Escorts
 
Cartona.pptx. Marketing how to present your project very well , discussed a...
Cartona.pptx.   Marketing how to present your project very well , discussed a...Cartona.pptx.   Marketing how to present your project very well , discussed a...
Cartona.pptx. Marketing how to present your project very well , discussed a...
 
Aiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMMAiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMM
 
Elevating Your Digital Presence by Evitha.pdf
Elevating Your Digital Presence by Evitha.pdfElevating Your Digital Presence by Evitha.pdf
Elevating Your Digital Presence by Evitha.pdf
 
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATIONHOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
 

Generating Qualitative Content with GPT-2 in All Languages

  • 1. #TechSEOBoost | @CatalystSEM THANK YOU TO OUR SPONSORS Generating Qualitative Content with GPT-2 in All Languages Vincent Terrasi, OnCrawl
  • 2. Vincent Terrasi | @vincentterrasi | #TechSEOBoost In All Languages Generating Qualitative Content
  • 3. Vincent Terrasi | @vincentterrasi | #TechSEOBoost SEO Use-cases • Image captioning with Pythia • Visual question & Answering • Abstractive Summarization with BERTsum • Full Article generation with GPT-2
  • 4. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Text Spinners are bad
  • 5. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2016? • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
  • 6. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2019? • Text that makes no sense to the reader but which may contain search keywords. • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://support.google.com/webmasters/answer/2721306?hl=en
  • 7. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Surprise!
  • 8. Vincent Terrasi | @vincentterrasi | #TechSEOBoost 2019, the best year for using AI for text generation
  • 9. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  • 10. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  • 11. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Transformer and Attention Model
  • 12. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word
  • 13. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word
  • 14. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words
  • 15. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence
  • 16. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word
  • 17. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word Pattern 6: Attention to delimiter tokens
  • 18. Vincent Terrasi | @vincentterrasi | #TechSEOBoost State of the Art ⚫ All models exist for English ⚫ Documentation is good ⚫ So we just need to translate
  • 19. Vincent Terrasi | @vincentterrasi | #TechSEOBoost There are a lot of biases: ◦ Small Talk ◦ Idioms ◦ Local Named Entities ◦ Rarest Verbs ◦ Uncommon Tenses ◦ Gender rules
  • 20. Vincent Terrasi | @vincentterrasi | #TechSEOBoost How to scale? Create your own model in your language
  • 21. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Objectives Use only qualitative methods to improve the quality of content created by humans Extract the knowledge learnt by the Deep Learning.
  • 22. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Why others attempts have failed? Quantitative: You need a lot of data: more than 100 000 texts with a minimum of 500 words Qualitative: You need qualitative texts
  • 23. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2 Recipe
  • 24. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 1: Training the model This method without pretraining requires significant computing power. You need GPUs! 3 days to get my first result with one GPU.
  • 25. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 1/2 GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of data compression. Why? - Predicting the next character is too imprecise - Predicting the next word is too precive and take a lot of computing power.
  • 26. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 2/2 Use SentencePiece to generate my BPE files. Why? - Unsupervised text tokenizer and detokenizer - Purely end-to-end system that does not depend on language-specific pre/postprocessing.
  • 27. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257
  • 28. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768
  • 29. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12
  • 30. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12 Number of layers: no greater accuracy if you increase this value - n_layer:12
  • 31. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 4: Generating article text Once the model has been trained, the gpt-2-gen command is used to generate a text. The first parameter is the path to the model. The second is the beginning of the sentence. Then there are two optional parameters: o --tokens-to-generate: number of tokens to generate, default 42 o --top-k: number of candidate tokens each time, by default 8.
  • 32. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Results & Quality Evaluated subjectively by a native reader. API pylanguagetool was used to quantifiably confirm the quality of results and did not find any errors in the generated text. https://github.com/Findus23/pyLanguagetool
  • 33. Vincent Terrasi | @vincentterrasi | #TechSEOBoost You can find my Google Colab Notebook here for the French https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV Warning: it is just an example using limited data. NOW it is your turn.
  • 34. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Further ? Parameters Objectives Use Cases top-k < 10 token < 10 High Performance Very high qualitative content related to your original training content Anchors for Internal Linking Variant of Title Variant of Meta top-k > 50 token > 400 Low Performance Low qualitative content because the model is weak, but the model successfully extracts all concepts that GPT-2 learnt about your dataset. Guides to help you write, compared to a query, with the stated purpose of saving you time.
  • 35. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Thank You vincent@oncrawl.com
  • 36. Catalyst | @CatalystSEM | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/