SlideShare a Scribd company logo
1 of 31
Alexander Borzunov
How to do research
at a large IT company
2
Who am I?
Alexander Borzunov
• Researcher at Yandex
• NEERC ICPC 2017 prize winner
• Bachelor’s at Ural FU
• Master’s at HSE University +
Yandex School of Data Analysis
3
Plan
• Why do companies need research?
• What researchers do?
• How to get there?
4
Why do companies need research?
Product development:
• Developers address user feedback/business needs
• No time to dive deeply into a problem (e. g. invent a new algorithm)
Research:
• Experts work on problems from a particular area full-time
• Necessary to get innovations in the long term
5
How is it different from universities?
Research in companies:
• More funding
• Access to more compute
• Interaction with product teams
6
Many breakthroughs in modern computer science
are made by companies
7
What researchers do?
• Follow latest findings and results (e. g. on Twitter)
8
What researchers do?
9
What researchers do?
• Follow latest findings and results (e. g. on Twitter)
• Choose promising research directions
10
What researchers do?
• Follow latest findings and results (e. g. on Twitter)
• Choose promising research directions
• Collaborate with each other
11
What researchers do?
• Follow latest findings and results (e. g. on Twitter)
• Choose promising research directions
• Collaborate with each other
• Conduct experiments (you need to write code quickly to evaluate many ideas)
12
What researchers do?
• Follow latest findings and results (e. g. on Twitter)
• Choose promising research directions
• Collaborate with each other
• Conduct experiments
• Design rigorous proofs
13
What researchers do?
If the method works:
• Write a paper for an (international) conference
• Defend it in a discussion with reviewers
• If accepted:
• Travel to a conference ✈️
• Tell the world about it on Twitter, Reddit, Habr, etc. 🌎
• Your results may be adopted by product teams
14
Yandex Research
• Focus: machine learning and related algorithms
• Computer vision, image generation
• Language processing
• Program synthesis with neural nets (e. g. trained on Codeforces solutions)
• Systems for distributed training
• Theory, e. g. continuous optimization
• Publications in top venues such as NeurIPS, ICML, CVPR, ACL
15
Collaboration with product teams
Self-driving and Robotics Voice assistants
16
Yandex Research
• Joint labs with paid programs for Master’s/PhD students:
• Collaborations:
17
How did I get there?
2014 – 2018 Bachelor’s at Ural FU, participated in ICPC
▎ “What’s next?”
▎ “Machine learning – a growing field”
18
Machine learning on “Cats vs. Dogs”
No methods known to get 60% accuracy (random gives 50%)
2007
vs.
19
Machine learning on “Cats vs. Dogs”
No methods known to get 60% accuracy (random gives 50%)
Solved with 98% accuracy
2007
2014
vs.
20
Machine learning on “Cats vs. Dogs”
No methods known to get 60% accuracy (random gives 50%)
Solved with 98% accuracy
Neural nets can draw cats and dogs themselves
(this cat does not exist)
2007
2014
2019
vs.
21
Machine learning in 2021
Neural nets can draw cats and dogs themselves
Neural nets draw pictures matching any text description
2019
2021
22
How did I get there?
2018 – 2020
2019 – 2021
2021 – Now
Master’s at HSE University + Yandex School of Data Analysis
▎ “Self-driving – a product that may change everyday life”
Research Engineer at Yandex Self-Driving
▎ “Research – a place where people invent new things”
Yandex Research
23
What I do?
• Compute needed for training latest neural nets grows quickly
• Popular training methods are designed for high-performance clusters
• Cluster to train GPT-3 costs over $250 million
• Hard to get if you are in a university or a startup
• Solution: distributed training over the Internet (like BitTorrent)
24
First use case: Language models
• Training one large neural net allows to solve many tasks:
• Understanding intents, tone, logical relations from a sentence
• Answering questions
• Extracting entities (locations, persons, etc.)
• Once trained, it is easy to use for your business/research
First use case: Language model for Bengali
• TOP-6 language by no. of native speakers
• No good model yet
First use case: Language model for Bengali
• We offered people to train one together!
Together with:
• Got a competitive model, state-of-the-art on some tasks
Roadblock to scaling: Security
• To train a neural net, you need to average
computations performed by peers on
different data samples
• A troll or competitor may destroy the
model by sending wrong values once
28
Secure distributed training
Idea #1: Clip outliers among computations
(it does not hurt training if done right)
29
Idea #2:
• Peers broadcast hashes of their calculations.
• Then, the system selects “policemen” to validate results of some peers.
• If a policeman accuses someone, we can learn who is right from the hashes.
Secure distributed training
Secure distributed training
Result: We ban offenders and quickly recover training progress
31
Thank you!
Check out our publications and
available positions on
research.yandex.com
I am available for a chat or questions
at the Yandex area
on the 3rd floor terrace until 7 pm 🙂

More Related Content

What's hot

A Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software TestingA Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software Testing
TechWell
 
How Virtual is Virtual: Designing for Distributed Work in Innovation
How Virtual is Virtual: Designing for Distributed Work in InnovationHow Virtual is Virtual: Designing for Distributed Work in Innovation
How Virtual is Virtual: Designing for Distributed Work in Innovation
Sociotechnical Roundtable
 

What's hot (17)

Online Collaboration - What’s Up in Singapore?
Online Collaboration - What’s Up in Singapore?Online Collaboration - What’s Up in Singapore?
Online Collaboration - What’s Up in Singapore?
 
Trainers Matter: Making the Case for VILT
Trainers Matter: Making the Case for VILTTrainers Matter: Making the Case for VILT
Trainers Matter: Making the Case for VILT
 
Interface Design for Elearning - Tips and Tricks
Interface Design for Elearning - Tips and TricksInterface Design for Elearning - Tips and Tricks
Interface Design for Elearning - Tips and Tricks
 
SCALE12X DevOps Day LA: 9 Principles for Navigating Change
SCALE12X DevOps Day LA: 9 Principles for Navigating ChangeSCALE12X DevOps Day LA: 9 Principles for Navigating Change
SCALE12X DevOps Day LA: 9 Principles for Navigating Change
 
A Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software TestingA Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software Testing
 
Agile Development in Large-Scale: Challenges and Insight from Research
Agile Development in Large-Scale: Challenges and Insight from ResearchAgile Development in Large-Scale: Challenges and Insight from Research
Agile Development in Large-Scale: Challenges and Insight from Research
 
Rapid Software Testing: Strategy
Rapid Software Testing: StrategyRapid Software Testing: Strategy
Rapid Software Testing: Strategy
 
9 Principles for Navigating Change
9 Principles for Navigating Change9 Principles for Navigating Change
9 Principles for Navigating Change
 
Prelude Suite Deck / South Summit 2018
Prelude Suite Deck / South Summit 2018Prelude Suite Deck / South Summit 2018
Prelude Suite Deck / South Summit 2018
 
Lets Talk Toolbox Talks: How to Effectively Reinforce Safe Work Practices
Lets Talk Toolbox Talks: How to Effectively Reinforce Safe Work PracticesLets Talk Toolbox Talks: How to Effectively Reinforce Safe Work Practices
Lets Talk Toolbox Talks: How to Effectively Reinforce Safe Work Practices
 
How Virtual is Virtual: Designing for Distributed Work in Innovation
How Virtual is Virtual: Designing for Distributed Work in InnovationHow Virtual is Virtual: Designing for Distributed Work in Innovation
How Virtual is Virtual: Designing for Distributed Work in Innovation
 
Trainers Matter: Making the Case for VILT
Trainers Matter: Making the Case for VILTTrainers Matter: Making the Case for VILT
Trainers Matter: Making the Case for VILT
 
Multi-Cloud for Dummies
Multi-Cloud for DummiesMulti-Cloud for Dummies
Multi-Cloud for Dummies
 
PuppetConf 2016: Collaboration and Empowerment: Driving Change in Infrastruct...
PuppetConf 2016: Collaboration and Empowerment: Driving Change in Infrastruct...PuppetConf 2016: Collaboration and Empowerment: Driving Change in Infrastruct...
PuppetConf 2016: Collaboration and Empowerment: Driving Change in Infrastruct...
 
Hiring Tips For Distributed Teams from PowerToFly
Hiring Tips For Distributed Teams from PowerToFlyHiring Tips For Distributed Teams from PowerToFly
Hiring Tips For Distributed Teams from PowerToFly
 
PLN4LL
PLN4LLPLN4LL
PLN4LL
 
ROI On DLP
ROI On DLPROI On DLP
ROI On DLP
 

Similar to How to do science in a large IT company (ICPC World Finals 2021, Moscow)

Similar to How to do science in a large IT company (ICPC World Finals 2021, Moscow) (20)

Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
 
NUS PhD e-open day 2020
NUS PhD e-open day 2020NUS PhD e-open day 2020
NUS PhD e-open day 2020
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master Specialisation
 
Computational Thinking and Acting: Future Technologies for Future Generations
Computational Thinking and Acting: Future Technologies for Future GenerationsComputational Thinking and Acting: Future Technologies for Future Generations
Computational Thinking and Acting: Future Technologies for Future Generations
 
Life after-phd-10-nov
Life after-phd-10-novLife after-phd-10-nov
Life after-phd-10-nov
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
 
Experience sharing-of-technologist-cum-mgmt-scientist-2013
Experience sharing-of-technologist-cum-mgmt-scientist-2013Experience sharing-of-technologist-cum-mgmt-scientist-2013
Experience sharing-of-technologist-cum-mgmt-scientist-2013
 
Staying Competitive in Data Analytics: Analyze Boulder 20140903
Staying Competitive in Data Analytics: Analyze Boulder 20140903Staying Competitive in Data Analytics: Analyze Boulder 20140903
Staying Competitive in Data Analytics: Analyze Boulder 20140903
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Classroom of the futurev3
Classroom of the futurev3Classroom of the futurev3
Classroom of the futurev3
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 steps
 
2016: Applying AI Innovation in Business
2016: Applying AI Innovation in Business2016: Applying AI Innovation in Business
2016: Applying AI Innovation in Business
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Software Professionals (RSEs) at NCSA
Software Professionals (RSEs) at NCSASoftware Professionals (RSEs) at NCSA
Software Professionals (RSEs) at NCSA
 
Community and Code: Lessons from NESCent Hackathons
Community and Code: Lessons from NESCent HackathonsCommunity and Code: Lessons from NESCent Hackathons
Community and Code: Lessons from NESCent Hackathons
 
Dr Abel Sanchez at Bristlecone Pulse 2017 MIT
Dr Abel Sanchez at Bristlecone Pulse 2017 MITDr Abel Sanchez at Bristlecone Pulse 2017 MIT
Dr Abel Sanchez at Bristlecone Pulse 2017 MIT
 
Hawaii Machine Learning - Our Inaugural Meetup
Hawaii Machine Learning - Our Inaugural MeetupHawaii Machine Learning - Our Inaugural Meetup
Hawaii Machine Learning - Our Inaugural Meetup
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Recently uploaded (20)

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

How to do science in a large IT company (ICPC World Finals 2021, Moscow)

  • 1. Alexander Borzunov How to do research at a large IT company
  • 2. 2 Who am I? Alexander Borzunov • Researcher at Yandex • NEERC ICPC 2017 prize winner • Bachelor’s at Ural FU • Master’s at HSE University + Yandex School of Data Analysis
  • 3. 3 Plan • Why do companies need research? • What researchers do? • How to get there?
  • 4. 4 Why do companies need research? Product development: • Developers address user feedback/business needs • No time to dive deeply into a problem (e. g. invent a new algorithm) Research: • Experts work on problems from a particular area full-time • Necessary to get innovations in the long term
  • 5. 5 How is it different from universities? Research in companies: • More funding • Access to more compute • Interaction with product teams
  • 6. 6 Many breakthroughs in modern computer science are made by companies
  • 7. 7 What researchers do? • Follow latest findings and results (e. g. on Twitter)
  • 9. 9 What researchers do? • Follow latest findings and results (e. g. on Twitter) • Choose promising research directions
  • 10. 10 What researchers do? • Follow latest findings and results (e. g. on Twitter) • Choose promising research directions • Collaborate with each other
  • 11. 11 What researchers do? • Follow latest findings and results (e. g. on Twitter) • Choose promising research directions • Collaborate with each other • Conduct experiments (you need to write code quickly to evaluate many ideas)
  • 12. 12 What researchers do? • Follow latest findings and results (e. g. on Twitter) • Choose promising research directions • Collaborate with each other • Conduct experiments • Design rigorous proofs
  • 13. 13 What researchers do? If the method works: • Write a paper for an (international) conference • Defend it in a discussion with reviewers • If accepted: • Travel to a conference ✈️ • Tell the world about it on Twitter, Reddit, Habr, etc. 🌎 • Your results may be adopted by product teams
  • 14. 14 Yandex Research • Focus: machine learning and related algorithms • Computer vision, image generation • Language processing • Program synthesis with neural nets (e. g. trained on Codeforces solutions) • Systems for distributed training • Theory, e. g. continuous optimization • Publications in top venues such as NeurIPS, ICML, CVPR, ACL
  • 15. 15 Collaboration with product teams Self-driving and Robotics Voice assistants
  • 16. 16 Yandex Research • Joint labs with paid programs for Master’s/PhD students: • Collaborations:
  • 17. 17 How did I get there? 2014 – 2018 Bachelor’s at Ural FU, participated in ICPC ▎ “What’s next?” ▎ “Machine learning – a growing field”
  • 18. 18 Machine learning on “Cats vs. Dogs” No methods known to get 60% accuracy (random gives 50%) 2007 vs.
  • 19. 19 Machine learning on “Cats vs. Dogs” No methods known to get 60% accuracy (random gives 50%) Solved with 98% accuracy 2007 2014 vs.
  • 20. 20 Machine learning on “Cats vs. Dogs” No methods known to get 60% accuracy (random gives 50%) Solved with 98% accuracy Neural nets can draw cats and dogs themselves (this cat does not exist) 2007 2014 2019 vs.
  • 21. 21 Machine learning in 2021 Neural nets can draw cats and dogs themselves Neural nets draw pictures matching any text description 2019 2021
  • 22. 22 How did I get there? 2018 – 2020 2019 – 2021 2021 – Now Master’s at HSE University + Yandex School of Data Analysis ▎ “Self-driving – a product that may change everyday life” Research Engineer at Yandex Self-Driving ▎ “Research – a place where people invent new things” Yandex Research
  • 23. 23 What I do? • Compute needed for training latest neural nets grows quickly • Popular training methods are designed for high-performance clusters • Cluster to train GPT-3 costs over $250 million • Hard to get if you are in a university or a startup • Solution: distributed training over the Internet (like BitTorrent)
  • 24. 24 First use case: Language models • Training one large neural net allows to solve many tasks: • Understanding intents, tone, logical relations from a sentence • Answering questions • Extracting entities (locations, persons, etc.) • Once trained, it is easy to use for your business/research
  • 25. First use case: Language model for Bengali • TOP-6 language by no. of native speakers • No good model yet
  • 26. First use case: Language model for Bengali • We offered people to train one together! Together with: • Got a competitive model, state-of-the-art on some tasks
  • 27. Roadblock to scaling: Security • To train a neural net, you need to average computations performed by peers on different data samples • A troll or competitor may destroy the model by sending wrong values once
  • 28. 28 Secure distributed training Idea #1: Clip outliers among computations (it does not hurt training if done right)
  • 29. 29 Idea #2: • Peers broadcast hashes of their calculations. • Then, the system selects “policemen” to validate results of some peers. • If a policeman accuses someone, we can learn who is right from the hashes. Secure distributed training
  • 30. Secure distributed training Result: We ban offenders and quickly recover training progress
  • 31. 31 Thank you! Check out our publications and available positions on research.yandex.com I am available for a chat or questions at the Yandex area on the 3rd floor terrace until 7 pm 🙂