SlideShare una empresa de Scribd logo
1 de 77
Descargar para leer sin conexión
source{d} live demos
Francesc Campoy
Francesc Campoy
VP of Developer Relations
francesc@sourced.tech
@francesc
github.com/campoy
Agenda
● source{d}
● Code as Data: source{d} Engine
● ML on Code: source{d} Lookout
● Q&A
source{d}
“Software is eating the world”
128k LoC
4-5M LoC
9M LoC
18M LoC
45M LoC
150M LoC
BigQuery Public Data
SELECT
SUM(ARRAY_LENGTH(SPLIT(t.content, 'n'))) as lines
FROM
`bigquery-public-data.github_repos.sample_contents` as t;
55,667,569,252 ~ 55xGoogle?
Code as Data
Code As Data
● Data Mining over trillions of lines of code
● Is that enough?
https://medium.com/google-cloud/analyzing-go-code-with-bigquery-485c70c3b451
Long long time ago
Names of the functions defined in these files
● func main … fn main … function main … public static void main?
● Language Classification
● Language Parsing
● Language Agnostic Token Extraction
bblf.sh
http://xkcd.com/208/
Code As Data
● Data Mining over trillions of lines of code
● Language Classification and Parsing and Token Extraction
LANGUAGE(path, content):
Returns the language of a file given its path and contents.
Powered by github.com/src-d/enry.
Some custom functions
Lines of code per language
# total lines of code per language in the Go repo
SELECT lang, SUM(lines) as total_lines
FROM (
SELECT
t.tree_entry_name as name,
LANGUAGE(t.tree_entry_name, b.blob_content) AS lang,
ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines
FROM refs r
NATURAL JOIN commits c
NATURAL JOIN commit_trees ct
NATURAL JOIN tree_entries t
NATURAL JOIN blobs b
WHERE r.ref_name = 'HEAD'
) AS lines
WHERE lang is not null
GROUP BY lang
ORDER BY total_lines DESC;
Lines of code per language
# total lines of code per language in the Go repo
SELECT lang, SUM(lines) as total_lines
FROM (
SELECT
t.tree_entry_name as name,
LANGUAGE(t.tree_entry_name, b.blob_content) AS lang,
ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines
FROM refs r
NATURAL JOIN commits c
NATURAL JOIN commit_trees ct
NATURAL JOIN tree_entries t
NATURAL JOIN blobs b
WHERE r.ref_name = 'HEAD'
) AS lines
WHERE lang is not null
GROUP BY lang
ORDER BY total_lines DESC;
Some custom functions
UAST(content, language, [filter]):
Returns the Universal Abstract Syntax Tree resulting of parsing the
given content in the given language.
Powered by github.com/bblfsh/bblfshd.
SELECT files.repository_id, files.file_path,
ARRAY_LENGTH(UAST(
files.blob_content,
LANGUAGE(files.file_path, files.blob_content),
'//*[@roleFunction and @roleDeclaration]')
) as functions
FROM files
NATURAL JOIN refs
WHERE
LANGUAGE(files.file_path,files.blob_content) = 'Go'
AND refs.ref_name = 'HEAD'
Number of functions per Go file
SELECT files.repository_id, files.file_path,
ARRAY_LENGTH(UAST(
files.blob_content,
LANGUAGE(files.file_path, files.blob_content),
'//*[@roleFunction and @roleDeclaration]')
) as functions
FROM files
NATURAL JOIN refs
WHERE
LANGUAGE(files.file_path,files.blob_content) = 'Go'
AND refs.ref_name = 'HEAD'
Number of functions per Go file
Code As Data
● Data Mining over trillions of lines of code
● Language Classification and Parsing and Token Extraction
● Is that enough?
Evolution matters
Code As Data
● Data Mining over trillions of lines of code
● Language Classification and Parsing
● All revisions must be available
● Is that enough?
slideshare.net/JeffKunkle/understanding-git
https://xkcd.com/1597/
Code As Data
● Data Mining over trillions of lines of code
● Language Classification and Parsing
● All revisions must be available
● Git to SQL mapping
Demo Time!
ML on Code
Challenge #1
Data Retrieval
The datasets of ML on Code
● GH Archive: https://www.gharchive.org
● Public Git Archive https://pga.sourced.tech
Tasks
● Language Classification
● File Parsing
● Token Extraction
● Reference Resolution
● History Analysis
Retrieving data for ML on Code
Tools
● enry, linguist, etc
● Babelfish, ad-hoc parsers
● XPath / CSS selectors
● Kythe
● go-git
srcd sql
# total lines of code per language in the Go repo
SELECT lang, SUM(lines) as total_lines
FROM (
SELECT
LANGUAGE(t.tree_entry_name, b.blob_content) AS lang,
ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines
FROM refs r
NATURAL JOIN commits c
NATURAL JOIN commit_trees ct
NATURAL JOIN tree_entries t
NATURAL JOIN blobs b
WHERE r.ref_name = 'HEAD' and r.repository_id = 'go'
) AS lines
WHERE lang is not null
GROUP BY lang
ORDER BY total_lines DESC;
SELECT files.repository_id, files.file_path,
ARRAY_LENGTH(UAST(
files.blob_content,
LANGUAGE(files.file_path, files.blob_content),
'//*[@roleFunction and @roleDeclaration]')
) as functions
FROM files
NATURAL JOIN refs
WHERE
LANGUAGE(files.file_path,files.blob_content) = 'Go'
AND refs.ref_name = 'HEAD'
srcd sql
source{d} engine
github.com/src-d/engine
Challenge #2
Data Analysis
'112', '97', '99', '107', '97', '103',
'101', '32', '109', '97', '105', '110',
'10', '10', '105', '109', '112', '111',
'114', '116', '32', '40', '10', '9',
'34', '102', '109', '116', '34', '10',
'41', '10', '10', '102', '117', '110',
'99', '32', '109', '97', '105', '110',
'40', '41', '32', '123', '10', '9',
'102', '109', '116', '46', '80', '114',
'105', '110', '116', '108', '110', '40',
'34', '72', '101', '108', '108', '111',
'44', '32', '112', '108', '97', '121',
'103', '114', '111', '117', '110', '100',
'34', '41', '10', '125', '10'
package main
import “fmt”
func main() {
fmt.Println(“Hello, Denver”)
}
What is Source Code
package package
IDENT main
;
import import
STRING "fmt"
;
func func
IDENT main
(
)
What is Source Code
{
IDENT fmt
.
IDENT Println
(
STRING "Hello, Denver"
)
;
}
;
package main
import “fmt”
func main() {
fmt.Println(“Hello, Denver”)
}
What is Source Code
package main
import “fmt”
func main() {
fmt.Println(“Hello, Denver”)
}
What is Source Code
package main
import “fmt”
func main() {
fmt.Println(“Hello, Denver”)
}
What is Source Code
● A sequence of bytes
● A sequence of tokens
● An abstract syntax tree
● A Graph (e.g. Control Flow Graph)
Challenge #3
Learning from Source Code
Neural Networks
Basically fancy linear regression machines
Given an input of a constant length,
they predict an output of constant length.
Example:
MNIST:
Input: images with 28x28 px
Output: a digit from zero to 9
MNIST
~0
~0
~0
~0
~0
~0
~0
~0
~1
~0
MLonCode: Predict the next token
for
i
:=
0
;
i
<
10
;
i
++
Recurrent Neural Networks
Can process sequences of variable length.
Uses its own output as a new input.
Example:
Natural Language Translation:
Input: “bonjour, les gauffres”
Output: “hi, waffles”
MLonCode: Code Generation
charRNN: Given n characters, predict the next one
Trained over the Go standard library
Achieved 61% accuracy on predictions.
Before training
r t,
kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i
^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L
^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?##
#^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?%
t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a
?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt #
1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty
k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki %
}i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#%
kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i?
?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@#
tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t
1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i
After one epoch (dataset seen once)
if testingValuesIntering() {
t.SetCaterCleen(time.SewsallSetrive(true)
if weq := nil {
t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error
}
t, err := ntr.Soare(cueper(err, err)
if err != nil {
t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into
}
if err != nil {
return
}
if err == nel {
t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err)
},
defarenContateFule(temt.Canses)
}
if err != nil {
return err
}
// Treters and restives of the sesconse stmpeletatareservet
// This no to the result digares wheckader. Constate bytes alleal
After two epochs
if !ok {
t.Errorf("%d: %v not %v", i, err)
}
if !ot.Close()
if enr != nil {
t.Fatal(err)
}
if !ers != nil {
t.Fatal(err)
}
if err != nil {
t.Fatal(err)
}
if err != nil {
t.Errorf("error %q: %s not %v", i, err)
}
return nil
}
if got := t.struct(); !ok {
t.Fatalf("Got %q: %q, %v, want %q", test, true
}
if !strings.Connig(t) {
t.Fatalf("Got %q: %q", want %q", t, err)
}
if !ot {
t.Errorf("%s < %v", x, y)
}
if !ok {
t.Errorf("%d <= %d", err)
}
if !stricgs(); !ot {
t.Errorf("!(%d <= %v", x, e)
}
}
if !ot != nil {
return ""
}
After many epochs
Learning to Represent Programs with Graphs
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer ???.Close()
io.Copy(to, from)
Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi
https://arxiv.org/abs/1711.00740
The VARMISUSE Task:
Given a program and a gap in it,
predict what variable is missing.
code2vec: Learning Distributed Representations of Code
Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav
https://arxiv.org/abs/1803.09473 | https://code2vec.org/
Much more research
github.com/src-d/awesome-machine-learning-on-source-code
Challenge #4
What can we build?
Predictable vs Predicted
~0
~0
~0
~0
~0
~0
~0
~0
~1
~0
A
G
o
PR
An attention model for code reviews.
Can you see the mistake?
VARMISUSE
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
io.Copy(to, from)
Can you see the mistake?
VARMISUSE
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close() ← s/from/to/
io.Copy(to, from)
Is this a good name?
func XXX(list []string, text string) bool {
for _, s := range list {
if s == text {
return true
}
}
return false
}
Suggestions:
● Contains
● Has
func XXX(list []string, text string) int {
for i, s := range list {
if s == text {
return i
}
}
return -1
}
Suggestions:
● Find
● Index
code2vec: Learning Distributed Representations of Code
source: WOCinTech
Assisted code review. src-d/lookout
And so much more
Coming soon:
● Automated Style Guide Enforcing
● Bug Prediction
● Automated Code Review
● Education
Coming … later:
● Code Generation: from unit tests, specification, natural language description.
● Natural Analysis: code description and conversational analysis.
Q&A

Más contenido relacionado

La actualidad más candente

Use PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language ParserUse PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language ParserYodalee
 
Twitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian NetworkTwitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian NetworkHendy Irawan
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Reversing the dropbox client on windows
Reversing the dropbox client on windowsReversing the dropbox client on windows
Reversing the dropbox client on windowsextremecoders
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Qiangning Hong
 
Debugging of (C)Python applications
Debugging of (C)Python applicationsDebugging of (C)Python applications
Debugging of (C)Python applicationsRoman Podoliaka
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text ProcessingRodrigo Senra
 
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09Ilya Grigorik
 
using python module: doctest
using python module: doctestusing python module: doctest
using python module: doctestmitnk
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180Mahmoud Samir Fayed
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?
Lukasz Byczynski
 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Maurice Naftalin
 
PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...
PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...
PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...Thomas Minier
 
Overloading Perl OPs using XS
Overloading Perl OPs using XSOverloading Perl OPs using XS
Overloading Perl OPs using XSℕicolas ℝ.
 

La actualidad más candente (20)

Use PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language ParserUse PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language Parser
 
Twitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian NetworkTwitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian Network
 
IO Streams, Files and Directories
IO Streams, Files and DirectoriesIO Streams, Files and Directories
IO Streams, Files and Directories
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Reversing the dropbox client on windows
Reversing the dropbox client on windowsReversing the dropbox client on windows
Reversing the dropbox client on windows
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010
 
Debugging of (C)Python applications
Debugging of (C)Python applicationsDebugging of (C)Python applications
Debugging of (C)Python applications
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
 
using python module: doctest
using python module: doctestusing python module: doctest
using python module: doctest
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
 
PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...
PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...
PeNeLoop: Parallelizing Federated SPARQL queries in presence of replicated fr...
 
Overloading Perl OPs using XS
Overloading Perl OPs using XSOverloading Perl OPs using XS
Overloading Perl OPs using XS
 

Similar a Introduction to source{d} Engine and source{d} Lookout

Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoRodolfo Carvalho
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Ramamohan Chokkam
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly LanguageMotaz Saad
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Guillaume Laforge
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil Witecki
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web AppsMark
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandrarantav
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Codesource{d}
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 

Similar a Introduction to source{d} Engine and source{d} Lookout (20)

Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Easy R
Easy REasy R
Easy R
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX Go
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, code
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
Perl tutorial final
Perl tutorial finalPerl tutorial final
Perl tutorial final
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Odp
OdpOdp
Odp
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 

Más de source{d}

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored MLsource{d}
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticssource{d}
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!source{d}
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriessource{d}
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookoutsource{d}
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...source{d}
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as datasource{d}
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack source{d}
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d}
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source codesource{d}
 

Más de source{d} (13)

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
 

Último

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Introduction to source{d} Engine and source{d} Lookout

  • 2. Francesc Campoy VP of Developer Relations francesc@sourced.tech @francesc github.com/campoy
  • 3. Agenda ● source{d} ● Code as Data: source{d} Engine ● ML on Code: source{d} Lookout ● Q&A
  • 5. “Software is eating the world”
  • 12.
  • 13.
  • 14.
  • 15. BigQuery Public Data SELECT SUM(ARRAY_LENGTH(SPLIT(t.content, 'n'))) as lines FROM `bigquery-public-data.github_repos.sample_contents` as t; 55,667,569,252 ~ 55xGoogle?
  • 16.
  • 17.
  • 18.
  • 20. Code As Data ● Data Mining over trillions of lines of code ● Is that enough?
  • 22. Names of the functions defined in these files ● func main … fn main … function main … public static void main? ● Language Classification ● Language Parsing ● Language Agnostic Token Extraction
  • 25. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing and Token Extraction
  • 26. LANGUAGE(path, content): Returns the language of a file given its path and contents. Powered by github.com/src-d/enry. Some custom functions
  • 27. Lines of code per language # total lines of code per language in the Go repo SELECT lang, SUM(lines) as total_lines FROM ( SELECT t.tree_entry_name as name, LANGUAGE(t.tree_entry_name, b.blob_content) AS lang, ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines FROM refs r NATURAL JOIN commits c NATURAL JOIN commit_trees ct NATURAL JOIN tree_entries t NATURAL JOIN blobs b WHERE r.ref_name = 'HEAD' ) AS lines WHERE lang is not null GROUP BY lang ORDER BY total_lines DESC;
  • 28. Lines of code per language # total lines of code per language in the Go repo SELECT lang, SUM(lines) as total_lines FROM ( SELECT t.tree_entry_name as name, LANGUAGE(t.tree_entry_name, b.blob_content) AS lang, ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines FROM refs r NATURAL JOIN commits c NATURAL JOIN commit_trees ct NATURAL JOIN tree_entries t NATURAL JOIN blobs b WHERE r.ref_name = 'HEAD' ) AS lines WHERE lang is not null GROUP BY lang ORDER BY total_lines DESC;
  • 29. Some custom functions UAST(content, language, [filter]): Returns the Universal Abstract Syntax Tree resulting of parsing the given content in the given language. Powered by github.com/bblfsh/bblfshd.
  • 30. SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]') ) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD' Number of functions per Go file
  • 31. SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]') ) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD' Number of functions per Go file
  • 32. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing and Token Extraction ● Is that enough?
  • 34. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing ● All revisions must be available ● Is that enough?
  • 37. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing ● All revisions must be available ● Git to SQL mapping
  • 38.
  • 42. The datasets of ML on Code ● GH Archive: https://www.gharchive.org ● Public Git Archive https://pga.sourced.tech
  • 43. Tasks ● Language Classification ● File Parsing ● Token Extraction ● Reference Resolution ● History Analysis Retrieving data for ML on Code Tools ● enry, linguist, etc ● Babelfish, ad-hoc parsers ● XPath / CSS selectors ● Kythe ● go-git
  • 44. srcd sql # total lines of code per language in the Go repo SELECT lang, SUM(lines) as total_lines FROM ( SELECT LANGUAGE(t.tree_entry_name, b.blob_content) AS lang, ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines FROM refs r NATURAL JOIN commits c NATURAL JOIN commit_trees ct NATURAL JOIN tree_entries t NATURAL JOIN blobs b WHERE r.ref_name = 'HEAD' and r.repository_id = 'go' ) AS lines WHERE lang is not null GROUP BY lang ORDER BY total_lines DESC;
  • 45. SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]') ) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD' srcd sql
  • 46.
  • 49. '112', '97', '99', '107', '97', '103', '101', '32', '109', '97', '105', '110', '10', '10', '105', '109', '112', '111', '114', '116', '32', '40', '10', '9', '34', '102', '109', '116', '34', '10', '41', '10', '10', '102', '117', '110', '99', '32', '109', '97', '105', '110', '40', '41', '32', '123', '10', '9', '102', '109', '116', '46', '80', '114', '105', '110', '116', '108', '110', '40', '34', '72', '101', '108', '108', '111', '44', '32', '112', '108', '97', '121', '103', '114', '111', '117', '110', '100', '34', '41', '10', '125', '10' package main import “fmt” func main() { fmt.Println(“Hello, Denver”) } What is Source Code
  • 50. package package IDENT main ; import import STRING "fmt" ; func func IDENT main ( ) What is Source Code { IDENT fmt . IDENT Println ( STRING "Hello, Denver" ) ; } ; package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  • 51. What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  • 52. What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  • 53. What is Source Code ● A sequence of bytes ● A sequence of tokens ● An abstract syntax tree ● A Graph (e.g. Control Flow Graph)
  • 55. Neural Networks Basically fancy linear regression machines Given an input of a constant length, they predict an output of constant length. Example: MNIST: Input: images with 28x28 px Output: a digit from zero to 9
  • 57. MLonCode: Predict the next token for i := 0 ; i < 10 ; i ++
  • 58. Recurrent Neural Networks Can process sequences of variable length. Uses its own output as a new input. Example: Natural Language Translation: Input: “bonjour, les gauffres” Output: “hi, waffles”
  • 59. MLonCode: Code Generation charRNN: Given n characters, predict the next one Trained over the Go standard library Achieved 61% accuracy on predictions.
  • 60. Before training r t, kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i ^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L ^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?## #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?% t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt # 1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki % }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#% kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i? ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@# tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t 1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i
  • 61. After one epoch (dataset seen once) if testingValuesIntering() { t.SetCaterCleen(time.SewsallSetrive(true) if weq := nil { t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error } t, err := ntr.Soare(cueper(err, err) if err != nil { t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into } if err != nil { return } if err == nel { t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err) }, defarenContateFule(temt.Canses) } if err != nil { return err } // Treters and restives of the sesconse stmpeletatareservet // This no to the result digares wheckader. Constate bytes alleal
  • 62. After two epochs if !ok { t.Errorf("%d: %v not %v", i, err) } if !ot.Close() if enr != nil { t.Fatal(err) } if !ers != nil { t.Fatal(err) } if err != nil { t.Fatal(err) } if err != nil { t.Errorf("error %q: %s not %v", i, err) } return nil }
  • 63. if got := t.struct(); !ok { t.Fatalf("Got %q: %q, %v, want %q", test, true } if !strings.Connig(t) { t.Fatalf("Got %q: %q", want %q", t, err) } if !ot { t.Errorf("%s < %v", x, y) } if !ok { t.Errorf("%d <= %d", err) } if !stricgs(); !ot { t.Errorf("!(%d <= %v", x, e) } } if !ot != nil { return "" } After many epochs
  • 64. Learning to Represent Programs with Graphs from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer ???.Close() io.Copy(to, from) Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi https://arxiv.org/abs/1711.00740 The VARMISUSE Task: Given a program and a gap in it, predict what variable is missing.
  • 65. code2vec: Learning Distributed Representations of Code Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav https://arxiv.org/abs/1803.09473 | https://code2vec.org/
  • 69. A G o PR An attention model for code reviews.
  • 70.
  • 71.
  • 72. Can you see the mistake? VARMISUSE from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer from.Close() io.Copy(to, from)
  • 73. Can you see the mistake? VARMISUSE from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer from.Close() ← s/from/to/ io.Copy(to, from)
  • 74. Is this a good name? func XXX(list []string, text string) bool { for _, s := range list { if s == text { return true } } return false } Suggestions: ● Contains ● Has func XXX(list []string, text string) int { for i, s := range list { if s == text { return i } } return -1 } Suggestions: ● Find ● Index code2vec: Learning Distributed Representations of Code
  • 75. source: WOCinTech Assisted code review. src-d/lookout
  • 76. And so much more Coming soon: ● Automated Style Guide Enforcing ● Bug Prediction ● Automated Code Review ● Education Coming … later: ● Code Generation: from unit tests, specification, natural language description. ● Natural Analysis: code description and conversational analysis.
  • 77. Q&A