Publicidad

Introduction to source{d} Engine and source{d} Lookout

source{d}
source{d}
1 de Oct de 2018
Publicidad

Más contenido relacionado

Presentaciones para ti(20)

Publicidad
Publicidad

Introduction to source{d} Engine and source{d} Lookout

  1. source{d} live demos Francesc Campoy
  2. Francesc Campoy VP of Developer Relations francesc@sourced.tech @francesc github.com/campoy
  3. Agenda ● source{d} ● Code as Data: source{d} Engine ● ML on Code: source{d} Lookout ● Q&A
  4. source{d}
  5. “Software is eating the world”
  6. 128k LoC
  7. 4-5M LoC
  8. 9M LoC
  9. 18M LoC
  10. 45M LoC
  11. 150M LoC
  12. BigQuery Public Data SELECT SUM(ARRAY_LENGTH(SPLIT(t.content, 'n'))) as lines FROM `bigquery-public-data.github_repos.sample_contents` as t; 55,667,569,252 ~ 55xGoogle?
  13. Code as Data
  14. Code As Data ● Data Mining over trillions of lines of code ● Is that enough?
  15. https://medium.com/google-cloud/analyzing-go-code-with-bigquery-485c70c3b451 Long long time ago
  16. Names of the functions defined in these files ● func main … fn main … function main … public static void main? ● Language Classification ● Language Parsing ● Language Agnostic Token Extraction
  17. bblf.sh
  18. http://xkcd.com/208/
  19. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing and Token Extraction
  20. LANGUAGE(path, content): Returns the language of a file given its path and contents. Powered by github.com/src-d/enry. Some custom functions
  21. Lines of code per language # total lines of code per language in the Go repo SELECT lang, SUM(lines) as total_lines FROM ( SELECT t.tree_entry_name as name, LANGUAGE(t.tree_entry_name, b.blob_content) AS lang, ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines FROM refs r NATURAL JOIN commits c NATURAL JOIN commit_trees ct NATURAL JOIN tree_entries t NATURAL JOIN blobs b WHERE r.ref_name = 'HEAD' ) AS lines WHERE lang is not null GROUP BY lang ORDER BY total_lines DESC;
  22. Lines of code per language # total lines of code per language in the Go repo SELECT lang, SUM(lines) as total_lines FROM ( SELECT t.tree_entry_name as name, LANGUAGE(t.tree_entry_name, b.blob_content) AS lang, ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines FROM refs r NATURAL JOIN commits c NATURAL JOIN commit_trees ct NATURAL JOIN tree_entries t NATURAL JOIN blobs b WHERE r.ref_name = 'HEAD' ) AS lines WHERE lang is not null GROUP BY lang ORDER BY total_lines DESC;
  23. Some custom functions UAST(content, language, [filter]): Returns the Universal Abstract Syntax Tree resulting of parsing the given content in the given language. Powered by github.com/bblfsh/bblfshd.
  24. SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]') ) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD' Number of functions per Go file
  25. SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]') ) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD' Number of functions per Go file
  26. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing and Token Extraction ● Is that enough?
  27. Evolution matters
  28. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing ● All revisions must be available ● Is that enough?
  29. slideshare.net/JeffKunkle/understanding-git
  30. https://xkcd.com/1597/
  31. Code As Data ● Data Mining over trillions of lines of code ● Language Classification and Parsing ● All revisions must be available ● Git to SQL mapping
  32. Demo Time!
  33. ML on Code
  34. Challenge #1 Data Retrieval
  35. The datasets of ML on Code ● GH Archive: https://www.gharchive.org ● Public Git Archive https://pga.sourced.tech
  36. Tasks ● Language Classification ● File Parsing ● Token Extraction ● Reference Resolution ● History Analysis Retrieving data for ML on Code Tools ● enry, linguist, etc ● Babelfish, ad-hoc parsers ● XPath / CSS selectors ● Kythe ● go-git
  37. srcd sql # total lines of code per language in the Go repo SELECT lang, SUM(lines) as total_lines FROM ( SELECT LANGUAGE(t.tree_entry_name, b.blob_content) AS lang, ARRAY_LENGTH(SPLIT(b.blob_content, 'n')) as lines FROM refs r NATURAL JOIN commits c NATURAL JOIN commit_trees ct NATURAL JOIN tree_entries t NATURAL JOIN blobs b WHERE r.ref_name = 'HEAD' and r.repository_id = 'go' ) AS lines WHERE lang is not null GROUP BY lang ORDER BY total_lines DESC;
  38. SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]') ) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD' srcd sql
  39. source{d} engine github.com/src-d/engine
  40. Challenge #2 Data Analysis
  41. '112', '97', '99', '107', '97', '103', '101', '32', '109', '97', '105', '110', '10', '10', '105', '109', '112', '111', '114', '116', '32', '40', '10', '9', '34', '102', '109', '116', '34', '10', '41', '10', '10', '102', '117', '110', '99', '32', '109', '97', '105', '110', '40', '41', '32', '123', '10', '9', '102', '109', '116', '46', '80', '114', '105', '110', '116', '108', '110', '40', '34', '72', '101', '108', '108', '111', '44', '32', '112', '108', '97', '121', '103', '114', '111', '117', '110', '100', '34', '41', '10', '125', '10' package main import “fmt” func main() { fmt.Println(“Hello, Denver”) } What is Source Code
  42. package package IDENT main ; import import STRING "fmt" ; func func IDENT main ( ) What is Source Code { IDENT fmt . IDENT Println ( STRING "Hello, Denver" ) ; } ; package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  43. What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  44. What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  45. What is Source Code ● A sequence of bytes ● A sequence of tokens ● An abstract syntax tree ● A Graph (e.g. Control Flow Graph)
  46. Challenge #3 Learning from Source Code
  47. Neural Networks Basically fancy linear regression machines Given an input of a constant length, they predict an output of constant length. Example: MNIST: Input: images with 28x28 px Output: a digit from zero to 9
  48. MNIST ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~1 ~0
  49. MLonCode: Predict the next token for i := 0 ; i < 10 ; i ++
  50. Recurrent Neural Networks Can process sequences of variable length. Uses its own output as a new input. Example: Natural Language Translation: Input: “bonjour, les gauffres” Output: “hi, waffles”
  51. MLonCode: Code Generation charRNN: Given n characters, predict the next one Trained over the Go standard library Achieved 61% accuracy on predictions.
  52. Before training r t, kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i ^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L ^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?## #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?% t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt # 1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki % }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#% kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i? ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@# tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t 1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i
  53. After one epoch (dataset seen once) if testingValuesIntering() { t.SetCaterCleen(time.SewsallSetrive(true) if weq := nil { t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error } t, err := ntr.Soare(cueper(err, err) if err != nil { t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into } if err != nil { return } if err == nel { t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err) }, defarenContateFule(temt.Canses) } if err != nil { return err } // Treters and restives of the sesconse stmpeletatareservet // This no to the result digares wheckader. Constate bytes alleal
  54. After two epochs if !ok { t.Errorf("%d: %v not %v", i, err) } if !ot.Close() if enr != nil { t.Fatal(err) } if !ers != nil { t.Fatal(err) } if err != nil { t.Fatal(err) } if err != nil { t.Errorf("error %q: %s not %v", i, err) } return nil }
  55. if got := t.struct(); !ok { t.Fatalf("Got %q: %q, %v, want %q", test, true } if !strings.Connig(t) { t.Fatalf("Got %q: %q", want %q", t, err) } if !ot { t.Errorf("%s < %v", x, y) } if !ok { t.Errorf("%d <= %d", err) } if !stricgs(); !ot { t.Errorf("!(%d <= %v", x, e) } } if !ot != nil { return "" } After many epochs
  56. Learning to Represent Programs with Graphs from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer ???.Close() io.Copy(to, from) Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi https://arxiv.org/abs/1711.00740 The VARMISUSE Task: Given a program and a gap in it, predict what variable is missing.
  57. code2vec: Learning Distributed Representations of Code Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav https://arxiv.org/abs/1803.09473 | https://code2vec.org/
  58. Much more research github.com/src-d/awesome-machine-learning-on-source-code
  59. Challenge #4 What can we build?
  60. Predictable vs Predicted ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~1 ~0
  61. A G o PR An attention model for code reviews.
  62. Can you see the mistake? VARMISUSE from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer from.Close() io.Copy(to, from)
  63. Can you see the mistake? VARMISUSE from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer from.Close() ← s/from/to/ io.Copy(to, from)
  64. Is this a good name? func XXX(list []string, text string) bool { for _, s := range list { if s == text { return true } } return false } Suggestions: ● Contains ● Has func XXX(list []string, text string) int { for i, s := range list { if s == text { return i } } return -1 } Suggestions: ● Find ● Index code2vec: Learning Distributed Representations of Code
  65. source: WOCinTech Assisted code review. src-d/lookout
  66. And so much more Coming soon: ● Automated Style Guide Enforcing ● Bug Prediction ● Automated Code Review ● Education Coming … later: ● Code Generation: from unit tests, specification, natural language description. ● Natural Analysis: code description and conversational analysis.
  67. Q&A
Publicidad