2. Why code Search?
Large amounts of source code is added consistently
online.
Documentation of the source code is generally not
attached to it.
Unorganized and distributed among different sources
Many versions of software systems, needed for
similarity analysis
3. Cont..
Enormous source code (Github alone has
approximately 10 million projects).
Very large and complex.
Most of the query systems available online; use
either keyword or meta Information based search.
Not easy to search and analyze.
4. What did CodeSearch do for programmers?
The CodeSearch service was a unique tool as it indexed open source code in the wild.
Codesearch is one of the most valuable tools in existence for all software developers,
specifically:
When an API is poorly documented, you could find sample bits of code that used the API.
When an API error codes was poorly documented, you could find sample bits of code that
handled it.
When an API was difficult to use (and the world is packed with those), you could find sample
bits of code that used it.
When you quickly wanted to learn a language, you knew you could find quality code with
simple searches.
When you wanted to find different solutions to everyday problems dealing with protocols,
new specifications, evolving standards and trends. You could turn to CodeSearch.
5. Cont..
When you were faced with an obscure error message, an obscure token, an obscure
return value or other forms of poor coding, you would find sample bits of code that
solved this problem.
When dealing with proprietary protocols or just poorly documented protocols, you
could find how they worked in minutes.
When you were trying to debug yet another broken standard or yet another poorly
specified standard, you knew you could turn quickly to CodeSearch to find the
answers to your problems (memories of OAuth and IMAP flash in my head).
When learning a new programming language or trying to improve your skills on a
new programming language, you could use CodeSearch to learn the idioms and the
best (and worst practices).
When building a new version of a library, either in a new language, making a fluent
version, making an open source version, building a more complete version you would
just go to Codesearch to find answers to how other people did things.
6. Google Code Search
Developer(s) Google
Initial release October 5, 2006
Development status Discontinued
Operating system Any (web-based application)
Type Code search engine
Website http://www.google.com/cod
esearch(archived version
from 2010)
Google Code Search
Features included the ability to
search using
operators,namely lang:, package:,
license: and file:.
The code available for searching was in
various formats including tar.gz, .tar.bz2,
.tar, and .zip , CVS , Subversion
,git and Mercurial repositories.
7.
8. How Google Code Search Worked
Introduction
Code Search was Google's first and only search engine to accept regular
expression queries, which was geekily great but a very small niche.When we
started Code Search, a Google search for “regular expression search engine”
turned up sites where you typed “phone number” and got back “(d{3})
d{3}-d{4}”.
Google open sourced the regular expression engine I wrote for Code
Search, RE2, in March 2010. Code Search and RE2 have been a great vehicle
for educating people about how to do regular expression search safely.
9. Regular expression
in theoretical computer science a sequence of characters that define
a search pattern. Usually this pattern is then used by string searching
algorithms for "find" or "find and replace" operations on strings.
Basic concepts
A regular expression, often called a pattern, is an expression used to specify
a set of strings required for a particular purpose.
For example, the set containing the three strings "Handel", "Händel", and
"Haendel" can be specified by the pattern
H(ä|ae?)ndel ;
we say that this pattern matches each of the three strings.
10. Indexed Word Search
o The key data structure is called a posting list or inverted index, which lists, for every possible search
term, the documents that contain that term.
consider these three very short documents:
1) Google Code Search
2) Google Code Project Hosting
3) Google Web Search
o The inverted index for these three documents looks like:
Code: {1, 2}
Google: {1, 2, 3}
Hosting: {2}
Project: {2}
Search: {1, 3}
Web: {3}
11. Cont..
o To support phrases, full-text search implementations usually record each occurrence of a word
in the posting list, along with its position:
An alternate way to support phrases is to treat them as AND queries to identify a set of candidate
documents and then filter out non-matching documents after loading the document bodies from disk.
In practice, phrases built out of common words like “to be or not to be” make this approach
unattractive. Storing the position information in the index entries makes the index bigger but avoids
loading a document from disk unless it is guaranteed to be a match.
Code: {(1, 2), (2, 2)}
Google: {(1, 1), (2, 1), (3, 1)}
Hosting: {(2, 4)}
Project: {(2, 3)}
Search: {(1, 3), (3, 4)}
Web: {(3, 2)}
12. Indexed Regular Expression Search
we can use an old information retrieval trick and build an index of n-grams, substrings of
length n
o the document set:
(1) Google Code Search
(2) Google Code Project Hosting
(3) Google Web Search
o has this trigram index:
_Co: {1, 2} Sea: {1, 3} e_W: {3} ogl: {1, 2, 3} _Ho: {2} Web: {3} ear: {1, 3} oje: {2} _Pr: {2} arc: {1, 3}
eb_: {3} oog: {1, 2, 3} _Se: {1, 3} b_S: {3} ect: {2} ost: {2} _We: {3} ct_: {2} gle: {1, 2, 3} rch: {1, 3}
Cod: {1, 2} de_: {1, 2} ing: {2} roj: {2} Goo: {1, 2, 3} e_C: {1, 2} jec: {2} sti: {2} Hos: {2} e_P: {2}
le_: {1, 2, 3} t_H: {2} Pro: {2} e_S: {1} ode: {1, 1} tin: {2}
13. Cont..
Trigram index
_Co: {1, 2} Sea: {1, 3} e_W: {3} ogl: {1, 2, 3} _Ho: {2} Web: {3} ear: {1, 3} oje: {2} _Pr: {2} arc: {1, 3}
eb_: {3} oog: {1, 2, 3} _Se: {1, 3} b_S: {3} ect: {2} ost: {2} _We: {3} ct_: {2} gle: {1, 2, 3} rch: {1, 3}
Cod: {1, 2} de_: {1, 2} ing: {2} roj: {2} Goo: {1, 2, 3} e_C: {1, 2} jec: {2} sti: {2} Hos: {2} e_P: {2}
le_: {1, 2, 3} t_H: {2} Pro: {2} e_S: {1} ode: {1, 1} tin: {2}
oGiven a regular expression such as /Google.*Search/, we can build a query of ANDs
and ORs that gives the trigrams that must be present in any text matching the regular
expression. In this case, the query is
Goo AND oog AND ogl AND gle AND Sea AND ear AND arc AND rch
14. Cont..
o The rules follow from the meaning of the regular expressions:
‘’ (empty string)
emptyable(‘’) = true
exact(‘’) = {‘’}
prefix(‘’) = {‘’}
suffix(‘’) = {‘’}
match(‘’) =
ANY (special
query: match
all documents)
c (single character)
emptyable(c) = false
exact(c) = {c}
prefix(c) = {c}
suffix(c) = {c}
match(c) = ANY
e? (zero or one)
emptyable(e?) = true
exact(e?) = exact(e) ∪ {‘’}
prefix(e?) = {‘’}
suffix(e?) = {‘’}
match(e?) = ANY
e* (zero or more)
emptyable(e*) = true
exact(e*) = unknown
prefix(e*) = {‘’}
suffix(e*) = {‘’}
match(e*) = ANY
15. Cont..
e+ (one or more)
emptyable(e+) = emptyable(e)
exact(e+) = unknown
prefix(e+) = prefix(e)
suffix(e+) = suffix(e)
match(e+) = match(e)
e1 | e2 (alternation)
emptyable(e1 | e2) =
emptyable(e1) or
emptyable(e2)
exact(e1 | e2) = exact(e1) ∪ exact(e2)
prefix(e1 | e2) = prefix(e1) ∪ prefix(e2)
suffix(e1 | e2) = suffix(e1) ∪ suffix(e2)
match(e1 | e2) = match(e1) OR match(e2)
e1 e2 (concatenation)
emptyable(e1e2) = emptyable(e1) and emptyable(e2)
exact(e1e2) = exact(e1) × exact(e2), if both are known
or unknown, otherwise
prefix(e1e2) = exact(e1) × prefix(e2), if exact(e1) is known
or prefix(e1) ∪ prefix(e2), if emptyable(e1)
or prefix(e1), otherwise
suffix(e1e2) = suffix(e1) × exact(e2), if exact(e2) is known
or suffix(e2) ∪ suffix(e1), if emptyable(e2)
or suffix(e2), otherwise
match(e1e2) = match(e1) AND match(e2)
16. Cont..
Single string
•Trigram(ab)=ANY
•Trigram(abc)=abc
•Trigram(abcd)=abc AND bcd
Set of strings
•Trigram({ab})=trigram(ab)=ab
•Trigram({abcd})=trigram(abcd)
•Trigram({ab,abcd})=trigram(ab) OR trigram(abcd)
At any time, set match(e) = match(e) AND trigrams(prefix(e)).
At any time, set match(e) = match(e) AND trigrams(suffix(e)).
At any time, set match(e) = match(e) AND trigrams(exact(e)).
19. Discontinuation
In October 2011, Google announced that Code Search was to be shut down along with the
Code Search API. The service remained online until March 2013, and it now returns a 404.
20. The Best Alternatives to Google Code for
Your Programming Projects
GitHub is the juggernaut in this arena, obviously, and the
web's most popular code repository.
Well known to nearly everyone who deals in the world of code, GitHub looks to help
developers build software through collaboration. As the “world’s largest open source
community,” GitHub allows users to share their projects “with the world, get feedback, and
contribute to millions of repositories.” What some developers may not know is that GitHub
also offers private repositories with upgraded plans.
21. GitHub
Key Features:
Review changes, comment on lines of code, report issues, and plan with discussion tools
Use organization accounts to communicate easily with teams
Integration with several applications and tools
Field-tested tools for any project, public or private
Integrated issue tracking
Use your go-to SVN tools to checkout, branch, and commit to GitHub repositories
22.
23.
24. CodePlex
CodePlex is Microsoft’s free open source project hosting site. With CodePlex,
users can create, share, collaborate and download from the project to the
software phase.
Key Features:
Source code control
Project discussions
Wiki pages
Feature/issue tracking
Cost: FREE
25.
26. BitBucket
Bitbucket, from Atlassian, offers unlimited private code repositories for Git or
Mercurial. Offering lightweight code review, Bitbucket is one of the most
popular source code repository hosts out there.
Key Features:
Built with small teams in mind, so you can consolidate sure management,
invite members, and share repositories
Review changes on a fork or branch easily with pull requests
In-line comments allow users to have discussions within the source code
Track every commit to an issue in JIRA
27.
28. General information
Name Manager Established Server side:
all Free
software
Client side:
All-free JS
code
Developed
and/or used
CDE
Require free
software on
registration
Ad-free notes
Bitbucket Atlassian 2008 No No Unknown No Yes
Denies
service to
Cuba, Iran,
North Korea,
Sudan, Syria
GitHub GitHub, Inc 2008-04 No No Unknown No Yes
List of
government
takedown
requests
CodePlex Microsoft 2006-05 No Unknown Unknown No Yes
Project must
be OSS
licensed
29. Features
Name Code
Revie
w
Bug
Trackin
g
Web
Hostin
g
Wiki Transla
tion
System
Shell
server
Mailin
g List
Forum Person
al
Branch
Private
Branch
Annou
nce
Build
Sysye
m
Team Releas
e
Binarie
s
Self-
hostin
g
Bitbuc
ket
Yes Yes Yes Yes No No No No Yes Yes No No Yes Yes
Comm
ercially
(Stash)
GitHub Yes Yes Yes Yes No No No No Yes Yes Yes
3rd-
party
(e.g. Tr
avis CI,
Appve
yor
and
others)
Yes Yes
Comm
ercially
(GitHu
b
Enterp
rise)
CodePl
ex
No Yes No Yes No No Yes Yes No No No No No Yes No
30. Popularity
Name Users Projects Alex rank
Bitbucket Unknown Unknown 834 as of 22 June 2016
CodePlex Unknown 107,712
2,689 as of 22 June
2016
GitHub 15,000,000 38,000,000 53 as of 19 August 2016
Google Code Unknown 250,000+
N/A (subdomain not
tracked)
31. Available version control systems
Name CVS Git Mercurial SVN Bazaar TFS Arch Perforce Fossil
Bitbucket No Yes Yes No No No No No No
CodePlex No Yes Yes Yes No Yes No No No
GitHub No Yes No Partial No No No No No