SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Software Heritage
the Universal Archive of our Software Commons
Stefano Zacchiroli
Software Heritage
zack@upsilon.cc
26 November 2016
Codemotion
Milan, Italy
THE GREAT LIBRARY OF SOURCE CODE
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 18
code by Margaret Hamilton and her NASA team, http://www.ibiblio.org/apollo/
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 2 / 18
/ ∗ This r o u t i n e e x p l o i t s a f i x e d 512 byte i n p u t b u f f e r in a
∗ VAX running the BSD 4 . 3 f i n g e r d binary . I t send 536
∗ b y t e s ( p lu s a newline ) to o v e r w r i t e s i x e x t r a words in
∗ the s t a c k frame , i n c l u d i n g the r e t u r n PC , to p o i n t i n t o
∗ the middle of the s t r i n g s e n t over . The i n s t r u c t i o n s in
∗ the s t r i n g do the d i r e c t system c a l l v e r s i o n of
∗ e xe c v e ( " / bin / sh " ) . ∗ /
s t a t i c t r y _ f i n g e r ( host , fd1 , fd2 ) / ∗ 0 x49ec , < j u s t _ r e t u r n +378
struct hst ∗ host ;
int ∗ fd1 , ∗ fd2 ;
{
/* ... */
for ( i = 0 ; i < 5 3 6 ; i ++) / ∗ 628 ,654 ∗ /
buf [ i ] = ’ 0 ’ ;
for ( i = 0 ; i < 4 0 0 ; i ++)
buf [ i ] = 1 ;
for ( j = 0 ; j < 2 8 ; j ++)
buf [ i + j ] = "  3 3 5  2 1 7 / sh  0  3 3 5  2 1 7 / bin 320^Z  3 3 5  0  3 3 5  0
https://github.com/arialdomartini/morris-worm
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 3 / 18
Source code is knowledge
“Programs must be written for people to read, and only
incidentally for machines to execute.” — Harold Abelson
Distinguishing features
executable and human readable knowledge (an all time new)
naturally evolves over time
development history is key to its understanding
complex: large web of dependencies, millions of SLOCs
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 4 / 18
The Software Commons
Definition (Commons)
The commons is the cultural and natural resources accessible to all
members of a society, including natural materials such as air, water,
and a habitable earth. These resources are held in common, not
owned privately. https://en.wikipedia.org/wiki/Commons
Definition (Software Commons)
The software commons consists of all computer software which is
available at little or no cost and which can be altered and reused
with few restrictions. Thus all open source software and all free
software are part of the [software] commons. [...]
https://en.wikipedia.org/wiki/Software_Commons
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 5 / 18
Software is spread all around
Fashion victims
many disparate development platforms
a myriad places where distribution may happen
projects tend to migrate from one place to the other over time
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 6 / 18
Software is spread all around
Fashion victims
many disparate development platforms
a myriad places where distribution may happen
projects tend to migrate from one place to the other over time
One place...
... where can we find, track and search all source code?
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 6 / 18
Software is fragile
Like all digital information, FOSS is fragile
inconsiderate and/or malicious code loss (e.g., Code Spaces)
business-driven code loss (e.g., Gitorious, Google Code)
for obsolete code: physical media decay (data rot)
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 7 / 18
Software is fragile
Like all digital information, FOSS is fragile
inconsiderate and/or malicious code loss (e.g., Code Spaces)
business-driven code loss (e.g., Gitorious, Google Code)
for obsolete code: physical media decay (data rot)
If a website disappears you go to the Internet Archive...
... where do you go if (a repository on) GitHub goes away?
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 7 / 18
The Software Heritage project
THE GREAT LIBRARY OF SOURCE CODE
Our mission
Collect, preserve and share the source code of all the software that
lies at the heart of our culture and our society.
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 8 / 18
Our principles
Open approach
100% FOSS
transparency
In for the long haul
replication
non profit
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 9 / 18
Archiving goals
Targets: VCS repositories & source code releases (e.g., tarballs)
We DO archive
file content (= blobs)
revisions (= commits), with full metadata
releases (= tags), ditto
(project metadata)
where & when we found any of the above
... in a VCS-/archive-agnostic canonical data model
We DON’T archive (UNIX philosophy)
homepages, wikis → collaboration with the Internet Archive
BTS/issues/code reviews/etc.
mailing lists
Long term vision: play our part in a "semantic wikipedia of software"
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 10 / 18
Data flow
dsc
dsc
hg
hg
hg
git
git
git git
svn
svn
svn
tar
zip
software
origins
Package
repos
Software Heritage
Archive
Forges
GitHub
lister
GitLab
lister
Debian
lister
Git
loader
Mercurial
loader
Debian source
package loader
PyPi
lister
tar loader
Merkle DAG
+
blob storage
.
.
.
.
.
.
Distros
...
Scheduling
Listing
(full/incremental)
Loading
& deduplication
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 11 / 18
Merkle trees
Merkle tree (R. C. Merkle, Crypto 1979)
Combination of
tree
hash function
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 12 / 18
Merkle trees
Merkle tree (R. C. Merkle, Crypto 1979)
Combination of
tree
hash function
Classical cryptographic construction
fast, parallel signature of large data structures
widely used (e.g., Git, Bitcoin, IPFS, ...)
built-in deduplication
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 12 / 18
The archive: a (giant) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 13 / 18
SHA1 collisions considered harmful
create domain sha1 as bytea
check ( length ( value ) = 2 0 ) ;
create domain sha1_git as bytea
check ( length ( value ) = 2 0 ) ;
create domain sha256 as bytea
check ( length ( value ) = 3 2 ) ;
create table content (
sha1 sha1 primary key ,
sha1_git sha1_git not null ,
sha256 sha256 not null ,
length b i g i n t not null ,
ctime timestamptz not null default now ( ) ,
s t a t u s c o n t e n t _ s t a t u s not null default ’ v i s i b l e ’ ,
o b j e c t _ i d b i g s e r i a l
) ;
create unique index on content ( sha1_git ) ;
create unique index on content ( sha256 ) ;
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 14 / 18
Archive coverage
Our sources
GitHub — full, up-to-date mirror
Debian — daily snapshots of all suites since 2005–2015
GNU — all releases as of August 2015
Gitorious, Google Code — local copy (Archive Team & Google)
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 15 / 18
Archive coverage
Our sources
GitHub — full, up-to-date mirror
Debian — daily snapshots of all suites since 2005–2015
GNU — all releases as of August 2015
Gitorious, Google Code — local copy (Archive Team & Google)
Some numbers
150 TB blobs, 5 TB database (as a graph: 4 B nodes + 40 B edges)
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 15 / 18
Archive coverage
Our sources
GitHub — full, up-to-date mirror
Debian — daily snapshots of all suites since 2005–2015
GNU — all releases as of August 2015
Gitorious, Google Code — local copy (Archive Team & Google)
Some numbers
150 TB blobs, 5 TB database (as a graph: 4 B nodes + 40 B edges)
The richest source code archive already, ... and growing daily!
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 15 / 18
The road ahead
Planned features...
lookup by content hash (done)
download: wget and git clone from Software Heritage
provenance information for all archived code and metadata
browsing: wayback machine for archived code and its history
full-text search on all archived source code files
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 16 / 18
The road ahead
Planned features...
lookup by content hash (done)
download: wget and git clone from Software Heritage
provenance information for all archived code and metadata
browsing: wayback machine for archived code and its history
full-text search on all archived source code files
... and much more than one could possibly imagine
all the world’s software development history in a single graph!
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 16 / 18
An ambitious, worldwide initiative
Inria as initiator
.fr national CS research institution
strong FOSS culture
founding partner of the W3C
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 17 / 18
An ambitious, worldwide initiative
Inria as initiator
.fr national CS research institution
strong FOSS culture
founding partner of the W3C
Supporters and early partners
ACM, Nokia Bell Labs, Creative Commons, DANS, Eclipse,
Engineering, FSF, OSI, GitHub, GitLab, IEEE, Informatics Europe,
Microsoft, OIN, OW2, SIF, SFC, SFLC, The Document Foundation,
The Linux Foundation, ...
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 17 / 18
An ambitious, worldwide initiative
Inria as initiator
.fr national CS research institution
strong FOSS culture
founding partner of the W3C
Supporters and early partners
ACM, Nokia Bell Labs, Creative Commons, DANS, Eclipse,
Engineering, FSF, OSI, GitHub, GitLab, IEEE, Informatics Europe,
Microsoft, OIN, OW2, SIF, SFC, SFLC, The Document Foundation,
The Linux Foundation, ...
Going global
building an open, multistakeholder, nonprofit global organisation
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 17 / 18
Conclusion
Software Heritage is
a revolutionary reference archive of all FOSS ever written
a unique complement for development platforms
an international, open, nonprofit, mutualized infrastructure
at the service of our community, at the service of society!
Come in, we’re open!
www.softwareheritage.org — sponsoring, job openings
wiki.softwareheritage.org — internships, leads
forge.softwareheritage.org — our own code
Questions?
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 18 / 18
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
A giant (extended) Merkle DAG
Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1

Más contenido relacionado

La actualidad más candente

La actualidad más candente (8)

Learn how to build decentralized and serverless html5 applications with embar...
Learn how to build decentralized and serverless html5 applications with embar...Learn how to build decentralized and serverless html5 applications with embar...
Learn how to build decentralized and serverless html5 applications with embar...
 
Python - basics
Python - basicsPython - basics
Python - basics
 
The Ring programming language version 1.6 book - Part 17 of 189
The Ring programming language version 1.6 book - Part 17 of 189The Ring programming language version 1.6 book - Part 17 of 189
The Ring programming language version 1.6 book - Part 17 of 189
 
WebRTC と Native とそれから、それから。
WebRTC と Native とそれから、それから。 WebRTC と Native とそれから、それから。
WebRTC と Native とそれから、それから。
 
Tizen store-z1-20150228rzr
Tizen store-z1-20150228rzrTizen store-z1-20150228rzr
Tizen store-z1-20150228rzr
 
Sheila Ayelen Berta - The Art of Persistence: "Mr. Windows… I don’t wanna go ...
Sheila Ayelen Berta - The Art of Persistence: "Mr. Windows… I don’t wanna go ...Sheila Ayelen Berta - The Art of Persistence: "Mr. Windows… I don’t wanna go ...
Sheila Ayelen Berta - The Art of Persistence: "Mr. Windows… I don’t wanna go ...
 
Mikhail Belopuhov: OpenBSD: Where is crypto headed?
Mikhail Belopuhov: OpenBSD: Where is crypto headed?Mikhail Belopuhov: OpenBSD: Where is crypto headed?
Mikhail Belopuhov: OpenBSD: Where is crypto headed?
 
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
 

Destacado

Cv312 g formation-db2-for-z-os-new-features-in-version-10-workshop
Cv312 g formation-db2-for-z-os-new-features-in-version-10-workshopCv312 g formation-db2-for-z-os-new-features-in-version-10-workshop
Cv312 g formation-db2-for-z-os-new-features-in-version-10-workshop
CERTyou Formation
 
Discussion on the legitimacy source of Court Judge in Taiwan
Discussion on the legitimacy source of Court Judge in TaiwanDiscussion on the legitimacy source of Court Judge in Taiwan
Discussion on the legitimacy source of Court Judge in Taiwan
Dennis Chen
 
Opinión personal sobre la elavoración del blog
Opinión personal sobre la elavoración del blogOpinión personal sobre la elavoración del blog
Opinión personal sobre la elavoración del blog
laycar
 
Changes In Education
Changes In EducationChanges In Education
Changes In Education
britte2204
 
La Vasija Agrietada Diapositivas
La Vasija Agrietada DiapositivasLa Vasija Agrietada Diapositivas
La Vasija Agrietada Diapositivas
avapla81
 
Pembubaran persekutuan
Pembubaran persekutuanPembubaran persekutuan
Pembubaran persekutuan
adelaa09
 
Hombre, cultura y sociedad.Unidad I
Hombre, cultura y sociedad.Unidad IHombre, cultura y sociedad.Unidad I
Hombre, cultura y sociedad.Unidad I
martaarmas28
 
Final semestral alice ciencia
Final semestral alice cienciaFinal semestral alice ciencia
Final semestral alice ciencia
SUSY84
 

Destacado (20)

Culbe
CulbeCulbe
Culbe
 
Cv312 g formation-db2-for-z-os-new-features-in-version-10-workshop
Cv312 g formation-db2-for-z-os-new-features-in-version-10-workshopCv312 g formation-db2-for-z-os-new-features-in-version-10-workshop
Cv312 g formation-db2-for-z-os-new-features-in-version-10-workshop
 
Modulo1.T2.Clasificacion de blogs
Modulo1.T2.Clasificacion de blogsModulo1.T2.Clasificacion de blogs
Modulo1.T2.Clasificacion de blogs
 
Floppy disks
Floppy disksFloppy disks
Floppy disks
 
Discussion on the legitimacy source of Court Judge in Taiwan
Discussion on the legitimacy source of Court Judge in TaiwanDiscussion on the legitimacy source of Court Judge in Taiwan
Discussion on the legitimacy source of Court Judge in Taiwan
 
Foro computacion
Foro computacionForo computacion
Foro computacion
 
BB Achievements
BB Achievements BB Achievements
BB Achievements
 
Opinión personal sobre la elavoración del blog
Opinión personal sobre la elavoración del blogOpinión personal sobre la elavoración del blog
Opinión personal sobre la elavoración del blog
 
Changes In Education
Changes In EducationChanges In Education
Changes In Education
 
Informe de la SD Eibar
Informe de la SD EibarInforme de la SD Eibar
Informe de la SD Eibar
 
Política de gestión de los medios sociales (GEYC)
Política de gestión de los medios sociales (GEYC)Política de gestión de los medios sociales (GEYC)
Política de gestión de los medios sociales (GEYC)
 
Formulario Mate
Formulario MateFormulario Mate
Formulario Mate
 
La Vasija Agrietada Diapositivas
La Vasija Agrietada DiapositivasLa Vasija Agrietada Diapositivas
La Vasija Agrietada Diapositivas
 
Pembubaran persekutuan
Pembubaran persekutuanPembubaran persekutuan
Pembubaran persekutuan
 
Hombre, cultura y sociedad.Unidad I
Hombre, cultura y sociedad.Unidad IHombre, cultura y sociedad.Unidad I
Hombre, cultura y sociedad.Unidad I
 
Final semestral alice ciencia
Final semestral alice cienciaFinal semestral alice ciencia
Final semestral alice ciencia
 
Departamento de Cordillera
Departamento de CordilleraDepartamento de Cordillera
Departamento de Cordillera
 
NTR Birthday Deisigns
NTR Birthday Deisigns  NTR Birthday Deisigns
NTR Birthday Deisigns
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
PIA NEWSLETTER
PIA NEWSLETTERPIA NEWSLETTER
PIA NEWSLETTER
 

Similar a Software Heritage: let's build together the universal archive of our software commons - Stefano Zacchiroli - Codemotion Milan 2016

Software Heritage: Archiving the Free Software Commons for Fun & Profit
Software Heritage: Archiving the Free Software Commons for Fun & ProfitSoftware Heritage: Archiving the Free Software Commons for Fun & Profit
Software Heritage: Archiving the Free Software Commons for Fun & Profit
Speck&Tech
 

Similar a Software Heritage: let's build together the universal archive of our software commons - Stefano Zacchiroli - Codemotion Milan 2016 (20)

Software Heritage: Building the Universal Software Archive, OW2con'16, Paris.
Software Heritage: Building the Universal Software Archive, OW2con'16, Paris.Software Heritage: Building the Universal Software Archive, OW2con'16, Paris.
Software Heritage: Building the Universal Software Archive, OW2con'16, Paris.
 
Software Heritage: Archiving the Free Software Commons for Fun & Profit
Software Heritage: Archiving the Free Software Commons for Fun & ProfitSoftware Heritage: Archiving the Free Software Commons for Fun & Profit
Software Heritage: Archiving the Free Software Commons for Fun & Profit
 
Software Heritage, a revolutionary infrastructure for software source code, O...
Software Heritage, a revolutionary infrastructure for software source code, O...Software Heritage, a revolutionary infrastructure for software source code, O...
Software Heritage, a revolutionary infrastructure for software source code, O...
 
R. Di Cosmo - Software Heritage
R. Di Cosmo - Software HeritageR. Di Cosmo - Software Heritage
R. Di Cosmo - Software Heritage
 
Cape Cod Web Technology Meetup - 3
Cape Cod Web Technology Meetup - 3Cape Cod Web Technology Meetup - 3
Cape Cod Web Technology Meetup - 3
 
OpenChain Webinar #5: Software Heritage
OpenChain Webinar #5: Software HeritageOpenChain Webinar #5: Software Heritage
OpenChain Webinar #5: Software Heritage
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...
 
IoTivity: From Devices to the Cloud
IoTivity: From Devices to the CloudIoTivity: From Devices to the Cloud
IoTivity: From Devices to the Cloud
 
A Tour of Open Source on the Mainframe
A Tour of Open Source on the MainframeA Tour of Open Source on the Mainframe
A Tour of Open Source on the Mainframe
 
#OSSPARIS17 - Logiciel libre pour une science reproductible, par ROBERTO DI C...
#OSSPARIS17 - Logiciel libre pour une science reproductible, par ROBERTO DI C...#OSSPARIS17 - Logiciel libre pour une science reproductible, par ROBERTO DI C...
#OSSPARIS17 - Logiciel libre pour une science reproductible, par ROBERTO DI C...
 
Hacktoberfest 2020 - Open source for beginners
Hacktoberfest 2020 - Open source for beginnersHacktoberfest 2020 - Open source for beginners
Hacktoberfest 2020 - Open source for beginners
 
Managing Code in Repositories
Managing Code in RepositoriesManaging Code in Repositories
Managing Code in Repositories
 
Automate your iOS deployment a bit
Automate your iOS deployment a bitAutomate your iOS deployment a bit
Automate your iOS deployment a bit
 
Osis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
 
IoTivity for Automotive: meta-ocf-automotive tutorial
IoTivity for Automotive: meta-ocf-automotive tutorialIoTivity for Automotive: meta-ocf-automotive tutorial
IoTivity for Automotive: meta-ocf-automotive tutorial
 
Using oss at an internet company and hacker culture; Linux Enterprise Users M...
Using oss at an internet company and hacker culture; Linux Enterprise Users M...Using oss at an internet company and hacker culture; Linux Enterprise Users M...
Using oss at an internet company and hacker culture; Linux Enterprise Users M...
 
Software development made serious
Software development made seriousSoftware development made serious
Software development made serious
 
Open Design Communities - MAKlab Glasgow (UK) 16/09/2011
Open Design Communities - MAKlab Glasgow (UK) 16/09/2011Open Design Communities - MAKlab Glasgow (UK) 16/09/2011
Open Design Communities - MAKlab Glasgow (UK) 16/09/2011
 
NTU Workshop: 01 What Is Open Design
NTU Workshop: 01 What Is Open DesignNTU Workshop: 01 What Is Open Design
NTU Workshop: 01 What Is Open Design
 
IDAS Workshop: 01 What Is Open Design
IDAS Workshop: 01 What Is Open DesignIDAS Workshop: 01 What Is Open Design
IDAS Workshop: 01 What Is Open Design
 

Más de Codemotion

Más de Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Último

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Software Heritage: let's build together the universal archive of our software commons - Stefano Zacchiroli - Codemotion Milan 2016

  • 1. Software Heritage the Universal Archive of our Software Commons Stefano Zacchiroli Software Heritage zack@upsilon.cc 26 November 2016 Codemotion Milan, Italy THE GREAT LIBRARY OF SOURCE CODE Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 18
  • 2. code by Margaret Hamilton and her NASA team, http://www.ibiblio.org/apollo/ Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 2 / 18
  • 3. / ∗ This r o u t i n e e x p l o i t s a f i x e d 512 byte i n p u t b u f f e r in a ∗ VAX running the BSD 4 . 3 f i n g e r d binary . I t send 536 ∗ b y t e s ( p lu s a newline ) to o v e r w r i t e s i x e x t r a words in ∗ the s t a c k frame , i n c l u d i n g the r e t u r n PC , to p o i n t i n t o ∗ the middle of the s t r i n g s e n t over . The i n s t r u c t i o n s in ∗ the s t r i n g do the d i r e c t system c a l l v e r s i o n of ∗ e xe c v e ( " / bin / sh " ) . ∗ / s t a t i c t r y _ f i n g e r ( host , fd1 , fd2 ) / ∗ 0 x49ec , < j u s t _ r e t u r n +378 struct hst ∗ host ; int ∗ fd1 , ∗ fd2 ; { /* ... */ for ( i = 0 ; i < 5 3 6 ; i ++) / ∗ 628 ,654 ∗ / buf [ i ] = ’ 0 ’ ; for ( i = 0 ; i < 4 0 0 ; i ++) buf [ i ] = 1 ; for ( j = 0 ; j < 2 8 ; j ++) buf [ i + j ] = " 3 3 5 2 1 7 / sh 0 3 3 5 2 1 7 / bin 320^Z 3 3 5 0 3 3 5 0 https://github.com/arialdomartini/morris-worm Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 3 / 18
  • 4. Source code is knowledge “Programs must be written for people to read, and only incidentally for machines to execute.” — Harold Abelson Distinguishing features executable and human readable knowledge (an all time new) naturally evolves over time development history is key to its understanding complex: large web of dependencies, millions of SLOCs Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 4 / 18
  • 5. The Software Commons Definition (Commons) The commons is the cultural and natural resources accessible to all members of a society, including natural materials such as air, water, and a habitable earth. These resources are held in common, not owned privately. https://en.wikipedia.org/wiki/Commons Definition (Software Commons) The software commons consists of all computer software which is available at little or no cost and which can be altered and reused with few restrictions. Thus all open source software and all free software are part of the [software] commons. [...] https://en.wikipedia.org/wiki/Software_Commons Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 5 / 18
  • 6. Software is spread all around Fashion victims many disparate development platforms a myriad places where distribution may happen projects tend to migrate from one place to the other over time Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 6 / 18
  • 7. Software is spread all around Fashion victims many disparate development platforms a myriad places where distribution may happen projects tend to migrate from one place to the other over time One place... ... where can we find, track and search all source code? Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 6 / 18
  • 8. Software is fragile Like all digital information, FOSS is fragile inconsiderate and/or malicious code loss (e.g., Code Spaces) business-driven code loss (e.g., Gitorious, Google Code) for obsolete code: physical media decay (data rot) Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 7 / 18
  • 9. Software is fragile Like all digital information, FOSS is fragile inconsiderate and/or malicious code loss (e.g., Code Spaces) business-driven code loss (e.g., Gitorious, Google Code) for obsolete code: physical media decay (data rot) If a website disappears you go to the Internet Archive... ... where do you go if (a repository on) GitHub goes away? Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 7 / 18
  • 10. The Software Heritage project THE GREAT LIBRARY OF SOURCE CODE Our mission Collect, preserve and share the source code of all the software that lies at the heart of our culture and our society. Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 8 / 18
  • 11. Our principles Open approach 100% FOSS transparency In for the long haul replication non profit Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 9 / 18
  • 12. Archiving goals Targets: VCS repositories & source code releases (e.g., tarballs) We DO archive file content (= blobs) revisions (= commits), with full metadata releases (= tags), ditto (project metadata) where & when we found any of the above ... in a VCS-/archive-agnostic canonical data model We DON’T archive (UNIX philosophy) homepages, wikis → collaboration with the Internet Archive BTS/issues/code reviews/etc. mailing lists Long term vision: play our part in a "semantic wikipedia of software" Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 10 / 18
  • 13. Data flow dsc dsc hg hg hg git git git git svn svn svn tar zip software origins Package repos Software Heritage Archive Forges GitHub lister GitLab lister Debian lister Git loader Mercurial loader Debian source package loader PyPi lister tar loader Merkle DAG + blob storage . . . . . . Distros ... Scheduling Listing (full/incremental) Loading & deduplication Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 11 / 18
  • 14. Merkle trees Merkle tree (R. C. Merkle, Crypto 1979) Combination of tree hash function Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 12 / 18
  • 15. Merkle trees Merkle tree (R. C. Merkle, Crypto 1979) Combination of tree hash function Classical cryptographic construction fast, parallel signature of large data structures widely used (e.g., Git, Bitcoin, IPFS, ...) built-in deduplication Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 12 / 18
  • 16. The archive: a (giant) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 13 / 18
  • 17. SHA1 collisions considered harmful create domain sha1 as bytea check ( length ( value ) = 2 0 ) ; create domain sha1_git as bytea check ( length ( value ) = 2 0 ) ; create domain sha256 as bytea check ( length ( value ) = 3 2 ) ; create table content ( sha1 sha1 primary key , sha1_git sha1_git not null , sha256 sha256 not null , length b i g i n t not null , ctime timestamptz not null default now ( ) , s t a t u s c o n t e n t _ s t a t u s not null default ’ v i s i b l e ’ , o b j e c t _ i d b i g s e r i a l ) ; create unique index on content ( sha1_git ) ; create unique index on content ( sha256 ) ; Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 14 / 18
  • 18. Archive coverage Our sources GitHub — full, up-to-date mirror Debian — daily snapshots of all suites since 2005–2015 GNU — all releases as of August 2015 Gitorious, Google Code — local copy (Archive Team & Google) Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 15 / 18
  • 19. Archive coverage Our sources GitHub — full, up-to-date mirror Debian — daily snapshots of all suites since 2005–2015 GNU — all releases as of August 2015 Gitorious, Google Code — local copy (Archive Team & Google) Some numbers 150 TB blobs, 5 TB database (as a graph: 4 B nodes + 40 B edges) Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 15 / 18
  • 20. Archive coverage Our sources GitHub — full, up-to-date mirror Debian — daily snapshots of all suites since 2005–2015 GNU — all releases as of August 2015 Gitorious, Google Code — local copy (Archive Team & Google) Some numbers 150 TB blobs, 5 TB database (as a graph: 4 B nodes + 40 B edges) The richest source code archive already, ... and growing daily! Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 15 / 18
  • 21. The road ahead Planned features... lookup by content hash (done) download: wget and git clone from Software Heritage provenance information for all archived code and metadata browsing: wayback machine for archived code and its history full-text search on all archived source code files Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 16 / 18
  • 22. The road ahead Planned features... lookup by content hash (done) download: wget and git clone from Software Heritage provenance information for all archived code and metadata browsing: wayback machine for archived code and its history full-text search on all archived source code files ... and much more than one could possibly imagine all the world’s software development history in a single graph! Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 16 / 18
  • 23. An ambitious, worldwide initiative Inria as initiator .fr national CS research institution strong FOSS culture founding partner of the W3C Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 17 / 18
  • 24. An ambitious, worldwide initiative Inria as initiator .fr national CS research institution strong FOSS culture founding partner of the W3C Supporters and early partners ACM, Nokia Bell Labs, Creative Commons, DANS, Eclipse, Engineering, FSF, OSI, GitHub, GitLab, IEEE, Informatics Europe, Microsoft, OIN, OW2, SIF, SFC, SFLC, The Document Foundation, The Linux Foundation, ... Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 17 / 18
  • 25. An ambitious, worldwide initiative Inria as initiator .fr national CS research institution strong FOSS culture founding partner of the W3C Supporters and early partners ACM, Nokia Bell Labs, Creative Commons, DANS, Eclipse, Engineering, FSF, OSI, GitHub, GitLab, IEEE, Informatics Europe, Microsoft, OIN, OW2, SIF, SFC, SFLC, The Document Foundation, The Linux Foundation, ... Going global building an open, multistakeholder, nonprofit global organisation Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 17 / 18
  • 26. Conclusion Software Heritage is a revolutionary reference archive of all FOSS ever written a unique complement for development platforms an international, open, nonprofit, mutualized infrastructure at the service of our community, at the service of society! Come in, we’re open! www.softwareheritage.org — sponsoring, job openings wiki.softwareheritage.org — internships, leads forge.softwareheritage.org — our own code Questions? Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 18 / 18
  • 27. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 28. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 29. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 30. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 31. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 32. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 33. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 34. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1
  • 35. A giant (extended) Merkle DAG Stefano Zacchiroli Software Heritage 26/11/2016, Codemotion 1 / 1