The IBM Center for Open Source, Data and AI Technology "CODAIT" (https://developer.ibm.com/code/open/centers/codait/) works on multiple open-source Data and AI projects. In this section we will introduce these projects around Jupyter Notebooks, reusable Model and Data assets, Trusted AI among others.
28. The IBM Data Asset eXchange
28
Also known as DAX.
A place to find curated free
and open datasets under
open data licenses.
Part of developer.ibm.com.
29. The MAX Named Entity Tagger
29
A model that identifies
mentions of named entities
like persons, organizations in
English-language text.
Trained by Nick Pentreath on
the CODAIT team
Most difficult part: Finding
usable training data
30. Groningen Meaning Bank
30
A project at the University of
Groningen to create an open
data set for training linguistic
models like named entity
taggers.
Public domain data with
public domain annotations,
assembled by a 10-person
team with help from online
volunteers.
We needed to make further
modifications to pass IBM’s
own controls.
31. Contracts Proposition Bank
31
A collection of annotated
sentences drawn from IBM’s
public contracts, annotated
with
Created by IBM Research.
Used by IBM researchers to
train better SRL parsers for
the legal documents domain.
Available on DAX.
32. IBM’s Open Data
32
IBM Research has produced
dozens, perhaps hundreds, of
open data sets.
The data is not kept in one place.
IBM is working to improve this.
– Initiatives within IBM Research
– DAX
– The Community Data License Agreement
33. The Community Data License Agreement
http://cdla.io
33
Linux Foundation initiative to
create a new legal framework
that meets the needs of AI
data sets.
IBM is a major supporter.
34. The Community Data License Agreement
http://cdla.io
34
Two licenses written
specifically for AI data
• CDLA-Sharing: “Copyleft”
license analogous the GPL
• CDLA-Permissive: Similar to
BSD license
Both licenses distinguish clearly
between use (analysis,
modeling) and modification of
the data set.
35. IBM Data Asset eXchange (DAX)
35
• Curated free and open datasets under open data licenses
• Standardized dataset formats and metadata
• Ready for use in enterprise AI applications
• Complement to the Model Asset eXchange (MAX)
Data Asset eXchange
ibm.biz/data-asset-exchange
Model Asset eXchange
ibm.biz/model-exchange
49. Jupyter Notebook
Simple, but Powerful
As simple as opening a web
page, with the capabilities of
a powerful, multilingual,
development environment.
Interactive widgets
Code can produce rich
outputs such as images,
videos, markdown, LaTeX
and JavaScript. Interactive
widgets can be used to
manipulate and visualize
data in real-time.
Language of choice
Jupyter Notebooks have
support for over 50
programming languages,
including those popular in
Data Science, Data
Engineer, and AI such as
Python, R, Julia and Scala.
Big Data Integration
Leverage Big Data platforms
such as Apache Spark from
Python, R and Scala.
Explore the same data with
pandas, scikit-learn,
ggplot2, dplyr, etc.
Share Notebooks
Notebooks can be shared
with others using e-mail,
Dropbox, Google Drive,
GitHub, etc
49
62. End to end ML platform on Kubernetes.
Initially originated at Google.
Key Projects
– Model Training and Hyper
parameter optimization
– Model Serving
– Model Management
– Pipelines:
• Combine components into
complex workflows
– Metadata
• Collect data from multiple components
Kubeflow
63. Overall community, and IBM’s presence in Kubeflow
• Commits in
KubeFlow
compared with
other companies
• IBM is 2nd
• or 3rd largest
contributor in the
past 12 months
• IBM maintainers
(approvers/review
ers) in Katib
Kubeflow Serving,
(HPO+Training),
Manifests,
Pipelines etc.
https://www.stackalytics.com/unaffiliated?project_type=kubeflow-group
64. IBMers contributing to:
• 590+ Commits
• 924K Lines of
Code
https://www.stackalytics.com/unaffiliated?project_type=kubeflow-group&company=ibm