This document discusses building a graph database and domain-specific language (DSL) for analyzing Slack data. It defines entities like messages, users, and channels as graph nodes and their relationships as edges. A REST API is created to ingest and query the graph using TinkerPop and remote traversals. Custom traversal sources and classes define shorthand traversals and business logic to build the DSL, adding structure and meaning to queries over the Slack data graph.
3. CONNECTING TO
BUSINESS STACKS VISUALISATION
CUSTOM BUILT
INFOGRAPHICS
NATURAL
LANGUAGE
GENERATED
INSIGHTS
EXPORT & SHARE
STORIES
EMAIL
POWERPOINT, TV
WEB
Embedded SDK
About
CLIENTS
• Automating the process of data storytelling
• For more information, visit www.nugit.co
4. Agenda
• Use Cases
• The Slack APIs
• Defining the Entities
• Graph Design and Considerations
• Making the Graph RESTful
• Building a DSL
• Testing the Application
• Scaling the Graph
5. Use Cases - Communities
• View contribution to
communication
• Participation across
channels
• Identify collaborative
groups
• Users connected by
mentions and reactions
• Identify influential users
per channel
6. • Highlight engaging conversations
• Top videos, GIFs, links
• Get insights across channels
Use Cases – Top Posts
10. Defining the Entities
• Narrows down data required for the use case
• Helps “whiteboarding” process for graph design
• Allows defining schema for payloads
• Requires understanding the nuances of the platform
11. Graph Design and Considerations
• Team node acts as root node
• Allows maintaing separate graphs
for different organisations
12. Graph Design and Considerations
• Top posts, notable messages are
both message nodes
• Differentiated using edge labels
• Edge traversals favoured over
property lookup
13. Graph Design and Considerations
• Any user can comment on, react to
or be mentioned in any message
• Reaction type modelled as edge
property
• Efficient as use-case does not need
filtering by reaction type
14. Graph Design and Considerations
• Same file shared across channels
shares common pool of reactions
• Schema respects Slack specific
behaviour
• Handles idempotency based on
unique ID maintained by Slack
19. The Journey So Far
• Defining entities and modelling them into Graph
• Iterative feedback-drivenprocess
• Understanding the data available from the API
• Identifying unique IDs
• Filtering out required fields
20. Data Ingestion and Extraction
• Apache Flink cluster retrieves, parses and filters Slack data
• GraphQL service requests data for visualization
• Flask REST service ingests/queries data to/from Tinkerpop
POST
PUT
GET
Gremlin-Python
Gremlin
Bytecode
21. Why Tinkerpop?
• Abstraction that lets us avoid vendor lock-in
• Reduces rework when switching data stores
• Gremlin query language
• Hadoop and SparkComputer
22. Making the Graph RESTful
• Defining REST Endpoints
• Defining the Resources
• Remote Traversals
23. • Write endpoints for seeding
• POST /teams/<team_uid>/channels
• POST /teams/<team_uid>/channels/<channel_uid>/messages
• Handling Idempotency
• Replace default strategy with ”ElementIDStrategy”
• Enables creation of nodes with Slack specific unique IDs
Defining REST Endpoints
// scripts/empty-sample.groovy
globals << [g : graph.traversal(),sg: graph.traversal().withStrategies(ElementIdStrategy.build().create())]
• Read endpoints for queries
• GET /teams/<team_uid>/top_posts
24. Making the Graph RESTful
• Setting up REST Endpoints
• Defining the Resources
• Remote Traversals
25. Defining the Resources
from marshmallow import Schema, fields, pre_load, pre_dump, post_load, validates_schema
from marshmallow.exceptions import ValidationError
...
class MessageSchema(Schema):
""" Holds all the required fields for a message object."""
ts = fields.Float(required=True)
text = fields.Str()
comment = fields.Str()
subtype = fields.Str()
bot_id = fields.Str(validate=is_bot_uid)
user = fields.Str(validate=is_user_uid) thread_ts = fields.Str()
file_share = fields.Nested(FileShareSchema, load_from="file")
attachments = fields.Nested(AttachmentSchema, many=True)
reactions = fields.Nested(ReactionSchema, many=True)
comments = fields.Nested(CommentSchema, many=True, load_from="replies")
mentions = fields.List(fields.Str(validate=is_user_uid))
class AttachmentSchema(Schema):
""" Holds all the required fields for an Attachment object."""
class ReactionSchema(Schema):
""" Holds all the required fields for a reaction object."""
class CommentSchema(Schema):
""" Holds all the required fields for a comment object."""
...
• Organized code with single point of
reference
26. Defining the Resources
from marshmallow import Schema, fields, pre_load, pre_dump, post_load, validates_schema
from marshmallow.exceptions import ValidationError
...
class MessageSchema(Schema):
""" Holds all the required fields for a message object."""
@validates_schema
def validate_message(self, data):
""" Validate if the message contains any of comments, mentions or reactions. """
if not any([f(data) for f in (has_comments, has_mentions, has_reactions)]):
raise ValidationError("The message must contain comments, mentions or
reactions")
ts = fields.Float(required=True)
text = fields.Str()
comment = fields.Str()
subtype = fields.Str()
bot_id = fields.Str(validate=is_bot_uid)
user = fields.Str(validate=is_user_uid) thread_ts = fields.Str()
file_share = fields.Nested(FileShareSchema, load_from="file")
attachments = fields.Nested(AttachmentSchema, many=True)
reactions = fields.Nested(ReactionSchema, many=True)
comments = fields.Nested(CommentSchema, many=True, load_from="replies")
mentions = fields.List(fields.Str(validate=is_user_uid))
class AttachmentSchema(Schema):
""" Holds all the required fields for an Attachment object."""
class ReactionSchema(Schema):
""" Holds all the required fields for a reaction object."""
class CommentSchema(Schema):
""" Holds all the required fields for a comment object."""
...
• Organized code with single point of
reference
• Validate data before ingestion
• Enforce types and required fields
@validates_schema
def validate_message(self, data):
""" Validate if the message contains any of comments, mentions or reactions. """
if not any([f(data) for f in (has_comments, has_mentions, has_reactions)]):
raise ValidationError("The message must contain comments, mentions or
reactions")
27. from marshmallow import Schema, fields, pre_load, pre_dump, post_load, validates_schema
from marshmallow.exceptions import ValidationError
...
class MessageSchema(Schema):
""" Holds all the required fields for a message object."""
class AttachmentSchema(Schema):
""" Holds all the required fields for an Attachment object."""
title = fields.Str()
fallback = fields.Str()
text = fields.Str()
thumb_url = fields.Str()
image_url = fields.Str()
title_link = fields.Str()
@post_load
def reshape_attachment(self, data):
""" Apply required transformations on the Attachment object. ""”
# Create a post_title field
collapse_keys(data, "post_title", *("fallback", "title", "text"))
# Create a post_thumbnail field
collapse_keys(data, "post_thumbnail", *("thumb_url", "image_url",
"title_link"))
# Set post_type to URL
data["post_type"] = "URL”
class ReactionSchema(Schema):
""" Holds all the required fields for a reaction object."""
class CommentSchema(Schema):
""" Holds all the required fields for a comment object."""
class FileShareSchema(Schema):
""" Holds all the required fields for a File Share object.""”
class UserSchema(Schema):
""" Holds all the required fields for a User object.""”
...
• Organized code with single point of
reference
• Validate data before ingestion
• Enforce types and required fields
• Normalize fields with post-
processing
Defining the Resources
@post_load
def reshape_attachment(self, data):
""" Apply required transformations on the Attachment object. ""”
# Create a post_title field
collapse_keys(data, "post_title", *("fallback", "title", "text"))
# Create a post_thumbnail field
collapse_keys(data, "post_thumbnail", *("thumb_url", "image_url",
"title_link"))
# Set post_type to URL
data["post_type"] = "URL”
28. Making the Graph RESTful
• Schema enforcement and validation
• Handling Idempotency of endpoints
• Custom Traversal Source
29. Remote Traversals
• Bytecode sent over network instead of string
• Allows using custom traversal source for a Domain Specific Language (DSL)
from gremlin_python.driver.driver_remote_connection import
DriverRemoteConnection
...
conn = DriverRemoteConnection(GREMLIN_SERVER_HOST, 'sg')
slack = Graph().traversal(SlackTraversalSource).withRemote(conn)
31. Building a DSL - Motivations
class SlackTraversalSource(BaseTraversalSource):
""" Module to initialise a Graph with the methods listed under SlackTraversal. """
def __init__(self, *args, **kwargs):
super(SlackTraversalSource, self).__init__(*args, **kwargs)
self.graph_traversal = SlackTraversal
def channels(self, *channel_ids):
""" Shorthand to identify all channel nodes"""
return traversal
• Custom traversal source can also specify useful shorthands
• E.g. Traversing to all the Channel nodes
traversal = self.get_graph_traversal()
traversal.bytecode.add_step("V")
traversal.bytecode.add_step("hasLabel", NODES.channel)
if channel_ids:
traversal.bytecode.add_step("has", "__id", P.within(channel_ids))
32. Building a DSL - Motivations
class SlackTraversal(BaseTraversal):
def addPartOfChannelEdges(self, channel_uid, *user_uids, **kwargs):
""" Add an edge to a channel from the users who were/are a part of the channel. ""”
return self
• Custom traversal source specifies business logic behind traversals
• E.g. Connecting a User node to a Channel node
for user_uid in user_uids:
edge_uid = construct_uid(user_uid, channel_uid, EDGES.part_of.name, delim="|")
self.getOrAddEdgeFrom(edge_label=EDGES.part_of, edge_uid=edge_uid,
node_label=NODES.user, node_uid=user_uid)
.upsertProperties(kwargs.get("properties")).inV()
33. Building a DSL - Motivations
from gremlin_python.process.graph_traversal import GraphTraversal
from gremlin_python.process.graph_traversal import GraphTraversalSource, __
class BaseTraversal(GraphTraversal):
def getOrAddEdgeFrom(self, edge_label, edge_uid, node_label, node_uid):
"""
Adds an edge from the node with the given label and uid only if the edge doesn’t exist.
"""
return self.coalesce(
__.addE(edge_label).property(T.id, edge_uid).from_(
__.V().getNode(node_label, node_uid)))
__.InE(edge_label).hasId(edge_uid).and(
__.outV().hasId(node_uid), __.outV().hasLabel(node_label)),
• BaseTraversal handles creation of nodes and edges
• These methods should guarantee idempotency
• E.g. Creation of edges between two nodes…
• ...checks for an existing edge
34. Building a DSL - Motivations
from gremlin_python.process.graph_traversal import GraphTraversal
from gremlin_python.process.graph_traversal import GraphTraversalSource, __
class BaseTraversal(GraphTraversal):
def getOrAddEdgeFrom(self, edge_label, edge_uid, node_label, node_uid):
"""
Adds an edge from the node with the given label and uid only if the edge doesn’t exist.
"""
return self.coalesce(
__.InE(edge_label).hasId(edge_uid).and(
__.outV().hasId(node_uid), __.outV().hasLabel(node_label)),
__.addE(edge_label).property(T.id, edge_uid).from_(
__.V().getNode(node_label, node_uid)))
• The edge is created only if it doesn’t already exist
35. def build_visualization(self, traversal_source,
**kwargs):
""" The below are standardized steps that are
required to generate data for any visualization."""
return self.start(traversal_source)
.filterByDate(self.date_dimension,
kwargs.get("start_time"),
kwargs.get("end_time"))
.filterByFields(self.filters_map,
kwargs.get("filters"))
.sortByFields(self.sorting_map,
kwargs.get("sort_field"),
kwargs.get("sort_direction"))
.buildObject(self.object_map).toList()
Building a DSL – Custom Workflows
• Standardized steps for generating a visualization are defined in the BaseTraversal
• Custom maps define traversal paths for fields that vary across visualizations
36. Building a DSL – Custom Workflows
# Sample filter from frontend
filter_obj = {'_and': [{"field": 'reactions', '_gte': 100},
{"field": 'post_creator',
'_in': [‘bob’, ‘chloe']
}]}
filter_map = {"post_creator": lambda pred:
__.in_(EDGES.created_post).has(USER.display_name, pred),
"reactions": lambda pred:
__.inE(EDGES.reacted_to).count().is_(pred)
}
object_map = {
"post_creator": {"uid": [__.in_(EDGES.created_post).values("__id"),
__.constant("")],
"image": ... # define similar path here,
},
"reactions":
__.inE(EDGES.reacted_to).groupCount().by(__.values(REACTION.name))
}
start = lambda traversal_source: traversal_source.posts()
# DSL generates the required lower level base traversals
slack.posts().where(
__.and_(
__.inE(EDGES.reacted_to).count().is_(P.gte(100)),
__.in_(EDGES.created_post).has(USER.display_name,
P.within(['bob', 'chloe'])))).
project("post_creator", "reactions").by(
__.project("image", "display_name", "uid").by(
__.in(EDGES.created_post).values(USER.image),
__.in(EDGES.created_post).values(USER.display_name),
__.in(EDGES.created_post).values("__id"))).by(
__.inE(EDGES.reacted_to).groupCount()).toList()
# Inject maps into DSL methods
start(slack)
.filterByFields(self.filters_map, kwargs.get("filters"))
.buildObject(self.object_map)
.toList()
• The DSL takes in functions/paths that map fields to their traversals
• Maps customized based on the visualization that is needed
37. Building a DSL – Custom Workflows
{
"reactions": {
"palm_tree": 82,
"robot_face": 18
},
"post_creator": {
"image": "https://url_of_image.jpg",
"display_name": ”chloe",
"uid": "U024ZH7HL”
}
}
• The traversals generated churn out the final response objects
• Objects rendered into visualizations by the client
39. Check if test passes
Use Fixtures
Write code to make the
test pass
Write a failing test
class TestNodeMethods(object):
""" Test methods that help in retrieval and creation of Nodes. """
def test_node_retrieval(self, graph):
""" Test if getNode retrieves an existing node. """
assert graph.V().getNode(label="person", uid=100)
.count().next() == 1
assert graph.V().getNode(label="person", uid=101)
.count().next() == 1
Start Gremlin
Server
Testing Our Application – Unit Testing
40. Check if test passes
Use Fixtures
Write code to make the
test pass
Write a failing test
Start Gremlin
Server
def getNode(self, label, uid):
"""
Returns the node with the given label and uid.
Args: label (string): The label of the node to return
uid (string): Unique ID of the node
Raises: StopIteration: Node with the given label and uid does not exist
"""
return self.and_(__.hasLabel(label), __.has(T.id, uid))
Testing Our Application – Unit Testing
41. Check if test passes
Use Fixtures
Write code to make the
test pass
Write a failing test
Start Gremlin
Server
$ bin/gremlin-server.sh conf/gremlin-server-neo4j-python.yaml
class TestBasicTraversal(object):
"""
Tests for methods that help create edges or nodes
and methods that help populate the properties of these objects.
"""
@pytest.fixture(scope="module")
def graph(self):
""" Graph with two nodes and one edge connecting them. """
graph = Graph().traversal(CerebroTraversalSource)
.withRemote(
DriverRemoteConnection(GREMLIN_SERVER_HOST,
GREMLIN_SERVER_TRAVERSER))
graph.V().clear()
from_node = graph.addV("person").
property(T.id, 100).next()
to_node = graph.addV("person").
property(T.id, 101).next()
graph.addE("knows").from_(from_node).to(to_node)
.property("__id", "1")
.next()
yield graph
graph.V().clear()
Testing Our Application – Unit Testing
42. Check if test passes
class TestNodeMethods(object):
""" Test methods that help in retrieval and creation of Nodes. """
def test_node_retrieval(self, graph):
""" Test if getNode retrieves an existing node. """
assert graph.V().getNode(label="person", uid=100)
.count().next() == 1
assert graph.V().getNode(label="person", uid=101)
.count().next() == 1
Write code to make the
test pass
Write a failing test
Use Fixtures
Start Gremlin
Server
Testing Our Application – Unit Testing
43. [
{
"reactions": [
{
"name": "joy",
"users": [
"U5K7JUATE”
]
}
],
"attachments": [
{
...
}
],
"text": "<https://www.youtube.com/watch?v=4iEh1ykb13w>",
"ts": "1465895473.000050",
"user": "U37BF9457",
"type": "message”
}
]
Testing Our Application – Unit Testing
class MessageSchema(Schema):
""" Holds all the required fields for a message object."""
. . .
• Fixture used to test if the
MessageSchema class is
implemented correctly
44. [
{
"reactions": [
{
"name": "joy",
"users": [
"U5K7JUATE”
]
}
],
"attachments": [
{...}
],
"text": ” <@U123456> <https://www.youtube.com/watch?v=4iEh1ykb13w>",
"mentions": [
"U123456”
],
"ts": ”a
"type": "message”
}
]
Testing Our Application – Unit Testing
class MessageSchema(Schema):
""" Holds all the required fields for a message object."""
mentions = fields.List(fields.Str(validate=is_user_uid))
• MessageSchema needs
to include mentions
• Update the fixture to
be able to test that the
schema includes
mentions
• Need to validate if
traversals pick up
mentions
45. Write code to make the
test pass
Write a failing test
[
{
"reactions": [
{
"name": "joy",
"users": [
"U5K7JUATE”
]
}
],
"attachments": [
{...}
],
"text": ” <@U123456> <https://www.youtube.com/watch?v=4iEh1ykb13w>",
"mentions": [
"U123456”
],
"ts": ”a
"type": "message”
}
]
gremlin> graph.io(graphson()).writeGraph("graph_name.json")
Testing Our Application – Unit Testing
Update JSON &
Generate GraphSON
Check if test passes
Use Fixtures
Start Gremlin
Server
46. Write code to make the
test pass
Write a failing test
@pytest.fixture(scope="module")
def slack_graph():
""" Open a subgraph on localhost for testing. """
slack.V().clear()
slack_client = Client(GREMLIN_SERVER_HOST, SLACK_TRAVERSER)
path_to_fixture = str(Path.cwd().joinpath(
"tests/fixtures/slack_graph.json"))
graphson_statement = 'graph.io(graphson()).readGraph("{}")’.
format(path_to_fixture)
slack_client.submit(graphson_statement).all().result()
yield slack
slack.V().clear()
Testing Our Application – Unit Testing
Update JSON &
Generate GraphSON
Check if test passes
Use Fixtures
Start Gremlin
Server
47. Testing the Application – CI/CD
• Automated tests using CircleCI
• Custom Configuration for Gremlin Server
• Caching Dependencies for Faster Tests
48. steps: #CircleCI 2.0
...
- run:
command: |
if [ ! -d ./apache-tinkerpop-gremlin-server-3.3.3 ]; then
curl -O https://archive.apache.org/dist/tinkerpop/3.3.3/apache-tinkerpop-gremlin-server-
3.3.3-bin.zip
unzip -q apache-tinkerpop-gremlin-server-3.3.3-bin.zip
# Install gremlin-python
cd ./apache-tinkerpop-gremlin-server-3.3.3 &&
./bin/gremlin-server.sh install org.apache.tinkerpop gremlin-python 3.3.3
# Change max content length and traversal strategy
sed -i -- 's/.*maxContentLength:.*/maxContentLength: 2621440/g' conf/gremlin-server.yaml
sed -i -- 's/graph.traversal()]/graph.traversal(),sg:
graph.traversal().withStrategies(ElementIdStrategy.build().create())]/g'
./scripts/empty-sample.groovy
fi
...
Testing the Application – CI/CD
52. def seed_channels(data, team_uid):
for channel_data in data:
channel_uid, creator, members = (channel_data.pop(key) for
key in ["uid", "creator", "members"])
slack.V().addChannel(channel_uid, properties=channel_data).next()
slack.teams(team_uid).addTeamHasChannelEdge(team_uid, channel_uid).next()
slack.users(creator).addCreatedChannelEdge(creator, channel_uid).next()
slack.channels(channel_uid).addPartOfChannelEdges(channel_uid, *members).next()
def seed_channels(data, team_uid):
for channel_data in data:
channel_uid, creator, members = (channel_data.pop(key) for
key in ["uid", "creator", "members"])
slack.V().addChannel(channel_uid, properties=channel_data)
.addTeamHasChannelEdge(team_uid, channel_uid).inV()
.addCreatedChannelEdge(creator, channel_uid).inV()
.addPartOfChannelEdges(channel_uid, *members).next()
def seed_channels(data, team_uid):
for channel_data in data:
channel_uid, creator, members = (channel_data.pop(key) for
key in ["uid", "creator", "members"])
slack.V().addChannel(channel_uid, properties=channel_data)
.addTeamHasChannelEdge(team_uid, channel_uid).inV()
.addCreatedChannelEdge(creator, channel_uid).inV()
.addPartOfChannelEdges(channel_uid, *members).promise()
• Seed subgraph using “next”
• Reduce number of blocking calls to one
per channel
• Seed subgraph using “promise”
• Make seeding asynchronous, no
blocking calls
• Verify that the returned futures were
successful
• Seed individual entities using “next”
• Each call to “next” is blocking
Async Traversals
next()
next()
next()
next()
next()
promise()
53. HA Cluster and Load Balancing
• Preparing for high availability with Neo4J and Gremlin
• Configuring Gremlin Server and Neo4J
• Understanding the Neo4J HA Architecture
• Advantages
• Data replication
• Spread writes across instance
• Handle greater read loads
• HA cluster is fronted by a load balancer like HAProxy
• Reference:
• https://neo4j.com/docs/operations-manual/current/ha-cluster/architecture/
• http://tinkerpop.apache.org/docs/3.3.3/reference/#_high_availability_configuration
54. HA Cluster and Load Balancing
• Tuning parameters for the cluster
• Frequency of pulling updates from other members of the cluster
• gremlin.neo4j.conf.ha.pull_interval
• Number of slaves a transaction should be committed to
• gremlin.neo4j.conf.ha.tx_push_factor
• Tuning parameters for the Load Balancer
• Routing requests across the cluster
• balance
• Checking if the members in the cluster are responsive
• option httpchk
// gremlin-server-neo4j-ha-{1..3}.yaml
channelizer: org.apache.tinkerpop.gremlin.server.channel.WsAndHttpChannelizer
> curl "http://localhost:8182?gremlin=100-1"