3. The original product
• Social data append
– PeopleGraph: match email addresses
to public demographics and social
profiles
– BrandGraph: match company URLs to
public firmographics and social
profiles
• Requirements
– Integrate a large (and expanding)
number of web data sources (REST,
SOAP, flat files)
– Realtime processing of large volumes
of contacts (60 queries/s)
4. The original technology stack
• Scala
– Best of both worlds
• Concise functional syntax
• Java libraries and deployment architecture
• Scala-specific libraries (Dispatch, Lift Web Framework)
• Twitter (soon to be Apache) Storm
– Streaming intake and normalization of large amounts of data
• MongoDB
– Expanding data sources = constantly updating schema
– Most sophisticated query syntax of NoSQL options
• AWS and Azure
– Well, duh
5. The new product
• Moving up the application stack
– Focus on the most compelling single-use case for our data
– Fliptop SpendScore
• Predictive analytics for sales and marketing teams
• “Machine learning for Salesforce”
6. The updated technology stack
• Still need to wrangle large amounts of data, so no changes
there
• New requirement: fast, scalable machine learning
7. Why not Scala (Java) native?
• The options
– Apache Mahout
• Only skeleton implementations for most sophicated machine
learning techniques (e.g. Random Forest, Adaboost)
• Customer-specific models – don’t need Big Data
– Weka – GPL
– Scala-native libraries – Too early to use in production
8. Why Python?
• scikit-learn
– Mature – around since 2006
– Actively-developed – Last stable release Aug 2013
– Sophisticated – Random Forest and Adaboost classifier show
comparable performance to R
• Why not R? Not really production grade.
9. Requirements
• APIs to exploit Python’s modeling power
– Train, predict, model info query, etc.
• Scalability
– On demand Python serving nodes
10. Tools for Scala-Python Integration
• Reimplementation of Python
– Jython (JPython)
• Communication through JNI
– Jepp
• Communication through IPC
– Apache Thrift
• Communication through REST API calls
– Bottle
11. Jython
• Re-Implementation of Python in Java
• Can import and use any Java class.
• Includes almost all of the modules in the standard Python
distribution
– Except some of the modules implemented originally in C.
• Compiles to Java bytecode
– either on demand or statically.
1
1
13. Jython
• Lacks support for lots of extensions for scientific computing
– Numpy, Scipy, etc.
• JyNI (Jython Native Interface) to the rescue?
– Specifically designed to support CPython extensions like
Numpy, Scipy
– Still in alpha
1
3
14. Communication through JNI
• Jepp (Java Embedded Python)
– Embeds CPython in Java
– Runs Python code in CPython
– Leverages both JNI and Python/C for integration
16. Jepp
1
6
object TestJepp extends App {
val jep = new Jep()
jep.runScript("python_util.py")
val a = (2).asInstanceOf[AnyRef]
val b = (3).asInstanceOf[AnyRef]
val sumByPython = jep.invoke("python_add", a, b)
println(sumByPython.asInstanceOf[Int])
}
def python_add(a, b):
return a + b
python_util.py
TestJepp.scala
17. Communication through IPC
• Apache Thrift
– Developed & open-sourced by Facebook
– More community support than Protobuf, Avro
– IDL-based (Interface Definition Language)
– Generates server/client code in specified languages
– Take care of protocol and transport layer details
– Comes with generators for Java, Python, C++, etc.
• No Scala generator
• Scrooge (Twitter) to the rescue!
1
7
19. Thrift – Python Server
1
9
class ExampleHandler(python_service_test.PythonAddService.Iface):
def pythonAdd(self, a, b):
return a + b
handler = ExampleHandler()
processor = Example.Processor(handler)
transport = TSocket.TServerSocket(9090)
tfactory = TTransport.TBufferedTransportFactory()
pfactory = TBinaryProtocol.TBinaryProtocolFactory()
server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
server.serve()
PythonAddServer.py
class Iface:
def pythonAdd(self, a, b):
pass
PythonAddService.p
y
20. Thrift – Scala Client
2
0
object PythonAddClient extends App {
val transport: TTransport = new TSocket("localhost", 9090)
val protocol: TProtocol = new TBinaryProtocol(transport)
val client = new PythonAddService.Client(protocol)
transport.open()
val sumByPython = client.python_add(3, 5)
println("3 + 5 = " + sumByPython)
transport.close()
}
PythonAddClient.sc
ala
25. Summary
• Jython
• (✓) Tight integration with Scala/Java
• (✗) Lack support for C extensions (JyNI might help in the future)
• Jepp
• (✓) Access high quality Python extensions with CPython speed
• (✗) Two runtime environments
• Thrift, REST
• (✓) Language-independent development
• (✗) Bigger communication overhead
2
5