3. What is Mahout?
• Java library of pre‐built implementaKons of
various machine learning tasks
• Recommenders: collaboraKve filtering
• Clustering: grouping things by similarity
• ClassificaKon: analysis of a corpus for clustering
• Intended to run against Hadoop‐based data sets
• h=p://mahout.apache.org/
4. What is jython?
• ImplementaKon of python that runs against
the jvm
• Has full access to any well‐behaved java library
• Started in 1997 by Jim Hugunin, who also later
did IronPython for the .Net CLR
• Version 2.5.2 mirrors python 2.5
• h=p://www.jython.org/
5. Why Do This?
• I needed to evaluate Mahout’s suitability as
the toolkit for our travel recommender system
• I am not primarily a java dev (yet?), and I don’t
know how to create a maven project
• But I do know python
• Fastest way between 2 points is a straight line
• Step 1: adapt sample code from “Mahout In
AcKon” to jython
6. How Do I Do This?
# Add Mahout jars to jython’s path
sys.path.append(os.environ.get("MAHOUT_CORE"))
for jar in glob.glob(os.environ.get("MAHOUT_JAR_DIR") +
"/*.jar"):
sys.path.append(jar)
# import classes from Mahout jar…
from org.apache.mahout.cf.taste.impl.model.file import *
# Bunch of imports deleted
def main():
# and we are using the imported FileDataModel
model = FileDataModel(File(sys.argv[1]))
7. What Did We Learn?
• About 3 hours to port first “Mahout In AcKon”
example to jython
• 3 minutes to port the second
• Includes learning how to import jars into python
• And building a nice loop to punt on jar
dependency management :‐)
• Increases ability to experiment with ideas in
Mahout by reducing ceremony
8. Want Some Extra Stuff?
• Python IDEs that work with jython:
– PyCharm (JetBrains)
– PyDev (Eclipse add‐on)
– WingIDE (no debugger)
• Ported GroupLens 100k data set example from
secKon 2.5 of “Mahout In AcKon” is at h=ps://
gist.github.com/1041033
Notas del editor
\n
First we built a travel booking tool\nThen we integrated it with expense and built reporting\nThen we went back and built the trip data storage subsystem to handle increased volumes of data\nNow we are trying to put the combined travel and expense data into Hadoop to do analysis and leverage the knowledge of our customers for their benefit\n
So Mahout looked like it might be a good way to bootstrap our efforts around building recommendations. If nothing else, it might be a fast path to v1 while we write more specialized algorithms tuned to our specific data sets as a v2.\n
It’s very cool: Jim H started both projects as tests: jython to see if jvm would be faster than python’s vm. IronPython to “prove” CLR was slow compared to e.g. JVM (it wasn’t)\nYeah, jython’s definitely on the cutting edge with python 2.5 support\n
Mahout appears to be a good system for doing recommendation engines. We need to find out how good, and what its strengths and limitations are.\n\nI do know some java; enough to do some light recreational Android programming. But not only do I know python, the data scientist who will actually determine the optimal factors to build our recommendation engine on knows python. She also doesn’t know java (yet?). So I have a tool that the team is familiar with\n\nJust building Mahout so I could test it out was painful enough. It requires maven2 to build, but since this is an existing project it was all configured for me to just build after downloading. But I still find it painful to watch maven work.\n\nI shuddered at the thought of having to actually do the maven setup for a new project that would have to be built\n\nMost importantly here, what you end up with when you make Mahout accessible via jython is a rapid prototyping/testing/experimentation tool for building out Mahout code. We’ve taken out the ceremony. That’s all.\n\nWhen you’re done figuring out what you need to do, you could then move to compiled java for speed.\n\nBut, for many/most applications, you can probably stop there. The actual Mahout processing is the serious limiting factor here, not the jython code. My suspicion is that there’s far more performance to be gained optimizing the actual Mahout implementation than moving the jython code (which is native jvm by the time it runs) to java/scala/clojure\n
\n
The single largest chunk of my time was actually spent trying to decide what jars I had to append to my jython path, followed by really grokking the jython path/import stuff\n\nAs you can see, after enough time I just punted on the jar dependencies. Every single jar is on the path, although I only import from the ones I need. Worth some research into jython to see if I’m adding any overhead other than search path like opening/inspecting the jars. I suspect not.\n\nNow, if you knew maven, it might take less time to start a new project and get it up than I would take, but once *I* was done, every subsequent jython script takes almost no time to set up, and the project is ready to run as soon as you’ve saved your source code.\n\nWe can work without having to either build a new app for every experiment, or build in some way to control which experiment runs in some ever-growing app\n
I haven’t really tested either PyCharm or PyDev to do these things. Someone else can do *that* lightning talk at a later meetup\n