Next DSS MIA Event - https://datascience.salon/miami/
For most data scientist building models is hard work, but deploying them into production and impacting business processes can be even harder. In fact, research shows that only about 10% of data science models get deployed into production, and those that do can take between 6 to 9 months to be deployed. This session will highlight the challenges that data scientist and organizations alike face when trying to deploy machine learning models and how to overcome these challenges. It will examine several use cases where models built in R and Python have been able to deliver impactful results across several industries.
Data Science for the Masses:
Turbo Tax-like interview interface
We have just announced the acquisition of Yhat, and will be bringing you a new solution that will help you to operationalize models through a REST API.
Another area of focus is around performance. The Alteryx Engine was built to handle gigabytes of data. The new Engine is built for Terabytes. When you have petabytes, we need to leverage the power of Spark and we need to process data where it lives.
We are adding additional algorithmic support. We already support R, and are adding a brand new Python SDK that will ultimately enable a Python tool for Alteryx Designer.
The Alteryx Engine was built to handle gigabytes of data. The new Engine is built for Terabytes. When you have petabytes, we need to leverage the power of Spark and we need to process data where it lives.
We are adding additional algorithmic support. We already support R, and are adding a brand new Python SDK that will ultimately enable a Python tool for Alteryx Designer.
So we have always believed data prep and blending would be a core competency of any analytic system, because it is the foundation for any type of analytic.
For the past few years, the industry was perhaps a bit over-focused on the new interactive visualization paradigm – which is awesome because it increases engagement and insight, but it didn’t really move us beyond descriptive analytics.
But now, folks are setting their sights higher. Organizations are starting to understand that business value is going to come from the more sophisticated analytics that are just now making there way into the mainstream.
And so the question there, becomes – how, or more specifically, who? This jump to these higher value analytics is really throwing the issue of analytic talent shortage, into sharp focus.
We talk to a lot of organizations and one thing we find is that there is a lot of consistency in their challenges.
There is an inordinate amount of difficulty in terms of extracting value from, in particular advanced analytics, but probably as you guys can attest too across the board from business intelligence all the way through sort of PhD level sort of methodologies and teams.
We boil all of the complexities and challenges that data science teams face into sort of two big categories.
One is human. Organizations tend to have a really difficult time understanding what's capable, what are appropriate questions to be asking of the data in the first place.
And then the second is technical. This stuff is very complicated. The data prep and blending problem is impossibly nuanced and incredibly complex. There will never be a solution that can magically clean data. You guys all know this. The same sort of complexity in situational nuance applies to machine learning problems that data scientists deal with every day.
So if we just focus on the later- the technology problem- it really comes down to truly extracting that value from analytics and building machine learning models is one way to do so, but to truly extract that value- it really comes down to model deployment.
The sort of main two methodologies or methods that companies tend to think about the most when they are dealing with or talking about investing in predictive analytics are sort of offline.
One is reports. If you take a PhD in astrophysics who is now working in industry, this individual may deliver on behalf of the executive team charts and pros that describe a business phenomenon and why it happens, and the end deliverable is sort of academic in spirit where there's no application, it's purely the academic style research applied to whatever the business phenomenon is.
The second is interactive dashboards which require that the model be fit and that scores be generated for new records usually on a nightly basis, though it can be at intermittent hours. If you want to estimate which leads to send direct mail to and which ones are poor quality leads that you should throw away. This would be another sort of batch or offline application.
Where Alteryx is now focusing on is these near real-time applications. Alteryx has been very, very capable for quite some time in doing a lot of batch capabilities with respect to predictive analytics, but there was no real way to take a model that you've built in Designer and incorporate the business logic that the model is designed to operate under into say an iPhone app. And we feel we are filling a gap and helping data scientist limit the challenges they face.
Why do we even care about this in general? Well, table stakes today in pretty much every software application that any of us uses on our phones or on the web is designed to be hyper personalized. This is the case with certainly in the consumer world but it's growing more and more even internally with products that people or companies are building for their frontline employees.
Not every team is going to have a very, very large or robust data engineering team. It's not uncommon for some of the LinkedIns and Netflixes, the sort of great machine learning companies out there to have one engineer for every data scientist. This is not a scalable team structure for the vast majority of companies.
When we think about the model deployment problem in particular, we always think about it in terms of when does data and data science projects in general become valuable? If you think about raw data versus clean data and the notoriously difficult set of problems that go into taking messy and unclean data and converting it into clean data suitable for a dashboard in Tableau or suitable for building models in the first place, there isn't a great deal of value created certainly not in terms of or in comparison to the great expense that goes into doing that work.
On the flip side, the higher you get up the data science value chain, the more value is realized with sort of the arc-typical example being super-sticky machine learning powered consumer apps like Netflix that rely entirely on machine learning and all of our mothers can use without ever knowing anything about support vector regression or clustering algorithms that go into this stuff. Sort of the rub in building a data science team despite the fact that maybe the majority of the work takes place in the data prep and blending phase, the value may not always be created until much farther down the stream.
Along the way there's a disastrous list of inefficiencies that data teams go through. You guys already know this.
You have to jockey and broker for IT resources. You have no idea in some cases that a project that you've been working hard on has already been completed, perhaps even several times by other people in other offices.
A lot of the time there's compliance and regulatory constraints that cost you many weeks of lag time and wait time in between when you've made a request into IT for access to a data asset and when you actually can get it. This becomes a very, very expensive process for virtually every company.
Most organisations today have a real disconnect between their data science teams and their developers or engineers. Data scientists will typically be working with a variety of statistical libraries in either R or Python. They’ll be creating and evaluating their models in those languages, and once finished they’ll pretty much hand over that code ‘as-is’ to a bunch of developers whose job is to translate those scripts into working code across the enterprise, and then get that code in a shape where it can be made available to your customers or users.
And here’s the problem. The enterprise doesn’t always speak R or Python. It’s talking Java, or .NET or PHP or Javascript. In short – we have a major language barrier problem. Rewriting an optimised statistical model from R into a Java Virtual Machine? That’s a lot of costly rework. What if Java doesn’t have a neat way to handle all of the clever statistical tricks your data scientist has incorporated into the model? Then your developers have really got their work cut out for them.
We could try a halfway house and get the data scientists to convert the model into PMML – Predictive Model Markup Language – and throw THAT over the wall. Fine for some situations, but PMML doesn’t cover all the rich capabilities of those R and Python libraries that your data scientists are using. It’s a partial solution, at best.
And in many cases according to TDWI- model deployments can take around 6 to 9 months.
What’s even scarier is that we worked with one customer that said that the typical cost to deploy one model was more than $250K
In fact, because of situations like these – REXER ANALYTICS did a survey where only 13% of data scientists said their models actually get deployed into Production. That language barrier – that re-engineering effort – prevents the quality work that your data science teams are producing from making an impact to your business.
We decided to think about what's a better way for companies instead of rewriting code that was implemented in R by a data science team or implemented as a workflow in Designer and then rewriting it painstaking line by line in Java or .NET or whatever the production front-end user facing app is implemented in, there should be a way to cost less and take the models that you've built and integrate them into an iPhone or a Salesforce plug-in or a website. So that's precisely what we have built.
From a user's perspective the way it works is you'll have a new tool, a deployed tool which can take as input anything, any predictive tool which currently the vast majority are R based, but we are working on Python aggressively as well, and deploy any arbitrary business logic that you've come up with in Designer across a number of servers in a cluster. These are then exposed as standards compliant highly scalable web services, APIs that any developer in the world will immediately understand how they work and how to use them.
So instead of saying to your development team, “Look, I built this business logic that it can estimate the likelihood that somebody is going to abandon their shopping cart and I'd really like to display a pop-up that says checkout now if the probability of shopping cart abandonment rate exceeds 90% or something,” instead of having that engineer receive that spec request from you and then have to go and open the guts of their website up and write the code that corresponds to the workflow that you've built already in Designer, instead you just give them one line of code and they will be able to hit your API and invoke the business logic that you have constructed in an Alteryx workflow. This gives you back the ability to make changes dynamically as well.
For example, let's say you come up with a new feature, a new user behaviour that is highly indicative that someone is likely or unlikely to abandon their shopping cart. You wouldn't even have to talk to your developer. You just make the change to the model and click rerun in Designer and the model will update itself.
So – let’s take a look at how Alteryx Promote fits into this vision. Your code-free or code-friendly models developed in the language of choice for your analysts and data science teams. Deployed through a simple process into the Promote cluster and instantly available through a clean API endpoint.
This API now becomes the easiest possible way to get your model in the hands of your end-users, your marketing teams, your customers and your applications. No rewriting, no performance bottlenecks – simply your analytic model available - in real-time – wherever you need it.
Using the Alteryx platform, we’ve got years of experience helping customers to deploy analytical and modelling workflows to production environments without the use of code. When analysts use our predictive suite, it’s easy to drop a Score tool into a workflow to receive the predicted insight from a deployed model – again, no code required. We’re also introducing enhancements to the platform that are going to make it much much easier for data scientists working in R or Python to join this party. With a bare minimum of code – your data teams will be able to deploy their native language straight into the Alteryx platform – ready for use!
Where we had blockers around managing updates to production models in legacy software, we find ourselves working with a platform that makes it positively EASY to deploy models many times every hour. Build the deployment process into your existing Alteryx scheduling process or DevOps scripts to start providing continuous integration for your data science teams. When it’s this easy to deploy – this completely transforms your ability to test new models before they enter production. This low cost of experimentation allows you to innovate with data science in ways that just weren’t possible before.
And how about managing your models from a simple web portal? Our platform can be configured so that model predictions are available for testing and review at any time. We can set up models on a champion/challenger basis so that new models can be gracefully promoted, while older ones are retired. It’s all about making it easier for you to get more value out of the analytic effort you’re putting in – we’ve regularly seen customers rapidly shift from 1 or 2 production models to literally hundreds: all under version control, all under a review-based deployment workflow for governance and integrated and embedded throughout the customer’s business processes.
That's the concept, and the benefits are pretty staggering. It typically cuts down the amount of development time for implementing a real-time predictive model. It could be 10x or more in terms of cost and time savings.
This is a company in Boulder, Colorado. They help energy providers so like utility companies, power plants target homes for clean tech products. An example would be they would look at the energy usage in a certain geography and then they would call down on the right leads at the right time with the right product and the right price. Then they also provide heads-up analytics in the form of all kinds of web apps, mobile apps, both for the utility companies and for the consumers that those utility companies work with. The numbers and the time savings are there.
Data scientists at energy software company Tendril opted for a hybrid approach that combines both collaborative and content-based filtering. Tendril providesanalytics and consumer solutions to energy suppliers, including which energy products consumers would most likely consider.
“We use Support Vector Regression models to predict household energy consumption to provide our clients with indepth, personalized information about their customers,” explains Mark Gately,
Another customer is a European lender called Ferratum Bank. They operate in like 30 countries and the decision-making tasks that go into making a loan through a UI that looks like this direct to consumer are numerous. I think they have on average per country a dozen predictive models that have to get invoked, and all of those predictive models are designed in R or within Alteryx workflows using the R tools. Each day we're dealing with millions and millions of euros worth of loans being assessed and delivered all in real-time.
The first general purpose credit scoring algorithm, now known as the FICO score, was introduced in 1989. The FICO score is still one of the most widely used models in the United States today, though peer-to-peer and direct lending organizationshave focused on developing new techniques over the past few years. These new machine learning models and algorithms capture innovative factors and relationships that traditional loan scorecards couldn’t, like how applicants managemonthly cash flow or whether friends or community members would endorse the applicant.
One such company is Ferratum Bank, a pioneer in financial technology and mobile consumer lending since 2005. “We developed complex statistical and machine learning models to enable smarter lending decisions,” explains Scott Donnelly,Director of Business Lending at Ferratum Bank. “By getting creative with our approach and adopting innovative technologies, we’ve been able to reinvent how both consumers and businesses obtain loans. This has allowed us to reachprospective customers that in the past may have been overlooked by traditional banking institutions.”
Turo offers car owners and travelers with a service toshare vehicles by renting them.
Use Case
They built a model that could suggest a reasonable price based on all the data we had at our disposal. The algorithm evaluates a few straightforward variables like car make, model, and year, as well as some less obvious factors like demand and competition, both intra- and extra- Turo.”
Our data science teams works predominantly in Python but our engineering team develops in Java and Javascript, so there was no clear path to production. I actually came across Yhat’s platform at a data science meetup in San Francisco. I thought the idea was really interesting. What stood out to me was the vision of separating the development tracks, uncoupling data science modeling from the user facing product.”
Turo uses ScienceOps to deploy the data science team’s dynamic pricing model and recommender engine into their web and mobile app.
Results
Yhat removed Turo’s data science team’s dependence
on other engineering teams
so that they can realize the value of their work almost immediately.
ounded in 2008, VIA SMS Group developed an entirely new approach to issuing quick loans using predictive analytics and machine learning. VIASMS Group uses advanced algorithms to assess whether an applicant is
fraudulent and, if not, what his or her credit profile looks like in order to determine whether or not to underwrite a loan.
VIA SMS Group writes the decision algorithms in the R programming language. Before using ScienceOps, VIA SMS had to rewrite these algorithms from R to the server-side language of PHP in order to deploy models into their web and mobile apps.
“We were basically limited to linear or logistic regression models and decision trees since anything we wrote had to be implemented in PHP.” “It also took a lot of time to make even minor modifications like adding new predictors because we were handing off models to the dev team.
Updates could easily take a few days per model since they were dependent on coordination between teams and manual input. Rewrites were prone to typos and syntactical errors, since the languages handle data types differently.”