Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Data Science and Machine Learning for eCommerce and Retail

Data Science and Machine Learning for eCommerce and Retail
Software and Hardware

  • Inicia sesión para ver los comentarios

Data Science and Machine Learning for eCommerce and Retail

  1. 1. Data Science and Machine Learning for eCommerce and Retail Dr. Andrei Lopatenko Director of Engineering, Recruit Institute of Technology Recruit Holdings former Walmart Labs, Google (twice), Apple (twice)
  2. 2. ML for eCommerce • Search, Browse, for commerce sites and application • Help users to find and discover items they will purchase • Maximize revenue/profit per user session
  3. 3. Search
  4. 4. Search - ranking ranking
  5. 5. Search - LHN Left Hand Navigation
  6. 6. Search spell correction
  7. 7. Search type ahead
  8. 8. Browse
  9. 9. Search data size • Catalogue items • 8 M items now compare ~ 400 M Amazon / eBay • X 10 in near future • 2 K text description per item + images • Several hundreds of structured attributes per catalog
  10. 10. Search – user searches • Tens of millions per day • Tens billions session per year • Online sales 13.2 B per year (http:// ecommerce/) • 500B per year sales offline stories (8% USA economy) in ~ 11K stores • The number of transactions ~ 10B (public data)
  11. 11. ML addressable problems • Learning to rank • Given a query, what’s the list of items with the highest probability of conversion (purchase), ATC (add to card), page view
  12. 12. ML addressable problems • Typeahead • Given a sequence of characters types by user, what’s most probably competitions, what are most probable items users wants to buy
  13. 13. ML addressable problems • Spell correction • Given a user query, what’s the query user actually wanted to type
  14. 14. ML addressable problems • Cold start • Given a new items with it’s set of attributes and no history of sales or exposure on site, predict items sales and item sales per query
  15. 15. ML addressable problems • Prediction of LHN • Given a user query, what’s the best set of facet and facet values, which gives higher probability of users interacting with them and finally buying an item
  16. 16. ML addressable problems • Query understanding • Given a query, build a semantic parse of query, tag tokens with attributes: blue tshirts for teenagers -> blue:color tshirts:type for:opt teenagers:agerestriction10-20 • Classification: blue tshirts for teenagers: - > type:apparel, price preference: 10-30, releaseyearpreference: 2014-2016
  17. 17. ML addressable problems • Related searches • Given a query, what are queries which are either semantically close to this one, or represent coincidental users interests • Nike shoes -> adidas shoes, sport shoes, • Coffee mugs -> travel mugs, photo coffee mugs, cappuccino cups
  18. 18. ML addressable problems • product discovery • help users to explore product assortment, • drive users to diverse products • reduce risk of selecting irrelevant items • help to find price,quality,brand etc alternatives • reduce pigeonhole risk • provide relevant data to make a decision
  19. 19. ML addressable problems • Image similarity • Given images of the items, give other items such that images of those are visually appealing to the users which like the original item (appealing by shape? Color? Texture?) -> causing high conversion in recommendation
  20. 20. ML addressable problems • Voice search • Given voice input, reply with a list of the best items • “what are the cheapest samsung tvs in the store” • “what is best deal on queen bed today?”
  21. 21. ML addressable problems • extraction of item attributes • Given an item: what are item attributes: brand, color, size (wheel, screen, height, S/M/XL, Queen/Twin/King/Full), Gender, Pattern, Shape, Features
  22. 22. ML addressable problems • Representations of users : actions on websites/apps -> searches, clicks, browsing behaviour, product -> purchase preferences, reviews, ratings, return rates
  23. 23. ML addressable problems • title generation: how to generate the title which will cause maximum conversion rate • which product attributes select for the title?
  24. 24. What makes a good title?
  25. 25. What makes a good title?
  26. 26. Limits • Most models should be served in production • 50ms on prediction • Part of big system, memory limits ~ 10G
  27. 27. Retail
  28. 28. Retail • Key directions which require machine learning: • discounting tools • coupons and rewards • loyalty • inventory management
  29. 29. Inventory management • Customer want to buy products • Customers have diverse needs • Products should be in stock, ideally in warehouses close to customers • but it’s expensive to store products • Problem: How many products of each type should be stored, when product supply should be refilled?
  30. 30. Customer intelligence • Retail • analyze sales data, find anomalies, explain them • low sales of umbrellas during last month in North California’s stores • No rains? (integration with external data about weather conditions) • Seasonal / the same as last year / time series • Competitors
  31. 31. Fraud detection • identify fraudulent transactions online • Hundreds fraud schemas detected daily • Global retail shrinkage is $119 billion in 2011, an average of 1.45% of retail sales. • from stolen credit card to price tag replaced, price discounts by high level managers to achieve personal goals
  32. 32. Propensity Modeling for Marketing Campaigns • build effective email/facebook/google ads campaign addressing proper customer at proper time at proper costs • behavior based customer segmentation and clusterization with demographics, lifestyle, attitudinal information
  33. 33. Online Grocery • which items can be replaced by other items and by which items they can be replaced • data are individual purchases in chain grocery, drug stores, online grocery shopping • the problem - find which items can be replaced by other item if they are not in store to fulfill the order
  34. 34. Dynamic pricing • define the best price • scrap continuously prices of competitors, predict demand by price, know the expenses • online commerce sites change prices every 10 minutes
  35. 35. Challenges • Data volumes: transactions: Walmart: 10 Million per day • Computations: complicated modeling techniques
  36. 36. Hardware platform • Needs: • Data storage • Data processing • Serving online
  37. 37. Data storage • Volumes of data: • 10 M transactions per day, 5 years - 18 billion transactions -> 1T • Catalog: 500 M items * 2K per each -> 1T
  38. 38. Data Storage • but if go to video: petabytes of data, RetailNext 75P per year from 30000+ sensors • Walmart 500P • eBay 40 P in 2013 (transactions + online behaviours)
  39. 39. Data processing • Rebuild model over fresh data: • typically daily: add daily data (millions of transactions, hundreds of millions of behavior units) to year data store (billions of transactions, hundred billion/trillion behavior units) • build a model to serve in production the next day
  40. 40. Data processing • some models such as fraud detection,dynamic pricing should be almost online (10-15 minutes) • build over data such as daily transactions or web crawl over competitors' sites
  41. 41. Serving online • online commerce WML - thousands / tens thousands queries per second in peak times • complicated algorithm of ranking, recommendation, • 50ms limit
  42. 42. serving online • price, in store availability - millions requests per second in peak times • item informations - millions requests per second • serving online - Solr/Lucene/Elastic Search, Cassandra, MongoDB, Oracle, CouchDB,Node.JS/Java solutions etc
  43. 43. Data processing • Hadoop / Spark clusters • a lot of I/O • HDFS does the redundancy , RAID is not necessary, RAID is slow to write, Hadoop writes a lot • SAN, NAS are not good either • so bare metal with DAS Directly Attached Storage
  44. 44. Data Processing • more servers, cheaper servers • more smaller disks is better than large disks • allocate cluster 100% to Hadoop
  45. 45. Data processing • Hadoop Masters vs Workers • large clusters: Masters > 64G RAM, dual Ethernet NIC, dual quad core CPU • Workers: memory 64G+, SAS 6Gb/s disk controllers, 2 Ethernet cards, 2*6core processors, 15M cache, Intel’s Hyper- Threading and QPI good to have
  46. 46. Data Processing • big models, deep learning • Nvidia DGX-1 and alike • Pascal GPUs , NVLink interconnect • Tesla k40, K80 work pretty well too • may require a lot of tuning http:// hardware-guide/ • hard to buy: big data solutions are considered profit generators, HPC servers are not
  47. 47. Serving online • Typically large memory, but not necessary (for example, Elastic Search/Solr degrades over 64G) • CPUs: more cores rather than faster • Disks: SSD, RAID 0, no NAS, a lot of conditions frequently optimize wrt how easy to change drivers rather than SSD endurance
  48. 48. ecommerce example • Database servers • Unified hardware platform : from HP • HP DL line: • 4 cpu sockets • 256 GM RAM • network interfaces • not much HDD, data is in NAS
  49. 49. ecommerce example • cloud servers: • purchased by racks: 40 in a rack • 2 CPU socket • 198G • 18 core CPU • SSD
  50. 50. network requirements • 1 network card per server - a big mistake, 1 switch per rack • 3 cards per servers: • typical three data flows: • production • “administrative” (dockers etc) • analytics
  51. 51. example • application servers vs big data servers • application servers (java, node.js apps): • 1TB SSD, RAID 5 • Big data servers: • 5T SAS
  52. 52. Questions? Dr. Andrei Lopatenko Director of Engineering, Recruit Institute of Technology Recruit Holdings •