Big Data for Library Services (2017)

6 de Mar de 2017

Más contenido relacionado


Big Data for Library Services (2017)

  1. Big Data & DS Analytics for PAARL Albert Anthony D. Gavino, MBA Data Scientist / DS Evangelist
  2. About the speaker: Albert Anthony D. Gavino
  3. Project profile
  4. Program Objectives / Program Goals Participants to be able to relate Big Data and Data Science applications to Library services.
  5. 1. What is Big Data? Extremely large data sets that may be analyzed to reveal patterns, trends and associations
  6. The BIG 3 V’s •  Variety: different types of data (Facebook, Twitter, CCTV feed) •  Velocity: the speed that data comes in (batch, streaming every second) •  Volume: the largeness of that data. (1GB, 1TB, 1PB, 1ZB)
  7. Library Data Resources What resources does the library have (budget, staff, premises, media, opening hours etc.) and how is the library performing against traditional parameters, like lending figures, visitors and social media activity? This library data can also be combined with environmental information like community education levels, geographical distances, age and so on.
  8. DATA Analytics Challenges and Pitfalls The challenges to creating a robust institutional data analytics program include culture, talent, cost, and data. We have deliberately mentioned culture first because it is very easy to jump to data challenges. In fact, most of the literature surrounding data analytics starts with challenges surrounding the data itself. However, we are convinced that institutional culture is the most important factor in determining the success of any given data analytics program, including the politics and process around questions of talent, cost, and data itself. Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities 63% of researchers and administrators expressed unhappiness with the use of metrics in higher education (Abbott et al., 2010)
  9. What about New Tasks like streamlining for the Librarian? If librarians take on new tasks, it is very important to track the amount of time and level of staff required when undertaking analytics projects. For example, collecting citation data for a researcher with a common name often requires manual and painstaking record-by-record searching in order to disambiguate that individual's research from others that share his/her name. This type of work requires a librarian with a deep and intimate knowledge of the bibliometric databases that are being used to harvest the bibliometric data. Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities
  10. What is the Cost? •  Data analytics should be thought of as a strategic investment, not a cost-saving technique •  the real cost is the time spent on cultural change and on developing and educating a staff with the analytical skills that we need in our discipline •  visionary analytics plan invests in people, in hiring and training, over data tools and platforms. .
  11. Pitfalls of Data Sharing: Challenges on Institutional Data Analytics Pitfalls Possible Solution/s Ownership: who owns the data? It could be registrar, library, IT services. An assigned office e.g. or Office of the President/ Compliance Office can release the official reports. Quality: deciding when it is accurate or good data, data reliability. Data Governance Unit assures the quality of data Standards: what kind of data variables are in use: string, numeric This can be addressed by Data Management on data warehousing Access: who has access to the data User roles can be defined as to who has access
  12. Getting Started on Institutional Data •  Creating an inventory of institutional data •  Developing a data dictionary •  Designing an unambiguous process for cleaning up those data •  Creating an open data set that answers to the most commonly asked data questions across campus.
  13. Opportunities for Libraries on Big Data •  Libraries know metadata •  Libraries know strategy •  Libraries know assessment •  Libraries are neutral •  Libraries know the vendors •  Libraries are part of larger bodies like PAARL •  Libraries have influence over campuses •  Libraries know metrics •  Libraries have user-centered culture •  Libraries know the vendors •  Libraries know the politics and policy issues with commercial parties •  Libraries collaborate with both academic and academic support
  14. 2. Building a BIG DATA culture •  Openness and acceptance to technology: Upper Management •  Willingness to invest in the Big Data Platform: which entails cost •  Training Staff and making sure of job security: Skills upgrade •  Make data sharing acceptable: Trust in the data quality and people •  Create Data Quality Assurance Team/s •  Foster collaboration among departments •  Continuous improvement of models
  15. DATA Governance and DATA Management are different roles Data governance is the designation of decision-rights and policy-making surrounding institutional data, while data management is the implementation of those decisions and policies. Institutions need both, and both require investment, but the senior leadership of our institutions need to design the former. Data Governance Council Data Management policies metrics Data Quality Dept Data Warehouse / Data Lake
  16. Machine Learning Is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed.
  17. Market Basket Analysis on Book Recommendations (Association Rule Algorithm)
  18. Weather related information and reading a book (use of hash tags and location and weather data) Pic from Marco Rasos
  19. Social Listening – is the process of monitoring digital conversations to understand what customers are saying about a brand or service.
  20. Online Research Journals and Click through Rates Click through Rates (CTR) Ratio of users who click on a specific link to get to a page from a page ad or button.
  21. OpenCV (Open Source and Computer Vision)
  22. Modern Day Data Scientists Dr. Reina Reyes, Astrophysicist Andrew Ng of Baidu, Coursera Amy Smith, Uber Singapore Data Science Conference 2016 YOU as the next Doctor Strange (Entering the world of Data Science) Isaac Reyes, Data Scientist Talas Data Scientists
  23. CRISP – DM Methodology The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company
  24. CRISP-DM Tasks
  25. From regular data to BIG data, from stat to AI RegulardataBIGdata Statistical modeling Machine Learning Deep Learning / A.I. Traditional Modern
  26. Trends in Data Science Domains Data Science Domain Current Status Natural Language Processing (NLP) Entered the market Predictive Analytics / Machine Learning Entered the market Visualization / Dashboards Entered the market Image Processing (openCV) Exploration Internet of Things (IoT) Exploration Artificial Intelligence Exploration
  27. DS/Big Data Applications to the field of Study Agriculture Climate forecast modeling to help farmers manage plantations (e.g. corn yields) Medical field Image processing for chest x rays, retina images for diabetic patients Linguistics Natural Language Processing (NLP) for dialects and Sentiment Analysis applications Economics/Finance Predicting a stock price based on certain indicators (e.g. noise, competitor price) Sample Field of Study Specific Applications Engineering Internet of Things (IoT) application to Big Data
  28. Building a Data Science Team Data ScientistData Engineer/ Dev Ops Statistician Viz Expert R, Python, Spark ML Hadoop, Spark Core, Spark stream SAS, SPSS, R, Matlab Tableau, Cognos D3, Javascript Neural Nets Random Forest RDD, dataframes, SQLContext Linear Regression K-means clustering visualization GIS maps DS role Prog Language Sample output Data Science Team Composition 1 2 3
  29. Trends on Programming Languages scala R python spark Rapid miner EMC java
  30. TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE OPEN SOURCE PROPRIETARY SOFTWARE pros No cost on software, packages are available faster Easy to deploy cons Takes some time to create and integrate with other software Expensive software, you have do buy in modules tools Python, R, Apache Spark SAS, IBM-SPSS, AWS, Google
  31. Small Data vs Big Data (in comparison) Small data Big data Sample size can be done (sampling e.g. survey) Use all of the data in the storage No need for memory computing, can be run on a regular PC/Mac Eats up memory and needs distributed computing Statistical assumptions hold true, normality, heteroskedasticity independence Statistical assumptions do not hold true like p-values since the data is so large (what seems not significant to small sets will become significant, be careful when using these assumptions)
  32. Simple DS Cheat sheet Classifiers Neural Nets Random forest Clustering K-means Association Assoc Rules Predicting Linear Regression Logistic Regression (binary) Cox Regression (Survival) Hierarchical Clustering SVM (Cancer Cells) Medical
  33. Vizualization TOOLS
  34. Color Hues and Functionality
  35. Local Implications: Data Privacy Act 10173 Sensitive personal information refers to personal information: 1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or political affiliations; 2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for any offense committed or alleged to have been committed by such individual, the disposal of such proceedings, or the sentence of any court in such proceedings; 3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social security numbers, previous or current health records, licenses or its denials, suspension or revocation, and tax returns; and 4. Specifically established by an executive order or an act of Congress to be kept classified.
  36. Solutions to the Data Privacy Act: Policies Make sure you have the following in place •  Opt In for customers •  Opt out for customers •  Updated your customer policy accordingly •  Make your policy available publicly e.g. websites
  37. References • • • • • •