The document discusses unstructured data and its importance for business intelligence. It notes that 80% of organizational data is typically unstructured and resides in various documents and sources, both internal and external to the organization. Environmental scanning involves systematically analyzing unstructured external data to produce market forecasts and intelligence reports. Text mining can help untangle unstructured data through content analytics and indexing content from sources like emails, websites and social media. This can provide insights for applications like brand, competitor and organizational intelligence. However, challenges include ensuring accurate content tagging and addressing scalability issues for large volumes of unstructured data.
2. Unstructured data Does not reside in relational database tables. Has no predefined structure or format. Not arranged in any order. Difficult to categorise for use in BI. Resides in several documents over multiple sources Internal (data within an organisation) External (data outside the organisation) Environmental Scanning: scanning for information about events trends and relationships in a company’s outside environment. (Sabherwal & Becerra-Fernandez 2011:85)
3. Environmental scanning: (Sabherwal & Becerra-Fernandez 2011:85) Shows how changes in external environment may impact a company’s decision making. Predictor of improved organisational performance through monitoring external events. Includes seeking/searching and using information.
4. A two dimensional model proposed by Daft & Weick(1984): (Sabherwal & Becerra-Fernandez 2011:86) Environmental Analysability (EA). Organisational intrusiveness (OI).
5. Environmental scanning cont’d Undirected viewing mode. Satisfied with limited information. Does not seek comprehensive data. Relies on irregular contacts and information. Conditioned viewing mode. Makes use of standard procedures. Relies on significant data from external reports that are widely used in industry.
6. Environmental scanning cont’d Searching mode. Systematically analyses data to produce market forecasts, trend analysis and intelligence reports. Willing to revise and update existing knowledge. Enacting mode. Construct own environment. Gather information by trying new behaviour and observing what happens. Experiment, test and stimulate. Ignore precedent, rules and traditional expectations.
7. Types of unstructured content: (Ferguson 2011:6; McCallum 2005:49; SPSS 2003:3): HTML content (e.g. web chat, blogs and web pages) Documents (e.g. memos, research papers and articles) Forms (e.g. patent applications) Emails SMS content. Multimedia content (audio, video, images).
8. Examples of data sources: (Ferguson 2011:6) Email archives. Call center transcripts. Customer feedback databases. Enterprise intranets. Enterprise content management systems. File systems. Document management systems. Social networking sites. RSSNewsfeeds.
9. Wittles (n.d.) asserts that : 20% of an organisations data is structured and ready for use in BI data analysis The remaining 80% is unstructured data. Significance of unstructured data is underestimated.
10. The social media effect The current main driver in the upsurge of online content is social networks. Facebook statistics are used as an example.
16. Untangling unstructured data Content analytics (text mining & web mining) The process of analysing semi-structured or unstructured content from one or more sources to derive insight that will be of business benefit. (Ferguson 2011:4)
17. Data acquisition Using crawlers, search and indexing technologies To identify tag and index relevant content. Multiple crawlers can be set to crawl in parallel. Crawled content can be Indexed and the index made available for analysis. Stored in a file system (e.g. Hadoop DFS, MongoDB).
20. Pros & Cons Pros Provides a deep insight for BI. Quick detection of trends. Cons Analytics are industry dependent, because each industry has unique content to utilise. Indexing large content volumes may bog down search engine performance. Content tagging may not be accurate. Crawlers may not detect some content.
21. Future considerations: Ensuring that user content is accurately tagged. Ensure that content is up-to-date and relevant. Validating content sources. Identify business drivers to get the best solution. For scalability issues allocate adequate processing power to analytics.
22. Possible research opportunities Patent violation detection system. Questionnaire/interview analysis system. CRM content analytics. Contextual comparison and assessment. Multimedia content detection.
23. References Feldman, R. and Sanger, J. 2007. The text mining handbook: Advanced approaches in analyzing unstructured data. New York: Cambridge University Press. Ferguson, M. 2011. Integrating and analysing unstructured data. Info360 BI Conference. Washington DC. McCallum, A. 2005. Information extraction. (http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf)Retrieved 17 February 2011. Sabherwal, R. & Becerra-Fernandez, I. 2011. Business intelligence: Practices, technologies, and management. John Wiley & Sons, Inc: New Jersey. SPSS. 2003. Meeting the challenge for text: Making text ready for predictive analysis. Chicago. Wittles, G. n.d. Unstructured data offers a vast store of untapped BI value. (http://www.themanager.org/strategy/Unstructured_data.htm)Retrieved 19 February 2011.