Handwritten Text Recognition for manuscripts and early printed texts
Data quality presentation oct 2006 23092006
1. Evaluating the impact and effect of analytics on your data quality……..the NT Community Health way ARK data quality conference 2006 Anastasia Govan Director Whitehorse Strategic Group October 2006
2.
3. 1. What is Community Health Well Womens Cancer Screening Community & Primary Care Hearing Planning & Development Child Youth & Family Director Customer Service Assistant Secretary CEO Quality and Best Practice Dept Health & Comm Srvcs (DHCS) Health Services Division (HSD) Community Health Branch (CHB) CHB Work units
4. 2. Comm Health Knowledge requirements OPERATIONAL MANAGEMENT EXECUTIVE Administration Nursing Planning & Development Hearing/Cancer Screening MINISTER
10. Vendryzyk study 7. Which tools should we use? 5.56 5.71 Data Mining Capability 5.78 6.27 Ad hoc Query Capability 6.22 6.19 Decision Support Capability 5.94 6.06 Query Performance 5.73 6.07 Average for All 18 Quality Assurance Entities Model Integrity 4.67 Incremental Update Capability 5.19 Lowest Consistency of Data between Source and Warehouse 6.61 Consistency of Data between Source and Warehouse 6.69 Highest IT Accountants 36 respondents – 5.9 mean
13. Rudra & Yeo study 9. Analysis and retrieval limitations
14. Analysis and retrieval limitations – the case of the incorrect venues CCIS SHILO Data marts Business objects Intranet Chief Information Officer Business Analyst – Management Reporting Team CCIS Manager Data Warehouse Manager Director Community Health
16. Gilhooly, K. (2005). Dirty Data BLIGHTs the Bottom Line. Computerworld , v 39,pp 23-4 Rudra, A & Yea, E. (1999). Key Issues in Achieving Data Quality and Consistency in Data among Large Organisations in Australia. Proceedings of the 32nd Hawaii International Conference on System Sciences , IEEE. Sen, R; Sen, T; Vendrzyk, V. An Instrument for Assessing Quality of Data Warehouses. Journal of Data Warehousing , Summer 2000, pp. 31-41 Theodoratos, D & Bouzeghou. (2001). Data currency quality satisfaction in the design of a data warehouse. International Journal of Cooperative Information Systems, Vol. 10, No. 3 pp299-326 Vendrzyk, V; Rymysen, D; Sen, A.(2001). How management accountants assess the quality of data warehouses. Management Accounting Quarterly , Spring Wang, R;Kon, H;Madnick, S.(1993). Data Quality Requirements Analysis and Modeling Wang, R; Reddy, M; Henry, K. (1992). Toward Quality Data: An Attribute-Based Approach References
17. Whitehorse Strategic Group Ltd. PO Box 2096, Darwin, Northern Territory Australia 3000. Level 3, 45 William Street, Melbourne, Victoria Australia 3000. [email_address] Whitehorse Strategic Group Ltd. is a management consulting practice with a well established reputation in Government and industry. It is a proud Australian company with significant international experience. Whitehorse has a broad client base, predominantly from major private companies and the public sector, especially those elements of the public sector undergoing commercialization or other business change processes. Whitehorse was founded in 1987 by a group of creative individuals who came together with a shared vision to create a new style of strategic consulting. The principals of Whitehorse come from diverse backgrounds and disciplines, and all have extensive management experience. Anastasia specialises in Information Architecture and process mapping across Australia and Asia
Notas del editor
This slide indicates the topics l will cover today
The Department of Health and Community Services (DHCS) is the largest Northern Territory government agency employing more than 4000 staff and is the only agency with a complete data warehouse structure that has been implemented for some time. Some other agencies have implemented discrete data marts due to the cost of implementing large scale data warehouse systems. The Community Health Branch (CH) is a disparate conglomerate of some mutually exclusive services and some overlapping (Community and Primary Care & Child Youth and Family services liaise daily to provide a holistic primary health care service to families) services. It works with a Full Time Equivalent cap of 147 staff made up of allied professionals, administration, management but mostly nursing staff.
There are several layers of reporting requirements co ordinated by the Planning and Development unit for all other work units in Community Health. There are four discrete functions that input data into very separate systems; Nursing Hearing Well Womens Cancer screening (Cervical and Breast) Reporting is required on an ad hoc, monthly, quarterly and annual basis to Management, back to the operational staff, the DHCS executive and the Minister.
This slide indicates the Information Systems environment of DHCS with all ‘corporate’ defined business systems feeding into the data warehouse. The Datawarehouse team has 4 full time equivalents and manages the periodical extraction, validation, transformation and loading of information from corporate operational systems in the Data Warehouse (SHILO). Datamarts (or subject specific data views of the data warehouse) are systematically produced and distributed for analysis. End user analysis and reporting is facilitated by end users creating reports in Business Objects or business analysts publishing reports on the intranet which identify performance against pre defined management indicators. Community Health has several information systems of which only one is currently deemed a ‘corporate’ database and therefore feeds into the data warehouse and is managed consistently with other databases on the network. This is CCIS. It is a client management system used by Community Health, Mental Health and Aged and Disability units. All operational reporting from ‘corporate’ databases except finance for Community Health is currently undertaken by Business Analysts in the Information Services Division as a Community Health Universe is not available. Management staff access Business objects to access monthly financial reports that are already in a set form.
Community Health has several different levels of reporting requirements with a conglomerate of disparate hardcopy and electronic information systems. Hearing use Hearsoft as a client management system but at each location this is on standalone pc’s that are not networked to a central database. As they receive federal government funding for some remote area positions reporting is manually taken out of this system once a month by the Manager of Hearing Services and provided to Planning & Development. Well Womens Cancer Screening use a different database for Pap Smears and Breast Screening. These databases are on the corporate network but are not linked to the data warehouse and are currently being replaced as they are not compatible with Active Directory. As this is a federal government program provided by the Northern Territory government the databases have extensive reporting. Due to the criticality of reporting and severe deficiencies with the current database structure (including the inability to electronically receive data from the source labs straight into the Pap Smear database) it takes 1.5 FTE’s to manipulate the inputing of smear test result data from 7 laboratories into the current Pap Smear system. The nursing and administration staff utilise CCIS (the client management system). Management access business objects or the intranet and sometimes CCIS. CCIS has an under developed reporting module so all retrieval and analysis of data is done through the data warehouse. Because a Community Health Universe is not available all retrieval of data into excel for analysis is completed by Information Service Division Business Analysts.
This slide indicates some of the general data issues we have across all programs in the Community Health Branch. Most of our data quality issues through the datawarehouse arise due to different interpretations at each Northern Territory Community Health Centre location (Alice Springs, Nhulunbuy, Katherine, Tennant Creek, Palmerston, Darwin, Nightcliff, Karama) of what each data element is. Different interpretations of what is a case as opposed to a casual service event, the lack of sub definitions of the types of services provided and what service is recorded as one of these has arisen as the Department was, until 3 years ago, providing services and managed along regional lines. Standard definitions have been difficult to implement as there is currently no nationally endorsed ontology or data set for Community Health. CCIS is currently very loosely based on CATCH and METEOR. Without a nationally endorsed code set it is difficult to implement a mandated set of business rules. All nursing staff therefore need to be involved and agree on a consistent set of business rules. Analysing trend data for one location is successful but when comparing trends in other locations or reporting on Community Health service events as a whole is fraught with quality issues. Differences in business rule interpretation and affects on reporting have occurred since the Bansemer Review recommended services at Community Health Centres (such as immunisations, wound dressings, post natal care) and reporting lines be Territory wide. Differences in interpretations of business rules and having staff see the importance of consistent and correct data entry has been facilitated by the Community Health Branch Quality Accreditation framework which provides a PDCA cycle for improving business processes. It has also been enhanced by the branch adopting an Evidence Based approach to decision making. Some ‘sacred cows’ identified by nursing staff (such as for drop in clinics most mothers and babies are presenting for immunisations) has been found to be incorrect (they are presenting more for baby weighs and developmental assessments) have been dispelled allowing for an easier transition of staff between the two Darwin centres and minimised the involvement of unions in changes to the way services are delivered.
DHCS currently do not have sophisticated tools other than Business Objects and SQL. For data cleansing we also do not have automated tools but data is validated on a regular basis to detect incompleteness and inconsistency. Where discrepancies are found, these are collated and sent to service providers for clarification. Business analysts who work with the data warehouse extract monthly reports from each data set which identifies where errors may be based on a set of validation rules. Once validation is completed the data sets are merged to form one data set and this is then accessed via business objects for reporting purposes.
To identify which tools should be used one can find reports identifying which commercially available applications are currently leading the market and invite them to provide responses to defined questions which identify your needs and invite them to demonstrate their packages and have a working group of users rate the applications against the pre defined set of needs.
There are so many applications available for data tagging, mining (SPSS), validation etc but how do you identify which ones are the most important for your needs and at which stage of the data process? This study by Vendrzyk, V; Rumysen, D and Andtarunk, S (2001). found that a team of Management Accountants and IT professionals are best to review and implement data warehouses and indicate which tools you should prioritise for implementation. The average of 5.9 indicates any scores above this are what can be considered important and what l have placed on the slide. The highest area of concern for accountants was backups (score of 6.59). The area of highest concern for technologists was data source interpretation (6.39). Another tool which DHCS does not have which would automate data validation is tagging data with quality parameters to indicate quality as the data moves through its lifecycle of creation, storage, retrieval, processing. An Entity Relationship model presented by Wang, Reddy, Madnick and Kon (1993) is a data model that facilitates cell-level tagging of data and a mathematical model description that extends the relational model, set of quality integrity rules, and a quality indicator algebra which can be used to process SQL queries that are augmented with quality indicators. They suggest establishing a set of premises, terms, and definitions for data quality management, and developing a step-by-step methodology for defining and documenting data quality parameters important to users. These quality parameters are then used to determine quality indicators, to be tagged to data items, about the data manufacturing process such as data source, creation time, and collection method.
One tool that is missing for DHCS is a performance management dashboard. This one is used by Power Water Corporation in the Northern Territory. It shows asset management information on a regular basis to senior managers from their datamart. It has been well received by senior management and has been requested by the Director of Community Health to be delivered. As DHCS already use Business Objects and this is regarded as a good tool for dashboards we will be documenting requirements and putting it through our benefits realisation process to rank the priority in the Divisions Information Management Group priorities list. A study by The Data Warehousing Institute (TDWl) in 2004 showed that most organisations (51%) already use a dashboard or scorecard and that another 17% are currently developing one. The same study showed that almost one-third of organizations that already have a dashboard or scorecard use it as their primary application for reporting and analysis of data.
This slide shows an example of the end intranet Management reporting output updated mid monthly by the business analyst team from Business Objects. This is currently being reviewed by Community Health managers to change the reporting to financial year (not calender year), give quarterly totals (not just end of year) and identifying what input into CCIS is actually related to Community Health as the work units that make up ‘Community Health’ have changed considerably in the last three years as a result of the Bansemer review of the Department.
This slide is a survey of Australia’s top 50 businesses with over 500 employees and indicates the top reasons for lack of data quality in data warehouses. This is synonomous with Community Health’s experience in entering data into the CCIS client management system and what comes out in reporting through the data warehouse, business objects and into excel or intranet management reports. These figures are also consistent with Gartner’s finding that through 2007 at least 25% of critical data within Fortune 1,000 companies will continue to be inaccurate and a 2004 PricewaterhouseCoopers survey which found only 34% of executives responding to the survey said they were very confident in the quality of their corporate data. (Gilhooly, 2005).
This year the Community Health Director’s trust in the process of data retrieval and analysis has diminished. During the annual government estimates reporting to Treasury she requested a report from the data warehouse on Community Health service events by venue for financial year 2005/06 for Community & Primary Care and Child, Youth & Family work units. Event details originate from nursing entering data into CCIS. The first report was received on 5 th July with the Business Analyst pointing out to the Director that there was an unusually large number of ‘unknown’ venues. The Director requested Managers to instruct nursing staff to accurately enter data into CCIS. The Managers emailed the requested to the Clinical Nurse Managers who forwarded to nursing staff on 6 th July. The operational staff then informed the Clinical Nurse Manager that the venue field is automatically filled in if the user preferences are set to a venue. The Clinical Nurse Manager forwards to the Business Analyst. He discusses with the CCIS team to confirm if the venue field is mandatory. This is confirmed. He then indicates an investigation of the data transfer to the data warehouse from CCIS by reviewing a subset of the data. On the 26 th July the Director requests an update on the error finding from the Business Analyst. He is confused with another data error being experienced and tells the Director that Community Health data is fine and the problem is in extracting data from Centre of Disease Control database to the Data Warehouse. The Director, very confused as she is not respbonsilbile for the Centre for Disease control tells the Business Analyst his response makes no sense. He then apologises and confirms that the problem lies with the extract program from CCIS to the data warehouse and that it was being corrected by the CCIS team. She then talks to the Manger of the CCIS team about rectification. The CCIS team do not identify any issues and leave it at that. As the CCIS team and the Business Analyst had their teams split during a restructure of the Information Services Division and awaiting the appointment of a new CIO to approve the access the Business Analyst did not have access to CCIS generic datamart reinstated until September to investigate the problem further himself. He finds the problem and tells the CCIS Manager that his ‘quality assurance’ investigations into the generic event records with an ‘unknown’ venue revealed that the event venue for persons in a Group Session event is not captured in the Generic extract and provides sample records from the extract file that have a null venue. He requests the CCIS Manager to approve and lodge a service request to the outsourced programmer to update the extract program. This again required the intervention of the CIO to have the request lodge. On 15 th September the correct venues data was provided to the Director with the explanation that there were in fact two problems; A data warehouse mapping and loading problem (which had been fixed) A CCIS data extract problem (which had not been fixed for future data extractions and would require $5000 worth of changes by the vendor to the CCIS database and the CCIS Manager was not keen to proceed through a lengthy change control process). The Business Analyst further explained that the data was not 100% accurate as approximately 20% of the Unknown venue records are for Group Session events that do have a venue (affecting approximately 2,000 of 100,000 records annually) as the data is not available in the extract files provided to the data warehouse from CCIS. The Director does not have a statistical background but is concerned that 20% of the data being incorrect is statistically significant and requests the Manager of Planning & Development to explain the problem to her again and pursue the Business Analyst and CCIS team to insist an auto fix is implemented. From this case study it is obvious further work to minimise integrity errors and improve data quality control measures is required to build back the trust in the departments automated reporting processes. It is a process issue not a tool issue in analysing and retrieving the information. Theodoratos and Bouzeghou(2001) have identified a datawharehouse architecture that supports data consistency quality goals by specifying a number of detailed currency constraints at the query level and availability constraints at the data source level may have also alleviated the issues in this case study.
Spending the $5000 to rectify the operational database problems to fix the monthly data quality extraction issues to the data warehouse in the example just now, implementing data quality tools such as Dataflux, Billerica, Firstlogic or implementing a dashboard that l spoke of earlier will mean lobbying management as to the benefits of such new or enhanced tools and beating others to the post to spend limited funds in the IT budget. One way to do this is to implement a benefits realisation or benefits harvesting process into your IT project governance model. It’s a few extra headings to fill in on the business case once the needs analysis is completed. This slide indicates the process that DHCS have proposed to assist with identifying if a course of action is worth taking to rectify information system problems by scoring higher than a median point on the graph and gives the priority for which Information and Technology related projects should be tackled by the Health Services Division Information Management group and the Information Services Division. It also has benefits for intangible information related projects such as improving reporting data quality as it allocates points to intangible benefits – not just bottom line costs. The Management Reporting Information System rating process involves both Information Division and the Divisional Information Management Group using an evaluation system for rating project proposals according to their feasibility factors and strategic factors. For a project proposal to be considered for development it must achieve high feasibility and strategic factor scores. Project Proposals from branches are assessed by Information Division and HSIMG using the System Project Proposal form that: Provides a concise project proposal statement Records Information Division feasibility rating (preliminary analysis) Records HSIMG strategic rating Feasibility factors (Technical, economic, legal, operational and schedule (TELOS)) and strategic factors (Productivity, differentiation and management (PDM) are two key elements of a proposal that are used in evaluating the potential of the MRIS proposal during the HSIMG planning process.
In concluding if you embrace Quality systems based processes and tools – focusing on the data being ‘clean’ at the source of origination (ie. Operational database and sample different levels and layers of the process you will be well on your way to an effective and trustworthy analysis an retrieval model.