Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

DataMeet 4: Data cleaning & census data

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Data cleansing
Data cleansing
Cargando en…3
×

Eche un vistazo a continuación

1 de 60 Anuncio

Más Contenido Relacionado

A los espectadores también les gustó (17)

Similares a DataMeet 4: Data cleaning & census data (20)

Anuncio

Más de Ritvvij Parrikh (16)

Más reciente (20)

Anuncio

DataMeet 4: Data cleaning & census data

  1. 1. DATA CLEANING & PROFILING UNDERSTANDING INDIA … CENSUS 2011 Datameet 4 Bhavin Dalal
  2. 2. What is Data Quality??  Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context.  Aspects of data quality include:  Accuracy – How much accurate the data is ?  Completeness – Is all the data present ?  Update status – How old is the data ?  Relevance – Is data relevant to solve the purpose ?  Consistency – Is data consistent from different sources?  Reliability – How much can we rely on the data ?  Appropriate presentation – Is the data presented in a way that makes it usable ?  Accessibility – Is the data accessible by all those who require it? 2
  3. 3. Data Quality Problems  Referential Integrity  Use of NULL  Value checking for reasonableness  Date value for example  Value constrained to pre-defined domain Eg: Salutation 3
  4. 4. Before doing data quality  Profiling of data  Conformity check  Standardization  Gender -> M/F or Male/Female or Unknown or Null ?  Duplicate Values  Survivorship  Best quality set from different records 4
  5. 5. Basic Data Cleaning Steps  Removing spaces and nonprinting characters  Fixing Number and Number Signs  Fixing Date and Time  Merging and Splitting Columns  Eg: Names (First Name + Last Name / Full Name)  Need for transformation  Checking data quality through joining and matching 5
  6. 6. Finding duplicate values  Below are the algorithms to find duplicates based on the phonetics  Hamming()  Jaro-winkler()  Levenshtein()  Damerau-Levenshtein() --- Advanced version  Q-gram()  Cosine()  Soundex() 6
  7. 7. Hamming  Number of positions with same symbol in both strings. Only defined for strings of equal length.  distance(‘abcdd‘,’abbcd‘) = 3 7
  8. 8. Jaro-winkler  This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25]. 8
  9. 9. Levenshtein  Minimal number of ins e rtio ns , d e le tio ns and re p la c e m e nts needed for transforming string a into string b. 9
  10. 10. N-gram / Q-gram  Sum of absolute differences between N-gram vectors of both strings. 10
  11. 11. Cosine  1 minus the cosine similarity of both N-gram vectors. 11
  12. 12. Soundex  SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string. Zeroes are added at the end if necessary to produce a four-character code.  SOUNDEX (‘Ahmedabad') = A531  SOUNDEX (‘Amdavad') = A531 12
  13. 13. 13 Steps to Data Cleaning
  14. 14. Components of Address Sujit Joshi 88 Ashoka Appts Juhu Bombay Tel: 6201670 Cell: 998054046 Email: Sujit.Joshi@gml.com Mr. Sujit Joshi 88 Ashoka Apartments Gandhigram Road Juhu Mumbai – 400 049 India Tel: (22) 26201670 Cell: 998054046X Email: Sujit.Joshi@gmail.com Missing salutation Abbreviated house name Missing postcode Old telephone number Salutation added House name standardised Postcode & Country added Correct telephone number for known changes (add 2 to 7 digit numbers; include STD code for the city) Old telephone number Incorrect email id Tag Cell Number to be of invalid format Email id typo corrected
  15. 15. Steps in Data Cleansing  Parsing  Correcting  Standardizing  Matching  Consolidating 15
  16. 16. Parsing  Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. 16
  17. 17. Parsing Input Data from Source File Beth Christine Parker, SLS MGR Regional Port Authority Federal Building 12800 Lake Calumet Hedgewisch, IL Parsed Data in Target File First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: Lake Calumet City: Hedgewisch State: IL 17
  18. 18. Correcting  Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. 18
  19. 19. Correcting Corrected Data First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: South Butler Drive City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Parsed Data First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: Lake Calumet City: Hedgewisch State: IL 19
  20. 20. Standardizing  Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. 20
  21. 21. Standardizing Corrected Data First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: South Butler Drive City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Corrected Data Pre-name: Ms. First Name: Beth 1st Name Match Standards: Elizabeth, Bethany, Bethel Middle Name: Christine Last Name: Parker Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: S. Butler Dr. City: Chicago State: IL Zip: 60633 Zip+Four: 2398 21
  22. 22. Matching  Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. 22
  23. 23. Match Patterns Business Name Street Branch Type Customer #/Tax ID City Vendor Code Pattern Pattern I.D. Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact VClose Exact Exact Exact Exact VClose VClose VClose VClose VClose Close Close Close Blanks Blanks AAAAAA ABAAA-ABA- AA ABCCAA BBACAA P110 P115 P120 S300 S310 23
  24. 24. Matching Corrected Data (Data Source #1) Pre-name: Ms. First Name: Beth 1st Name Match Standards: Elizabeth, Bethany, Bethel Middle Name: Christine Last Name: Parker Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: S. Butler Dr. City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Corrected Data (Data Source #2) Pre-name: Ms. First Name: Elizabeth 1st Name Match Standards: Beth, Bethany, Bethel Middle Name: Christine Last Name: Parker-Lewis Title: Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: S. Butler Dr., Suite 2 City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Phone: 708-555-1234 Fax: 708-555-5678 24
  25. 25. Consolidating  Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation. 25
  26. 26. Consolidating Corrected Data (Data Source #1) Corrected Data (Data Source #2) Consolidated Data Name: Ms. Beth (Elizabeth) Christine Parker-Lewis Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Address: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398 Phone: 708-555-1234 Fax: 708-555-5678 26
  27. 27. Sometime Such Algo’s Don’t Work !!!  So we do manual cleaning 27
  28. 28. Example of Manual Cleaning Car Name Correct Name Waganer Wagon R Sujhuki Suzuki Benj Mercedes Benz Faurtuner Fortuner Scopeio Scorpio Sevrole Chevrolet Furrarree Ferrari Landcrusher Land Cruiser 28
  29. 29. Car Cleaning Approach 29
  30. 30. Other data that we have cleaned  Occupation  Marital Status  Gender  And many other fields … 30
  31. 31. 31 Data Capture Tips
  32. 32. Top Ten Data Capture Tips  Every contact is data capture opportunity  Make it easy for end user to give you information  Incentivise your end user to part with their details  Collect data in-line with private regulations  Decide what data you need and prioritise 32
  33. 33. Top Ten Data Capture Tips  Don’t ask everything at once – build it over time  Set targets for breadth, depth and quality  Collect data in standardized format  Streamline the data from point of capture to storage  If you cant collect it, BUY it!!! 33
  34. 34. End of Part 1 34
  35. 35. Understanding India … Census 2011 35
  36. 36. Census in India 36 The first census in India in modern times was conducted in 1872. Population census has been carried out every 10 years. The census is carried out by the office of the Registrar General and Census Commissioner of India, Delhi, an office in the Ministry of Home Affairs, Government of India, under the 1948 Census of India Act.
  37. 37. CENSU 37 S  The 15th Indian National census was conducted in two phases  House listing  Population enumeration.  The Census covered  640 districts  5767 tehsils  7742 towns  More than 6 lac villages.  2.7 million officials visited households in 7,742 towns and 6,40,867 villages, classifying the population in different segments
  38. 38. POPULATION COMPARISON 38 2021 2011 2001 The population of India has increased by more than 181 million during the decade 2001-2011.This addition is slightly lower than the population of Brazil, the fifth most populous country in the world !!
  39. 39. India as compared to the world 39 The gap between India, the country with the second largest population in the world and China, the country with the largest population in the world has narrowed from 238 million in 2001 to nearly 131 million in 2011. On the other hand, the gap between India and the United States of America, which has the third largest population, has now widened to about 902 million from 741 million in 2001.
  40. 40. State wise population 2001 40
  41. 41. State wise population 2011 41
  42. 42. Census report of 2011 42
  43. 43. Se x Ra tio 43 The sex ratio of India is 940. The sex ratio at the National level has risen by seven points since the last Census in 2001. This is the highest since 1971.
  44. 44. Sex Ratio Trend in India 44 The sex ratio in India has been historically negative or in other words, unfavourable to females. Sex ratio reached its lowest in 1991 but since then has kept rising.
  45. 45. 45 State-wise Sex Ratios
  46. 46. Census Facts 2011 46  Thane district of Maharashtra is the most populated district of India.  Dibang Valley of Arunachal Pradesh is the least populated.  Kurung Kumey of Arunachal Pradesh registered highest population growth rate of 111.01 percent.  Longleng district of Nagaland registered negative population growth rate of (-58.39).  Mahe district of Puducherry has highest sex ratio of 1176 females per 1000 males.  Daman district has lowest sex ratio of 533 females per 1000 males.  Serchhip district of Mizoram has highest literacy rate of 98.76 percent.  Alirajpur of MP is the least literate district of India with figure of 37.22 percent only.  North East Delhi has the higest density with figure of 37346 person per square kilometer.  Dibang Valley has the least density of 1 person per sq. km.
  47. 47. States having highest population 47  Uttar Pradesh - (19.96 Crore) increased at the rate of 20% from 2001  Maharashtra- (11.24 Crore) increased at the rate of 15% since last census.  Delhi is most densely populated with a density of 11297 per sq km ( an increase of 21% from 2001)  Bihar is the most densely populated state with a density of 1102 per sq km ( an increase of 25% from 2001).
  48. 48. States with highest literacy 48
  49. 49. Interesting Facts 49
  50. 50. Interesting facts- Telecom 50  “More phones than toilets” Census 2011 sheds light on changing India.  63.2 per cent households in India now have a telephone/mobile facility( 82 per cent in urban and 54 per cent in rural area.)  The penetration of mobile phone is 59 per cent and landline is 10 per cent.  More than half of Indian households (some 53.1 per cent) do not have access to something as basic as a toilet.
  51. 51. Facts- Communication 51  The penetration of computers and laptops in India is only 9.4 per cent or less than one out of 10 households with only 3 per cent having internet facility.  The penetration of internet is 8 per cent in urban as compared to less than 1 per cent in rural area.  Maharashtra is the biggest Indian Internet market with 18% .  47.2 % of Indians own a Television  19.9 % of Indians own a Radio/Transistors  13.42 Million broadband connections (Home + Offices ) combined.
  52. 52. Facts- Literacy and Population 52  Uttar Pradesh is the most populous state and the combined population of Uttar Pradesh and Maharashtra is more than that of the USA.  Ten states and union territories have attained literacy rate of above 85 per cent.  According to the Census report India's population is now bigger than the combined population of USA, Indonesia, Brazil, Pakistan and Bangladesh.  74% of Indians can now read, write and do basic maths (like adding, subtracting) — that means that 3 out of every 4 Indians are literate.
  53. 53. Facts : General 53  Females outnumber males in Goa.  Population  50% <=25 yrs of age  65% <=35 yrs of age  It is anticipated that the median age of an Indian citizen will be 29 years in 2020, in comparison to 48 for Japan and 37 for China.  India covers 2.4% of the land territory of the world and represents more than 17.5% of the population of the world.
  54. 54. Facts : General 54 Total expenditure and materials used : • Cost Rs. 2200 crore • Cost per person Rs. 18.33 • No. of Census Functionaries 2.7 million • No. of Languages in which Schedules were canvassed 16 • No. of Languages in which Training Manuals prepared 18 • No. of Schedules Printed 340 million • No. of Training Manuals Printed 5.4 million • Paper Utilised 12,000 MTs • Material Moved 10,500 MTs
  55. 55. What do we do with Census ??  Census is more than population, literacy and sex ratio.  Census can provide insights about various dimensions !!!  The data is available in the xls format  The data is available free of cost  The data is clean  It has proper database architecture with codes in place 55
  56. 56. Two stages of Census 56 Houselisting Population Enumeration
  57. 57. Houselisting questionaire 57
  58. 58. Population enumeration 58
  59. 59. References 59  http://www.census2011.co.in/  http://articles.timesofindia.indiatimes.com/2011-03-31/ http://censusindia.gov.in/  http://en.wikipedia.org/wiki/2011_census_of_India  http://www.mapsofindia.com/census2011/
  60. 60. THANK YOU 60

Notas del editor

  • http://www.computerweekly.com/feature/How-clean-is-your-data
  • https://support.office.com/en-ie/article/Top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19
  • http://stackoverflow.com/questions/6683380/techniques-for-finding-near-duplicate-records
    http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
  • http://www.adma.com.au/connect/articles/top-ten-data-capture-tips/
  • http://www.adma.com.au/connect/articles/top-ten-data-capture-tips/
  • POP IN 2001
  • POPULATION IN 2011 – MAHARASHTRA , UP , BIHAR POP INCREASES MOST. OTHER STATES ALSO ON UPWARD TREND. NAGALAND POP REDUCES. Arrows show the growth rate.

×