SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
Získáváme, čistíme a
ukládáme data
Digital Humanities, Lekce druhá
Josef Šlerka, Studia nových médií, 15. 10. 2012
ETL (light verze)
Extracting data from outside sources
Transforming it to fit operational needs (which can
include quality levels)
Loading it into the end target (database, more
specifically, operational data store, data mart or data
warehouse)
(viz Wikipedie)
Real-life podle Wiki
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for
data integrity, create aggregates or disaggregates)
6. Stage (load into staging tables, if used)
Real-life podle Wiki

7. Audit reports (for example, on compliance with
business rules. Also, in case of failure, helps to
diagnose/repair)
8. Publish (to target tables)
9. Archive
10. Clean up
Extracting
co se vám bude hodit...
Extract
strukturovaná data vs nestrukturovaná
pro DH nejčastěji databáze vs web
web API vs scrapping
lze si vystačit i jen malým znalostmi
statická data vs real-time mohou být zákeřná, ale jde
to řešit
XPATH

XPath, the XML Path Language, is a query language
for selecting nodes from an XML document. In
addition, XPath may be used to compute values (e.g.,
strings, numbers, or Boolean values) from the content
of an XML document. XPath was defined by the World
Wide Web Consortium (W3C)
Jednoduché nástroje
Google Docs (hlavně statická data)
http://drive.google.com
YQL (hlavně statická data)
http://developer.yahoo.com/yql/console/
Yahoo Pipes (hlavně dynamická data)
http://pipes.yahoo.com/pipes/
IFTTT (hlavně dynamická data)
https://ifttt.com/
Ale mocné....


Twitter Archiving Google Spreadsheet TAGS v3
http://mashe.hawksey.info/2012/01/twitter-archive-
tagsv3/
Transforming
Hlavně o čištění a sjednocování dat ...
Google Refine
http://code.google.com/p/google-refine/downloads/list?
can=1
Google Refine is a standalone desktop application
provided by Google for data cleanup and
transformation to other formats. It is similar to
spreadsheet applications (and can work with
spreadsheet file formats), however acts more like
database.
Loading
kam s nimi, když ne do tradiční databáze...
Google Fusion Tables
jednoduché řešení pro běžné uživatele
http://www.google.com/fusiontables/Home/
Web service provided by Google for data
management. Data is stored in multiple tables that
Internet users can view and download. The Web
service provides means for visualizing data with pie
charts, bar charts, lineplots, scatterplots, timelines as
well as geographical maps. Data is exported in a
comma-separated values file format.
A teď ještě jedno
demo....

Más contenido relacionado

La actualidad más candente

La actualidad más candente (6)

Visualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaVisualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and Kibana
 
Ehr.care system introduction
Ehr.care system introduction Ehr.care system introduction
Ehr.care system introduction
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージ
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001e
 
R programming lab 2 - jupyter notebook
R programming lab   2 - jupyter notebookR programming lab   2 - jupyter notebook
R programming lab 2 - jupyter notebook
 

Destacado

Některé obecně rozšířené mýty o Facebooku
Některé obecně rozšířené  mýty o FacebookuNěkteré obecně rozšířené  mýty o Facebooku
Některé obecně rozšířené mýty o Facebooku
Josef Šlerka
 

Destacado (18)

Věštění (s) Wikipedií
Věštění (s) WikipediíVěštění (s) Wikipedií
Věštění (s) Wikipedií
 
Social Insider
Social InsiderSocial Insider
Social Insider
 
Několik čísel na téma děti (?) a internet (?)
Několik čísel na téma děti (?) a internet (?)Několik čísel na téma děti (?) a internet (?)
Několik čísel na téma děti (?) a internet (?)
 
Internet of things
Internet of thingsInternet of things
Internet of things
 
Hranice se stírají
Hranice se stírajíHranice se stírají
Hranice se stírají
 
Strojová cesta do zákazníkovy duše
Strojová cesta do zákazníkovy dušeStrojová cesta do zákazníkovy duše
Strojová cesta do zákazníkovy duše
 
Český a slovenský Twitter pod lupou
Český a slovenský Twitter pod lupouČeský a slovenský Twitter pod lupou
Český a slovenský Twitter pod lupou
 
Last.fm
Last.fmLast.fm
Last.fm
 
Každý bude jiný... Bohužel...
Každý bude jiný... Bohužel...Každý bude jiný... Bohužel...
Každý bude jiný... Bohužel...
 
The Art of Trolling 2.0 For Dummies
The Art of Trolling 2.0 For DummiesThe Art of Trolling 2.0 For Dummies
The Art of Trolling 2.0 For Dummies
 
Ways to understand fans - social network analysis
Ways to understand fans - social network analysisWays to understand fans - social network analysis
Ways to understand fans - social network analysis
 
Shall we dance
Shall we danceShall we dance
Shall we dance
 
The Art of Trolling 2.0
The Art of Trolling 2.0The Art of Trolling 2.0
The Art of Trolling 2.0
 
All about Facebook? All about you!
All about Facebook? All about you!All about Facebook? All about you!
All about Facebook? All about you!
 
Just metadata
Just metadataJust metadata
Just metadata
 
Malý velký svět bublin na Facebooku
Malý velký svět bublin na FacebookuMalý velký svět bublin na Facebooku
Malý velký svět bublin na Facebooku
 
Úvod do studia nových médií
Úvod do studia nových médiíÚvod do studia nových médií
Úvod do studia nových médií
 
Některé obecně rozšířené mýty o Facebooku
Některé obecně rozšířené  mýty o FacebookuNěkteré obecně rozšířené  mýty o Facebooku
Některé obecně rozšířené mýty o Facebooku
 

Similar a Získáváme, čistíme a ukládáme data

Information On Line Transaction Processing
Information On Line Transaction ProcessingInformation On Line Transaction Processing
Information On Line Transaction Processing
Stefanie Yang
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
Dataiku
 

Similar a Získáváme, čistíme a ukládáme data (20)

Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
ETL DW-RealTime
ETL DW-RealTimeETL DW-RealTime
ETL DW-RealTime
 
ETL Tools Ankita Dubey
ETL Tools Ankita DubeyETL Tools Ankita Dubey
ETL Tools Ankita Dubey
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
XML In The Real World - Use Cases For Oracle XMLDB
XML In The Real World - Use Cases For Oracle XMLDBXML In The Real World - Use Cases For Oracle XMLDB
XML In The Real World - Use Cases For Oracle XMLDB
 
From Data Hell to Bliss: Getting the Most Out of Your Acumatica Data
From Data Hell to Bliss: Getting the Most Out of Your Acumatica DataFrom Data Hell to Bliss: Getting the Most Out of Your Acumatica Data
From Data Hell to Bliss: Getting the Most Out of Your Acumatica Data
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Information On Line Transaction Processing
Information On Line Transaction ProcessingInformation On Line Transaction Processing
Information On Line Transaction Processing
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Etl design document
Etl design documentEtl design document
Etl design document
 
ETL
ETL ETL
ETL
 
notes
notesnotes
notes
 
Lighty
LightyLighty
Lighty
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...
Apache Phoenix with Actor Model (Akka.io)  for real-time Big Data Programming...Apache Phoenix with Actor Model (Akka.io)  for real-time Big Data Programming...
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 

Más de Josef Šlerka

Más de Josef Šlerka (20)

Znaky, znaky, modely
Znaky, znaky, modelyZnaky, znaky, modely
Znaky, znaky, modely
 
LLM a mixed methods v humanitních vědách
LLM a mixed methods v humanitních vědáchLLM a mixed methods v humanitních vědách
LLM a mixed methods v humanitních vědách
 
Vliv AI na mediální trh
Vliv AI na mediální trhVliv AI na mediální trh
Vliv AI na mediální trh
 
Informační věda - Pravděpodobnosti
Informační věda - PravděpodobnostiInformační věda - Pravděpodobnosti
Informační věda - Pravděpodobnosti
 
Informacni veda: Pocitace
Informacni veda: PocitaceInformacni veda: Pocitace
Informacni veda: Pocitace
 
Inforamační věda: Algoritmus
Inforamační věda: AlgoritmusInforamační věda: Algoritmus
Inforamační věda: Algoritmus
 
Co je to datova novinarina
Co je to datova novinarinaCo je to datova novinarina
Co je to datova novinarina
 
Algoritmy a sociální sítě - stručný úvod
Algoritmy a sociální sítě - stručný úvodAlgoritmy a sociální sítě - stručný úvod
Algoritmy a sociální sítě - stručný úvod
 
Atlas konspirací
Atlas konspiracíAtlas konspirací
Atlas konspirací
 
Parallel Polis Revisited: Way from concept of Parallel Polis to Distributed R...
Parallel Polis Revisited: Way from concept of Parallel Polis to Distributed R...Parallel Polis Revisited: Way from concept of Parallel Polis to Distributed R...
Parallel Polis Revisited: Way from concept of Parallel Polis to Distributed R...
 
Dezinformační weby a zpravodajství v ČR
Dezinformační weby a zpravodajství v ČRDezinformační weby a zpravodajství v ČR
Dezinformační weby a zpravodajství v ČR
 
INFOWAR IN CZECH REPUBLIC
INFOWAR IN CZECH REPUBLICINFOWAR IN CZECH REPUBLIC
INFOWAR IN CZECH REPUBLIC
 
Česká média dnes aneb Pokus o kontext k aktuální debatě
Česká média dnes aneb Pokus o kontext k aktuální debatěČeská média dnes aneb Pokus o kontext k aktuální debatě
Česká média dnes aneb Pokus o kontext k aktuální debatě
 
Svět viděný cizíma očima
Svět viděný cizíma očimaSvět viděný cizíma očima
Svět viděný cizíma očima
 
Do Birds of a Feather Flock Together?
Do Birds of a Feather Flock Together?Do Birds of a Feather Flock Together?
Do Birds of a Feather Flock Together?
 
Projekt Navigátor - datová část
Projekt Navigátor - datová částProjekt Navigátor - datová část
Projekt Navigátor - datová část
 
AI a žurnalistika
AI a žurnalistikaAI a žurnalistika
AI a žurnalistika
 
Stručná zpráva o jednom experimentu
Stručná zpráva o jednom experimentuStručná zpráva o jednom experimentu
Stručná zpráva o jednom experimentu
 
Volba a metoda
Volba a metodaVolba a metoda
Volba a metoda
 
Wikipedie ve službách zla?!
Wikipedie ve službách zla?!Wikipedie ve službách zla?!
Wikipedie ve službách zla?!
 

Último

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Último (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Získáváme, čistíme a ukládáme data

  • 1. Získáváme, čistíme a ukládáme data Digital Humanities, Lekce druhá Josef Šlerka, Studia nových médií, 15. 10. 2012
  • 2. ETL (light verze) Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse) (viz Wikipedie)
  • 3. Real-life podle Wiki 1. Cycle initiation 2. Build reference data 3. Extract (from sources) 4. Validate 5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) 6. Stage (load into staging tables, if used)
  • 4. Real-life podle Wiki 7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) 8. Publish (to target tables) 9. Archive 10. Clean up
  • 5. Extracting co se vám bude hodit...
  • 6. Extract strukturovaná data vs nestrukturovaná pro DH nejčastěji databáze vs web web API vs scrapping lze si vystačit i jen malým znalostmi statická data vs real-time mohou být zákeřná, ale jde to řešit
  • 7. XPATH XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C)
  • 8. Jednoduché nástroje Google Docs (hlavně statická data) http://drive.google.com YQL (hlavně statická data) http://developer.yahoo.com/yql/console/ Yahoo Pipes (hlavně dynamická data) http://pipes.yahoo.com/pipes/ IFTTT (hlavně dynamická data) https://ifttt.com/
  • 9. Ale mocné.... Twitter Archiving Google Spreadsheet TAGS v3 http://mashe.hawksey.info/2012/01/twitter-archive- tagsv3/
  • 10. Transforming Hlavně o čištění a sjednocování dat ...
  • 11. Google Refine http://code.google.com/p/google-refine/downloads/list? can=1 Google Refine is a standalone desktop application provided by Google for data cleanup and transformation to other formats. It is similar to spreadsheet applications (and can work with spreadsheet file formats), however acts more like database.
  • 12. Loading kam s nimi, když ne do tradiční databáze...
  • 13. Google Fusion Tables jednoduché řešení pro běžné uživatele http://www.google.com/fusiontables/Home/ Web service provided by Google for data management. Data is stored in multiple tables that Internet users can view and download. The Web service provides means for visualizing data with pie charts, bar charts, lineplots, scatterplots, timelines as well as geographical maps. Data is exported in a comma-separated values file format.
  • 14. A teď ještě jedno demo....