Munging Solo: the Joy of Small Data

•

0 recomendaciones•829 vistas

This document discusses how small data can be effectively analyzed using simple command line tools and scripting languages like Ruby. The key points made are: 1) Most data is small in size and does not require large Hadoop clusters for processing - command line tools and scripting are often much faster for small data workloads. 2) The Unix shell is a powerful programming environment that allows stringing together simple commands into powerful pipelines for analyzing and transforming data in flexible ways. 3) Ruby is a great fit for scripting small data tasks and integrating with Unix tools due to its clean syntax, large standard library, and ability to be used for one-liners or full scripts.

Software

Mini Munging: 
the Joy of Small Data
Rob Miller
https://robm.me.uk/ | @robmil

mung (mʌŋ)
verb
To transform data from
one form into another,
unrecognisable one.

My Toolkit
• Hadoop
• Elasticsearch
• Cassandra

The Small
Data Toolkit
• The command line
• Pipelines
• Ruby!

$ head -1 log.csv 
fred@example.com,login,2015-07-20 13:10:11 
$ cat log.csv | cut -d, -f1 | sort | uniq -c
25 fred@example.com 
107 bob@example.net

$ cat log.csv | grep '^bob@example.net,' | 
cut -d, -f3 | cut -d' ' -f2 | sort | uniq -c
61 2015-01-20 
42 2015-06-18 
4 2015-07-20

Free functionality, 
free parallelism,
composable & modular

”For the same amount of data 
I was able to use my laptop 
to get the results in about 
12 seconds (270MB/sec), while 
the Hadoop cluster took about 
26 minutes (1.14MB/sec)”
Adam Drake, “Command-line tools can be 235x faster 
than your Hadoop cluster”, http://bit.ly/1sS01aP

$ ruby -e
$ ruby -ne
$ ruby -pe
$ ruby -F -ane
$ ruby -r

$ head -1 log.csv 
fred@example.com,login,2015-07-20 13:10:11 
$ cat log.csv | cut -d, -f1 | cut -d@ -f2 |  
ruby -rresolv -ne 'puts Resolv.getaddress(chomp)' | 
sort | uniq -c | sort -rn
24 10.0.42.1 
3 10.27.100.8

Start with coreutils, 
then throw in Ruby

$$ ruby -rcsv -rbitly -e 'b =  Bitly.new("user", "foo");   CSV.filter { |r| r.each { |f|   f.replace b.shorten(f).short_url if f =~ /^https?:/ } }' urls.csv > urls-shortened.csv$

% nokogiri -e 'puts @doc.css("img").map { |i| 
"https:" + i["src"] }'  
https://en.wikipedia.org/wiki/Unix |
xargs -n1 -P4 wget

Your shell is a
programming 
environment

The most useful 
bits of coreutils
cat
grep
head
tail
split
wc
sort
shuff
uniq
comm
cut
paste
join
tr
column

Text Processing 
with Ruby
• Published by 
Pragmatic Bookshelf
• Currently in beta
• https://pragprog.com/
book/rmtpruby/text-
processing-with-ruby

Más contenido relacionado

La actualidad más candente

Introduction to Web Scraping using Python and Beautiful Soup

Tushar Mittal

Exploring MongoDB & Elasticsearch: Better Together

ObjectRocket

When dealing with datasets, journalists have many options to choose from when moving beyond Excel. Usually the first step is using a relational (or SQL) database. While a relational database can be a good choice for some datasets, data analysts today turn to new tools to gain deeper insight. This talk will show how we can use a graph database to analyze highly connected data using examples from U.S. Congressional data and political email archives. Using the U.S. Congress data, we’ll show you how to explore the dataset using Cypher, the Neo4j query language, to discover legislator activity including bill sponsorship and voting activity. Building up our knowledge of Cypher as we progress, we’ll show how you can use principles from social network analysis to find influential legislators and discover what topics legislators have influence over. Finally, we will examine how to draw insights from the Hillary Clinton email dataset, released as part of a FOIA request earlier this year. We will explore this dataset as a graph of interactions among users, answering questions like: Who is communicating with Hillary the most? What are the topics of these emails? You’ll learn how to visualize these using the Neo4j browser to quickly make sense of the data as we are exploring. The goal of this talk is to provide a demonstration of database tools that any journalist can use to explore datasets and draw insights from connected datasets.

Finding Insights In Connected Data: Using Graph Databases In Journalism

William Lyon

Web Scraping Technologies

Krishna Sunuwar

Nosql databases for the .net developer

Jesus Rodriguez

Session 03 acquiring data

bodaceacat

Introduction to Elastic with a hint of Symfony and Docker

Daniel Platt

A general introduction to Spring Data / Neo4J

Florent Biville

GraphDb in XPages

Oliver Busse

Productive Data Tools for Quants

Wes McKinney

Cogapp Open Studios 2012 - Adventures with Linked Data

Cogapp

Semantics, rdf and drupal

Gokul Nk

What is Web-scraping?

Yu-Chang Ho

Augmenting Mongo DB with treasure data

Treasure Data, Inc.

Drupal 7 and RDF

scorlosquet

Graph databases: Tinkerpop and Titan DB

Mohamed Taher Alrefaie

Integrating Drupal with a Triple Store

Barry Norton

An Intro to Elasticsearch and Kibana

ObjectRocket

Graph basedrdf storeforapachecassandra

Ravindra Ranwala

Let your data shine... with OpenRefine

Open Knowledge Belgium

La actualidad más candente (20)

Introduction to Web Scraping using Python and Beautiful Soup

Exploring MongoDB & Elasticsearch: Better Together

Finding Insights In Connected Data: Using Graph Databases In Journalism

Web Scraping Technologies

Nosql databases for the .net developer

Session 03 acquiring data

Introduction to Elastic with a hint of Symfony and Docker

A general introduction to Spring Data / Neo4J

GraphDb in XPages

Productive Data Tools for Quants

Cogapp Open Studios 2012 - Adventures with Linked Data

Semantics, rdf and drupal

What is Web-scraping?

Augmenting Mongo DB with treasure data

Drupal 7 and RDF

Graph databases: Tinkerpop and Titan DB

Integrating Drupal with a Triple Store

An Intro to Elasticsearch and Kibana

Graph basedrdf storeforapachecassandra

Let your data shine... with OpenRefine

Similar a Munging Solo: the Joy of Small Data

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Douglas Moore

"R, Hadoop, and Amazon Web Services (20 December 2011)"

Portland R User Group

R, Hadoop and Amazon Web Services

Portland R User Group

Data science in ruby is it possible? is it fast? should we use it?

Rodrigo Urubatan

Big data and hadoop

Chanchal Tripathi

Chengqi zhang graph processing and mining in the era of big data

jins0618

Data science in ruby, is it possible? is it fast? should we use it?

Rodrigo Urubatan

MongoDB: What, why, when

Eugenio Minardi

Spark Summit EU talk by Shay Nativ and Dvir Volk

Spark Summit

Architecting Your First Big Data Implementation

Adaryl "Bob" Wakefield, MBA

With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between. In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again. In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy. This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

MongoDB

Hadoop Data Modeling

Adam Doyle

The Briefing Room with Dr. Robin Bloor and IBM Cloudant Live Webcast March 24, 2015 Watch the Archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=e8bf62408d47e76c43aa73be08377e41c Context matters. Perspective matters. Thinking outside the box? That's often the key! While the Structured Query Language remains the lingua Franca of data, there are some views of the world that are best rendered with the benefit of NoSQL engines. As usual, that's easier said than done. How can your organization migrate from a structured query to unstructured or semi-structured query language? Register for this episode of The Briefing Room to find out! Veteran Analyst Dr. Robin Bloor will provide a detailed assessment of serious considerations when using NoSQL engines in conjunction with SQL. He'll be briefed by Ryan Millay of IBM Cloudant, who will showcase his company's solution, and how it's addressing the more vexing challenges facing today's information managers. Visit InsideAnalysis.com for more information.

Framing the Argument: How to Scale Faster with NoSQL

Inside Analysis

Hadoop Master Class : A concise overview

Abhishek Roy

Grails goes Graph

darthvader42

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

Mitul Tiwari

Hadoop for the Absolute Beginner

Ike Ellis

Challenges of Building a First Class SQL-on-Hadoop Engine

Nicolas Morales

Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way. Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models. Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync. View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...

Precisely

Challenges of Implementing an Advanced SQL Engine on Hadoop

DataWorks Summit

Similar a Munging Solo: the Joy of Small Data (20)

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

"R, Hadoop, and Amazon Web Services (20 December 2011)"

R, Hadoop and Amazon Web Services

Data science in ruby is it possible? is it fast? should we use it?

Big data and hadoop

Chengqi zhang graph processing and mining in the era of big data

Data science in ruby, is it possible? is it fast? should we use it?

MongoDB: What, why, when

Spark Summit EU talk by Shay Nativ and Dvir Volk

Architecting Your First Big Data Implementation

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Hadoop Data Modeling

Framing the Argument: How to Scale Faster with NoSQL

Hadoop Master Class : A concise overview

Grails goes Graph

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

Hadoop for the Absolute Beginner

Challenges of Building a First Class SQL-on-Hadoop Engine

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...

Challenges of Implementing an Advanced SQL Engine on Hadoop

Último

Diamond Application Development Crafting Solutions with Precision

SolGuruz

A great deal of attention in medical devices has shifted towards cybersecurity with the ratification of section 524B of the FD&C act. This new law enables the FDA to enforce cybersecurity controls in any medical device that is capable of networked communications or that has software. In this webinar we will recap the process for managing vulnerabilities, identify categories of vulnerabilities and solutions and more.

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

ICS

In the realm of real-time applications, Large Language Models (LLMs) have long dominated language-centric tasks, while tools like OpenCV have excelled in the visual domain. However, the future (maybe) lies in the fusion of LLMs and deep learning, giving birth to the revolutionary concept of Large Action Models (LAMs). Imagine a world where AI not only comprehends language but mimics human actions on technology interfaces. For example, the Rabbit r1 device presented at CES 2024, driven by an AI operating system and LAM, brings this vision to life. It executes complex commands, leveraging GUIs with unprecedented ease. In this presentation, join me on a journey as a software engineer tinkering with WebRTC, Janus, and LLM/LAMs. Together, we’ll evaluate the current state of these AI technologies, unraveling the potential they hold for shaping the future of real-time applications.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Alberto González Trastoy

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live Booking Contact Details :- WhatsApp Chat :- [+91-9999965857 ] The Best Call Girls Delhi At Your Service Russian Call Girls Delhi Doing anything intimate with can be a wonderful way to unwind from life's stresses, while having some fun. These girls specialize in providing sexual pleasure that will satisfy your fetishes; from tease and seduce their clients to keeping it all confidential - these services are also available both install and outcall, making them great additions for parties or business events alike. Their expert sex skills include deep penetration, oral sex, cum eating and cum eating - always respecting your wishes as part of the experience (29-April-2024(PSS)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Looking for an efficient way to manage your finances? Look no further than our money management app. With easy-to-use features, you can track your expenses, create budgets, and monitor your savings goals all in one place. Our app provides real-time updates on your spending habits and helps you make smarter financial decisions. Take control of your finances today with our user-friendly money management app.

Right Money Management App For Your Financial Goals

Jhone kinadey

10 Trends Likely to Shape Enterprise Technology in 2024

Mind IT Systems

At the recent Microsoft Ignite 2023 conference, Microsoft unveiled a groundbreaking strategy that will redefine enterprise work management. The plan involves integrating Microsoft’s key planning tools, Microsoft To Do, Microsoft Planner, and Microsoft Project for the web into a unified experience called “Microsoft Planner.” What does this new strategy from Microsoft mean for current users? Join us and learn how best to take advantage of this announcement while gaining a clear path on how to elevate the current state of Microsoft Planner from a basic task manager to a comprehensive tool for Enterprise Work Management using OnePlan. Learn how OnePlan’s integration with Microsoft Planner allows for strategic alignment with business goals through advanced features like strategic planning, portfolio management, resource management, financial management, and more!

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

OnePlan Solutions

Investing in AI transformation today The modern business advantage: Uncovering deep insights with AI Organizations around the world have come to recognize AI as the transformative technology that enables them to gain real business advantage. AI’s ability to organize vast quantities of data allows those who implement it to uncover deep business insights, augment human expertise, drive operational efficiency, transform their products, and better serve their customers

Microsoft AI Transformation Partner Playbook.pdf

Willy Marroquin (WillyDevNET)

Software Quality Assurance Interview Questions

Arshad QA

Looking to embark on a digital project in New York City? Choosing the ideal Laravel development partner is pivotal. Begin by defining your project requirements clearly. Assess potential partners' experience, expertise, and technical proficiency, checking portfolios and client testimonials. Effective communication and collaboration are paramount, so evaluate partners' communication styles and project management approaches. Consider long-term scalability and support options, and discuss pricing and contracts transparently. Lastly, trust your instincts when selecting a partner aligned with your vision and values.

How to Choose the Right Laravel Development Partner in New York City_compress...

software pro Development

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

ThousandEyes

(Vivek)Call Us, 8448380779,Call girls in Delhi NCr – We Offer best in class call girls. escort Service At Affordable Price At low Rate with Space Night 8000 We Are One Of The Oldest Escort and Call girls Agencies in Delhi. You Will Find That Our Female Escorts Are Full Of Fun, Sexy And They Would Love Enjoy Your Company. We Have A Fantastic Selection Of Escort Ladies Available For In-Calls As Well As Out-Calls. Our Escorts Are Not Only Beautiful But All Have Great Personalities Making Them The Perfect Companion For Any Occasion. In-Call:- You Can Come At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation. Out-Call:- You have To Come Pick The Girl From My Place We Are Also Provide Door Step Services (Delhi Ncr, Noida, Gurgaon, Faridabad, Ghaziabad Note:- Pic Collectors Time Passers Bargainers Stay Away As We Respect The Value For Your Money Time And Expect The Same From You Hygienic:- Full Ac room And Clean Rooms Available In Hotel 24 * 7 Hourly In Delhi NCR More Details, With WhatsApp Number, +91-8448380779

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Delhi Call girls

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Philip Schwarz

Foundation models are machine learning models which are easily capable of performing variable tasks on large and huge datasets. FMs have managed to get a lot of attention due to this feature of handling large datasets. It can do text generation, video editing to protein folding and robotics. In case we believe that FMs can help the hospitals and patients in any way, we need to perform some important evaluations, tests to test these assumptions. In this review, we take a walk through Fms and their evaluation regimes assumed clinical value. To clarify on this topic, we reviewed no less than 80 clinical FMs built from the EMR data. We added all the models trained on structured and unstructured data. We are referring to this combination of structured and unstructured EMR data or clinical data.

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

harshavardhanraghave

HR Software Buyers Guide in 2024 - HRSoftware.com

Fatema Valibhai

How To Use Server-Side Rendering with Nuxt.js

Andolasoft Inc

InShot proinshot.com stands tall among its peers as the ultimate video editing app, offering simplicity, versatility, and power in one package. With its intuitive interface and comprehensive feature set, InShot caters to both beginners and seasoned editors alike. Whether you're creating content for social media, YouTube, or personal projects, InShot empowers you to unleash your creativity and transform your videos into captivating masterpieces. Join the millions of users who trust InShot https://www.proinshot.com/ for all their video editing needs and discover the difference for yourself!

Exploring the Best Video Editing App.pdf

proinshot.com

Conference: Engage2024 in Antwerp Type: Workshop Speakers: Florian Vogler, Henning Kunz, Christoph Adler Title: Navigating the Future with The Hitchhiker's Guide to Notes and Domino 14 Abstract: Embark on an exhilarating journey with industry trailblazers Florian Vogler, Henning Kunz, and Christoph Adler in this not-to-be-missed workshop at the forefront of the tech universe. Get ready for a thrilling kick-off as we navigate the current state of the HCL universe, setting the stage for an exploration of the groundbreaking Notes and Domino 14. Discover the latest enhancements and revolutionary features that will redefine your experience. In this interactive session, unlock a treasure trove of tips and tricks to elevate your utilization of version 14, both with and without the game-changing panagenda MarvelClient. Brace yourself for also diving into Nomad, Nomad Web, and VoltMX, expanding your horizons in the expansive HCL landscape. Be a part of this exclusive opportunity to stay ahead in the ever-evolving world of HCL technologies. Your journey to mastering Notes and Domino 14 begins here. And remember, in the spirit of intergalactic exploration, don't forget to bring your towel!

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

panagenda

A Secure and Reliable Document Management System is Essential.docx

ComplianceQuest1

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

OnePlan Solutions

Munging Solo: the Joy of Small Data

1. Mini Munging:  the Joy of Small Data Rob Miller https://robm.me.uk/ | @robmil

2. Who are you?

4. What do you do?

5. I mung data

6. mung (mʌŋ) verb To transform data from one form into another, unrecognisable one.

7. My Toolkit • Hadoop • Elasticsearch • Cassandra

8. My Toolkit • Hadoop • Elasticsearch • Cassandra

9. My data  is small

10. Your data is  (probably)  small too

11. The Small Data Toolkit • The command line • Pipelines • Ruby!

12. The shell is a  programming environment

13. Text is a  universal interface

14. Pipelines are  incredibly powerful

15. $ head -1 log.csv  fred@example.com,login,2015-07-20 13:10:11  $ cat log.csv | cut -d, -f1 | sort | uniq -c 25 fred@example.com  107 bob@example.net

17. Free functionality,  free parallelism, composable & modular

18. ”For the same amount of data  I was able to use my laptop  to get the results in about  12 seconds (270MB/sec), while  the Hadoop cluster took about  26 minutes (1.14MB/sec)” Adam Drake, “Command-line tools can be 235x faster  than your Hadoop cluster”, http://bit.ly/1sS01aP

19. And Ruby ﬁts here too!

20. $ ruby -e $ ruby -ne $ ruby -pe $ ruby -F -ane $ ruby -r

22. Start with coreutils,  then throw in Ruby

23. $ ruby -rcsv -rbitly -e 'b =  Bitly.new("user", "foo");   CSV.filter { |r| r.each { |f|   f.replace b.shorten(f).short_url if f =~ /^https?:/ } }' urls.csv > urls-shortened.csv

24. % nokogiri -e 'puts @doc.css("img").map { |i|  "https:" + i["src"] }'   https://en.wikipedia.org/wiki/Unix | xargs -n1 -P4 wget

25. Your shell is a programming  environment

26. Ruby ﬁts into  this world perfectly

27. Most data is small

28. Go and mung stuff!

29. The most useful  bits of coreutils cat grep head tail split wc sort shuff uniq comm cut paste join tr column

30. Text Processing  with Ruby • Published by  Pragmatic Bookshelf • Currently in beta • https://pragprog.com/ book/rmtpruby/text- processing-with-ruby

Munging Solo: the Joy of Small Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Munging Solo: the Joy of Small Data

Similar a Munging Solo: the Joy of Small Data (20)

Último

Último (20)

Munging Solo: the Joy of Small Data