SlideShare a Scribd company logo
1 of 66
Download to read offline
Programming for Evolutionary Biology
            April 3rd – 19th 2013
             Leipzig, Germany




  Introduction to Unix systems
Part 3: grep, and Unix philosophy
          Giovanni Marco Dall'Olio
          Universitat Pompeu Fabra
              Barcelona (Spain)
Schedule
   9.30 – 11.00: “What is Unix?” 
   11.30 – 12.30: Introducing the terminal
   14:30 – 16:30: grep & Unix philosophy
   17:00 – 18:00: awk, make, & question time
Grep: search files for a
            pattern
   The command grep allows to search for patterns 
    in a file
   Basic usage:
       ●   grep “searchword” filename
   Example:
       ●   grep “ATG” myfasta.fa
Let's go to the exercises
          folder
   Remember:
      ●   cd (to go back to the home folder)
      ●   cd /homes/evopserver/evopadmin/unix_intro
      ●   cd exercises
      ●   ls
Let's look at the fastq
           examples
   Go to the folder “unix_intro/exercises/fastq”
       ●   This time is fastq, not fasta!
   Fastq is a format to represent sequences and 
    quality scores
   To see what is in the files:
          less sample_1000genomes_fastq.fastq 
Let's look at the fastq
           examples
   Go to the folder “unix_intro/exercises/fastq”
       ●   This time is fastq, not fasta!
   Fastq is a format to represent sequences and 
    quality scores
   To see what is in the files:
          less sample_1000genomes_fastq.fastq 
   Tip: try the less ­S option
          less ­S sample_1000genomes_fastq.fastq 
The Fastq format

           1st line:
      Sequence name
       and description

@SRR062634.321 HWI-EAS110_103327062:6:1:1446:951/2
TGATCATTTGATTAATACTGACATGTAGACAA         2nd line: Sequence
+
B5=BD5DAD?:CBDD-DDDDDCDDB+-B:;?A?C        4th line: Quality
                                                  Scores
Let's “grep”
   Let's use grep to see if any of the sequences in the 
    fastq folder contains the word “ACTG”
       ●   grep “ACTG” *
●   Note that grep is case sensitive. “ACTG” is 
    different than “actg”
“grep ACTG” *
Grep options
   Let's have a look at the man page for grep
       ●   man grep
The grep man page
   screenshot
Case sensitivity
   Most command line tools are case sensitives
   This means that “ACTG” and “actg” are not 
    considered as the same sequence
   We can use the ­i option to ignore the case:
       ●   grep ­i “actg” *
grep -B
   We have used grep to get all the sequences that 
    have “ACTG” in our files
   However, the name of the sequences are stored in 
    the line above the sequence
   We can use the ­B option to print the line above 
    each match
Retrieve all the sequences
   that do <not> match
   The grep ­v option prints only the lines that do 
    <not> match the query
   grep ­v “ACTG”  → returns the exact opposite 
    results than the previous query
Short exercise
   Which parameter can be used to count the number 
    of matching lines, instead of printing them? (hint: 
    see the man page)
   Which parameter can be used to print the lines 
    below? (the opposite of the ­B option)
   How many sequences match the pattern 
    “AAATTTC” in the sample vcf file?
Hint: saving results to a file
    It may be useful to save the results of a grep 
     search to a file
    To do this, you can use the “>” operator
    Example:
        ●   grep ACGTG sample_vcf.vcf > results.txt
Regular Expressions
   Regular Expressions are a way to specify a string 
    that matches different set of characters
   They will be explained better in the Perl course
   For now, let's see some
Simple grep regular
         expressions
   The symbol “.” matches any character
   Example:
       ●   grep 'AAA..GGG' matches all the sequences that 
           start with a “AAA”, have any two other characters, 
           and end with a “GGG”
Simple grep regular
          expressions
   We can use the brackets to specify a set of 
    characters
   For example, “[ACG]” will match any A, C or G
   Let's try it:
       ●   grep AAA[ACG].GGG sample_1000genomes_fastq.fastq
Exercises
   Search the following sequences in the file 
    sample_1000genomes_fastq.fastq :
       ●   ACTGAAT 
       ●   ACC followed by any 3 characters, then CC
       ●   AAA followed by any C or G
More grep exercises
   grep is an important command, so let's see other 
    examples
Let's look at genbank files
   Genbank is a format for storing annotation on 
    sequences, used by the Genbank database
   You should be in the fastq folder. To go to the 
    genbank folder:
       ●   cd ..
       ●   ls
       ●   cd genbank
The genbank format
   Seems a nice format for grepping!
Grepping genbank
   Let's see how many sequences there are in the 
    files contained in the examples folder:
       ●   grep 'LOCUS' *
How many PLoS articles are
      in the files?
    grep 'PLoS' *
Exercises
   Let's check together:
       ●   How many articles by “Powers” are cited?
       ●   How many CDS sequences are in the files
Other commands: cut and
         sort
Another command: “cut”
   cut allows to extract columns from a tab/comma 
    separated file
   We will play with it on some vcf files
Go to the vcf folder
   You should be in the vcf folder. To go to the 
    genbank folder:
       ●   cd ..
       ●   ls
       ●   cd vcf
The VCF format
●VCF is a tab-separated format used to store
SNP genotypes
“cut” on a VCF file
   Basic cut usage:
       ●   cut ­f <column_number> <filename>
   The sample VCF file in the exercise folder has a 
    lot of columns
   Let's use cut to extract only the second column of 
    the file: 
       ●   cut ­f 2 sample_vcf.vcf
Specifying ranges in cut
   You can also specify ranges of columns
   For example, to print the first columns:
       ●   cut ­f 1­5 sample_vcf.vcf 
Let's get the genotype of
         HG00099
   Let's say that we want to extract the genotype of 
    the individual HG00099
   HG00099 is on the 12th column of the vcf file
       ●   cut ­f 12 sample_vcf.vcf
Let's look at other “cut”
         options
“cut -c”
   Use cut ­c to define a range of characters instead 
    of a range of columns
   This is useful to deal with files that are not 
    comma­separated
Exercise
   Which parameter can be used to define a different 
    field separator? (for example: a comma­separated 
    file instead of a tab­separated file)
The “sort” command
   Let's look at another command: sort
   The sort command can be used to sort a csv file 
    by a column
Let's sort
   Basic sort usage:
       ●   sort ­k <column_number> filename
   Let's sort the sample vcf file by chromosome 
    position:
       ●   sort ­k 2 sample_vcf.vcf
“sort -k 2 sample_vcf.vcf”
   Note: when sorting numerical columns, the “­n” 
    option should be used
Let's sort by genotype state
         of HG00099
    sort ­k 12 sample_vcf.vcf
Unix philosophy & Piping
The history of Unix systems
    Let's go back in time to when the original Unix 
     system was developed:
        ●   Graphical interfaces didn't exist
        ●   No facebook or gmail
        ●   Computers were used mainly by programmers, and 
            for data analysis / document formatting
The history of Unix systems
             (2)
    When Unix was designed, it was used mostly for 
     data analysis and document formatting
    Unix was important for the diffusion of the first 
     best practices on data analysis and document 
     formatting
    By studying Unix's history and philosophy, we 
     can learn how the first programmers approached 
     these problems
The Unix Philosophy
   The Unix Philosophy can be summarized as:
       ●   Store everything on text files
       ●   Write small programs that carry out a single task 
       ●   Combine the small programs together to perform 
           more complex tasks
The Unix Philosophy
   “Write programs that do one thing and do it well. 
    Write programs to work together. Write programs 
    to handle text streams, because that is a universal 
    interface.”
       ●   (Doug McIlroy, inventor of the Unix pipes system)
Each Unix tool is specialized
         on a task
    Did you notice that each of the tools we have seen 
     today is specialized on one single task?
        ●   grep searches for a word in a file
        ●   cut extracts columns
        ●   sort orders a file 
Each Unix tool is specialized
         on a task
    Did you notice that each of the tools we have seen 
     today is specialized on one single task?
        ●   grep searches for a word in a file
        ●   cut extracts columns
        ●   sort orders a file 
    Almost all Unix command line tools are 
     specialized on a task:
        ●   diff, find, echo, wait...
Chaining commands
            together
   So, the Unix approach to programming is to write 
    small programs specialized on single tasks, and 
    organize them together
   To chain commands together, we will use the 
    “pipe” command (“|”)
The “pipe” system
   The “|” symbol can be used to “chain” commands 
    together
   To type the “|” symbol, use AltGr+> on your 
    keyboard:
Basic Pipe usage
   The pipe command redirects the output of a 
    command into another one
   Let's try it:
       ●   man ls | head → prints the first lines of the “ls” 
           manual
Piping to “less”
   The command “less” allows to scroll the contents 
    of a file (we have seen it this morning)
   It is often useful to redirect the output of a 
    command to less
   Example: 
       ●   grep ACTGCC sample_vcf.vcf | less
Piping exercise on vcf files
   You should be in the exercises/vcf folder
   Let's get the first 5 columns of the first 10 lines of 
    the vcf file:
       ●   cut ­f 1­5 sample_vcf.vcs | head ­n 10
   Explanation:
       ●   cut ­f 1­5 sample_vcf.vcf → extracts columns 1 to 
           5
       ●   head ­n 10 → prints first 10 lines
Piping exercise on vcf files
   Let's repeat the previous example, but removing 
    all the lines starting with “#” first:
       ●   grep ­v '#' sample_vcf.vcf | cut ­f 1­5 | head ­n 10
   Explanation:
       ●   grep  ­v '#' sample_vcf.vcf → extract all the lines 
           that do <not> contain a “#” (note the ­v flag)
       ●   cut ­f 1­5 → extracts columns 1 to 5
       ●   head ­n 10 → prints first 10 lines
Another command: wc
   wc stands for “word count”
   It counts the number of words, lines and 
    characters in a file
        Number
          of
         words




    Number        Number
       of            of
     lines       characters
“wc” can be very useful
      combined with grep
   How many sequences containing “AACCC” are 
    in the sample_vcf.vcf file?
      ●   grep AACCC sample_vcf.vcf | wc
“wc -l”
   An useful option in wc is the ­l, which only 
    shows the number of lines
   This is useful when applied to any command that 
    produces an output organized in lines
   ls ­l | wc ­l → will show the number of files in the 
    current folder
“uniq”
   uniq removes all duplicated lines from a file
   Useful to elaborate the results of cut or grep
   Example: 
       ●   cut ­f 5 sample_vcf.vcf | uniq
“uniq”
   uniq removes all duplicated lines from a file
   Useful to elaborate the results of cut or grep
   Example: 
       ●   cut ­f 5 sample_vcf.vcf | uniq
   Note: uniq expected the input to be already sorted
       ●   cut ­f 5 sample_vcf.vcf | sort | uniq
Unix philosophy applied to
      Programming
   The Unix philosophy is a general approach to 
    data analysis
   When you will have learned programming, you 
    can apply it to any data analysis task
       ●   Split your analysis in small segments
       ●   Write a different script for each segment
       ●   Use piping to put everything together
Example of a bioinfo
           pipeline
   Imagine that we need to plot the CG content of a 
    genome:
       ●   Approach 1 (not Unix): write a script that 
           downloads the sequence of the genome, calculate 
           the CG content, and draw a plot
       ●   Approach 2 (Unix): write three separate scripts: 
           one  to download the sequence, one to calculate 
           CG content, and another to draw the plot. Then, 
           pipe them together.
Questions
   Which of the commands seen today (grep, cut, 
    sort) would be useful for your work?
   What do you find more difficult about Unix 
    systems?
   Which kind of analysis workflows do you need to 
    do for your research?
Unix Approach to
          Bioinformatics
           Conclusions
   The Unix approach is more flexible and adaptable 
    to other situations
   You don't need to use the Unix approach always, 
    but you should be aware of it
Advanced Commands
   tr → replace a character in a text file
   comm → compare two files
   join → join files by a common column
   paste → merge lines of files
More advanced commands
   sed → stream editor
   awk → spreadsheet­like programming language
   make → define simple pipelines
Resume of the session:
   grep: search files for a text
   cut: extract columns from a csv file
   sort: sort a csv line 
   The pipe symbol (|) can be used to combine Unix 
    tools together
Time for a break!




Next session starts at 17:00

More Related Content

What's hot

Talk Unix Shell Script 1
Talk Unix Shell Script 1Talk Unix Shell Script 1
Talk Unix Shell Script 1
Dr.Ravi
 
15 practical grep command examples in linux : unix
15 practical grep command examples in linux : unix15 practical grep command examples in linux : unix
15 practical grep command examples in linux : unix
chinkshady
 
Unix Basics
Unix BasicsUnix Basics
Unix Basics
Dr.Ravi
 
Unix Basics 04sp
Unix Basics 04spUnix Basics 04sp
Unix Basics 04sp
Dr.Ravi
 

What's hot (20)

Linux shell scripting
Linux shell scriptingLinux shell scripting
Linux shell scripting
 
Talk Unix Shell Script 1
Talk Unix Shell Script 1Talk Unix Shell Script 1
Talk Unix Shell Script 1
 
15 practical grep command examples in linux : unix
15 practical grep command examples in linux : unix15 practical grep command examples in linux : unix
15 practical grep command examples in linux : unix
 
Unix Basics
Unix BasicsUnix Basics
Unix Basics
 
Mysql
MysqlMysql
Mysql
 
Unix Basics 04sp
Unix Basics 04spUnix Basics 04sp
Unix Basics 04sp
 
Unix - Filters/Editors
Unix - Filters/EditorsUnix - Filters/Editors
Unix - Filters/Editors
 
Shellscripting
ShellscriptingShellscripting
Shellscripting
 
Unix Tutorial
Unix TutorialUnix Tutorial
Unix Tutorial
 
Unix - Shell Scripts
Unix - Shell ScriptsUnix - Shell Scripts
Unix - Shell Scripts
 
Unix shell scripts
Unix shell scriptsUnix shell scripts
Unix shell scripts
 
Shell programming
Shell programmingShell programming
Shell programming
 
Quick start bash script
Quick start   bash scriptQuick start   bash script
Quick start bash script
 
system management -shell programming by gaurav raikar
system management -shell programming by gaurav raikarsystem management -shell programming by gaurav raikar
system management -shell programming by gaurav raikar
 
50 Most Frequently Used UNIX Linux Commands -hmftj
50 Most Frequently Used UNIX  Linux Commands -hmftj50 Most Frequently Used UNIX  Linux Commands -hmftj
50 Most Frequently Used UNIX Linux Commands -hmftj
 
Unix And Shell Scripting
Unix And Shell ScriptingUnix And Shell Scripting
Unix And Shell Scripting
 
Linux com
Linux comLinux com
Linux com
 
Shell scripting
Shell scriptingShell scripting
Shell scripting
 
Basic linux commands
Basic linux commands Basic linux commands
Basic linux commands
 
Chap06
Chap06Chap06
Chap06
 

Viewers also liked

Unix command-line tools
Unix command-line toolsUnix command-line tools
Unix command-line tools
Eric Wilson
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
Tri Truong
 
Ras pioverview
Ras pioverviewRas pioverview
Ras pioverview
Alec Clews
 

Viewers also liked (20)

Unix command-line tools
Unix command-line toolsUnix command-line tools
Unix command-line tools
 
Grep - A powerful search utility
Grep - A powerful search utilityGrep - A powerful search utility
Grep - A powerful search utility
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Learning sed and awk
Learning sed and awkLearning sed and awk
Learning sed and awk
 
QSpiders - Unix Operating Systems and Commands
QSpiders - Unix Operating Systems  and CommandsQSpiders - Unix Operating Systems  and Commands
QSpiders - Unix Operating Systems and Commands
 
Linux intro 1 definitions
Linux intro 1  definitionsLinux intro 1  definitions
Linux intro 1 definitions
 
UNIX/Linux training
UNIX/Linux trainingUNIX/Linux training
UNIX/Linux training
 
Unix Operating System
Unix Operating SystemUnix Operating System
Unix Operating System
 
Linux pipe & redirection
Linux pipe & redirectionLinux pipe & redirection
Linux pipe & redirection
 
Chap 17 advfs
Chap 17 advfsChap 17 advfs
Chap 17 advfs
 
Unix tutorial-08
Unix tutorial-08Unix tutorial-08
Unix tutorial-08
 
Linux Commands
Linux CommandsLinux Commands
Linux Commands
 
Chap 18 net
Chap 18 netChap 18 net
Chap 18 net
 
Ras pioverview
Ras pioverviewRas pioverview
Ras pioverview
 
Linux fundamental - Chap 16 System Rescue
Linux fundamental - Chap 16 System RescueLinux fundamental - Chap 16 System Rescue
Linux fundamental - Chap 16 System Rescue
 
History of L0phtCrack
History of L0phtCrackHistory of L0phtCrack
History of L0phtCrack
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Nigerian design and digital marketing agency
Nigerian design and digital marketing agencyNigerian design and digital marketing agency
Nigerian design and digital marketing agency
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysis
 
VideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact ReportVideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact Report
 

Similar to Linux intro 3 grep + Unix piping

Python advanced 3.the python std lib by example – application building blocks
Python advanced 3.the python std lib by example – application building blocksPython advanced 3.the python std lib by example – application building blocks
Python advanced 3.the python std lib by example – application building blocks
John(Qiang) Zhang
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
cargillfilberto
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
drandy1
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
monicafrancis71118
 

Similar to Linux intro 3 grep + Unix piping (20)

Gráficas en python
Gráficas en python Gráficas en python
Gráficas en python
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Server Logs: After Excel Fails
Server Logs: After Excel FailsServer Logs: After Excel Fails
Server Logs: After Excel Fails
 
dotor.pdf
dotor.pdfdotor.pdf
dotor.pdf
 
Python advanced 3.the python std lib by example – application building blocks
Python advanced 3.the python std lib by example – application building blocksPython advanced 3.the python std lib by example – application building blocks
Python advanced 3.the python std lib by example – application building blocks
 
Automation tools: making things go... (March 2019)
Automation tools: making things go... (March 2019)Automation tools: making things go... (March 2019)
Automation tools: making things go... (March 2019)
 
MCLS 45 Lab Manual
MCLS 45 Lab ManualMCLS 45 Lab Manual
MCLS 45 Lab Manual
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Not Your Fathers C - C Application Development In 2016
Not Your Fathers C - C Application Development In 2016Not Your Fathers C - C Application Development In 2016
Not Your Fathers C - C Application Development In 2016
 
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
Managing PostgreSQL with Ansible - FOSDEM PGDay 2016
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
 
Nithi
NithiNithi
Nithi
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
Unix 2 en
Unix 2 enUnix 2 en
Unix 2 en
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
Chapter 1: Introduction to Command Line
Chapter 1: Introduction to  Command LineChapter 1: Introduction to  Command Line
Chapter 1: Introduction to Command Line
 

More from Giovanni Marco Dall'Olio

The true story behind the annotation of a pathway
The true story behind the annotation of a pathwayThe true story behind the annotation of a pathway
The true story behind the annotation of a pathway
Giovanni Marco Dall'Olio
 
(draft) perl e bioinformatica - presentazione per ipw2008
(draft) perl e bioinformatica - presentazione per ipw2008(draft) perl e bioinformatica - presentazione per ipw2008
(draft) perl e bioinformatica - presentazione per ipw2008
Giovanni Marco Dall'Olio
 

More from Giovanni Marco Dall'Olio (19)

Fehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal ClubFehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal Club
 
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
 
Agile bioinf
Agile bioinfAgile bioinf
Agile bioinf
 
Version control
Version controlVersion control
Version control
 
Wagner chapter 5
Wagner chapter 5Wagner chapter 5
Wagner chapter 5
 
Wagner chapter 4
Wagner chapter 4Wagner chapter 4
Wagner chapter 4
 
Wagner chapter 3
Wagner chapter 3Wagner chapter 3
Wagner chapter 3
 
Wagner chapter 2
Wagner chapter 2Wagner chapter 2
Wagner chapter 2
 
Wagner chapter 1
Wagner chapter 1Wagner chapter 1
Wagner chapter 1
 
Hg for bioinformatics, second part
Hg for bioinformatics, second partHg for bioinformatics, second part
Hg for bioinformatics, second part
 
Hg version control bioinformaticians
Hg version control bioinformaticiansHg version control bioinformaticians
Hg version control bioinformaticians
 
The true story behind the annotation of a pathway
The true story behind the annotation of a pathwayThe true story behind the annotation of a pathway
The true story behind the annotation of a pathway
 
Plotting data with python and pylab
Plotting data with python and pylabPlotting data with python and pylab
Plotting data with python and pylab
 
Pycon
PyconPycon
Pycon
 
Makefiles Bioinfo
Makefiles BioinfoMakefiles Bioinfo
Makefiles Bioinfo
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific researchWeb 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
 
Perl Bioinfo
Perl BioinfoPerl Bioinfo
Perl Bioinfo
 
(draft) perl e bioinformatica - presentazione per ipw2008
(draft) perl e bioinformatica - presentazione per ipw2008(draft) perl e bioinformatica - presentazione per ipw2008
(draft) perl e bioinformatica - presentazione per ipw2008
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Linux intro 3 grep + Unix piping

  • 1. Programming for Evolutionary Biology April 3rd – 19th 2013 Leipzig, Germany Introduction to Unix systems Part 3: grep, and Unix philosophy Giovanni Marco Dall'Olio Universitat Pompeu Fabra Barcelona (Spain)
  • 2. Schedule  9.30 – 11.00: “What is Unix?”   11.30 – 12.30: Introducing the terminal  14:30 – 16:30: grep & Unix philosophy  17:00 – 18:00: awk, make, & question time
  • 3. Grep: search files for a pattern  The command grep allows to search for patterns  in a file  Basic usage: ● grep “searchword” filename  Example: ● grep “ATG” myfasta.fa
  • 4. Let's go to the exercises folder  Remember: ● cd (to go back to the home folder) ● cd /homes/evopserver/evopadmin/unix_intro ● cd exercises ● ls
  • 5. Let's look at the fastq examples  Go to the folder “unix_intro/exercises/fastq” ● This time is fastq, not fasta!  Fastq is a format to represent sequences and  quality scores  To see what is in the files:  less sample_1000genomes_fastq.fastq 
  • 6. Let's look at the fastq examples  Go to the folder “unix_intro/exercises/fastq” ● This time is fastq, not fasta!  Fastq is a format to represent sequences and  quality scores  To see what is in the files:  less sample_1000genomes_fastq.fastq   Tip: try the less ­S option  less ­S sample_1000genomes_fastq.fastq 
  • 7. The Fastq format 1st line: Sequence name and description @SRR062634.321 HWI-EAS110_103327062:6:1:1446:951/2 TGATCATTTGATTAATACTGACATGTAGACAA 2nd line: Sequence + B5=BD5DAD?:CBDD-DDDDDCDDB+-B:;?A?C 4th line: Quality Scores
  • 8. Let's “grep”  Let's use grep to see if any of the sequences in the  fastq folder contains the word “ACTG” ● grep “ACTG” * ● Note that grep is case sensitive. “ACTG” is  different than “actg”
  • 10. Grep options  Let's have a look at the man page for grep ● man grep
  • 11. The grep man page  screenshot
  • 12. Case sensitivity  Most command line tools are case sensitives  This means that “ACTG” and “actg” are not  considered as the same sequence  We can use the ­i option to ignore the case: ● grep ­i “actg” *
  • 13. grep -B  We have used grep to get all the sequences that  have “ACTG” in our files  However, the name of the sequences are stored in  the line above the sequence  We can use the ­B option to print the line above  each match
  • 14. Retrieve all the sequences that do <not> match  The grep ­v option prints only the lines that do  <not> match the query  grep ­v “ACTG”  → returns the exact opposite  results than the previous query
  • 15. Short exercise  Which parameter can be used to count the number  of matching lines, instead of printing them? (hint:  see the man page)  Which parameter can be used to print the lines  below? (the opposite of the ­B option)  How many sequences match the pattern  “AAATTTC” in the sample vcf file?
  • 16. Hint: saving results to a file  It may be useful to save the results of a grep  search to a file  To do this, you can use the “>” operator  Example: ● grep ACGTG sample_vcf.vcf > results.txt
  • 17. Regular Expressions  Regular Expressions are a way to specify a string  that matches different set of characters  They will be explained better in the Perl course  For now, let's see some
  • 18. Simple grep regular expressions  The symbol “.” matches any character  Example: ● grep 'AAA..GGG' matches all the sequences that  start with a “AAA”, have any two other characters,  and end with a “GGG”
  • 19. Simple grep regular expressions  We can use the brackets to specify a set of  characters  For example, “[ACG]” will match any A, C or G  Let's try it: ● grep AAA[ACG].GGG sample_1000genomes_fastq.fastq
  • 20. Exercises  Search the following sequences in the file  sample_1000genomes_fastq.fastq : ● ACTGAAT  ● ACC followed by any 3 characters, then CC ● AAA followed by any C or G
  • 21. More grep exercises  grep is an important command, so let's see other  examples
  • 22. Let's look at genbank files  Genbank is a format for storing annotation on  sequences, used by the Genbank database  You should be in the fastq folder. To go to the  genbank folder: ● cd .. ● ls ● cd genbank
  • 23. The genbank format  Seems a nice format for grepping!
  • 24. Grepping genbank  Let's see how many sequences there are in the  files contained in the examples folder: ● grep 'LOCUS' *
  • 25. How many PLoS articles are in the files?  grep 'PLoS' *
  • 26. Exercises  Let's check together: ● How many articles by “Powers” are cited? ● How many CDS sequences are in the files
  • 28. Another command: “cut”  cut allows to extract columns from a tab/comma  separated file  We will play with it on some vcf files
  • 29. Go to the vcf folder  You should be in the vcf folder. To go to the  genbank folder: ● cd .. ● ls ● cd vcf
  • 30. The VCF format ●VCF is a tab-separated format used to store SNP genotypes
  • 31. “cut” on a VCF file  Basic cut usage: ● cut ­f <column_number> <filename>  The sample VCF file in the exercise folder has a  lot of columns  Let's use cut to extract only the second column of  the file:  ● cut ­f 2 sample_vcf.vcf
  • 32. Specifying ranges in cut  You can also specify ranges of columns  For example, to print the first columns: ● cut ­f 1­5 sample_vcf.vcf 
  • 33. Let's get the genotype of HG00099  Let's say that we want to extract the genotype of  the individual HG00099  HG00099 is on the 12th column of the vcf file ● cut ­f 12 sample_vcf.vcf
  • 34. Let's look at other “cut” options
  • 35. “cut -c”  Use cut ­c to define a range of characters instead  of a range of columns  This is useful to deal with files that are not  comma­separated
  • 36. Exercise  Which parameter can be used to define a different  field separator? (for example: a comma­separated  file instead of a tab­separated file)
  • 37. The “sort” command  Let's look at another command: sort  The sort command can be used to sort a csv file  by a column
  • 38. Let's sort  Basic sort usage: ● sort ­k <column_number> filename  Let's sort the sample vcf file by chromosome  position: ● sort ­k 2 sample_vcf.vcf
  • 39. “sort -k 2 sample_vcf.vcf”  Note: when sorting numerical columns, the “­n”  option should be used
  • 40. Let's sort by genotype state of HG00099  sort ­k 12 sample_vcf.vcf
  • 42. The history of Unix systems  Let's go back in time to when the original Unix  system was developed: ● Graphical interfaces didn't exist ● No facebook or gmail ● Computers were used mainly by programmers, and  for data analysis / document formatting
  • 43. The history of Unix systems (2)  When Unix was designed, it was used mostly for  data analysis and document formatting  Unix was important for the diffusion of the first  best practices on data analysis and document  formatting  By studying Unix's history and philosophy, we  can learn how the first programmers approached  these problems
  • 44. The Unix Philosophy  The Unix Philosophy can be summarized as: ● Store everything on text files ● Write small programs that carry out a single task  ● Combine the small programs together to perform  more complex tasks
  • 45. The Unix Philosophy  “Write programs that do one thing and do it well.  Write programs to work together. Write programs  to handle text streams, because that is a universal  interface.” ● (Doug McIlroy, inventor of the Unix pipes system)
  • 46. Each Unix tool is specialized on a task  Did you notice that each of the tools we have seen  today is specialized on one single task? ● grep searches for a word in a file ● cut extracts columns ● sort orders a file 
  • 47. Each Unix tool is specialized on a task  Did you notice that each of the tools we have seen  today is specialized on one single task? ● grep searches for a word in a file ● cut extracts columns ● sort orders a file   Almost all Unix command line tools are  specialized on a task: ● diff, find, echo, wait...
  • 48. Chaining commands together  So, the Unix approach to programming is to write  small programs specialized on single tasks, and  organize them together  To chain commands together, we will use the  “pipe” command (“|”)
  • 49. The “pipe” system  The “|” symbol can be used to “chain” commands  together  To type the “|” symbol, use AltGr+> on your  keyboard:
  • 50. Basic Pipe usage  The pipe command redirects the output of a  command into another one  Let's try it: ● man ls | head → prints the first lines of the “ls”  manual
  • 51. Piping to “less”  The command “less” allows to scroll the contents  of a file (we have seen it this morning)  It is often useful to redirect the output of a  command to less  Example:  ● grep ACTGCC sample_vcf.vcf | less
  • 52. Piping exercise on vcf files  You should be in the exercises/vcf folder  Let's get the first 5 columns of the first 10 lines of  the vcf file: ● cut ­f 1­5 sample_vcf.vcs | head ­n 10  Explanation: ● cut ­f 1­5 sample_vcf.vcf → extracts columns 1 to  5 ● head ­n 10 → prints first 10 lines
  • 53. Piping exercise on vcf files  Let's repeat the previous example, but removing  all the lines starting with “#” first: ● grep ­v '#' sample_vcf.vcf | cut ­f 1­5 | head ­n 10  Explanation: ● grep  ­v '#' sample_vcf.vcf → extract all the lines  that do <not> contain a “#” (note the ­v flag) ● cut ­f 1­5 → extracts columns 1 to 5 ● head ­n 10 → prints first 10 lines
  • 54. Another command: wc  wc stands for “word count”  It counts the number of words, lines and  characters in a file Number of words Number Number of of lines characters
  • 55. “wc” can be very useful combined with grep  How many sequences containing “AACCC” are  in the sample_vcf.vcf file? ● grep AACCC sample_vcf.vcf | wc
  • 56. “wc -l”  An useful option in wc is the ­l, which only  shows the number of lines  This is useful when applied to any command that  produces an output organized in lines  ls ­l | wc ­l → will show the number of files in the  current folder
  • 57. “uniq”  uniq removes all duplicated lines from a file  Useful to elaborate the results of cut or grep  Example:  ● cut ­f 5 sample_vcf.vcf | uniq
  • 58. “uniq”  uniq removes all duplicated lines from a file  Useful to elaborate the results of cut or grep  Example:  ● cut ­f 5 sample_vcf.vcf | uniq  Note: uniq expected the input to be already sorted ● cut ­f 5 sample_vcf.vcf | sort | uniq
  • 59. Unix philosophy applied to Programming  The Unix philosophy is a general approach to  data analysis  When you will have learned programming, you  can apply it to any data analysis task ● Split your analysis in small segments ● Write a different script for each segment ● Use piping to put everything together
  • 60. Example of a bioinfo pipeline  Imagine that we need to plot the CG content of a  genome: ● Approach 1 (not Unix): write a script that  downloads the sequence of the genome, calculate  the CG content, and draw a plot ● Approach 2 (Unix): write three separate scripts:  one  to download the sequence, one to calculate  CG content, and another to draw the plot. Then,  pipe them together.
  • 61. Questions  Which of the commands seen today (grep, cut,  sort) would be useful for your work?  What do you find more difficult about Unix  systems?  Which kind of analysis workflows do you need to  do for your research?
  • 62. Unix Approach to Bioinformatics Conclusions  The Unix approach is more flexible and adaptable  to other situations  You don't need to use the Unix approach always,  but you should be aware of it
  • 63. Advanced Commands  tr → replace a character in a text file  comm → compare two files  join → join files by a common column  paste → merge lines of files
  • 64. More advanced commands  sed → stream editor  awk → spreadsheet­like programming language  make → define simple pipelines
  • 65. Resume of the session:  grep: search files for a text  cut: extract columns from a csv file  sort: sort a csv line   The pipe symbol (|) can be used to combine Unix  tools together
  • 66. Time for a break! Next session starts at 17:00