Introduction to Python for Biologists Katerina Taškova 1 Jean-Fred Fontaine 1,2 1 Faculty of Biology, Johannes Gutenberg-Universität Mainz, Mainz, Germany 2 Genomics and Computational Biology, Kernel Press, Mainz, Germany https://cbdm.uni-mainz.de/mb17 March 21, 2017
Introduction to Python for Biologists Table of Contents Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 2
Introduction to Python for Biologists Introduction Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 3
Introduction to Python for Biologists Introduction What is Python? Python is a general-purpose programming language created by Guido van Rossum (1991) high-level (abstraction from the details of the computer) interpreted (needs an interpreter software) Python design philosophy code readability syntax brevity Python is widely used for Biology rich built-in features powerful scientific extensions plotting capabilities March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 4
Introduction to Python for Biologists Introduction Structured programming I Instructions are executed sequentially, one per line Conditional statements allow selective execution of code blocks Loops allow repeated execution of code blocks Functions allow on-demand execution of code blocks March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 5
Introduction to Python for Biologists Introduction Structured programming II 1 i n s t r u c t i o n 1 # 1 s t i n s t r u c t i o n ( hashtag # s t a r t s comments ) 2 # blank l i n e 3 repeat 20 times # 2nd i n s t r u c t i o n ( loop s t a r t s a block ) 4 i n s t r u c t i o n a # block defined by i n d e n t a t i o n ( spaces or tabs ) 5 i n s t r u c t i o n b # 2nd i n s t r u c t i o n i n block 6 # blank l i n e 7 i f n>10 # 3 rd i n s t r u c t i o n ( C o n d i t i o n a l statement ) 8 i n s t r u c t i o n a # 1 s t i n s t r u c t i o n i n block 9 i n s t r u c t i o n b # 2nd i n s t r u c t i o n i n block 10 # blank l i n e 11 # blank l i n e 12 # backslashs j o i n l i n e s 13 i n s t r u c t i o n 3 \ # 3 rd i n s t r u c t i o n, p a r t 1 14 i n s t r u c t i o n 3 # 3 rd i n s t r u c t i o n, p a r t 2 15 # blank l i n e 16 # Expressions i n ( ), {}, or [ ] can span m u l t i p l e l i n e s 17 i n s t r u c t i o n 4 ( 1, 2, 3 # 4 th i n s t r u c t i o n, p a r t 1 18 4, 5, 6) # 4 th i n s t r u c t i o n, p a r t 2 March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 6
Introduction to Python for Biologists Introduction Namespace Variables are names associated with data e.g. a=2 assigns value 2 to variable a Functions are names associated to specific code blocks built-in functions are available (see list on slide 100) e.g. print(a) will display 2 on the screen The user namespace is the set of names available to the user users can define new names of variables and functions in their namespace imported modules can add names of variables and functions in the user namespace March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 7
Introduction to Python for Biologists Introduction Object-oriented programming Data is organized in classes and objects a class is a template defining what objects can store and do an object is an instance of a class objects have attributes to store data and methods to do actions object namespaces are different from user namespace Example class Human is defined as: has a name (an attribute name ) has an age (an attribute age ) can introduce itself (a method who ) example with 1 existing Human object P1: 1 P1. name = Mary # assigns value to a t t r i b u t e name 2 P1. age = 26 # assigns value to a t t r i b u t e age 3 P1. who ( ) # d i s p l a y s My name i s Mary I am 2 6! 4 who ( ) # e r r o r! not i n the user namespace March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 8
Introduction to Python for Biologists Introduction Modules Modules can add functionalities to Python e.g. classes and functions Example of available modules: NumPy for scientific computing Matplotlib for plotting BioPython for Biology Modules have to be imported into the code 1 # import datetime module i n i t s own namespace 2 import datetime 3 datetime. date. today ( ) # 2017 03 16 4 today ( ) # e r r o r! 5 6 # import f u n c t i o n s log2 and log10 from module math 7 # i n c u r r e n t namespace 8 from math import log2, log10 9 log10 ( 1 ) # equal 0 March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 9
Introduction to Python for Biologists Running code Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 10
Introduction to Python for Biologists Running code Running code I From a terminal by using the interactive Python shell 1 $ python3 # opens Python s h e l l 2 a=2 # assigns 2 to a 3 b=3 # assigns 3 to b 4 e x i t ( ) # closes Python s h e l l From a terminal by running a script file e.g. let say myscript.py is a script file (simple text file) and it contains: print( hello world! ) 1 $ python3 myscript. py # runs python3 and the s c r i p t 2 h e l l o world! # r e s u l t of the s c r i p t on the t e r m i n a l March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 11
Introduction to Python for Biologists Running code Running code II From Jupyter Notebook web-based graphical interface manage cells of code or text see execution results on the same notebook save/open notebooks March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 12
Introduction to Python for Biologists Running code Documentation and messages I Documentation and help: https://docs.python.org/3 use the built-in help() function e.g. help(print) to display help for function print() see help menu or Google it Examples of error messages 1 # F o r g e t t i n g quotes 2 p r i n t ( Hello world ) 3 # F i l e < s t d i n >, l i n e 2 4 # p r i n t ( Hello world ) 5 # ˆ 6 # SyntaxError : i n v a l i d syntax March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 13
Introduction to Python for Biologists Running code Documentation and messages II 1 # S p e l l i n g mistakes 2 p r i n ( Hello world ) 3 # Traceback ( most recent c a l l l a s t ) : 4 # F i l e < s t d i n >, l i n e 2, i n <module> 5 # NameError : name p r i n i s not defined 1 # Wrong l i n e break w i t h i n a s t r i n g 2 p r i n t ( Hello 3 World ) 4 # F i l e < s t d i n >, l i n e 2 5 # p r i n t ( Hello 6 # ˆ 7 # SyntaxError : EOL while scanning s t r i n g l i t e r a l March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 14
Introduction to Python for Biologists Literals and variables Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 15
Introduction to Python for Biologists Literals and variables Numeric and strings literals I 1 # Numeric l i t e r a l s 2 12 3 123 4 1.6E3 # means 1600 5 6 # S t r i n g s l i t e r a l s 7 A s t r i n g # A s t r i n g 8 A s t r i n g # A s t r i n g 9 A s t r i n g # A s t r i n g 10 Three s i n g l e quotes # Three s i n g l e quotes 11 Three double quotes # Three double quotes 12 A \ s t r i n g \ # A s t r i n g ( backslash escape sequence ) 13 r A \ s t r i n g \ # A \ s t r i n g \ ( raw s t r i n g ) Python stores literals in objects of corresponding classes (class int for integers, float for floatting point, and str for strings) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 16
Introduction to Python for Biologists Literals and variables Numeric and strings literals II Printing numeric and strings literals 1 p r i n t (12) # 12 2 p r i n t (1+2) # 3 3 4 p r i n t ( Hello World ) # Hello World 5 6 p r i n t ( Hello World, 1+2) # Hello World 3 7 p r i n t ( Hello World, 1+2, sep= ) # Hello World 3 8 p r i n t ( Hello World, 1+2, sep= \ t ) # Hello World 3 9 # (\ t : tab, \n : newline ) 10 11 p r i n t ( AB, end= ) # AB ( avoid newline at the end ) 12 p r i n t ( CD ) # ABCD 13 14 p r i n t ( Max i s, 12, and Min i s, 3) # Max i s 12 and Min i s 3 March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 17
Introduction to Python for Biologists Literals and variables Variables I Variables are names used to access objects first letter is a character (not a digit) no space characters allowed case-sensitive (variable name var is not Var) prefer alphanumeric characters (e.g. abc123) avoid accents, non-alphanumeric, non English underscores may be used (e.g. abc 123) The following keywords can not be used as variable names and, assert, break, class, continue def, del, elif, else, except, exec, finally, for, from global, if, import, in, is, lambda, not, or, pass print, raise, return, try, while, yield March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 18
Introduction to Python for Biologists Literals and variables Variables II 1 # Numeric types 2 a=2 # a i s assigned an i n t o b j e c t of value 2 3 p r i n t ( a ) # p r i n t s the o b j e c t assigned to a ( 2 ) 4 b=a # b i s assigned the same o b j e c t as a ( 2 ) 5 p r i n t ( b ) # 2 6 a=5 # a i s assigned a new o b j e c t of value 5 7 p r i n t ( a ) # 5 8 p r i n t ( b ) # 2 ( b i s s t i l l assigned to o b j e c t of value 2) 9 10 # S t r i n g s 11 c1= a 12 p r i n t ( c1 ) # a 13 myname125 = abc March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 19
Introduction to Python for Biologists Numeric types Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 20
Introduction to Python for Biologists Numeric types Numeric types I 1 type ( 7 ) # <class i n t > ( i n t e g e r number ) 2 type ( 8. 2 5 ) # <class f l o a t > ( f l o a t i n g p o i n t ) 3 type (4.52 e 3) # <class f l o a t > ( f l o a t i n g p o i n t ) 4 5 # Operators ( s p e c i a l b u i l t i n f u n c t i o n s ) 6 1 + 3 # 4 ( a d d i t i o n ) 7 4 1 # 3 ( s u b s t r a c t i o n ) 8 3 2 # 6 ( m u l t i p l i c a t i o n ) 9 9 / 2 # 4.5 ( d i v i s i o n ) 10 9 / / 2 # 4 ( i n t e g e r d i v i s i o n ) 11 9 % 2 # 1 ( i n t e g e r d i v i s i o n remainder ) 12 2 3 # 8 ( exponent ) 13 14 # Lowest to highest operators precedence ( equal i f on same l i n e ) 15 +, # Addition, S u b t r a c t i o n 16, /, / /, % # M u l t i p l i c a t i o n, D i v i s i o n s, Remainder 17 +x, x # P o s i t i v e, Negative 18 # Exponentiation March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 21
Introduction to Python for Biologists Numeric types Numeric types II 1 # B u i l t i n f u n c t i o n s 2 abs ( 2.58) # 2.58 ( absolute value of x ) 3 round ( 2. 5 ) # 2 ( round to c l o s e s t i n t e g e r ) 4 5 # With v a r i a b l e s 6 a = 1 # 1 7 b = 1 + 1 # 2 8 c = a + b # 3 9 d = a+c b # 7 ( precedence of over +) 10 d = ( a+c ) b # 8 ( use parentheses to break precedence ) 11 12 # Short n o t a t i o n s ( v a l i d f o r +,,, /,... ) 13 a += 1 # a = a + 1 14 a = 5 # a = a 5 15 16 # Special f l o a t values 17 f l o a t ( NaN ) # nan ( Not a Number ) 18 f l o a t ( I n f ) # i n f : I n f i n i t e p o s i t i v e ; i n f : I n f i n i t e negative March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 22
Introduction to Python for Biologists Strings Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 23
Introduction to Python for Biologists Strings Sequence types Text sequence type: Strings: immutable sequences of characters Basic sequence types: Lists: mutable sequences Tuples: immutable sequences Ranges: immutable sequence of numbers Sequence operations: All sequence types support common sequence operations (slide 98) Mutable sequence types support specific operations (slide 99) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 24
Introduction to Python for Biologists Strings Strings I 1 # Quotes 2 A s t r i n g # A s t r i n g 3 A s t r i n g # A s t r i n g 4 A s t r i n g # A s t r i n g 5 Three s i n g l e quotes # Three s i n g l e quotes 6 Three double quotes # Three double quotes 7 8 # Escape sequences ( see annexes ) 9 A s i n g l e quote # A s i n g l e quote 10 A s i n g l e quote \ # A s i n g l e quote 11 A t a b u l a t i o n \ t 12 A newline \n See other escape sequences in slide 97 Triple quoted strings may span multiple lines - all associated whitespace will be included in the string literal March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 25
Introduction to Python for Biologists Strings Strings II 1 # Operators 2 pipe + t t e # = p i p e t t e ( concatenation ) 3 A 7 # = AAAAAAA ( r e p l i c a t i o n ) 4 A 3 + C 2 # = AAACC 5 A + s t r ( 2. 0 ) # = A2. 0 ( convert number then concatenate ) 6 7 # B u i l t i n f u n c t i o n s 8 len ( A s t r i n g of characters ) # 22 ( l e n g t h i n characters ) 9 type ( a ) # <class s t r > ( s t r i n g ) 10 11 # S l i c e s [ s t a r t : end : step ] (0 i s index of f i r s t character ) 12 ABCDEFG [ 2 : 5 ] # CDE ( F at index 5 excluded ) 13 ABCDEFG [ : 5 ] # ABCDE ( from begining ) 14 ABCDEFG [ 5 : ] # FG ( to the end ) 15 ABCDEFG [ 2:] # FG ( 2 from the end : to the end ) 16 ABCDEFG [ 0 : 5 : 2 ] # ACE ( every second l e t t e r w ith step =2) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 26
Introduction to Python for Biologists Strings Strings methods I Strings are immutable: new objects are created for changes 1 seq = ACGtCCAgTnAGaaGT 2 3 # Case 4 seq. c a p i t a l i z e ( ) # Acgtccagtnagaagt 5 seq. casefold ( ) # acgtccagtnagaagt ( e s z e t t => ss ) 6 seq. lower ( ) # acgtccagtnagaagt ( e s z e t t => e s z e t t ) 7 seq. swapcase ( ) # acgtccagtnagaagt 8 seq. upper ( ) # ACGTCCAGTNAGAAGT 9 10 # Search and replace 11 seq. count ( a ) # 2 ( case s e n s i t i v e ) 12 seq. count ( G,0, 4) # 1 ( s l i c e s t a r t and end indexes ) 13 seq. endswith ( GT ) # True 14 seq. endswith ( G,0, 4) # False ( s l i c e s t a r t and end indexes ) 15 seq. f i n d ( GtC ) # 2 (1 s t h i t index, 1 otherwise ) 16 seq. replace ( aa, t t ) # ACGtCCAgTnAGttGT ( case s e n s i t i v e ) 17 seq. replace ( A, x, 2) # xcgtccxgtnagaagt (2 f i r s t h i t s only ) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 27
Introduction to Python for Biologists Strings Strings methods II 1 seq = ACGtCCAgTnAGaaGT 2 3 # I s f u n c t i o n s 4 seq. isalnum ( ) # True ( Are a l l characters alphanumeric?) 5 seq. i s a l p h a ( ) # True ( Are a l l characters a l p h a b e t i c?) 6 seq. i s l o w e r ( ) # False ( Are a l l characters lowercase?) 7 seq. isnumeric ( ) # False ( Are a l l numeric characters?) 8 seq. isspace ( ) # False ( Are a l l whitespace characters?) 9 seq. isupper ( ) # False ( Are a l l characters uppercase?) 10 11 # Join and s p l i t 12. j o i n ( [ A, B ] ) # A B 13. j o i n ( seq ) # A C G t C C A g T n A G a a G T 14 seq. p a r t i t i o n ( aa ) # ( ACGtCCAgTnAG, aa, GT ) : a t u p l e 15 seq. s p l i t ( aa ) # [ ACGtCCAgTnAG, GT ] : a l i s t 16 1\n2. s p l i t l i n e s ( ) # [ 1, 2 ] ( s p l i t at l i n e boundaries \r, \n ) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 28
Introduction to Python for Biologists Strings Strings methods III 1 seq = ACGtCCAgTnAGaaGT 2 3 # D e l e t i n g 4 seq. l s t r i p ( ) # remove leading whitespace characters 5 seq. r s t r i p ( ) # remove t r a i l i n g whitespace characters 6 seq. s t r i p ( ) # remove whitespace characters from both ends 7 8 seq. l s t r i p ( AC ) # GtCCAgTnAGaaGT ( remove C s or A s ) 9 seq. l s t r i p ( CA ) # GtCCAgTnAGaaGT ( remove C s or A s ) 10 seq. l s t r i p ( C ) # ACGtCCAgTnAGaaGT ( no impact ) 11 # same f o r r s t r i p but from the r i g h t and s t r i p from both ends 12 13 # Simple parsing of t e x t l i n e s from CSV f i l e s 14 l i n e. s t r i p ( ). s p l i t (, ) # remove newline and s p l i t CSV (\ t i f TSV) 15 16 # t r a n s l a t e ( case s e n s i t i v e ) 17 t a b l e = seq. maketrans ( atcg, tagc ) # map characters by index 18 seq. lower ( ). t r a n s l a t e ( t a b l e ) # t g c a g g t c a n t c t t c a March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 29
Introduction to Python for Biologists Exercise Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 30
Introduction to Python for Biologists Exercise Exercise Create the following directory structure Dokumente python notebooks data Jupyter Notebook File: Literals.ipynb URL: https://cbdm.uni-mainz.de/mb17 Download the file into the notebooks folder March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 31
Introduction to Python for Biologists Lists, tuples and ranges Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 32
Introduction to Python for Biologists Lists, tuples and ranges Sequence types Text sequence type: Strings: immutable sequences of characters Basic sequence types: Lists: mutable sequences Tuples: immutable sequences Ranges: immutable sequence of numbers Sequence operations: All sequence types support common sequence operations (slide 98) Mutable sequence types support specific operations (slide 99) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 33
Introduction to Python for Biologists Lists, tuples and ranges Lists I A List is an ordered collection of objects 1 L i s t 1 = [ ] # an empty l i s t 2 3 L i s t 1 = [ b, a, 1, cat, K, dog, F ] 4 L i s t 1 [ 0 ] # b ( access item of index 0) 5 L i s t 1 [ 1 ] # a ( access item of index 1) 6 L i s t 1 [ 1] # F ( access the l a s t item ) 7 L i s t 1 [ 2] # dog ( access the second l a s t item ) 8 9 # S l i c e s [ s t a r t : end : step ] 10 L i s t 1 [ 2 : 5 ] # [ 1, cat, K ] ( index 5 excluded ) 11 L i s t 1 [ : 5 ] # [ b, a, 1, cat, K ] 12 L i s t 1 [ 5 : ] # [ dog, F ] 13 L i s t 1 [ 2:] # [ dog, F ] 14 L i s t 1 [ 0 : 5 : 2 ] # [ b, 1, K ] March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 34
Introduction to Python for Biologists Lists, tuples and ranges Lists II 1 # B u i l t i n f u n c t i o n s 2 L i s t 2 = [ 1, 2, 3, 4, 5] 3 len ( L i s t 1 ) # 5 ( l e n g t h = 7 items ) 4 max( L i s t 2 ) # 5 5 min ( L i s t 2 ) # 1 6 sum( L i s t 2 ) # 15 7 8 # L i s t methods 9 L i s t 2 = [ ] # empty l i s t 10 L i s t 2. append ( 1 ) # [ 1 ] 11 L i s t 2. append ( A ) # [ 1, A ] 12 L i s t 2. extend ( [ B, 2 ] ) # [ 1, A, B, 2] 13 L i s t 2. pop ( 2 ) # [ 1, A, 2] 14 L i s t 2. i n s e r t ( 3, A ) # [ 1, A, 2, A ] ( i n s e r t A at index 3) 15 L i s t 2. index ( A ) # 1 ( index of the 1 s t A ) 16 L i s t 2. count ( A ) # 2 ( number of A ) 17 L i s t 2. reverse ( ) # [ A, 2, A, 1] March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 35
Introduction to Python for Biologists Lists, tuples and ranges Lists III 1 # s o r t i n g 2 L i s t 3 = [ 5, 3, 4, 1, 2] 3 sorted ( L i s t 3 ) # [ 1, 2, 3, 4, 5] ( b u i l d a new sorted l i s t ) 4 L i s t 3 # [ 5, 3, 4, 1, 2] ( L i s t 3 not changed ) 5 L i s t 3. s o r t ( ) # modifies the l i s t in place 6 L i s t 3 # [ 1, 2, 3, 4, 5] (. s o r t ( ) did modify L i s t 3! ) 7 8 # nested l i s t / 2D l i s t s / t a b l e s 9 mylist = [ [ b, a ], 10 [ 1, cat ] ] # a l i s t of 2 l i s t s 11 mylist [ 0 ] # r e t u r n s the f i r s t l i s t [ b, a ] 12 mylist [ 0 ] [ 0 ] # b (1 s t item of the 1 s t l i s t ) 13 mylist [ 0 ] [ 1 ] # a (2 nd item of the 1 s t l i s t ) 14 mylist [ 1 ] # r e t u r n s the 2nd l i s t [ 1, cat ] 15 mylist [ 1 ] [ 0 ] = 1 0 # [ [ b, a ], [10, cat ] ] March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 36
Introduction to Python for Biologists Lists, tuples and ranges Lists IV 1 mylist = [ [ b, a ], 2 [ 1, cat ] ] 3 4 f o r s u b l i s t i n mylist : # loop over s u b l i s t s 5 f o r value i n s u b l i s t : # loop over values 6 p r i n t ( value ) # p r i n t 1 value per l i n e 7 # b 8 # a 9 # 10 10 # cat 11 12 f o r s u b l i s t i n mylist : # loop over s u b l i s t s 13 n e w s u b l i s t = map( s t r, s u b l i s t ) # convert each item to s t r i n g 14 p r i n t ( \ t. j o i n ( n e w s u b l i s t ) ) # p r i n t as TSV t a b l e 15 # b a 16 # 10 cat March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 37
Introduction to Python for Biologists Lists, tuples and ranges Tuples and ranges A Tuple is an ordered collection of objects 1 Tuple1 = ( ) # empty t u p l e 2 Tuple1 = ( b, a, 1, cat, K, dog, F ) # defined t u p l e 3 4 Tuple1 [ 0 ] # b 5 Tuple1 [ 1 : 3 ] # ( a, 1) ( index 3 excluded ) Ranges 1 # Range ( s t a r t, stop [, step ] ) 2 range (10) # range ( 0, 10) => no nice p r i n t method 3 l i s t ( range (10) ) # [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 4 l i s t ( range ( 0, 30, 5) ) # [ 0, 5, 10, 15, 20, 25] 5 l i s t ( range ( 0, 5, 1) ) # [ 0, 1, 2, 3, 4] 6 l i s t ( range ( 0 ) ) # [ ] 7 l i s t ( range ( 1, 0) ) # [ ] March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 38
Introduction to Python for Biologists Sets and dictionaries Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 39
Introduction to Python for Biologists Sets and dictionaries Sets I A Set is a mutable unordered collection of objects 1 S0 = set ( ) # an empty set 2 S0 = { a, 1} # a new set of 2 items 3 S1 = { a, 1, b, R } # a new set of 4 items 4 S2 = { a, 1, b, S } # a new set of 4 items 5 len ( S0 ) # 2 6 7 # Operators 8 R i n S1 # True 9 R not i n S2 # True 10 S1 S2 # i n S1 but not i n S2 => { R } 11 S1 S2 # i n S1 or i n S2 => {1, a, S, R, b } 12 S1 & S2 # i n S1 and i n S2 => {1, b, a } 13 S1 ˆ S2 # i n S1 or i n S2 but not i n both => { R, S } 14 S0 <= S1 # S0 i s subset of S2 => True 15 S1 >= S2 # S1 i s superset of S2 => False 16 S1 >= S0 # True 17 S0. i s d i s j o i n t ( S1 ) # False March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 40
Introduction to Python for Biologists Sets and dictionaries Sets II 1 # Methods 2 S0. copy ( ) # r e t u r n a new set w ith a shallow copy of S0 3 S0. add ( item ) # add element item to the set 4 S0. remove ( item ) # remove element item from the set 5 S0. discard ( item ) # remove element item from the set i f present 6 S0. pop ( ) # remove and r e t u r n an a r b i t r a r y element 7 S0. c l e a r ( ) # remove a l l elements from the set March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 41
Introduction to Python for Biologists Sets and dictionaries Dictionaries I A Dictionary is a mutable indexed collection of objects (indexed by unique keys) 1 d = {} # empty d i c t i o n a r y 2 d = { A : ALA, C : CYS } # d i c t i o n a r y w ith 2 items 3 d [ A ] # ALA 4 d [ C ] # CYS 5 d [ H ]= HIS # add new item 6 d # { H : HIS, C : CYS, A : ALA } 7 del d [ A ] # { C : CYS, H : HIS } 8 9 C i n d # True ( key C i s i n d ) 10 A not i n d # True ( key A i s not i n d anymore ) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 42
Introduction to Python for Biologists Sets and dictionaries Dictionaries II d[key] get value by key d[key] = val set value by key del d[key] delete item by key d.clear() delete all items len(d) number of items d.copy() make a shallow copy d.keys() return a view of all keys d.values() return a view of all values d.items() return a view of all items (key,value) d.update(d2) add all items from dictionary d2 d.get(key [, val]) get value by key if exists, otherwise val d.setdefaults(key [, val]) like d.get(k,val), also set d[k]=val if k not in d pop(key[, default]) remove key and return its value, return default otherwise. d.popitem() remove a random item and returns it as tuple Table: Functions for dictionaries March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 43
Introduction to Python for Biologists Convert and copy Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 44
Introduction to Python for Biologists Convert and copy Converting types I Many Python functions are sensitive to the type of data. For example, you cannot concatenate a string with an integer: 1 sign = You are + 21 + years old # e r r o r!! 2 sign = You are + s t r ( 21) + years old # OK 3 sign # You are 21 years old 4 5 # convert to i n t ( from s t r or f l o a t ) 6 i n t ( 2014 ) # from a s t r i n g 7 i n t (3.141592) # from a f l o a t 8 9 # convert to f l o a t ( from s t r or i n t ) 10 f l o a t ( 1.99 ) # from a s t r i n g 11 f l o a t ( 5 ) # from an i n t e g e r March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 45
Introduction to Python for Biologists Convert and copy Converting types II 1 # convert to s t r ( from i n t, f l o a t, l i s t, tuple, d i c t and set ) 2 s t r (3.141592) # 3.141592 3 s t r ( [ 1, 2, 3, 4 ] ) # [ 1, 2, 3, 4 ] 4 5 # convert a sequence type to another 6 # ( s t r, l i s t, tuple, and set f u n c t i o n s ) 7 new set = set ( o l d l i s t ) # l i s t to set 8 new tuple = t u p l e ( o l d l i s t ) # l i s t to t u p l e 9 new set = set ( Hello ) # s t r i n g to set { H, o, e, l } 10 n e w l i s t = l i s t ( Hello ) # s t r i n g to l i s t [ H, e, l, l, o ] March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 46
Introduction to Python for Biologists Convert and copy Copy I Assignments (=) do not copy objects, they create bindings between a target and an object. 1 # Numeric types ( immutable ) 2 a = 1 # a binds the o b j e c t 1 3 b = a # b binds the o b j e c t 1 4 b = b + 1 # b binds a new o b j e c t created by the sum 5 a # 1 6 b # 2 7 8 # S t r i n g s ( immutable ) 9 a = Hello # a binds the o b j e c t Hello 10 b = a # b binds the o b j e c t Hello 11 a = a. replace ( o, o World! ) # a binds a new o b j e c t 12 a # Hello World! 13 b # Hello March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 47
Introduction to Python for Biologists Convert and copy Copy II For collections that are mutable or contain mutable items, a shallow copy is sometimes needed so one can change one copy without changing the other. 1 # D i c t i o n a r y ( mutable ) 2 d1 = { A : ALA, C : CYS } # d1 binds the o b j e c t 3 d2 = d1 # d2 binds the o b j e c t 4 d2 [ H ]= HIS # add item to the o b j e c t 5 d1 # { A : ALA, H : HIS, C : CYS } 6 d2 # { A : ALA, H : HIS, C : CYS } 7 8 d2 = d1. copy ( ) # d2 binds a shallow copy of the o b j e c t 9 d2 [ P ]= PRO # add item to the copied o b j e c t 10 d1 # { A : ALA, H : HIS, C : CYS } 11 d2 # { A : ALA, H : HIS, P : PRO, C : CYS } March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 48
Introduction to Python for Biologists Convert and copy Copy III 1 # L i s t ( mutable ) 2 l 1 = [ A, H, C ] 3 l 2 = l 1 4 l 2. append ( P ) 5 l 1 # [ A, H, C, P ] 6 l 2 # [ A, H, C, P ] 7 8 l 2 = l 1 [ : ] # shallow copy by assigning a s l i c e of the a l l l i s t 9 l 2. append ( V ) 10 l 1 # [ A, H, C, P ] 11 l 2 # [ A, H, C, P, V ] March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 49
Introduction to Python for Biologists Convert and copy Copy IV Convert types to get copies 1 n e w l i s t = l i s t ( o l d l i s t ) # shallow copy 2 new dict = d i c t ( o l d d i c t ) # shallow copy 3 new set = set ( o l d l i s t ) # copy l i s t as a set 4 new tuple = t u p l e ( o l d l i s t ) # copy l i s t a t u p l e The copy module 1 import copy 2 x. copy ( ) # shallow copy of x 3 x. deepcopy ( ) # deep copy of x, i n c l u d i n g embedded o b j e c t s March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 50
Introduction to Python for Biologists Loops Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 51
Introduction to Python for Biologists Loops For loop I 1 # For items i n a l i s t 2 f o r person i n [ I s a b e l, Kate, Michael ] : 3 p r i n t ( Hi, person ) 4 # Hi I s a b e l 5 # Hi Kate 6 # Hi Michael 7 8 # For items i n a d i c t i o n a r y 9 seq = # an empty s t r i n g 10 d = { A : ALA, C : CYS } # a d i c t i o n a r y w ith 2 keys 11 f o r k i n d. keys ( ) : # loop over the keys 12 seq += d [ k ] # append value to seq 13 p r i n t ( seq ) # CYSALA March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 52
Introduction to Python for Biologists Loops For loop II 1 # For items i n a s t r i n g 2 f o r c i n abc : 3 p r i n t ( c ) 4 # a 5 # b 6 # c 7 8 # For items i n a range 9 f o r n i n range ( 3 ) : 10 p r i n t ( n ) 11 # 0 12 # 1 13 # 2 14 15 # For items from any i t e r a t o r 16 f o r n i n i t e r a t o r : 17 p r i n t ( n ) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 53
Introduction to Python for Biologists Loops Enumerate 1 # loop g e t t i n g index and value 2 RNAs = [ mirna, trna, mrna ] 3 f o r i, rna i n enumerate (RNAs) : 4 p r i n t ( i, rna ) 5 # 0 mirna 6 # 1 trna 7 # 2 mrna 8 9 # loop over 2 l i s t s 10 RNAtypes = [ micro, t r a n s f e r, messenger ] 11 f o r i, t i n enumerate ( RNAtypes ) : 12 r = RNAs[ i ] 13 p r i n t ( i, t, r ) 14 # 0 micro mirna 15 # 1 t r a n s f e r trna 16 # 2 messenger mrna March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 54
Introduction to Python for Biologists Loops While loop 1 i =0 2 value=1 3 while value <200: 4 i +=1 5 value = i 6 p r i n t ( i, value ) 7 # 1 1 8 # 2 2 9 # 3 6 10 # 4 24 11 # 5 120 12 # 6 720 March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 55
Introduction to Python for Biologists Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 56
Introduction to Python for Biologists Exercise URL https://cbdm.uni-mainz.de/mb17 Jupyter Notebook File: Sequences.ipynb Download the file into the notebooks folder Data file File: shrub dimensions.csv Download the file into the data folder March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 57
Introduction to Python for Biologists Functions Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 58
Introduction to Python for Biologists Functions Functions I 1 from random import choice # import f u n c t i o n choice 2 3 # Simple f u n c t i o n 4 def kmerfixed ( ) : # d e f i n e f u n c t i o n kmerfixed 5 p r i n t ( ACGTAGACGC ) # p r i n t predefined s t r i n g 6 7 kmerfixed ( ) # d i s p l a y ACGTAGACGC 8 9 # Returning a value 10 def kmer10 ( ) : # d e f i n e f u n c t i o n kmer10 11 seq= # d e f i n e an empty s t r i n g 12 f o r count i n range ( 10) : # repeat 10 times 13 seq += choice ( CGTA ) # add 1 random nt to s t r i n g 14 r e t u r n ( seq ) # r e t u r n s t r i n g 15 16 newkmer = kmer10 ( ) # get r e s u l t of f u n c t i o n i n t o v a r i a b l e 17 p r i n t ( newkmer ) # c a l l the f u n c t i o n e. g. ACGGATACGC March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 59
Introduction to Python for Biologists Functions Functions II 1 # One parameter 2 def kmer ( k ) : # defin e kmer with 1 param. k 3 seq= 4 f o r count i n range ( k ) : # k i s used to d e f i n e the range 5 seq+=choice ( CGTA ) 6 r e t u r n ( seq ) 7 8 p r i n t ( kmer ( k =4) ) # e. g. TACC 9 p r i n t ( kmer (20) ) # e. g. CACAATGGGTACCCCGGACC 10 p r i n t ( kmer ( 0 ) ) # 11 p r i n t ( kmer ( ) ) # TypeError : kmer ( ) missing 1 r e q u i r e d 12 # p o s i t i o n a l argument : k March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 60
Introduction to Python for Biologists Functions Functions III 1 # Parameters with more parameters and d e f a u l t values 2 def generic kmer ( alphabet= ACGT, k=10) : 3 seq= 4 f o r count i n range ( k ) : 5 seq+=choice ( alphabet ) 6 r e t u r n ( seq ) 7 8 generic kmer ( AB12, 15) # e. g. 112AA1A12AA1121 9 generic kmer ( AB12 ) # e. g. 1AA1B1BA2A 10 generic kmer ( k=20) # e. g. GTGGGCTTGTGCCCTGCACT 11 generic kmer ( ) # e. g. CTTGCCGGGA 12 generic kmer ( k =8, alphabet= #$%& ) # e. g. $$#&%$%$ March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 61
Introduction to Python for Biologists Functions Name spaces I Variable and function names defined globally can be seen in functions: this is the global namespace 1 a = 10 # g l o b a l v a r i a b l e 2 3 def my function ( ) : 4 p r i n t ( a ) # w i l l use the g l o b a l v a r i a b l e 5 6 my function ( ) # 10 ( the g l o b a l a ) 7 p r i n t ( a ) # 10 ( the g l o b a l a ) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 62
Introduction to Python for Biologists Functions Name spaces II Names defined within a function can not be seen outside: the function has its own namespace. 1 a = 10 # g l o b a l v a r i a b l e 2 3 def my function ( ) : 4 a = 1 # l o c a l v a r i a b l e defined by assignment 5 b = 2 # l o c a l v a r i a b l e defined by assignment 6 p r i n t ( a ) 7 8 my function ( ) # 1 ( the l o c a l a ) 9 p r i n t ( a ) # 10 ( the g l o b a l a ) 10 p r i n t ( b ) # NameError : name b i s not defined March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 63
Introduction to Python for Biologists Functions Name spaces III Use parameters and returned values to get and set variables outside the name space 1 a = 10 # g l o b a l v a r i a b l e 2 3 def my function ( v a l ) : # l o c a l v a r i a b l e v a l 4 b = 2 5 v a l = v a l + b 6 r e t u r n ( v a l ) 7 p r i n t ( a ) # 10 ( the g l o b a l a ) 8 p r i n t ( my function ( a ) ) # 12 9 p r i n t ( a ) # 10 ( the g l o b a l a unchanged ) 10 11 c = my function ( a ) # set v a l to 10 and assign 10+2 to c 12 p r i n t ( c ) # 12 ( g l o b a l a was changed ) 13 p r i n t ( a ) # 10 ( g l o b a l a was unchanged ) 14 15 a = my function ( a ) # change g l o b a l a w ith value 10+2 March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 64
Introduction to Python for Biologists Branching Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 65
Introduction to Python for Biologists Branching Truth Value Testing I Any object can be tested for truth value. The following values are considered false (other values are considered True): None False zero value: e.g. 0 or 0.0 an empty sequence or mapping: e.g., (), [ ], { }. Operations and built-in functions that have a Boolean result always return 0 for False and 1 for True March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 66
Introduction to Python for Biologists Branching Boolean Operations I A Boolean is equal to True or False a and b (true if a and b are true, false otherwise) a or b (true if a or b is true (1 alone or both), false otherwise) a ˆ b (true if either a or b is true (not both), false otherwise) not b (true if b is false, false otherwise) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 67
Introduction to Python for Biologists Branching Boolean Operations II All example code for tests below return True unless otherwise specified 1 # l e t set values of 3 v a r i a b l e s ( s i n g l e = symbol ) 2 a = True 3 b = False 4 c = True 5 6 7 # simple t e s t s using two = symbols (==) 8 a == True 9 b == False 10 c == True March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 68
Introduction to Python for Biologists Branching Boolean Operations III 1 # l e t set values of 3 v a r i a b l e s ( one = symbol ) 2 a = True 3 b = False 4 c = True 5 6 # order i s i r r e l e v a n t 7 ( a or b ) == ( b or a ) 8 ( a and b ) == ( b and a ) 9 10 # n e u t r a l ( whatever value of a ) 11 ( a or False ) == a 12 ( a and True ) == a 13 14 # always the same ( whatever value of a ) 15 ( a and False ) == False 16 ( a or True ) == True March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 69
Introduction to Python for Biologists Branching Boolean Operations IV 1 # l e t set values of 3 v a r i a b l e s ( one = symbol ) 2 a = True 3 b = False 4 c = True 5 6 # precedence == > not > and > or 7 ( a and b or c ) == ( ( a and b ) or c ) 8 ( not a == b ) == ( not ( a == b ) ) 9 10 # e q u i v a l e n t expressions 11 ( ( a or b ) or c ) == ( a or ( b or c ) ) == ( a or b or c ) 12 ( a or a or a ) == a 13 ( b and b and b ) == b 14 15 b and b and b == b # False and False and True => False!! 16 17 a and ( b or c ) == ( a and b ) or ( a and c ) 18 a or ( b and c ) == ( a or b ) and ( a or c ) March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 70
Introduction to Python for Biologists Branching Comparisons 1 Operations 2 < # s t r i c t l y l ess than 3 <= # less than or equal 4 > # s t r i c t l y g r e a t e r than 5 >= # g r e a t e r than or equal 6 == # equal ( two symbols =) 7 math. i s c l o s e ( a, b ) # equal f o r f l o a t i n g p o i n t s a and b 8! = # not equal 9 i s # o b j e c t i d e n t i t y 10 i s not # negated o b j e c t i d e n t i t y 11 x < y <= z # i s e q u i v a l e n t to x < y and y <= z Comparisons between objects of same class are supported if operator defined for the class. Different numerical types can be compared: e.g. 2<4.56 Floating points can not be compared exactly due to the limited precision to represent infinite numbers such as 1/3 = 0.33333... March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 71
Introduction to Python for Biologists Branching Conditionals IF-ELIF-ELSE 1 seq = ATGAnnATG 2 i f n i n seq : 3 p r i n t ( sequence contains undefined bases ( n ) ) 4 e l i f x i n seq : 5 p r i n t ( sequence contains unknown bases x but not n ) 6 else : 7 p r i n t ( no undefined bases i n sequence ) 8 9 # 10 # sequence contains undefined bases ELIF and ELSE are optional multiple ELIF are possible March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 72
Introduction to Python for Biologists Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 73
Introduction to Python for Biologists Exercise URL https://cbdm.uni-mainz.de/mb17 Jupyter Notebook File: Conditionals.ipynb Download the file into the notebooks folder March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 74
Introduction to Python for Biologists Regular Expressions Introduction Running code Literals and variables Numeric types Strings Exercise Lists, tuples and ranges Sets and dictionaries Convert and copy Loops Functions Branching Regular Expressions Annexes March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 75
Introduction to Python for Biologists Regular Expressions RE: Regular Expressions I Regular expressions (called REs, or regexes, or regex patterns) are a powerful language for matching text patterns (re module) In Python a regular expression search is typically written as: 1 match = re. search ( expression, s t r i n g ) The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, re.search() returns a Match object (actually class sre.sre Match ) or None otherwise. March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 76
Introduction to Python for Biologists Regular Expressions RE: Regular Expressions II 1 import re # import re module 2 s t r = an example word : cat!! # Example s t r i n g 3 match = re. search ( r word : \w\w\w, s t r ) # Search a p a t t e r n 4 i f match : 5 p r i n t ( found, match. group ( ) ) # found word : cat 6 else : 7 p r i n t ( did not f i n d ) In the pattern string, \w codes a character (letter, digit or underscore) The r at the start of the pattern string designates a python raw string which passes through backslashes without change. March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 77
Introduction to Python for Biologists Regular Expressions RE: Basic Patterns Pattern Match a, X, 9, < ordinary characters match themselves exactly. a period matches any single character except newline \w matches a word character: a letter or digit or underbar [a-za-z0-9 ] \W matches any non-word character \b boundary between word and non-word \s a single whitespace character space, newline, return, tab, form [\n \r \t \f] \S matches any non-whitespace character \t tab \n newline \r return \d decimal digit [0-9] ˆ circumflex (top hat) matches the start of a string $ dollar matches the end of a string \ inhibits the specialness of a character. So, for example, use \. to match a period Table: Regular expressions: basic patterns March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 78
Introduction to Python for Biologists Regular Expressions RE: Basic examples I The basic rules of RE search for a pattern within a string are: The search proceeds through the string from start to end, stopping at the first match found All of the pattern must be matched, but not all of the string If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 79
Introduction to Python for Biologists Regular Expressions RE: Basic examples II 1 match = re. search ( r i i i, p i i i g ) # found 2 match. group ( ) == i i i # True 3 4 match = re. search ( r i g s, p i i i g ) # not found 5 match == None # True 6 7 match = re. search ( r.. g, p i i i g ) # found 8 match. group ( ) == i i g # True 9 10 match = re. search ( r \d\d\d, p123g ) # found 11 match. group ( ) == 123 # True 12 13 match = re. search ( r \w\w\w, @@abcd!! ) # found 14 match. group ( ) == abc # True March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 80
Introduction to Python for Biologists Regular Expressions RE: Repetitions I Repetitions are defined using +, *,? and { } + means 1 or more occurrences of the pattern to its left e.g. i+ = one or more i s * means 0 or more occurrences of the pattern to its left? means match 0 or 1 occurrences of the pattern to its left curly brackets are used to specify exact number of repetitions e.g. A{5} for 5 A letters A{6,10} for 6 to 10 A letters Leftmost and Largest: First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible i.e. + and * go as far as possible (they are said to be greedy ). March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 81
Introduction to Python for Biologists Regular Expressions RE: Repetitions II 1 # simple r e p e t i t i o n s 2 re. search ( r p i +, p i i i g ). group ( ) # p i i i 3 re. search ( r p i?, ap ). group ( ) # p 4 re. search ( r p i?, a p i i ). group ( ) # p i 5 re. search ( r p i, ap ). group ( ) # p 6 re. search ( r p i, a p i i ). group ( ) # p i i 7 re. search ( r p i {3}, a p i i i i i ). group ( ) # p i i i 8 re. search ( r i +, p i i g i i i i ). group ( ) # i i (1 s t h i t only ) 9 10 # 3 d i g i t s p o s s i b l y separated by whitespaces (\ s ) 11 re. search ( r \d\s \d\s \d, xx1 2 3xx ). group ( ) # 1 2 3 12 re. search ( r \d\s \d\s \d, xx12 3xx ). group ( ) # 12 3 13 re. search ( r \d\s \d\s \d, xx123xx ). group ( ) # 123 March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 82
Introduction to Python for Biologists Regular Expressions RE: Sets of characters I Square brackets indicate a set of characters [ABC] matches A or B or C. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot Dash indicate a range or itself if put at the end [a-z] for lowercase alphabetic characters [a-za-z] for alphabetic characters [AB-] for A, B or dash Circumflex (ˆ) at the start inverts the set [ˆAB] for any character except A or B. March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 83
Introduction to Python for Biologists Regular Expressions RE: Sets of characters II 1 s t r = purple a l i c e b@google. com monkey dishwasher 2 match = re. search ( r \w+@\w+, s t r ) 3 i f match : 4 p r i n t match. group ( ) ## b@google 5 6 match = re. search ( r [ \w. ]+@[ \w. ]+, s t r ) 7 i f match : 8 p r i n t match. group ( ) ## a l i c e b@google. com March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 84
Introduction to Python for Biologists Regular Expressions RE: Functions I RE module functions: re.match() returns a Match object if occurrence found at begining of string, None otherwise re.search() returns a Match object for 1st occurrence, None if not found re.findall() returns a list of matched sub strings, an empty list if not found re.finditer() returns an iterator on Match objects of the occurrences, an empty iterator if not found Match object methods: match.start() returns start index match.end() returns end index match.span() returns start and end index in a tuple match.group() returns matched string March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 85
Introduction to Python for Biologists Regular Expressions RE: Functions II 1 import re 2 seq = RPAPPDRAPDQX # A sequence 3 expr = A.{ 1, 2}D # A and D separated by 1 or 2 characters 4 5 match = re. search ( expr, seq ) 6 i f match : 7 p r i n t ( 8 match. s t a r t ( ), # s t a r t index 9 match. end ( ), # end index 10 match. span ( ), # s t a r t and end index 11 match. group ( ), # the matched s t r i n g 12 seq [ match. s t a r t ( ) : match. end ( ) ], # the matched s t r i n g 13 sep= 14 ) 15 # 2 6 ( 2, 6) APPD APPD March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 86
Introduction to Python for Biologists Regular Expressions RE: Functions III 1 import re 2 seq = RPAPPDRAPDQX # A sequence 3 expr = A.{ 1, 2}D # A and D separated by 1 or 2 characters 4 5 match = re. match ( expr, seq ) # Not found at begining 6 p r i n t ( match ) 7 # None 8 9 matches = re. f i n d a l l ( expr, seq ) # Found 2 occurrences 10 p r i n t ( matches ) 11 # [ APPD, APD ] 12 13 matches = re. f i n d i t e r ( expr, seq ) # Found 2 occurrences 14 f o r m i n matches : # I t e r a t e over Match o b j e c t s 15 p r i n t ( m. span ( ), m. group ( ) ) # Use each Match o b j e c t 16 # ( 2, 6) APPD 17 # ( 7, 10) APD March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 87
Introduction to Python for Biologists Regular Expressions RE: Group Extraction Groups are defined with parentheses On a successful search match.group(): the whole match text match.group(1): match text of 1st left parenthesis match.group(2): match text of 2nd left parenthesis... 1 import re 2 s t r = purple a l i c e b@google. com monkey dishwasher 3 match = re. search ( ( [ \w. ]+)@( [ \w. ]+), s t r ) 4 i f match : 5 p r i n t ( match. group ( ) ) ## a l i c e b@google. com 6 p r i n t ( match. group ( 1 ) ) ## a l i c e b 7 p r i n t ( match. group ( 2 ) ) ## google. com March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 88
Introduction to Python for Biologists Regular Expressions RE: Group Extraction and Findall If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2)... data. 1 s t r = alice@google. com, monkey bob@abc. com dishwasher 2 t u p l e s = re. f i n d a l l ( r ( [ \w\. ]+)@( [ \w\. ]+), s t r ) 3 p r i n t ( t u p l e s ) 4 # [ ( a l i c e, google. com ), ( bob, abc. com ) ] 5 6 f o r t i n t u p l e s : 7 p r i n t ( t [ 0 ], t [ 1 ], sep= ) 8 # a l i c e google. com 9 # bob abc. com March 21, 2017 Johannes Gutenberg-Universität Mainz Taškova & Fontaine 89