The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus

Similar documents
A L A BA M A L A W R E V IE W

ETIKA V PROFESII PSYCHOLÓGA

176 5 t h Fl oo r. 337 P o ly me r Ma te ri al s

P a g e 5 1 of R e p o r t P B 4 / 0 9

OH BOY! Story. N a r r a t iv e a n d o bj e c t s th ea t e r Fo r a l l a g e s, fr o m th e a ge of 9

Software Process Models there are many process model s in th e li t e ra t u re, s om e a r e prescriptions and some are descriptions you need to mode

T h e C S E T I P r o j e c t

COMPILATION OF AUTOMATA FROM MORPHOLOGICAL TWO-LEVEL RULES

P a g e 3 6 of R e p o r t P B 4 / 0 9

Request to Allocate the Sharada Script in the Unicode Roadmap

CATAVASII LA NAȘTEREA DOMNULUI DUMNEZEU ȘI MÂNTUITORULUI NOSTRU, IISUS HRISTOS. CÂNTAREA I-A. Ήχος Πα. to os se e e na aș te e e slă ă ă vi i i i i

h : sh +i F J a n W i m +i F D eh, 1 ; 5 i A cl m i n i sh» si N «q a : 1? ek ser P t r \. e a & im a n alaa p ( M Scanned by CamScanner

I N A C O M P L E X W O R L D

F O R SOCI AL WORK RESE ARCH

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

ISO/IEC JTC1/SC2/WG2 N2

Executive Committee and Officers ( )

H STO RY OF TH E SA NT

Automatic Control III (Reglerteknik III) fall Nonlinear systems, Part 3

Mono alphabetic substitution cipher

Subrings and Ideals 2.1 INTRODUCTION 2.2 SUBRING

Le classeur à tampons

MATHEMATICS: PAPER II

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

610B Final Exam Cover Page

Framework for functional tree simulation applied to 'golden delicious' apple trees

Devanagari Ä Ç Bengali à ä Gujarati ê í Oriya ò ö ÿ Ÿ.

Unit 3. Digital encoding

I M P O R T A N T S A F E T Y I N S T R U C T I O N S W h e n u s i n g t h i s e l e c t r o n i c d e v i c e, b a s i c p r e c a u t i o n s s h o

Table of C on t en t s Global Campus 21 in N umbe r s R e g ional Capac it y D e v e lopme nt in E-L e ar ning Structure a n d C o m p o n en ts R ea

CMSC 313 Lecture 17 Postulates & Theorems of Boolean Algebra Semiconductors CMOS Logic Gates

Last 4 Digits of USC ID:

Ch. 9 NOTES ~ Chemical Bonding NOTE: Vocabulary terms are in boldface and underlined. Supporting details are in italics.

Guide to the Extended Step-Pyramid Periodic Table

02/05/09 Last 4 Digits of USC ID: Dr. Jessica Parr

Microsoft Excel Directions

t t t ér t rs r t ét q s

Ed S MArket. NarROW } ] T O P [ { U S E R S G U I D E. urrrrrrrrrrrv

I n t e r ku l t ú r n a ko mu n i ká c i a na hodine anglické h o jazyka. p r ostrední c tvom použitia PC

PHYSICAL SCIENCES MARCH CONTROLLED TEST GRADE

Me n d e l s P e a s Exer c i se 1 - Par t 1

2 tel

Räkneövningar Empirisk modellering

N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.

8. Relax and do well.

2 (27) 3 (26) 4 (21) 5 (18) 6 (8) Total (200) Periodic Table

Chemistry 1 First Lecture Exam Fall Abbasi Khajo Levine Mathias Mathias/Ortiz Metlitsky Rahi Sanchez-Delgado Vasserman

P Swedish National Seismic Network (SNSN) A short report on recorded earthquakes during the second quarter of the year 2007


PHYSICAL SCIENCES GRADE : 10

CLASS TEST GRADE 11. PHYSICAL SCIENCES: CHEMISTRY Test 4: Matter and materials 1

HANDOUT SET GENERAL CHEMISTRY II

[ ]:543.4(075.8) 35.20: ,..,..,.., : /... ;. 2-. ISBN , - [ ]:543.4(075.8) 35.20:34.

P Swedish National Seismic Network (SNSN) A short report on recorded earthquakes during the third quarter of the year 2006

Circle the letters only. NO ANSWERS in the Columns!

Geometric Predicates P r og r a m s need t o t es t r ela t ive p os it ions of p oint s b a s ed on t heir coor d ina t es. S im p le exa m p les ( i

STANDARDIZATION OF BLENDED NECTAR USING BANANA PSEUDOSTEM SAP AND MANGO PULP SANTOSH VIJAYBHAI PATEL

Ash Wednesday. First Introit thing. * Dómi- nos. di- di- nos, tú- ré- spi- Ps. ne. Dó- mi- Sál- vum. intra-vé-runt. Gló- ri-

HANDOUT SET GENERAL CHEMISTRY I

The exam must be written in ink. No calculators of any sort allowed. You have 2 hours to complete the exam. Periodic table 7 0

8. Relax and do well.

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

K E L LY T H O M P S O N

Solutions and Ions. Pure Substances

-"l" also contributes ENERGY. Higher values for "l" mean the electron has higher energy.

CHM 101 PRACTICE TEST 1 Page 1 of 4

CHEM 10113, Quiz 5 October 26, 2011

Advanced Placement. Chemistry. Integrated Rates

Chemistry 2 Exam Roane State Academic Festival. Name (print neatly) School

Lesson Ten. What role does energy play in chemical reactions? Grade 8. Science. 90 minutes ENGLISH LANGUAGE ARTS

Winsome Winsome W Wins e ins e WUin ser some s Guide

If anything confuses you or is not clear, raise your hand and ask!

There are four irrational roots with approximate values of

Element Cube Project (x2)

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

Gen ova/ Pavi a/ Ro ma Ti m i ng Count er st at Sep t. 2004

TnT Part of Speech Tagger

P Swedish National Seismic Network (SNSN) A short report on recorded earthquakes during the third quarter of the year 2009

Context-Free Grammars. 2IT70 Finite Automata and Process Theory

Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese

Circle the letters only. NO ANSWERS in the Columns! (3 points each)

9/20/2017. Elements are Pure Substances that cannot be broken down into simpler substances by chemical change (contain Only One Type of Atom)

Instructions. 1. Do not open the exam until you are told to start.

Information Theory. Week 4 Compressing streams. Iain Murray,

Matrices and Determinants

Final exam: Automatic Control II (Reglerteknik II, 1TT495)

Chapter 3: Stoichiometry

Radiometric Dating (tap anywhere)

K. 27 Co. 28 Ni. 29 Cu Rb. 46 Pd. 45 Rh. 47 Ag Cs Ir. 78 Pt.

OC330C. Wiring Diagram. Recommended PKH- P35 / P50 GALH PKA- RP35 / RP50. Remarks (Drawing No.) No. Parts No. Parts Name Specifications

Agenda Rationale for ETG S eek ing I d eas ETG fram ew ork and res u lts 2

UNIQUE FJORDS AND THE ROYAL CAPITALS UNIQUE FJORDS & THE NORTH CAPE & UNIQUE NORTHERN CAPITALS

Regular Expressions and Finite-State Automata. L545 Spring 2008

Dorian Mazauric. To cite this version: HAL Id: tel

CHEM 108 (Spring-2008) Exam. 3 (105 pts)

ON DIFFERENT FORMS OF FLOWERS IN THE SAME SPIKE IN DIGITALIS PURPUREA L., f. HEPTANDRA

THIS PAGE DECLASSIFIED IAW E

INSTRUCTIONS: Exam III. November 10, 1999 Lab Section

Faculty of Natural and Agricultural Sciences Chemistry Department. Semester Test 1 MEMO. Analytical Chemistry CMY 283

BROOKLYN COLLEGE Department of Chemistry. Chemistry 1 Second Lecture Exam Nov. 27, Name Page 1 of 5

Transcription:

Uppsala University Department of Linguistics The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus Bengt Dahlqvist Abstract The paper describes some of the characteristics of words and single character distribution as well as the distribution of bi- and trigrams in the Uppsala Newspaper Corpus. This corpus consists of more that 70 million words collected from all published articles in two major Swedish daily newspapers between the years 1995 and 1996.

1 The text material The Uppsala newspaper corpus consists of somewhat more than 70 million words. It was collected from all the published articles in the Swedsish newspapers Svenska Dagbladet (SvD) and Upsala Nya Tidning (UNT) between the years 1995 and 1996. In total it comprises over 220 000 single articles. Articles Tokens Types Characters SvD 159 691 47 433 729 1 282 264 311 618 803 UNT 60 395 22 810 171 770 660 149 857 738 220 086 70 243 900 1 672 993 461 476 541 Table 1. frequency information for the texts in the newspaper corpus. The text material was delivered from the papers in the form of ANSI coded textfiles, including specific format and other information codes used by the papers. After the extraction of the plain text in this material, the files contained texts as shown in table 1 above. A token is defined as any character string delimited by whitespace (spaces etc.). The types are the unique tokens in the corpus. ANSI CHAR FREQUENCY PERCENT 32 67 284 667 14.58 100 e 36 038 688 7.81 96 a 34 245 860 7.42 113 r 31 854 659 6.90 109 n 31 689 432 6.87 115 t 31 446 380 6.81 114 s 24 225 946 5.25 104 i 20 832 702 4.51 107 l 19 198 444 4.16 99 d 15 790 640 3.42 110 o 15 681 073 3.40 108 m 12 552 977 2.72 106 k 11 919 428 2.58 102 g 11 630 760 2.52 117 v 8 310 382 1.80 101 f 7 441 882 1.61 103 h 7 275 872 1.58 227 ä 7 159 929 1.55 149 p 6 868 752 1.49 116 u 6 599 762 1.43 228 å 5 869 976 1.27 245 ö 5 454 781 1.18 97 b 5 338 211 1.16 98 c 4 703 485 1.02 45. 4 554 969 0.99 43, 2 749 080 0.60 105 j 2 630 884 0.57 93 ^ 2 567 781 0.56 120 y 2 296 085 0.50 44-1 136 239 0.25 47 0 1 062 587 0.23 48 1 1 036 179 0.22 49 2 587 964 0.13 56 9 587 046 0.13 52 5 488 902 0.11 50 3 452 088 0.10 119 x 404 963 0.09 51 4 402 365 0.09 33 " 382 668 0.08 57 : 377 164 0.08 53 6 339 864 0.07 55 8 319 545 0.07 54 7 310 461 0.07 118 w 295 340 0.06 40 ) 294 811 0.06 146 ( 264 040 0.06 38 ' 178 765 0.04 121 z 144 557 0.03 63? 109 693 0.02 232 é 108 172 0.02 46 / 67 953 0.01 112 q 47 396 0.01 33! 43 648 0.01 59 < 33 656 0.01 61 > 33 566 0.01 58 ; 29 971 0.01 170-19 516 0.00 37 & 18 627 0.00 92 ] 16 815 0.00 64 [ 16 813 0.00 Table 2. The 60 most frequent characters in the corpus, sorted by descending frequency. Listed in table 2 above are the 60 most frequent characters in the corpus. This table differs from the corresponding table in [2] by the ordering and the symbol representation, which here includes only lower case letters (i.e. upper case letters are counted together with their lower case equivalents). Thus, the line for character e represents both the character e (occurring 2

with a frequency of 35 210 610, ANSI-code 100) and the character E (frequency 828 078, ANSI-code 69). The table demonstrates that the single most frequent character in the corpus is the space character, which comprises 14.58 per cent of the corpus text. The most frequent letter is e, comprising 7.81 per cent of the corpus. A more detailed description of the text material and the text extracting methods employed while constructing the corpus is to be found in [2]. There is presented a list of all the 256 ANSI characters in the corpus, sorted by code value and occurrence. In total, 191 different ANSI characters are present in the corpus with a total frequency of 461 476 541, which is the number of the characters in the corpus as a whole. 2 The sets of alfabetic and non-alfabetic characters Alfabetic symbols, letters and delimiters in the corpusen are defined as follows: letters (56) = { abcdefghijklmnopqrstuvwxyzåäöàáâãæçèéêëìíîïñòóôõøùúûüýÿß } delimiters (14) = space, CR, formfeed, tab and the set {.,?!"()/_= } Of course, the letters also include the upper case ones. So, 53 upper case letters are present in the corpus ( ýÿß are only to be found in the lower case). All the other characters aside from these two sets (191 2x56 14 = 61) are symbols of the type +, -, : etc. See [2] for a full description of the occurrence of these characters. a 7.42 b 1.16 c 1.02 d 3.42 e 7.81 f 1.61 g 2.52 h 1.58 i 4.51 j 0.57 k 2.58 l 4.16 m 2.72 n 6.87 o 3.40 p 1.49 q 0.01 r 6.90 s 5.25 t 6.81 u 1.43 v 1.80 w 0.06 x 0.09 y 0.50 z 0.03 å 1.27 ä 1.55 ö 1.18 à 0.00 (1621) á 0.00 (3018) â 0.00 (259) ã 0.00 (81) æ 0.00 (1145) ç 0.00 (1190) è 0.00 (3982) é 0.02 ê 0.00 (366) ë 0.00 (863) ì 0.00 (30) í 0.00 (668) î 0.00 (159) ï 0.00 (115) ñ 0.00 (189) ò 0.00 (70) ó 0.00 (1360) ô 0.00 (475) õ 0.00 (61) ø 0.00 (2014) ù 0.00 (35) ú 0.00 (219) û 0.00 (55) ü 0.00 (10641) ý 0.00 (59) ÿ 0.00 (27) ß 0.00 (30) Table 3. Distribution of letters (percentage) in the corpus. Frequencies for low frequency characters are given in parenthesis. In table 3 part of the information in table 2 is repeated, this time given only for letters and listed in alphabetic order. The letters stated represent both the upper case and lower case occurrences. Percentages rounded to 0.00 (two significant decimals) are pair with the 3

corresponding frequency in parenthesis. For example, the low occurrence of the letter ü equals with five significant decimals 10 641 * 100 / 461 476 541 = 0.00231. 3 Probabilities In terms of probability, statements like the following can be made: selecting, with uniform probability, a character in the corpus, the probability of getting the letter e (or E ) is 0.0781. This can be deduced from table 3. The probability of getting any other character is then of course 0.9219. Also, the probability of getting the space character is 0.1458. Further, one can divide the set of letter into consonants: { bcdfghjklmnpqrstvwxzçñß }, and vowels: { ieäyöuaåoâîûêôàáãæèéëìíïòóõøùúüýÿ }. The probability to find at random a vowel is then P(vowel) = P(i)+P(e) = 0.2959 and finding a consonant P(consonant) = P(b)+P(c) = 0.5150 also, for all other chars 0.1891. More, if one assumes that the character randomly drawn is a letter, the probability of getting a vowel is expressed as: P (vowel) = P(vowel letter) = 0.3449 And, in the same way, for a consonant: P (consonant) = P(consonant letter) = 0.6351 The two probabilities above are examples of a conditional probability. It further shows that the sum of these probabilities still becomes one: P (vowel) + P (consonant) = P(vowel letter) + P(consonant letter) = 1 To postulate the letter property of the characters to be studied (i.e. only regard this subset) is a natural assumption when one wants to work with well-formed words only, suitable for lexicon work. Further, one can look at the bigrams and compute the conditional probability that a consonant follows after a vowel (e.g. the bigram ab ) in a word. This can be expressed as the probability P(consonant vowel) = 0.3745. In this way all probabilities for any combination of vowels/consonants can be computed. See the table below for an overview.. Vowel follows Consonant follows Vowel first 1.1081 % 37.4548 % Consonant first 37.0533 % 24.3839 % Table 4. Conditional probabilities for pair of vowels/consonants in words. 4 Bi- and trigrams In many cases, it is desirable to study the occurrence of tuplets and triplets of characters belonging to the letter set, {abcdefghijklmnopqrstuvwxyzåäöàáâãæçèéêëìíîïñòóôõøùúûüýÿß}. Such n-tuples are usually denoted bi- and trigrams, respectively. Exemples of such are ll, ck, abc and att. 4

Before starting an analysis of the bi- and trigrams in our newspaper corpus, we make the same restrictions regarding the tokens as in the previous paper [2]. That is, excluded from the set of tokens to be studied are strings consisting only of 1) numerals, 2) numerals in combinations with non-letters eller 3) non-letters proper. Remaining for further study are the tokens described in [2] as belonging to the data file fx.fre (containing frequency sorted, and in the sense described earlier, only well-formed words), in total 68 003 683 tokens and 1 421 950 types. Regarding all the tokens in the data file fx.fre as our fundamental data source, some basic facts can be established. The corpus then gives a total of 18 167 920 bigrams, 4 081 of them unique and with a total frequency of 437 567 251. In the same manner we have 16 745 970 trigrams, 52 233 unique and a total frequency of 369 563 568. All this regarding only transitions from letter to letter or word ending (i.e. multiple whitespaces not counted). 4.1 Bigrams In table 5 below, the 30 most frequent bigrams (from a letter to another letter) from the corpus are listed. The first column lists the bigrams themselves, followed by a frequency column, and a column (%tot) stating the percentage of occurrence based on the proportion of the bigram over all the tokens (as defined in the data file fx.fre). Finally, a column (%letter) giving the percentage regarding only transitions between letters is included. This figure gives a more resonable estimate of the expectancy for a specific character when regarding well-formed words. Expressed in statistical terms, every bigram can be regarded as occurring with a conditional probability. For the bigram en this means that the probability for an e to be followed by a n in a word equals 0.281, i.e. P(n e) = 0.0281. If the possibility for the following character to be a word delimiter is also include, i.e. a word ending, the bigram can be a e (the letter e followed by a space) the probability then will be modified and expressed as P (n e) = 0.0192. Bigram Frequency %tot %letter en 8421673 1.92 2.81 er 7534047 1.72 2.51 de 7086745 1.62 2.37 ar 6039869 1.38 2.02 an 5628493 1.29 1.88 et 4956729 1.13 1.65 in 4928119 1.13 1.65 st 4260531 0.97 1.42 te 4235201 0.97 1.41 tt 4086039 0.93 1.36 at 3696604 0.84 1.23 ra 3680278 0.84 1.23 ll 3584644 0.82 1.20 om 3253715 0.74 1.09 re 3244788 0.74 1.08 ti 3237829 0.74 1.08 ör 3215360 0.73 1.07 ta 3052891 0.70 1.02 nd 3042116 0.70 1.02 ng 2970977 0.68 0.99 na 2906141 0.66 0.97 la 2841574 0.65 0.95 ka 2825885 0.65 0.94 sk 2719853 0.62 0.91 5

fö 2546506 0.58 0.85 oc 2538351 0.58 0.85 ge 2505277 0.57 0.84 är 2482250 0.57 0.83 me 2456293 0.56 0.82 li 2450202 0.56 0.82 Table 5. The 30 most frequent bigrams. 4.2 Trigrams In the same way as table 5 shows the frequency (absolute and in percent) for the bigram occurences in the corpus, table 6 shows this for the trigrams. A trigram is defined as a threeletter character sequence. Here, as before for the bigrams, the first numeric column in the table shows the absolute frequency, followed by the same frequency in percent, named %tot, over the whole set of characters in the corpus and finally in percentage of letters and word endings only, denoted %letter. The most frequent trigram is för with the occurrence 1.05%. The conditional probability of getting a r given that one already has the sequence fö and still expects a letter is then, as seen in the table, P(r fö) = 0.0105. Trigram Frequency %tot %letter för 2444710 0.56 1.05 att 2007974 0.46 0.86 ing 1945353 0.44 0.83 och 1926036 0.44 0.82 ter 1628759 0.37 0.70 det 1536625 0.35 0.66 and 1452293 0.33 0.62 nde 1393776 0.32 0.60 som 1349820 0.31 0.58 den 1323319 0.30 0.57 ill 1315802 0.30 0.56 gen 1244475 0.28 0.53 ska 1133369 0.26 0.49 ade 1101209 0.25 0.47 til 1081879 0.25 0.46 med 1065829 0.24 0.46 nin 1023526 0.23 0.44 var 974651 0.22 0.42 rna 969637 0.22 0.42 der 927216 0.21 0.40 lig 875909 0.20 0.37 sta 861292 0.20 0.37 nte 843181 0.19 0.36 era 810957 0.19 0.35 ver 792698 0.18 0.34 nge 783854 0.18 0.34 are 764519 0.17 0.33 ett 754293 0.17 0.32 int 749909 0.17 0.32 han 746329 0.17 0.32 Table 6. The 30 most frequent trigrams. 6

5 Further data Some more interesting figures from the corpus are to be found in the appendix. In table A1 is shown the distribution of letters for word start and endings. Table A2 shows the transition probability in percentage between different letters. Table A3 and A4 show the bigram word beginnings and endings in a way analog with the previous tables 5 and 6. Table A5 and A6 show the distribution and some statistics for the word lengths in the corpus. Again, this is displayed in a graphical form in fig. A1. 6 Conclusion This paper has in a descriptive manner presented a number of fundamental characteristics of the newspaper corpus from the lowest level, involving the distribution of single characters, bigrams, trigrams and single words. Further, some things have been said about probabilities and conditional such to be used for computing expectancies of character sequences. While it is of value as such to have a thorough description of a corpus as such, this knowledge has a number of useful applications. For example one can mention proof-reading, where the knowledge of bi- and trigram distribution can be utilised for making spelling corrections. Also, the transition probabilities between letters can used for text generation, using stochastic models of different orders. Another classic example for the use of bi- and trigram probabilities is in cryptanalysis to solve ciphers. 7

7 References 1. Dahlqvist, B, Word Frequency Lists for the Uppsala Newspaper Corpus. Collection of Swedish word lists from the 70 million word Uppsala Newspaper Corpus, Dept. of Linguistics, Uppsala University, 1997. 2. Dahlqvist, Bengt, A Swedish Text Corpus for Generating Dictionaries, Project Report 3.1.3, EC-project Scarrie, 1998. 3. Gaines, H. F., Cryptanalysis, a Study of Ciphers and Their Solutions, Dover Publications, ISBN 0486200973, 1939. 4. Oakes, Michael P., Statistics for Corpus Linguistics, Edinburgh University Press, ISBN 0748608176, 1998. 8

Appendix %First %Last 6.97 a 10.88 4.41 b 0.12 0.67 c 0.09 6.77 d 3.81 4.74 e 7.90 7.14 f 0.24 2.35 g 3.58 5.08 h 2.98 5.94 i 4.08 1.18 j 0.12 4.19 k 1.37 2.69 l 3.10 6.16 m 4.23 2.52 n 15.04 5.23 o 0.53 4.28 p 0.54 0.01 q 0.00 2.13 r 15.33 11.75 s 5.52 4.81 t 14.41 2.50 u 0.51 4.32 v 1.89 0.23 w 0.05 0.01 x 0.12 0.16 y 0.34 0.04 z 0.04 0.86 å 3.03 2.11 ä 0.01 0.76 ö 0.08 Table A1. First and last letter for words, distribution in percent. 9

E 1 I 1 F 2 J 2 B 2 P 3 S 5 M 6 D 6 V 6 G 6 H 7 K 10 L 10 N 10 T 10 R 12 A R 23 N 21 T 14 L 8 D 7 G 5 S 5 V 5 M 5 K 2 P 1 F 1 B 1 C 1 Å 1 Ö 1 V 1 Y 1 K 1 G 2 D 3 I 3 N 5 T 5 L 5 U 5 M 7 B 7 S 8 O 8 E 9 A 9 R 19 B E 26 A 14 L 13 O 10 R 9 I 8 Y 4 U 4 Ö 3 B 3 Ä 3 Å 2 J 1 S 1 C 1 R 1 T 1 U 1 N 2 Ä 2 S 4 A 5 E 6 Y 7 I 9 O 60 C H 48 K 29 E 9 I 5 A 3 O 2 L 1 U 1 C 1 Y 1 G 1 V 1 Y 1 S 1 Ä 1 Ö 2 D 2 U 2 Å 2 O 3 L 5 R 7 I 9 A 16 E 17 N 27 D E 54 A 13 R 5 I 5 S 4 O 3 L 2 Ä 2 D 2 N 2 U 1 Å 1 Ö 1 J 1 V 1 G 1 B 1 F 1 J 1 C 1 I 2 P 3 H 3 K 4 B 4 N 4 V 5 S 5 L 6 M 7 G 8 R 10 T 13 D 22 E N 27 R 25 T 16 L 7 D 6 S 5 M 3 K 2 F 2 G 2 V 1 C 1 X 1 B 1 P 1 U 1 V 1 Ö 1 G 1 K 1 Y 1 P 1 U 1 Ä 2 D 2 T 4 M 4 L 5 S 6 R 7 N 8 O 8 F 8 I 9 A 11 E 18 F Ö 35 R 13 A 8 T 8 I 8 O 7 L 4 Å 4 E 4 F 3 Ä 2 U 2 Y 1 S 1 P 1 L 1 D 1 T 1 U 1 Ö 1 Y 2 G 3 O 3 E 5 Å 5 R 5 Ä 5 A 15 I 21 N 30 G E 27 A 19 R 7 S 7 T 7 Å 4 I 4 O 4 G 3 N 3 Ö 2 H 2 L 2 Ä 2 U 2 D 1 J 1 Y 1 D 1 I 1 P 1 L 1 A 2 O 2 M 3 N 3 K 4 E 4 S 4 R 4 G 5 T 5 C 58 H A 37 E 20 O 11 U 6 Ä 6 Ö 6 Å 4 I 4 J 1 L 1 Y 1 R 1 N 1 E 1 P 1 H 1 C 1 K 2 B 2 G 2 F 3 D 4 M 6 S 8 V 9 N 10 R 14 L 15 T 19 I N 27 L 12 G 12 S 10 T 7 K 6 D 5 O 4 V 4 E 3 C 2 A 2 R 2 F 1 M 1 P 1 P 1 I 1 V 1 M 1 K 2 F 2 A 3 O 3 N 3 E 4 B 4 H 4 G 5 D 5 T 7 R 8 Ö 8 S 15 L 24 J A 25 U 17 O 15 Ä 14 E 13 Ö 6 L 3 D 2 N 1 K 1 S 1 Y 1 Å 1 Ä 2 U 2 O 3 L 3 Ö 3 N 5 E 6 A 7 R 8 I 12 C 15 S 30 K A 26 O 14 E 12 T 11 R 7 L 5 U 5 S 4 N 4 I 3 V 2 Ä 2 Ö 2 H 1 Y 1 Y 1 Ö 1 M 1 G 1 T 1 N 2 D 2 F 2 P 2 Å 2 R 3 U 3 K 3 S 4 B 4 Ä 5 O 7 E 11 I 12 A 13 L 21 L L 21 A 17 I 14 E 12 S 4 T 4 Ä 4 D 4 O 3 J 3 U 3 M 2 V 2 Å 1 K 1 Y 1 Ö 1 N 1 N1T 1 Y 1 Ö 2 I 2 U 3 L 4 Ä 4 S 4 R 4 E 9 M 10 A 15 O 39 M E 25 A 17 I 10 M 9 O 6 Å 4 Ä 3 U 3 S 3 P 2 T 2 N 2 Ö 2 Y 2 B 2 L 2 F 1 H 1 R 1 G 1 Ö 1 L 1 M 1 D 1 G 1 S 1 T 1 L 1 Å 3 N 4 U 4 Ä 4 R 5 O 8 I 16 A 19 E 28 N D 14 G 14 A 14 S 11 T 9 I 8 E 7 N 6 O 4 Ä 2 K 2 U 1 Y 1 Å 1 L 1 F 1 V 1 H 1 B 1 C 1 V 1 D 3 G 3 J 3 F 4 P 4 L 4 B 4 H 5 M 5 I 6 N 7 T 9 R 11 K 12 S 17 O M 21 C 17 N 15 R 15 L 8 T 6 S 3 P 2 D 2 G 2 K 2 V 1 F 1 B 1 U 1 H 1 K 1 D 1 Ä 1 Y 1 T 1 N 1 X 1 L 2 R 2 I 3 Ö 3 E 4 M 6 O 9 A 9 S 15 U 16 P 24 P Å 19 E 15 P 15 R 13 A 12 O 8 L 5 S 3 I 3 T 2 U 2 G 1 N 1 H 1 F 1 Ö 1 K 1 U 1 Y 1 O 1 E 4 S 4 A 5 R 6 G 6 C 7 L 8 I 9 M 12 N 13 D 19 Q U 54 V 42 I 1 A 1 Y 1 I 1 R 1 B 2 U 2 G 2 D 2 K 3 P 3 F 3 Å 3 T 4 O 7 Ä 8 Ö 11 A 20 E 25 R A 17 E 15 I 11 S 7 N 7 O 6 T 5 D 4 K 4 Ä 3 Å 3 U 2 G 2 L 2 B 2 M 2 R 2 Ö 1 Y 1 V 1 F 1 H 1 J 1 Å 1 V 1 P 1 Y 1 Ö 2 M 2 Ä 2 K 3 O 3 D 3 G 4 U 4 L 4 T 7 S 7 A 8 R 9 E 9 I 12 N 15 S T 21 K 13 O 10 E 8 A 8 I 7 S 6 V 3 L 3 Ä 3 P 3 Å 3 M 2 J 1 U 1 N 1 Ö 1 Y 1 B 1 C 1 H 1 D 1 F 1 R 1 Ö 1 Y 1 M 1 Å 1 Ä 2 F 2 G 2 L 2 O 3 R 4 K 4 U 4 I 4 N 7 A 13 T 15 S 15 E 18 T E 20 T 19 I 15 A 14 R 6 S 5 O 5 Ä 2 U 2 Y 2 V 2 N 1 Å 1 L 1 H 1 Ö 1 J 1 B 1 F 1 C 1 I 1 V 2 O 2 A 2 P 2 F 3 G 3 E 3 D 4 B 4 S 5 M 6 N 6 H 7 L 9 J 9 T 9 R 11 K 11 U N 21 T 19 R 10 P 10 S 10 L 9 D 4 M 4 K 4 B 2 G 1 V 1 E 1 A 1 C 1 F 1 I 1 G 1 Å 1 D 2 U 2 N 3 O 4 Ä 4 R 4 K 5 L 5 E 6 T 6 Ö 7 I 12 S 13 A 25 V A 25 I 22 E 22 Ä 11 Å 6 S 3 D 2 O 1 T 1 U 1 L 1 R 1 N 1 G 1 U 1 M 1 G 1 N 1 Ö 1 L 1 K 2 Y 2 W 2 I 3 D 3 T 4 R 6 H 6 A 7 S 10 E 22 O 25 W A 29 E 23 I 22 O 8 H 4 S 4 N 1 R 1 W 1 Ä 1 T 1 L 1 B 1 Y 1 O 4 I 4 U 5 A 12 Ä 13 E 59 X E 24 T 23 P 13 A 11 I 10 L 3 U 3 N 3 O 2 K 1 C 1 F 1 J 1 H 1 B 1 D 1 O 1 C 1 G 1 A 1 E 2 D 2 H 2 F 4 K 5 M 8 S 9 B 9 L 11 R 12 N 14 T 16 Y C 14 R 12 S 12 G 11 T 10 N 7 D 7 A 6 L 5 M 4 K 3 P 2 F 2 O 1 H 1 E 1 B 1 V 1 K 1 Y 1 H 1 C 1 L 2 D 3 S 3 U 4 R 4 O 6 E 7 N 9 Z 10 I 13 T 14 A 21 Z A 19 E 18 I 17 O 12 Z 10 U 2 Y 2 N 2 H 2 L 2 J 2 K 1 B 1 Ü 1 É 1 M 1 T 1 S 1 D 1 K 1 B 2 D 3 H 4 T 4 L 5 N 5 F 5 V 7 G 8 M 8 S 11 R 13 P 24 Å R 27 N 24 G 12 T 10 L 9 D 7 S 5 K 3 V 1 E 1 G 2 B 3 F 3 K 4 D 4 M 5 H 6 J 6 N 8 T 9 L 11 S 11 R 12 V 14 Ä R 35 N 19 L 12 T 7 G 7 S 4 M 4 V 3 K 3 D 2 C 1 X 1 F 1 N 1 D 2 J 3 B 3 K 4 M 4 T 4 G 5 S 5 L 5 R 6 H 6 F 52 Ö R 60 V 7 K 5 S 4 D 4 N 3 T 3 M 3 G 3 J 3 P 2 L 2 Table A2. Transition probabilities for letters preceeding and succeeding a given letter. 10

2gram Frequens %tot %letter %cumul de 3180302 0.73 4.69 4.693 i_ 2169161 0.50 3.20 7.894 oc 2089508 0.48 3.08 10.977 fö 2054912 0.47 3.03 14.009 at 1687019 0.39 2.49 16.498 me 1657934 0.38 2.45 18.945 ha 1644380 0.38 2.43 21.371 in 1515100 0.35 2.24 23.607 en 1492035 0.34 2.20 25.808 so 1435274 0.33 2.12 27.926 ti 1335404 0.31 1.97 29.897 på 1209099 0.28 1.78 31.681 st 1203980 0.28 1.78 33.457 av 1099244 0.25 1.62 35.080 vi 1028109 0.23 1.52 36.597 är 1021960 0.23 1.51 38.105 va 952218 0.22 1.41 39.510 sk 922854 0.21 1.36 40.871 ko 827332 0.19 1.22 42.092 fr 818435 0.19 1.21 43.300 si 794275 0.18 1.17 44.472 be 756842 0.17 1.12 45.589 an 723787 0.17 1.07 46.657 ma 714712 0.16 1.05 47.711 om 687330 0.16 1.01 48.725 ut 643016 0.15 0.95 49.674 sa 579668 0.13 0.86 50.530 ka 566701 0.13 0.84 51.366 pr 553721 0.13 0.82 52.183 et 547347 0.13 0.81 52.991 re 541623 0.12 0.80 53.790 mi 533435 0.12 0.79 54.577 se 490070 0.11 0.72 55.300 up 470683 0.11 0.69 55.995 bl 467767 0.11 0.69 56.685 al 463396 0.11 0.68 57.369 vä 462839 0.11 0.68 58.052 sv 455347 0.10 0.67 58.723 tr 419490 0.10 0.62 59.342 un 418800 0.10 0.62 59.960 sä 417503 0.10 0.62 60.576 he 415506 0.09 0.61 61.190 kr 411171 0.09 0.61 61.796 ba 392872 0.09 0.58 62.376 li 390134 0.09 0.58 62.952 fi 389281 0.09 0.57 63.526 mo 380863 0.09 0.56 64.088 nä 369187 0.08 0.54 64.633 lä 359410 0.08 0.53 65.163 ja 350900 0.08 0.52 65.681 Table A3. Bigram word beginnings, the 50 most frequent ones. 11

2gram Frequency %tot %letter %cumul en 5868402 1.34 8.66 8.656 er 3622902 0.83 5.34 14.001 et 2973205 0.68 4.39 18.386 tt 2763497 0.63 4.08 22.463 ar 2529303 0.58 3.73 26.194 de 2254913 0.52 3.33 29.520 om 2186256 0.50 3.22 32.745 _i 2169161 0.50 3.20 35.945 an 2044243 0.47 3.02 38.960 ch 1954875 0.45 2.88 41.844 är 1525250 0.35 2.25 44.094 na 1399438 0.32 2.06 46.158 ll 1286261 0.29 1.90 48.055 ör 1226840 0.28 1.81 49.865 ra 1193336 0.27 1.76 51.625 på 1131705 0.26 1.67 53.295 ka 1002717 0.23 1.48 54.774 te 964951 0.22 1.42 56.197 av 946713 0.22 1.40 57.594 ed 921446 0.21 1.36 58.953 re 905887 0.21 1.34 60.289 ng 878251 0.20 1.30 61.585 ta 862421 0.20 1.27 62.857 on 810796 0.19 1.20 64.053 ns 797584 0.18 1.18 65.229 la 696792 0.16 1.03 66.257 as 688934 0.16 1.02 67.274 ga 631479 0.14 0.93 68.205 ig 594303 0.14 0.88 69.082 år 551066 0.13 0.81 69.895 or 547839 0.13 0.81 70.703 ag 539231 0.12 0.80 71.498 gt 530286 0.12 0.78 72.280 st 487937 0.11 0.72 73.000 at 482902 0.11 0.71 73.712 in 461982 0.11 0.68 74.394 ts 443671 0.10 0.65 75.048 så 422426 0.10 0.62 75.672 nd 416573 0.10 0.61 76.286 es 415725 0.10 0.61 76.899 rt 402963 0.09 0.59 77.494 da 397337 0.09 0.59 78.080 ad 383630 0.09 0.57 78.646 ån 354507 0.08 0.52 79.169 kt 338113 0.08 0.50 79.667 id 326479 0.07 0.48 80.149 nt 320452 0.07 0.47 80.622 lt 314372 0.07 0.46 81.085 el 290417 0.07 0.43 81.514 ot 289701 0.07 0.43 81.941 Table A4. Bigram word endings, the 50 most frequent ones. 12

Lengh Frequency Percent Cumul.% 1 3225100 0,859387 0,859387 2 8370116 4,460744 5,320131 3 16555548 13,234590 18,55473 4 7720999 8,229611 26,78434 5 7668507 10,217080 37,00141 6 6546302 10,466300 47,46771 7 4742100 8,845341 56,31305 8 3919260 8,354874 64,66793 9 3187981 7,645468 72,31339 10 2432196 6,481035 78,79443 11 1672300 4,901767 83,69620 12 1145687 3,663474 87,35967 13 794604 2,752580 90,11225 14 614136 2,291070 92,40332 15 480165 1,919232 94,32255 16 336460 1,434496 95,75705 17 248670 1,126466 96,88351 18 196146 0,940801 97,82432 19 126847 0,642214 98,46653 20 85824 0,457388 98,92392 21 63066 0,352907 99,27682 22 37018 0,217011 99,49384 23 23924 0,146625 99,64046 24 27479 0,175735 99,81620 25 8358 0,055679 99,87187 26 4422 0,030636 99,90251 27 5082 0,036563 99,93907 28 1816 0,013549 99,95262 29 1233 0,009528 99,96215 30 641 0,005124 99,96728 31 349 0,002883 99,97016 32 283 0,002413 99,97257 33 177 0,001556 99,97413 34 111 0,001006 99,97513 35 113 0,001054 99,97619 >35 880 0,023813 100 70243900 100 Table A5. Distribution of word lengths in the corpus. Word lengh: Mean = 5.3425 Median = 6.2863 Standard deviation = 3.4837 Table A6. Statistics for the word length. 13

Ordlängd 14 12 10 Procent 8 6 4 2 0 1 4 7 10 13 16 19 22 25 28 31 34 Tecken Fig. A1. Word lengths (no of characters) and percentage of occurrence. 14