Main
Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox
Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox
Sanders, P., Mehlhorn, K., Dietzfelbinger, M., Dementiev, R.
This textbook is a concise introduction to the basic toolbox of structures that allow efficient organization and retrieval of data, key algorithms for problems on graphs, and generic techniques for modeling, understanding, and solving algorithmic problems. The authors aim for a balance between simplicity and efficiency, between theory and practice, and between classical results and the forefront of research. Individual chapters cover arrays and linked lists, hash tables and associative arrays, sorting and selection, priority queues, sorted sequences, graph representation, graph traversal, shortest paths, minimum spanning trees, optimization, collective communication and computation, and load balancing. The authors also discuss important issues such as algorithm engineering, memory hierarchies, algorithm libraries, and certifying algorithms. Moving beyond the sequential algorithms and data structures of the earlier related title, this book takes into account the paradigm shift towards the parallel processing required to solve modern performancecritical applications and how this impacts on the teaching of algorithms. The book is suitable for undergraduate and graduate students and professionals familiar with programming and basic mathematical language. Most chapters have the same basic structure: the authors discuss a problem as it occurs in a reallife situation, they illustrate the most important applications, and then they introduce simple solutions as informally as possible and as formally as necessary so the reader really understands the issues at hand. As they move to more advanced and optional issues, their approach gradually leads to a more mathematical treatment, including theorems and proofs. The book includes many examples, pictures, informal explanations, and exercises, and the implementation notes introduce clean, efficient implementations in languages such as C++ and Java.
Year:
2019
Publisher:
Springer
Language:
english
Pages:
509 / 516
ISBN 10:
3030252094
ISBN 13:
9783030252090
File:
PDF, 7.52 MB
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

Peter Sanders Kurt Mehlhorn Martin Dietzfelbinger Roman Dementiev Sequential and Parallel Algorithms and Data Structures The Basic Toolbox Sequential and Parallel Algorithms and Data Structures Peter Sanders • Kurt Mehlhorn Martin Dietzfelbinger • Roman Dementiev Sequential and Parallel Algorithms and Data Structures The Basic Toolbox Peter Sanders Karlsruhe Institute of Technology Karlsruhe, Germany Kurt Mehlhorn Max Planck Institute for Informatics Saarbrücken, Germany Martin Dietzfelbinger Technische Universität Ilmenau Ilmenau, Germany Roman Dementiev Intel Deutschland GmbH Feldkirchen, Germany Intel material (including descriptions of algorithms and their implementations, text, figures, and tables describing their experimental evaluation and experimental results on Intel processors, and sample code) is reproduced by permission of Intel Corporation. ISBN 9783030252083 ISBN 9783030252090 (eBook) https://doi.org/10.1007/9783030252090 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland To all algorithmicists Preface An algorithm is a precise and unambiguous recipe for solving a class of problems. If formulated as programs, they can be executed by a machine. Algorithms are at the heart of every nontrivial computer application. Therefore, every computer scientist and every professional programmer should know about the basic algorithmic toolbox: structures that allow efficient organization and retrieval of data, key algorithms for problems on graphs, and generic techniques for modeling, understanding, and solving algorithmic problems. This book is a concise introduction to this basic toolbox, intended for students and professionals familiar with programming and basic mathematical language. Although the presentation is concise, we have included many examples, pictures, informal explanations, and exercises, and some linkage to the real world. The book covers the material typically taught in undergraduate and firstyear graduate courses on algorithms. Most chapters have the same basic structure. We begin by discussing a problem as it occurs in a reallife situation. We illustrate the most important applications and then introduce simple solutions as informally as possible and as formally as necessary to really understand the issues at hand. When we move to more advanced and optional issues, this approach gradually leads to a more mathematical treatment, including theorems and proofs. Thus, the book should work for readers with a wide range of mathematical expertise. There are also advanced sections (marked with a *) where we recommend that readers should skip them on first reading. Exercises provide additional examples, alternative approaches and opportunities to think about the problems. It is highly recommended to take a look at the exercises even if there is no time to solve them during the first reading. In order to be able to concentrate on ideas rather than programming details, we use pictures, words, and highlevel pseudocode to explain our algorithms. A section “Implementation Notes” links these abstract ideas to clean, efficient implementations in real programming languages such as C++ and Java. Each chapter ends with a section on further findings that provides a glimpse at the state of the art, generalizations, and advanced solutions. The first edition of this book treated only sequential algorithms and data structures. Today, almost every computer, be it a desktop, a notebook, or a smartphone, has multiple cores, and sequential machines have become an exotic species. The viii Preface reason for this change is that sequential processors have ceased to get proportional performance improvements from increased circuit complexity. Although the number of transistors in an integrated circuit (still) doubles every two years (Moore’s law), the only reasonable way to use this transistor budget is to put multiple processor cores on a chip. The consequence is that nowadays every performancecritical application has to be parallelized. Moreover, big data – the explosion of data set sizes in many applications – has produced an enormous demand for algorithms that scale to a large number of processors. This paradigm shift has profound effects on teaching algorithms. Parallel algorithms are no longer a specialized topic reserved for a small percentage of students. Rather, every student needs some exposure to parallel algorithms, and parallel solution paradigms need to be taught early on. As a consequence, parallel algorithms should be integrated tightly and early into algorithms courses. We therefore decided to include parallel algorithms in the second edition of the book. Each chapter now has some sections on parallel algorithms. The goals remain the same as for the first edition: a careful balance between simplicity and efficiency, between theory and practice, and between classical results and the forefront of research. We use a slightly different style for the sections on parallel computing. We include concrete programming examples because parallel programming is still more difficult than sequential programming (the programs are available at github.com/ basictoolboxsamplecode/basictoolboxsamplecode/). We also reference original work directly in the text instead of in the section on history because the parallel part is closer to current research. Algorithmics is a modern and active area of computer science, even at the level of the basic toolbox. We have made sure that we present algorithms in a modern way, including explicitly formulated invariants. We also discuss important further aspects, such as algorithm engineering, memory hierarchies, algorithm libraries, and certifying algorithms. We have chosen to arrange most of the material by problem domain and not by solution technique. The chapter on optimization techniques is an exception. We find that an organization by problem domain allows a more concise presentation. However, it is also important that readers and students obtain a good grasp of the available techniques. Therefore, we have structured the optimization chapter by techniques, and an extensive index provides crossreferences between different applications of the same technique. Bold page numbers in the index indicate the pages where concepts are defined. This book can be used in multiple ways in teaching. We have used the first edition of the book and a draft version of the second edition in undergraduate and graduate courses on algorithmics. In a first year undergraduate course, we concentrated on the sequential part of the book. In a second algorithms course at an advanced bachelor level or master’s level and with students with some experience in parallel programming we made most of the sequential part of the book a prerequisite and concentrated on the more advanced material such as the starred sections and the parts on external memory and parallel algorithms. If a first algorithms course is taught later in the undergraduate curriculum, the book can be used to teach parallel and sequential algorithms in an integrated way. Although we have included some material about Preface ix parallel programming and several concrete programming examples, the parallel part of the book works best for readers who already have some background in parallel programming. Another approach is to use the book to provide concrete algorithmic content for a parallel programming course that uses another book for the programming part. Last but not least, the book should be useful for independent study and to professionals who have a basic knowledge of algorithms but less experience with parallelism. Follow us on an exciting tour through the world of algorithms. Ilmenau, Saarbrücken, Karlsruhe, Heidelberg August 2018 Martin Dietzfelbinger Kurt Mehlhorn Peter Sanders Roman Dementiev A first edition of this book was published in 2008. Since then the book has been translated into Chinese, Greek, Japanese, and German. Martin Dietzfelbinger translated the book into German. Actually, he did much more than a translation. He thoroughly revised the book and improved the presentation at many places. He also corrected a number of mistakes. Thus, the book gained through the translation, and we decided to make the German edition the reference for any future editions. It is only natural that we asked Martin to become an author of the German edition and any future editions of the book. Saarbrücken, Karlsruhe March 2014 Kurt Mehlhorn Peter Sanders Soon after the publication of the German edition, we started working on the revised English edition. We decided to expand the book into the parallel world for the reasons indicated in the preface. Twenty years ago, parallel machines were exotic, nowadays, sequential machines are exotic. However, the parallel world is much more diverse and complex than the sequential world, and therefore algorithmengineering issues become more important. We concluded that we had to go all the way to implementations and experimental evaluations for some of the parallel algorithms. We invited Roman Dementiev to work with us on the algorithm engineering aspects of parallel computing. Roman received his PhD in 2006 for a thesis on “Algorithm Engineering for Large Data Sets”. He now works for Intel, where he is responsible for performance engineering of a major database system. Ilmenau, Saarbrücken, Karlsruhe November 2017 Martin Dietzfelbinger Kurt Mehlhorn Peter Sanders Contents 1 Appetizer: Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Multiplication: The School Method . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Result Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 A Recursive Version of the School Method . . . . . . . . . . . . . . . . . . . . 1.5 Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Parallel Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Algorithm Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 The Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Proofs of Lemma 1.7 and Theorem 1.9 . . . . . . . . . . . . . . . . . . . . . . . 1.10 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 1 2 6 10 11 13 15 17 18 21 23 23 2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Sequential Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Parallel Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Parallel Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Designing Correct Algorithms and Programs . . . . . . . . . . . . . . . . . . 2.7 An Example – Binary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Basic Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 AverageCase Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 ParallelAlgorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Randomized Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 P and NP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.15 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 25 26 29 32 38 44 46 49 52 57 62 63 68 73 77 79 xii Contents 3 Representing Sequences by Arrays and Linked Lists . . . . . . . . . . . . . . . 3.1 Processing Arrays in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Processing Linked Lists in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Unbounded Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 *Amortized Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Stacks and Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Parallel QueueLike Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Lists versus Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 81 82 86 91 97 102 106 109 113 114 116 4 Hash Tables and Associative Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Hashing with Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Universal Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hashing with Linear Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Chaining versus Linear Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 *Perfect Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Parallel Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 117 120 122 128 130 131 134 147 149 5 Sorting and Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Simple Sorters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Simple, Fast, and Inefficient Parallel Sorting . . . . . . . . . . . . . . . . . . . 5.3 Mergesort – an O(n log n) Sorting Algorithm . . . . . . . . . . . . . . . . . . 5.4 Parallel Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 A Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Parallel Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Parallel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Breaking the Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 *Parallel Bucket Sort and Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . 5.12 *External Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13 Parallel Sample Sort with Implementations . . . . . . . . . . . . . . . . . . . . 5.14 *Parallel Multiway Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 Parallel Sorting with Logarithmic Latency . . . . . . . . . . . . . . . . . . . . . 5.16 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.17 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 153 156 158 160 162 165 168 173 178 180 182 186 187 191 201 205 207 208 Contents xiii 6 Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Binary Heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Addressable Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 *External Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Parallel Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 211 213 218 224 226 229 230 7 Sorted Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Binary Search Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 (a, b)Trees and Red–Black Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 More Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Amortized Analysis of Update Operations . . . . . . . . . . . . . . . . . . . . . 7.5 Augmented Search Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Parallel Sorted Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 233 235 238 245 248 250 252 254 256 8 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Unordered Edge Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Adjacency Arrays – Static Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Adjacency Lists – Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 The Adjacency Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Implicit Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Parallel Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 259 260 261 261 263 264 265 267 268 9 Graph Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 BreadthFirst Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Parallel BreadthFirst Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 DepthFirst Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Parallel Traversal of DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 271 272 274 282 294 298 299 10 Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 From Basic Concepts to a Generic Algorithm . . . . . . . . . . . . . . . . . . 10.2 Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Nonnegative Edge Costs (Dijkstra’s Algorithm) . . . . . . . . . . . . . . . . 10.4 *AverageCase Analysis of Dijkstra’s Algorithm . . . . . . . . . . . . . . . 10.5 Monotone Integer Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Arbitrary Edge Costs (Bellman–Ford Algorithm) . . . . . . . . . . . . . . . 10.7 AllPairs Shortest Paths and Node Potentials . . . . . . . . . . . . . . . . . . . 10.8 ShortestPath Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 302 306 307 311 313 318 320 322 xiv Contents 10.9 Parallel Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 10.10 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 10.11 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 331 11 Minimum Spanning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Cut and Cycle Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Jarník–Prim Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Kruskal’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 The Union–Find Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 *External Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 *Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 333 334 337 338 340 344 348 351 353 354 12 Generic Approaches to Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Linear Programming – Use a BlackBox Solver . . . . . . . . . . . . . . . . 12.2 Greedy Algorithms – Never Look Back . . . . . . . . . . . . . . . . . . . . . . . 12.3 Dynamic Programming – Build It Piece by Piece . . . . . . . . . . . . . . . 12.4 Systematic Search – When in Doubt, Use Brute Force . . . . . . . . . . . 12.5 Local Search – Think Globally, Act Locally . . . . . . . . . . . . . . . . . . . 12.6 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.8 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 357 359 365 368 373 379 389 391 392 13 Collective Communication and Computation . . . . . . . . . . . . . . . . . . . . . . 13.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Prefix Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 (All)Gather/Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 AlltoAll Message Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.7 Asynchronous Collective Communication . . . . . . . . . . . . . . . . . . . . . 393 396 402 404 406 412 413 418 14 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Overview and Basic Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Prefix Sums – Independent Tasks with Known Size . . . . . . . . . . . . . 14.3 The Master–Worker Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 (Randomized) Static Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6 Handling Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 420 422 424 426 427 433 Contents xv A Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Mathematical Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Mathematical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Useful Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 435 436 438 443 B Computer Architecture Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 Cores and Hardware Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 The Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 Hardware Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 Memory Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 The Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 CPU Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.9 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 447 448 449 451 451 451 452 453 454 C Support for Parallelism in C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 “Hello World” C++11 Program with Threads . . . . . . . . . . . . . . . . . . C.2 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Asynchronous Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.5 Useful Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.6 Memory Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.7 Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 455 456 457 457 458 458 459 D The Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 “Hello World” and What Is an MPI Program? . . . . . . . . . . . . . . . . . . D.2 PointtoPoint Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 461 463 464 E List of Commercial Products, Trademarks and Software Licenses . . . 465 E.1 BSD 3Clause License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 1 Appetizer: Integer Arithmetic An appetizer is supposed to stimulate the appetite at the beginning of a meal. This is exactly the purpose of this chapter. We want to stimulate your interest in algorithmic1 techniques by showing you a surprising result. Although, the school method for multiplying integers has been familiar to all of us since our school days and seems to be the natural way to multiply integers, it is not the best multiplication algorithm; there are much faster ways to multiply large integers, i.e., integers with thousands or even millions of digits, and we shall teach you one of them. Algorithms for arithmetic are not only among the oldest algorithms, but are also essential in areas such as cryptography, geometric computing, computer algebra, and computer architecture. The three former areas need software algorithms for arithmetic on long integers, i.e., numbers with up to millions of digits. We shall present the algorithms learned in school, but also an improved algorithm for multiplication. The improved multiplication algorithm is not just an intellectual gem but is also extensively used in applications. Computer architecture needs very fast algorithms for moderatelength (32 to 128 bits) integers. We shall present fast parallel algorithms suitable for hardware implementation for addition and multiplication. On the way, we shall learn basic algorithm analysis and basic algorithm engineering techniques in a simple setting. We shall also see the interplay of theory and experiment. We assume that integers2 are represented as digit strings. In the base B number system, where B is an integer larger than 1, there are digits 0, 1, . . . , B − 1, and a digit string an−1 an−2 . . . a1 a0 represents the number ∑0≤i<n ai Bi . The most important systems with a small value of B are base 2, with digits 0 and 1, base 10, with digits 0 1 2 The Soviet stamp on this page shows Muhammad ibn Musa alKhwarizmi (born approximately 780; died between 835 and 850), Persian mathematician and astronomer from the Khorasan province of presentday Uzbekistan. The word “algorithm” is derived from his name. Throughout this chapter, we use “integer” to mean a nonnegative integer. © Springer Nature Switzerland AG 2019 P. Sanders et al., Sequential and Parallel Algorithms and Data Structures, https://doi.org/10.1007/9783030252090_1 1 2 1 Appetizer: Integer Arithmetic to 9, and base 16, with digits 0 to 15 (frequently written as 0 to 9, A, B, C, D, E, and F). Larger bases, such as 28 , 216 , 232 , and 264 , are also useful. For example, 1 · 24 + 0 · 23 + 1 · 22 + 0 · 21 + 1 · 20 = 21, “10101” in base 2 represents 9 · 102 + 2 · 101 + 4 · 100 = 924. “924” in base 10 represents We assume that we have two primitive operations at our disposal: the addition of three digits with a twodigit result (this is sometimes called a full adder), and the multiplication of two digits with a twodigit result.3 For example, in base 10, we have 3 5 and 6 · 7 = 42. 5 13 We shall measure the efficiency of our algorithms by the number of primitive operations executed. We can artificially turn any ndigit integer into an mdigit integer for any m ≥ n by adding additional leading 0’s. Concretely, “425” and “000425” represent the same integer. We shall use a and b for the two operands of an addition or multiplication and assume throughout this chapter that a and b are ndigit integers. The assumption that both operands have the same length simplifies the presentation without changing the key message of the chapter. We shall come back to this remark at the end of the chapter. We refer to the digits of a as an−1 to a0 , with an−1 being the most significant digit (also called leading digit) and a0 being the least significant digit, and write a = (an−1 . . . a0 ). The leading digit may be 0. Similarly, we use bn−1 to b0 to denote the digits of b, and write b = (bn−1 . . . b0 ). 1.1 Addition We all know how to add two integers a = (an−1 . . . a0 ) and b = (bn−1 . . . b0 ). We simply write one under the other with the least significant digits aligned, and sum the integers digitwise, carrying a single digit from one position to the next. This digit is called a carry. The result will be an (n + 1)digit integer s = (sn . . . s0 ). Graphically, an−1 bn−1 cn cn−1 sn sn−1 3 ... ... ... ... a1 b1 c1 s1 a0 b0 0 s0 first operand second operand carries sum Observe that the sum of three digits is at most 3(B − 1) and the product of two digits is at most (B − 1)2 , and that both expressions are bounded by (B − 1) · B1 + (B − 1) · B0 = B2 − 1, the largest integer that can be written with two digits. This clearly holds for B = 2. Increasing B by 1 increases the first expression by 3, the second expression by 2B − 1, and the third expression by 2B + 1. Hence the claim holds for all B. 1.1 Addition 3 where c0 , . . . , cn is the sequence of carries and s = (sn . . . s0 ) is the sum. We have c0 = 0, ci+1 · B + si = ai + bi + ci for 0 ≤ i < n, and sn = cn . As a program, this is written as c = 0 : Digit // Variable for the carry digit for i := 0 to n − 1 do add ai , bi , and c to form si and a new carry c sn := c We need one primitive operation for each position, and hence a total of n primitive operations. Theorem 1.1. The addition of two ndigit integers requires exactly n primitive operations. The result is an (n + 1)digit integer. 1.1.1 Parallel Addition Our addition algorithm produces the result digits one after the other, from least significant to most significant. In fact, the ith carry depends on the (i − 1)th carry, which in turn depends on the (i − 2)nd carry, and so on. So, it seems natural to compute the result digits sequentially. Is there a parallel algorithm for addition that produces all digits in time less than linear in the number of digits? Parallel addition is not an academic exercise but crucial for microprocessor technology, because processors need fast, hardwareimplemented algorithms for arithmetic. Suppose that engineers had insisted on using serial addition algorithms for microprocessors. In that case it is likely that we would still be using 8bit processors wherever possible, since they would be up to eight times faster than 64bit processors. Our strategy for parallel addition is simple and ambitious – we want to perform all digit additions ai + bi + ci in parallel. Of course, the problem is how to compute the carries ci in parallel. Consider the following example of the addition of two 8digit binary numbers a and b: position a b c x y 87 1 0 00 p s 6 54 0 00 1 00 0 01 pss sss 3 0 1 1 p g 2 10 1 11 0 10 1 00 pgp ggp Row c indicates the carries. Why is there a carry into position 4, why is c4 = 1? Because position 1 generates a carry, since a1 + b1 = 2, and positions 2 and 3 propagate it, since a2 + b2 = a3 + b3 = 1. Why is there no carry into position 8, why is c8 = 0? Positions 6 and 7 would propagate a carry, since a6 + b6 = a7 + b7 = 1, but there is nothing to propagate, since no carry is generated in position 5 since a5 + b5 = 0. The general rule is: We have a carry into a certain position if a carry is generated somewhere to the right and then propagated through all intermediate positions. Rows x and y implement this rule. Row x indicates whether a position generates (g), propagates (p), or stops (s) a carry. We have 4 1 Appetizer: Integer Arithmetic g xi = p s if ai + bi = 2, if ai + bi = 1, if ai + bi = 0. The different xi ’s are independent and can be computed in parallel. Row y gives information whether we have a carry into a certain position. For i ≥ 1, we have a carry into position i, i.e., ci = 1, if and only if yi−1 = g. We never have a carry into position 0, i.e., c0 = 0. We give two rules for determining the sequence of y’s. Consider the maximal subsequences of the xsequence consisting of a string of p’s followed by either an s or a g or the right end. Within each subsequence, turn all symbols into the last symbol; leave the sequence of trailing p’s unchanged. In our example, the maximal subsequences are x7 x6 x5 , x4 , and x3 x2 x1 , and the sequence of trailing p’s consists of x0 . Therefore y7 y6 y5 become s’s, y4 becomes s, y3 y2 y1 become g’s, and y0 becomes p. It would be equally fine to turn the trailing p’s into s’s. However, this would make the formal treatment slightly more cumbersome. Alternatively, we may state the rule for the yi ’s as follows: ( xi if i = 0 or xi ∈ {s, g}, yi = yi−1 otherwise, i.e., i > 0 and xi = p. Our task is now to compute the yi ’s in parallel. The first step is to rewrite the definition of yi as yi = ( x0 xi ⊗ yi−1 if i = 0, if i > 0, where ⊗ s p g s s s g p s p g g s g g The operator ⊗4 returns its left argument if this argument is s or g and returns its right argument when the left argument is p. We next expand the definition of yi and obtain yi = xi ⊗ yi−1 = xi ⊗ (xi−1 ⊗ yi−2 ) = xi ⊗ (xi−1 ⊗ (xi−2 . . . (x1 ⊗ x0) . . .)). This formula corresponds to the sequential computation of yi . We first compute x1 ⊗ x0 , then leftmultiply by x2 , then by x3 , and finally by xi . The operator ⊗ is associative (we shall prove this below) and hence we can change the evaluation order without changing the result. Compare the following formulae for y6 : y6 = x6 ⊗ (x5 ⊗ (x4 ⊗ (x3 ⊗ (x2 ⊗ (x1 ⊗ x0 ))))) and y6 = (x6 ⊗ ((x5 ⊗ x4 ) ⊗ ((x3 ⊗ x2) ⊗ (x1 ⊗ x0)). 4 Pronounced “otimes”. 1.1 Addition x7 x6 N x4 N N N N y7 x5 y6 N y5 x3 x2 N y4 x0 step 1 (form pairs) N N N x1 circuit for n = 4 N y3 5 y2 step 3 (compute y6 , y4 , . . . ) y1 y0 Fig. 1.1. The computation of y7 to y0 from x7 to x0 . The horizontal partition of the computation corresponds to the general description of the algorithm. We first form pairs. In the second row of gates, we combine pairs, and in the third row, we combine pairs of pairs. Then we start to fill in gaps. In row four, we compute y5 , and in the last row, we compute y6 , y4 , and y2 . The latter formula corresponds to a parallel evaluation of y6 . We compute x5 ⊗ x4 , x3 ⊗ x2 , and x1 ⊗ x0 in parallel, then x6 ⊗ (x5 ⊗ x4 ) and (x3 ⊗ x2 ) ⊗ (x1 ⊗ x0 ), and finally y6 . Thus three rounds of computation suffice instead of the six rounds for the sequential evaluation. Generally, we can compute yi in ⌈log i⌉ rounds of computation. The computation of the different yi ’s can be intertwined, as Fig. 1.1 shows. We next give a more formal treatment for a general base B. We start with the simple observation that the carry digit is either 0 or 1. Lemma 1.2. ci ∈ {0, 1} for all i. Proof. We have c0 = 0 by definition. Assume inductively, that ci ∈ {0, 1}. Then ai + bi + ci ≤ 2(B − 1) + 1 = 2B − 1 = 1 · B + B − 1 and hence ci+1 ∈ {0, 1}. ⊔ ⊓ Two input digits ai and bi will generate a carry no matter what happens to the right of them if ai + bi > B − 1. On the other hand, the addition involving ai and bi will stop any carry if ai + bi < B − 1; this holds because an incoming carry is at most 1. If ai + bi = B − 1, an incoming carry will be propagated to the left. Hence, we define s if ai + bi < B − 1 (stop), xi := p if ai + bi = B − 1 (propagate), (1.1) g if ai + bi > B − 1 (generate). Lemma 1.3. The operator ⊗ is associative, i.e., (u ⊗ v) ⊗ w = u ⊗ (v ⊗ w) for any u, N v, and w. Let yi = 0≤ j≤i x j . Then ci = 1 if and only if yi−1 = g. Proof. If u = 6 p, then (u ⊗ v) ⊗ w = u ⊗ w = u = u ⊗ (v ⊗ w). If u = p, then (u ⊗ v) ⊗ w = v ⊗ w = u ⊗ (v ⊗ w). 6 1 Appetizer: Integer Arithmetic We have a carry into position i if and only if there is a k < i such that a carry is generated in position k and propagated by positions k + 1 to i − 1, i.e., xk = g and xk+1 = . . . xi−1 = p. Then yk = xk = g. This completes the argument when i − 1 = k. Otherwise, yk+1 = p ⊗ yk = g, yk+2 = p ⊗ yk+1 = g, . . . , yi−1 = p ⊗ yi−2 = g. ⊔ ⊓ We now come to the parallel computation of the yi ’s. For simplicity, we assume n to be a power of 2. If n = 1 = 20 , we simply return x0 . Otherwise, we do the following: (a) Combine consecutive pairs of inputs: z0 = x1 ⊗ x0 , z1 = x3 ⊗ x2 , . . . , zn/2−1 = xn−1 ⊗ xn−2. (b) Apply the algorithm recursively to the sequence zn/2−1 , . . . , z0 to obtain its sequence of prefix sums wn/2−1 , . . . , w0 . (c) Assemble the output as y2i+1 = wi for i ∈ 0..n/2 − 1 and y0 = x0 and y2i = x2i ⊗ wi−1 for i ∈ 1..n/2 − 1. Figure 1.1 illustrates the computation for n = 8. The number N(n) of gates satisfies N(1) = 0 and N(n) = n/2 + N(n/2) + n/2 − 1 ≤ N(n/2) + n. Note that we use n/2 gates in step (a), n/2 − 1 gates in step (c), and N(n/2) gates in step (b). By repeated substitution, we obtain N(n) ≤ n + N(n/2) ≤ n + n/2 + n/4 + n/8 + . . . = O(n). The number of rounds of computation is 0 for n = 1, is 1 for n = 2, and grows by 2 whenever n doubles. Thus the number of rounds is 2 log n − 1. The algorithm for computing the yi ’s in parallel is an example of a prefix sum computation – a frequently useful collective computation described in more detail in Sect. 13.3. 1.2 Multiplication: The School Method We all know how to multiply two integers. In this section, we shall review the “school method”. In a later section, we shall get to know a method which is significantly faster for large integers. We shall proceed slowly. We first review how to multiply an ndigit integer a by a onedigit integer b j . We use b j for the onedigit integer, since this is how we need it below. For any digit ai of a, we form the product ai · b j . The result is a twodigit integer (ci di ), i.e., a i · b j = ci · B + d i . We form two integers, c = (cn−1 . . . c0 0) and d = (dn−1 . . . d0 ) from the ci ’s and di ’s, respectively. Since the c’s are the higherorder digits in the products, we add a 0 digit at the end. We add c and d to obtain the product p j = a · b j . Graphically, (an−1 . . . ai . . . a0 ) · b j −→ cn−1 cn−2 . . . ci ci−1 . . . c0 0 dn−1 . . . di+1 di . . . d1 d0 sum of c and d Let us determine the number of primitive operations. For each i, we need one primitive operation to form the product ai · b j , for a total of n primitive operations. Then 1.2 Multiplication: The School Method 7 we add two (n + 1)digit numbers (cn−1 . . . c0 0) and (0dn−1 . . . d0 ). However, we may simply copy the digit d0 to the result and hence effectively add two ndigit numbers. This requires n primitive operations. So the total number of primitive operations is 2n. The leftmost addition of cn−1 and the carry into this position generates carry 0 as a · b j ≤ (Bn − 1) · (B − 1) < Bn+1 , and hence the result can be written with n + 1 digits. Lemma 1.4. We can multiply an ndigit number by a onedigit number with 2n primitive operations. The result is an (n + 1)digit number. When you multiply an ndigit number by a onedigit number, you will probably proceed slightly differently. You combine5 the generation of the products ai · b j with the summation of c and d into a single phase, i.e., you create the digits of c and d when they are needed in the final addition. We have chosen to generate them in a separate phase because this simplifies the description of the algorithm. Exercise 1.1. Give a program for the multiplication of a and b j that operates in a single phase. We can now turn to the multiplication of two ndigit integers. The school method for integer multiplication works as follows: We first form partial products p j by multiplying a by the jth digit b j of b, and then sum the suitably aligned products p j · B j to obtain the product of a and b. Graphically, pn−1,n p0,n p0,n−1 . . . p0,2 p0,1 p0,0 p1,n p1,n−1 p1,n−2 . . . p1,1 p1,0 p2,n p2,n−1 p2,n−2 p2,n−3 . . . p2,0 ... . . . pn−1,3 pn−1,2 pn−1,1 pn−1,0 sum of the n partial products The description in pseudocode is more compact. We initialize the product p to 0 and then add to it the partial products a · b j · B j one by one: p=0:N for j := 0 to n − 1 do p := p + a · b j · B j Let us analyze the number of primitive operations required by the school method. Each partial product p j requires 2n primitive operations, and hence all partial products together require 2n2 primitive operations. The product a · b is a 2ndigit number, and hence all summations p + a · b j · B j are summations of 2ndigit integers. Each such addition requires at most 2n primitive operations, and hence all additions together require at most 2n2 primitive operations. Thus, we need no more than 4n2 primitive operations in total. 5 In the literature on compiler construction and performance optimization, this transformation is known as loop fusion. 1 Appetizer: Integer Arithmetic Tn / sec 8 0.00000469 16 0.0000154 32 0.0000567 64 0.000222 128 0.000860 256 0.00347 512 0.0138 1 024 0.0547 2 048 0.220 4 096 0.880 8 192 3.53 16 384 14.2 32 768 56.7 65 536 227 131 072 910 n Tn /Tn/2 3.28 3.67 3.91 3.87 4.03 3.98 3.95 4.01 4.00 4.01 4.01 4.00 4.00 4.00 school method 100 10 time / sec 8 1 0.1 0.01 0.001 0.0001 2 4 2 6 2 8 10 2 n 2 12 2 14 2 16 Fig. 1.2. The running time of the school method for the multiplication of ndigit integers. The three columns of the table on the left give n, the running time Tn in seconds of the C++ implementation given in Sect. 1.8, and the ratio Tn /Tn/2 . The plot on the right shows log Tn versus log n, and we see essentially a line. Observe that if Tn = α nβ for some constants α and β , then Tn /Tn/2 = 2β and log Tn = β log n + log α , i.e., log Tn depends linearly on log n with slope β . In our case, the slope is 2. Please use a ruler to check. A simple observation allows us to improve this bound. The number a · b j · B j has n + 1 + j digits, the last j of which are 0. We can therefore start the addition in the ( j + 1)th position. Also, when we add a · b j · B j to p, we have p = a · (b j−1 . . . b0 ), i.e., p has n + j digits. Thus, the addition of p and a ·b j ·B j amounts to the addition of two n + 1digit numbers and requires only n + 1 primitive operations. Therefore, all n − 1 additions together require no more than (n − 1)(n + 1) < n2 primitive operations. We have thus shown the following result. Theorem 1.5. The school method multiplies two ndigit integers with 3n2 primitive operations. We have now analyzed the numbers of primitive operations required by the school methods for integer addition and integer multiplication. The number Mn of primitive operations for the school method for integer multiplication is essentially 3n2 . We say that Mn grows quadratically. Observe also that Mn 3n2 = = 4, Mn/2 3(n/2)2 i.e., quadratic growth has the consequence of essentially quadrupling the number of primitive operations when the size of the instance is doubled. Assume now that we actually implement the multiplication algorithm in our favorite programming language (we shall do so later in the chapter), and then time the 1.2 Multiplication: The School Method 9 program on our favorite machine for various ndigit integers a and b and various n. What should we expect? We want to argue that we shall see quadratic growth. The reason is that primitive operations are representative of the running time of the algorithm. Consider the addition of two ndigit integers first. What happens when the program is executed? For each position i, the digits ai and bi have to be moved to the processing unit, the sum ai + bi + c has to be formed, the digit si of the result needs to be stored in memory, the carry c is updated, the index i is incremented, and a test for loop exit needs to be performed. Thus, for each i, the same number of machine cycles is executed. We have counted one primitive operation for each i, and hence the number of primitive operations is representative of the number of machine cycles executed. Of course, there are additional effects, for example pipelining and the complex transport mechanism for data between memory and the processing unit, but they will have a similar effect for all i, and hence the number of primitive operations is also representative of the running time of an actual implementation on an actual machine. The argument extends to multiplication, since multiplication of a number by a onedigit number is a process similar to addition and the second phase of the school method for multiplication amounts to a series of additions. Let us confirm the above argument by an experiment. Figure 1.2 shows execution times of a C++ implementation of the school method; the program can be found in Sect. 1.8. For each n, we performed a large number6 of multiplications of ndigit random integers and then determined the average running time Tn ; Tn is listed in the second column of the table. We also show the ratio Tn /Tn/2 . Figure 1.2 also shows a plot of the data points7 (log n, log Tn ). The data exhibits approximately quadratic growth, as we can deduce in various ways. The ratio Tn /Tn/2 is always close to four, and the double logarithmic plot shows essentially a line of slope 2. The experiments are quite encouraging: Our theoretical analysis has predictive value. Our theoretical analysis showed quadratic growth of the number of primitive operations, we argued above that the running time should be related to the number of primitive operations, and the actual running time essentially grows quadratically. However, we also see systematic deviations. For small n, the growth factor from one row to the next is by less than a factor of four, as linear and constant terms in the running time still play a substantial role. For larger n, the ratio is very close to four. For very large n (too large to be timed conveniently), we would probably see a factor larger than four, since the access time to memory depends on the size of the data. We shall come back to this point in Sect. 2.2. Exercise 1.2. Write programs for the addition and multiplication of long integers. Represent integers as sequences (arrays or lists or whatever your programming language offers) of decimal digits and use the builtin arithmetic to implement the primitive operations. Then write ADD, MULTIPLY1, and MULTIPLY functions that add 6 7 The internal clock that measures CPU time returns its timings in some units, say milliseconds, and hence the rounding required introduces an error of up to onehalf of this unit. It is therefore important that the experiment whose duration is to be timed takes much longer than this unit, in order to reduce the effect of rounding. Throughout this book, we use log x to denote the logarithm to base 2, log2 x. 10 1 Appetizer: Integer Arithmetic integers, multiply an integer by a onedigit number, and multiply integers, respectively. Use your implementation to produce your own version of Fig. 1.2. Experiment with using a larger base than 10, say base 216 . Exercise 1.3. Describe and analyze the school method for division. 1.3 Result Checking Our algorithms for addition and multiplication are quite simple, and hence it is fair to assume that we can implement them correctly in the programming language of our choice. However, writing software8 is an errorprone activity, and hence we should always ask ourselves whether we can check the results of a computation. For multiplication, the authors were taught the following technique in elementary school. The method is known as Neunerprobe in German, “casting out nines” in English, and preuve par neuf in French. Add the digits of a. If the sum is a number with more than one digit, sum its digits. Repeat until you arrive at a onedigit number, called the checksum of a. We use sa to denote this checksum. Here is an example: 4528 → 19 → 10 → 1. Do the same for b and the result c of the computation. This gives the checksums sb and sc . All checksums are singledigit numbers. Compute sa · sb and form its checksum s. If s differs from sc , c is not equal to a · b. This test was described by alKhwarizmi in his book on algebra. Let us go through a simple example. Let a = 429, b = 357, and c = 154153. Then sa = 6, sb = 6, and sc = 1. Also, sa · sb = 36 and hence s = 9. So sc = 6 s and hence c is not the product of a and b. Indeed, the correct product is c = 153153. Its checksum is 9, and hence the correct product passes the test. The test is not foolproof, as c = 135153 also passes the test. However, the test is quite useful and detects many mistakes. What is the mathematics behind this test? We shall explain a more general method. Let q be any positive integer; in the method described above, q = 9. Let sa be the remainder, or residue, in the integer division of a by q, i.e., sa = a − ⌊a/q⌋ · q. Then 0 ≤ sa < q. In mathematical notation, sa = a mod q.9 Similarly, sb = b mod q and sc = c mod q. Finally, s = (sa · sb ) mod q. If c = a · b, then it must be the case that 6 a · b and uncovers a mistake in the multiplication (or s = sc . Thus s = 6 sc proves c = a mistake in carrying out casting out nines). What do we know if s = sc ? We know that q divides the difference of c and a · b. If this difference is nonzero, the mistake will be detected by any q which does not divide the difference. 8 9 The bug in the division algorithm of the floatingpoint unit of the original Pentium chip became infamous. It was caused by a few missing entries in a lookup table used by the algorithm. The method taught in school uses residues in the range 1 to 9 instead of 0 to 8 according to the definition sa = a − (⌈a/q⌉ − 1) · q. 1.4 A Recursive Version of the School Method 11 Let us continue with our example and take q = 7. Then a mod 7 = 2, b mod 7 = 0 and hence s = (2 · 0) mod 7 = 0. But 135153 mod 7 = 4, and we have uncovered the fact that 135153 6= 429 · 357. Exercise 1.4. Explain why casting out nines corresponds to the case q = 9. Hint: 10k mod 9 = 1 for all k ≥ 0. Exercise 1.5 (Elferprobe, casting out elevens). Powers of ten have very simple remainders modulo 11, namely 10k mod 11 = (−1)k for all k ≥ 0, i.e., 1 mod 11 = +1, 10 mod 11 = −1, 100 mod 11 = +1, 1 000 mod 11 = −1, etc. Describe a simple test to check the correctness of a multiplication modulo 11. 1.4 A Recursive Version of the School Method We shall now derive a recursive version of the school method. This will be our first encounter with the divideandconquer paradigm, one of the fundamental paradigms in algorithm design. Let a and b be our two ndigit integers which we want to multiply. Let k = ⌊n/2⌋. We split a into two numbers a1 and a0 ; a0 consists of the k least significant digits and a1 consists of the n − k most significant digits.10 We split b analogously. Then a = a 1 · Bk + a 0 and hence and b = b1 · Bk + b0 , a · b = a1 · b1 · B2k + (a1 · b0 + a0 · b1 ) · Bk + a0 · b0. This formula suggests the following algorithm for computing a · b: (a) Split a and b into a1 , a0 , b1 , and b0 . (b) Compute the four products a1 · b1 , a1 · b0, a0 · b1 , and a0 · b0 . (c) Add the suitably aligned products to obtain a · b. Observe that the numbers a1 , a0 , b1 , and b0 are ⌈n/2⌉digit numbers and hence the multiplications in step (b) are simpler than the original multiplication if ⌈n/2⌉ < n, i.e., n > 1. The complete algorithm is now as follows. To multiply onedigit numbers, use the multiplication primitive. To multiply ndigit numbers for n ≥ 2, use the threestep approach above. It is clear why this approach is called divideandconquer. We reduce the problem of multiplying a and b to some number of simpler problems of the same kind. A divideandconquer algorithm always consists of three parts: In the first part, we split the original problem into simpler problems of the same kind (our step (a)); in the second part, we solve the simpler problems using the same method (our step (b)); and, in the third part, we obtain the solution to the original problem from the solutions to the subproblems (our step (c)). Instead of “we solve the simpler problems using 12 1 Appetizer: Integer Arithmetic a1 a1 · b0 a1 · b1 a0 · b1 a0 a0 · b0 b1 b0 Fig. 1.3. Visualization of the school method and its recursive variant. The rhombusshaped area indicates the partial products in the multiplication a · b. The four subareas correspond to the partial products a1 · b1 , a1 · b0 , a0 · b1 , and a0 · b0 . In the recursive scheme, we first sum the partial products in the four subareas and then, in a second step, add the four resulting sums. the same method”, one usually says more elegantly “we solve the simpler problems recursively”. What is the connection of our recursive integer multiplication to the school method? The two methods are strongly related. Both methods compute all n2 digit products ai b j and then sum the resulting twodigit results (appropriately shifted). They differ in the order in which these summations are performed. Figure 1.3 visualizes the digit products and their place values as rhombusshaped regions. The school method adds the digit products row by row. The recursive method first computes the partial results corresponding to the four subregions, which it then combines to obtain the final result with three additions. We may almost say that our recursive integer multiplication is just the school method in disguise. In particular, the recursive algorithm also uses a quadratic number of primitive operations. We next derive the quadratic behavior without appealing to the connection to the school method. This will allow us to introduce recurrence relations, a powerful concept for the analysis of recursive algorithms. Lemma 1.6. Let T (n) be the maximum number of primitive operations required by our recursive multiplication algorithm when applied to ndigit integers. Then ( 1 if n = 1, T (n) ≤ 4 · T (⌈n/2⌉) + 2 · 2 · n if n ≥ 2. Proof. Multiplying two onedigit numbers requires one primitive multiplication. This justifies the case n = 1. So, assume n ≥ 2. Splitting a and b into the four pieces a1 , a0 , b1 , and b0 requires no primitive operations.11 Each piece has at most ⌈n/2⌉ digits and hence the four recursive multiplications require at most 4 · T (⌈n/2⌉) primitive operations. The products a0 · b0 and a1 · b1 · B2k can be combined into a single number without any cost as the former number is a 2kdigit number and the latter number ends with 2k many 0’s. Finally, we need two additions to assemble the final result. Each addition involves two numbers of at most 2n digits and hence requires at most 2n primitive operations. This justifies the inequality for n ≥ 2. ⊔ ⊓ 10 11 Observe that we have changed notation; a0 and a1 now denote the two parts of a and are no longer single digits. It will require work, but it is work that we do not account for in our analysis. 1.5 Karatsuba Multiplication 13 In Sect. 2.8, we shall learn that such recurrences are easy to solve and yield the already conjectured quadratic execution time of the recursive algorithm. Lemma 1.7. Let T (n) be the maximum number of primitive operations required by our recursive multiplication algorithm when applied to ndigit integers. Then T (n) ≤ 5n2 if n is a power of 2, and T (n) ≤ 20n2 for all n. Proof. We refer the reader to Sect. 1.9 for a proof. ⊔ ⊓ 1.5 Karatsuba Multiplication In 1962, the Soviet mathematician Karatsuba [175] discovered a faster way of multiplying large integers. The running time of his algorithm grows like nlog 3 ≈ n1.58. The method is surprisingly simple. Karatsuba observed that a simple algebraic identity allows one multiplication to be eliminated in the divideandconquer implementation, i.e., one can multiply ndigit numbers using only three multiplications of integers half the size. The details are as follows. Let a and b be our two ndigit integers which we want to multiply. Let k = ⌊n/2⌋. As above, we split a into two numbers a1 and a0 ; a0 consists of the k least significant digits and a1 consists of the n − k most significant digits. We split b in the same way. Then a = a 1 · Bk + a 0 and b = b1 · Bk + b0 and hence (the magic is in the second equality) a · b = a1 · b1 · B2k + (a1 · b0 + a0 · b1 ) · Bk + a0 · b0 = a1 · b1 · B2k + ((a1 + a0 ) · (b1 + b0) − (a1 · b1 + a0 · b0 )) · Bk + a0 · b0 . At first sight, we have only made things more complicated. A second look, however, shows that the last formula can be evaluated with only three multiplications, namely, a1 · b1 , a0 · b0 , and (a1 + a0 ) · (b1 + b0 ). We also need six additions.12 That is three more than in the recursive implementation of the school method. The key is that additions are cheap compared with multiplications, and hence saving a multiplication more than outweighs the additional additions. We obtain the following algorithm for computing a · b: (a) Split a and b into a1 , a0 , b1 , and b0 . (b) Compute the three products 12 p2 = a1 · b1 , p0 = a0 · b0 , p1 = (a1 + a0 ) · (b1 + b0 ). Actually, five additions and one subtraction. We leave it to readers to convince themselves that subtractions are no harder than additions. 14 1 Appetizer: Integer Arithmetic (c) Add the suitably aligned products to obtain a · b, i.e., compute a · b according to the formula a · b = (p2 · B2k + p0) + (p1 − (p2 + p0)) · Bk . The first addition can be performed by concatenating the corresponding digit strings and requires no primitive operation. The numbers a1 , a0 , b1 , b0 , a1 + a0 , and b1 + b0 are (⌈n/2⌉ + 1)digit numbers and hence the multiplications in step (b) are simpler than the original multiplication if ⌈n/2⌉ + 1 < n, i.e., n ≥ 4. The complete algorithm is now as follows: To multiply threedigit numbers, use the school method, and to multiply ndigit numbers for n ≥ 4, use the threestep approach above. school method Karatsuba4 Karatsuba32 10 time / sec 1 0.1 0.01 0.001 0.0001 0.00001 2 4 6 2 2 8 2 n 10 2 12 2 14 Fig. 1.4. The running times of implementations of the Karatsuba and school methods for integer multiplication. The running times of two versions of Karatsuba’s method are shown: Karatsuba4 switches to the school method for integers with fewer than four digits, and Karatsuba32 switches to the school method for integers with fewer than 32 digits. The slopes of the lines for the Karatsuba variants are approximately 1.58. The running time of Karatsuba32 is approximately onethird the running time of Karatsuba4. Figure 1.4 shows the running times TS (n), TK4 (n), and TK32 (n) of C++ implementations of the school method and of two variants of the Karatsuba method for the multiplication of ndigit numbers. Karatsuba4 (running time TK4 (n)) uses the school method for numbers with fewer than four digits and Karatsuba32 (running time TK32 (n)) uses the school method for numbers with fewer than 32 digits; we discuss the rationale for this variant in Sect. 1.7. The scales on both axes are logarithmic. We see, essentially, straight lines of different slope. The running time of the school method grows like n2 , and hence the slope is 2 in the case of the school method. The slope is smaller in the case of the Karatsuba method, and this suggests that its running time grows like nβ with β < 2. In fact, the ratios13 TK4 (n)/TK4 (n/2) and TK32 (n)/TK32 (n/2) are close to three, and this suggests that β is such that 2β = 3 or β = log 3 ≈ 1.58. Alternatively, you may determine the slope from Fig. 1.4. We shall prove below that TK (n) grows like nlog 3 . We say that the Karatsuba method has 13 TK4 (1024) = 0.0455, TK4 (2048) = 0.1375, and TK4 (4096) = 0.41. 1.6 Parallel Multiplication 15 better asymptotic behavior than the school method. We also see that the inputs have to be quite big before the superior asymptotic behavior of the Karatsuba method actually results in a smaller running time. Observe that for n = 28 , the school method is still faster, that for n = 29 , the two methods have about the same running time, and that the Karatsuba method wins for n = 210 . The lessons to remember are: • • Better asymptotic behavior ultimately wins. An asymptotically slower algorithm can be faster on small inputs. In the next section, we shall learn how to improve the behavior of the Karatsuba method for small inputs. The resulting algorithm will always be at least as good as the school method. It is time to derive the asymptotics of the Karatsuba method. Lemma 1.8. Let TK (n) be the maximum number of primitive operations required by the Karatsuba algorithm when applied to ndigit integers. Then ( 3n2 if n ≤ 3, TK (n) ≤ 3 · TK (⌈n/2⌉ + 1) + 8n if n ≥ 4. Proof. Multiplying two ndigit numbers using the school method requires no more than 3n2 primitive operations, according to Theorem 1.5. This justifies the first line. So, assume n ≥ 4. Splitting a and b into the four pieces a1 , a0 , b1 , and b0 requires no primitive operations.14 Each piece and the sums a0 + a1 and b0 + b1 have at most ⌈n/2⌉ + 1 digits, and hence the three recursive multiplications require at most 3 · TK (⌈n/2⌉+ 1) primitive operations. We need two additions to form a0 + a1 and b0 + b1 . The results of these additions have fewer than n digits and hence the additions need no more than n elementary operations each. Finally, we need three additions in order to compute the final result from the results of the multiplications. These are additions of numbers with at most 2n digits. Thus these additions require at most 3 · 2n primitive operations. Altogether, we obtain the bound stated in the second line of the recurrence. ⊔ ⊓ In Sect. 2.8, we shall learn some general techniques for solving recurrences of this kind. Theorem 1.9. Let TK (n) be the maximum number of primitive operations required by the Karatsuba algorithm when applied to ndigit integers. Then TK (n) ≤ 153nlog3 for all n. Proof. We refer the reader to Sect. 1.9 for a proof. ⊔ ⊓ 1.6 Parallel Multiplication Both the recursive version of the school method and the Karatsuba algorithm are good starting points for parallel algorithms. For simplicity, we focus on the school 14 It will require work, but remember that we are counting primitive operations. 16 1 Appetizer: Integer Arithmetic method. Recall that the bulk of the work is done in the recursive multiplications a0 b0 , a1 b1 , a0 b1 , and a1 b0 . These four multiplications can be done independently and in parallel. Hence, in the first level of recursion, up to four processors can work in parallel. In the second level of recursion, the parallelism is already 4 · 4 = 16. In the ith level of recursion, 4i processors can work in parallel. Fig. 1.5 shows a graphical representation of the resulting computation for multiplying two numbers a11 a10 a01 a00 and b11 b10 b01 b00 . a11 ·b11 a1 ·b1 a11 ·b10 a10 ·b11 + + + a10 ·b10 a11 ·b01 a1 ·b0 a11 ·b00 a10 ·b01 + + a10 ·b00 a·b + + a01 ·b11 a0 ·b1 a01 ·b10 a00 ·b11 + + + + + a00 ·b10 a01 ·b01 a0 ·b0 a01 ·b00 a00 ·b01 a00 ·b00 + + + Fig. 1.5. Task graph for parallel recursive multiplication with two levels of parallel recursion and using the school method on twodigit numbers as the base case. A “>” stands for a shift by two digits. What is the running time of the algorithm if an unlimited number of processors are available? This quantity is known as the span of a parallel computation; see Sect. 2.10. For simplicity, we assume a sequential addition algorithm. As before, we shall count only arithmetic operations on digits. For the span S(n) for multiplying two ndigit integers, we get ( 1 if n = 1, S(n) ≤ S(⌈n/2⌉) + 3 · 2 · n if n ≥ 2. This recurrence has the solution S(n) ≤ 12n. Note that this is much less than the quadratic number of operations performed. Hence, we can hope for considerable speedup by parallel processing – even without parallel addition. Exercise 1.6. Show that by using parallel addition (see Sect. 1.1.1), we can achieve a span O log2 n for parallel school multiplication. 1.7 Algorithm Engineering 17 Parallel programming environments for multicore processors make it relatively easy to exploit parallelism in the kind of divideandconquer computations described above, see Sect. 2.14 for details. Roughly, they create a task for recursive calls and automatically assign cores to tasks. They ensure that enough tasks are created to keep all available cores busy, but also that tasks stay reasonably large. Section 14.5 explains the loadbalancing algorithms behind this programming model. Exercise 1.7. Describe a parallel divideandconquer algorithm for the Karatsuba method. Show that its span is linear in the number of input digits. 1.7 Algorithm Engineering Karatsuba integer multiplication is superior to the school method for large inputs. In our implementation, the superiority only shows for integers with more than 1000 digits. However, a simple refinement improves the performance significantly. Since the school method is superior to the Karatsuba method for short integers, we should stop the recursion earlier and switch to the school method for numbers which have fewer than n0 digits for some yet to be determined n0 . We call this approach the refined Karatsuba method. It is never worse than either the school method or the original Karatsuba algorithm as long as n0 is not chosen too large. 0.5 Karatsuba, n = 2048 Karatsuba, n = 4096 time / sec 0.4 0.3 0.2 0.1 4 8 16 32 64 128 256 512 1024 recursion threshold n0 Fig. 1.6. The running time of the Karatsuba method as a function of the recursion threshold n0 . The times consumed for multiplying 2048digit and 4096digit integers are shown. The minimum is at n0 = 32. What is a good choice for n0 ? We shall answer this question both experimentally and analytically. Let us discuss the experimental approach first. We simply time the refined Karatsuba algorithm for different values of n0 and then adopt the value giving the smallest running time. For our implementation, the best results were obtained for n0 = 32 (see Fig. 1.6). The asymptotic behavior of the refined Karatsuba method is shown in Fig. 1.4. We see that the running time of the refined method still grows like nlog 3 , that the refined method is about three times faster than the basic Karatsuba 18 1 Appetizer: Integer Arithmetic method and hence the refinement is highly effective, and that the refined method is never slower than the school method. Exercise 1.8. Derive a recurrence for the worstcase number TR (n) of primitive operations performed by the refined Karatsuba method. We can also approach the question analytically. If we use the school method to multiply ndigit numbers, we need 3n2 primitive operations. If we use one Karatsuba step and then multiply the resulting numbers of length ⌈n/2⌉ + 1 using the school method, we need about 3(3(n/2 + 1)2) + 7n primitive operations. The latter is smaller for n ≥ 23, and hence a recursive step saves primitive operations as long as the number of digits is more than 23. You should not take this as an indication that an actual implementation should switch at integers of approximately 23 digits, as the argument concentrates solely on primitive operations. You should take it as an argument that it is wise to have a nontrivial recursion threshold n0 and then determine the threshold experimentally. Exercise 1.9. Throughout this chapter, we have assumed that both arguments of a multiplication are ndigit integers. What can you say about the complexity of multiplying ndigit and mdigit integers? (a) Show that the school method requires no more than α · nm primitive operations for some constant α . (b) Assume n ≥ m and divide a into ⌈n/m⌉ numbers of m digits each. Multiply each of the fragments by b using Karatsuba’s method and combine the results. What is the running time of this approach? 1.8 The Programs We give C++ programs for the school and Karatsuba methods below. These programs were used for the timing experiments described in this chapter. The programs were executed on a machine with a 2 GHz dualcore Intel Core 2 Duo T7200 processor with 4 MB of cache memory and 2 GB of main memory. The programs were compiled with GNU C++ version 3.3.5 using optimization level O2. A digit is simply an unsigned int and an integer is a vector of digits; here, “vector” is the vector type of the standard template library. A declaration integer a(n) declares an integer with n digits, a.size() returns the size of a, and a[i] returns a reference to the ith digit of a. Digits are numbered starting at 0. The global variable B stores the base. The functions fullAdder and digitMult implement the primitive operations on digits. We sometimes need to access digits beyond the size of an integer; the function getDigit(a, i) returns a[i] if i is a legal index for a and returns 0 otherwise: typedef unsigned int digit; typedef vector<digit> integer ; unsigned int B = 10; // Base, 2 <= B <= 2^16 void fullAdder( digit a, digit b, digit c, digit & s, digit & carry) { unsigned int sum = a + b + c; carry = sum/B; s = sum  carry *B; } 1.8 The Programs 19 void digitMult ( digit a, digit b, digit & s, digit & carry) { unsigned int prod = a*b; carry = prod/B; s = prod  carry *B; } digit getDigit (const integer& a, int i ) { return ( i < a.size()? a[ i ] : 0 ); } We want to run our programs on random integers: randDigit is a simple random generator for digits, and randInteger fills its argument with random digits. unsigned int X = 542351; digit randDigit () { X = 443143*X + 6412431; return X % B ; } void randInteger(integer& a) { int n = a.size (); for ( int i =0; i <n; i ++) a[ i ] = randDigit ();} We now come to the school method of multiplication. We start with a routine that multiplies an integer a by a digit b and returns the result in atimesb. We need a carrydigit carry, which we initialize to 0. In each iteration, we compute d and c such that c ∗ B + d = a[i] ∗ b. We then add d, the c from the previous iteration, and the carry from the previous iteration, store the result in atimesb[i], and remember the carry. The school method (the function mult) multiplies a by each digit of b and then adds it at the appropriate position to the partial result (the function addAt): void mult(const integer& a, const digit& b, integer& atimesb) { int n = a.size (); assert(atimesb.size() == n+1); digit carry = 0, c, d, cprev = 0; for ( int i = 0; i < n; i ++) { digitMult (a[ i ], b,d,c ); fullAdder (d, cprev, carry, atimesb[i ], carry ); cprev = c; } d = 0; fullAdder (d, cprev, carry, atimesb[n], carry ); assert(carry == 0); } void addAt(integer& p, const integer& atimesbj, int j ) { // p has length n+m, digit carry = 0; int L = p.size (); for ( int i = j ; i < L; i ++) fullAdder (p[ i ], getDigit (atimesbj, i  j ), carry, p[ i ], carry ); assert(carry == 0); } integer mult(const integer& a, const integer& b) { int n = a.size (); int m = b.size (); integer p(n + m,0); integer atimesbj(n+1); for ( int j = 0; j < m; j ++) { mult(a, b[ j ], atimesbj); addAt(p, atimesbj, j ); } return p; } 20 1 Appetizer: Integer Arithmetic For Karatsuba’s method, we also need algorithms for general addition and subtraction. The subtraction method may assume that the first argument is no smaller than the second. It computes its result in the first argument: integer add(const integer& a, const integer& b) { int n = max(a.size(),b.size ()); integer s(n+1); digit carry = 0; for ( int i = 0; i < n; i ++) fullAdder ( getDigit (a, i ), getDigit (b, i ), carry, s[ i ], carry ); s[n] = carry; return s; } void sub(integer& a, const integer& b) // requires a >= b { digit carry = 0; for ( int i = 0; i < a.size (); i ++) if ( a[ i ] >= ( getDigit (b, i ) + carry )) { a[ i ] = a[ i ]  getDigit (b, i )  carry; carry = 0; } else { a[ i ] = a[ i ] + B  getDigit (b, i )  carry; carry = 1;} assert(carry == 0); } The function split splits an integer into two integers of half the size: void split (const integer& a,integer& a1, integer& a0) { int n = a.size (); int k = n/2; for ( int i = 0; i < k; i ++) a0[i ] = a[ i ]; for ( int i = 0; i < n  k; i ++) a1[i ] = a[k+ i ]; } The function Karatsuba works exactly as described in the text. If the inputs have fewer than n0 digits, the school method is employed. Otherwise, the inputs are split into numbers of half the size and the products p0, p1, and p2 are formed. Then p0 and p2 are written into the output vector and subtracted from p1. Finally, the modified p1 is added to the result: integer Karatsuba(const integer& a, const integer& b, int n0) { int n = a.size (); int m = b.size (); assert(n == m); assert(n0 >= 4); integer p(2*n); if (n < n0) return mult(a,b); int k = n/2; integer a0(k), a1(n  k ), b0(k), b1(n  k ); split (a,a1,a0); split (b,b1,b0); integer p2 = Karatsuba(a1,b1,n0), p1 = Karatsuba(add(a1,a0),add(b1,b0),n0), p0 = Karatsuba(a0,b0,n0); for ( int i = 0; i < 2*k; i ++) p[ i ] = p0[i ]; for ( int i = 2*k; i < n+m; i++) p[ i ] = p2[i  2*k ]; sub(p1,p0); sub(p1,p2); addAt(p,p1,k); return p; } 1.9 Proofs of Lemma 1.7 and Theorem 1.9 21 The following program generated the data for Fig. 1.4: inline double cpuTime() { return double(clock())/CLOCKS_PER_SEC; } int main(){ for ( int n = 8; n <= 131072; n *= 2) { integer a(n), b(n); randInteger(a); randInteger(b); double T = cpuTime(); int k = 0; while (cpuTime()  T < 1) { mult(a,b); k++; } cout << "\n" << n << " school = " << (cpuTime() T)/k; T = cpuTime(); k = 0; while (cpuTime()  T < 1) { Karatsuba(a,b,4); k++; } cout << " Karatsuba4 = " << (cpuTime() T) /k; cout. flush (); T = cpuTime(); k = 0; while (cpuTime()  T < 1) { Karatsuba(a,b,32); k++; } cout << " Karatsuba32 = " << (cpuTime() T) /k; cout.flush (); } return 0; } 1.9 Proofs of Lemma 1.7 and Theorem 1.9 To make this chapter selfcontained, we include proofs of Lemma 1.7 and Theorem 1.9. We start with an analysis of the recursive version of the school method. Recall that T (n), the maximum number of primitive operations required by our recursive multiplication algorithm when applied to ndigit integers, satisfies the recurrence relation ( 1 if n = 1, T (n) ≤ 4 · T (⌈n/2⌉) + 4n if n ≥ 2. We use induction on n to show T (n) ≤ 5n2 − 4n when n is a power of 2. For n = 1, we have T (1) ≤ 1 = 5n2 − 4n. For n > 1, we have T (n) ≤ 4T (n/2) + 4n ≤ 4(5(n/2)2 − 4n/2) + 4n = 5n2 − 4n, where the second inequality follows from the induction hypothesis. For general n, we observe that multiplying ndigit integers is certainly no more costly than multiplying 2⌈log n⌉ digit integers and hence T (n) ≤ T (2⌈log n⌉ ). Since 2⌈log n⌉ ≤ 2n, we conclude T (n) ≤ 20n2 for all n. Exercise 1.10. Prove a bound on the recurrence T (1) ≤ 1 and T (n) ≤ 4T (n/2) + 9n when n is a power of 2. How did we know that “5n2 − 4n” is the bound to be shown? There is no magic here. For n = 2k , repeated substitution yields 22 1 Appetizer: Integer Arithmetic T (2k ) ≤ 4 · T (2k−1 ) + 4 · 2k ≤ 4 · (4 · T k−2 + 4 · 2k−1) + 4 · 2k ≤ 4 · (4 · (4 · T(2k−3 ) + 4 · 2k−2) + 4 · 2k−1) + 4 · 2k ≤ 43 T (2k−3 ) + 4 · (42 · 2k−2 + 41 · 2k−1 + 2k ) ≤ . . . ≤ 4k T (1) + 4 k k ∑ 0≤i≤k−1 k 4i 2k−i ≤ 4k + 4 · 2k 2 ∑ 2i 0≤i≤k−1 2 ≤ 4 + 4 · 2 (2 − 1) = n + 4n(n − 1) = 5n − 4n. We now turn to the proof of Theorem 1.9. Recall that TK satisfies the recurrence ( 3n2 if n ≤ 3, TK (n) ≤ 3 · TK (⌈n/2⌉ + 1) + 8n if n ≥ 4. The recurrence for the school method has the nice property that if n is a power of 2, the arguments of T on the righthand side are again powers of two. This is not true for TK . However, if n = 2k + 2 and k ≥ 1, then ⌈n/2⌉ + 1 = 2k−1 + 2, and hence we should now use numbers of the form n = 2k + 2, k ≥ 0, as the basis of the inductive argument. We shall show that TK (2k + 2) ≤ 51 · 3k − 16 · 2k − 8 for k ≥ 0. For k = 0, we have TK (20 + 2) = TK (3) ≤ 3 · 32 = 27 = 51 · 30 − 16 · 20 − 8. For k ≥ 1, we have TK (2k + 2) ≤ 3TK (2k−1 + 2) + 8 · (2k + 2) ≤ 3 · 51 · 3k−1 − 16 · 2k−1 − 8 + 8 · (2k + 2) = 51 · 3k − 16 · 2k − 8. Again, there is no magic in coming up with the right induction hypothesis. It is obtained by repeated substitution. Namely, TK (2k + 2) ≤ 3TK (2k−1 + 2) + 8 · (2k + 2) ≤ 3k TK (20 + 2) + 8 · 30 (2k + 2) + 31(2k−1 + 2) + . . . + 3k−1 (21 + 2) k 3k − 1 k k (3/2) − 1 ≤ 27 · 3 + 8 · 2 +2 3/2 − 1 3−1 ≤ 51 · 3k − 16 · 2k − 8, where the first inequality uses the fact that 2k + 2 is even, the second inequality follows from repeated substitution, the third inequality uses TK (3) = 27, and the last inequality follows by a simple computation. 1.11 Historical Notes and Further Findings 23 It remains to extend the bound to all n. Let k be the minimum integer such that n ≤ 2k + 2. Then k ≤ 1 + log n. Also, multiplying ndigit numbers is no more costly than multiplying (2k + 2)digit numbers, and hence TK (n) ≤ 51 · 3k − 16 · 2k − 8 ≤ 153 · 3logn ≤ 153 · nlog3 , where we have used the equality 3log n = 2(log 3)·(log n) = nlog 3 . Exercise 1.11. Solve the recurrence ( 3n2 + 2n if n < n0 , TR (n) ≤ 3 · TR(⌈n/2⌉ + 1) + 8n if n ≥ n0 , where n0 is a positive integer. Optimize n0 . 1.10 Implementation Notes The programs given in Sect. 1.8 are not optimized. The base of the number system should be a power of 2 so that sums and carries can be extracted by bit operations. Also, the size of a digit should agree with the word size of the machine and a little more work should be invested in implementing the primitive operations on digits. 1.10.1 C++ GMP [127] and LEDA [195] offer exact arithmetic on integers and rational numbers of arbitrary size, and arbitraryprecision floatingpoint arithmetic. Highly optimized implementations of Karatsuba’s method are used for multiplication here. 1.10.2 Java java.math implements exact arithmetic on integers of arbitrary size and arbitraryprecision floatingpoint numbers. 1.11 Historical Notes and Further Findings Is the Karatsuba method the fastest known method for integer multiplication? No, much faster methods are known. Karatsuba’s method splits an integer into two parts and requires three multiplications of integers of half the length. The natural extension is to split integers into k parts of length n/k each. If the recursive step requires ℓ multiplications of numbers of length n/k, the running time of the resulting algorithm grows like nlogk ℓ . In this way, Toom [316] and Cook [78] reduced the running time to15 O n1+ε for arbitrary positive ε . Asymptotically even more efficient are the 15 The O(·) notation is defined in Sect. 2.1. 24 1 Appetizer: Integer Arithmetic algorithms of Schönhage and Strassen [286] and Schönhage [285]. The former multiplies nbit integers with O(n log n log logn) bit operations, and can be implemented to run in this time bound on a Turing machine. The latter runs in linear time O(n) and requires the machine model discussed in Sect. 2.2. In this model, integers with log n bits can be multiplied in constant time. The former algorithm was improved by ∗ Fürer [117] and De et al. [84] to O((n log n)2c log (n) ) bit operations. Here, log∗ n is the smallest integer k ≥ 0 such that log(log(. . . log(n) . . .)) ≤ 1 (kfold application of the logarithm function). The function log∗ n grows extremely slowly. In March 2019 an algorithm for integer multiplication was announced that requires O(n log n) bit operations [146]. This bound is widely believed to be best possible. Modern microprocessors use a base B = 2 multiplication algorithm that does quadratic work but requires only O(log n) wire delays [149]. The basic idea is to first compute n2 digit products in parallel. Bits with the same position in the output are combined together using full adders, which reduce three values at position i to one value at position i and one value at position i + 1. This way, log3/2 n layers of full adders suffice to reduce up to n initial bits at position i to two bits. The remaining bits can be fed into an adder with logarithmic delay as in Sect. 1.1.1. 2 Introduction When you want to become a sculptor,1 you have to learn some basic techniques: where to get the right stones, how to move them, how to handle the chisel, how to erect scaffolding, . . . . Knowing these techniques will not make you a famous artist, but even if you have an exceptional talent, it will be very difficult to develop into a successful artist without knowing them. It is not necessary to master all of the basic techniques before sculpting the first piece. But you always have to be willing to go back to improve your basic techniques. This introductory chapter plays a similar role in this book. We introduce basic concepts that make it simpler to discuss and analyze algorithms in the subsequent chapters. There is no need for you to read this chapter from beginning to end before you proceed to later chapters. You can also skip the parts on parallel processing when you are only considering sequential algorithms. On the first reading, we recommend that you should read carefully to the end of Sect. 2.3 and skim through the remaining sections. We begin in Sect. 2.1 by introducing some notation and terminology that allow us to argue about the complexity of algorithms in a concise way. We then introduce machine models in Sect. 2.2 that allow us to abstract from the highly variable complications introduced by real hardware. The models are concrete enough to have predictive value and abstract enough to permit elegant arguments. Section 2.3 then introduces a highlevel pseudocode notation for algorithms that is much more convenient for expressing algorithms than the machine code of our abstract machine. Pseudocode is also more convenient than actual programming languages, since we can use highlevel concepts borrowed from mathematics without having to worry about exactly how they can be compiled to run on actual hardware. We frequently annotate programs to make algorithms more readable and easier to prove correct. This is the subject of Sect. 2.6. Section 2.7 gives the first comprehensive example: binary search in a sorted array. In Sect. 2.8, we introduce mathematical techniques 1 The above illustration of Stonehenge is from [255]. © Springer Nature Switzerland AG 2019 P. Sanders et al., Sequential and Parallel Algorithms and Data Structures, https://doi.org/10.1007/9783030252090_2 25 26 2 Introduction for analyzing the complexity of programs, in particular, for analyzing nested loops and recursive procedure calls. Additional analysis techniques are needed for averagecase analysis and parallel algorithm analysis; these are covered in Sects. 2.9 and 2.10, respectively. Randomized algorithms, discussed in Sect. 2.11, use coin tosses in their execution. Section 2.12 is devoted to graphs, a concept that will play an important role throughout the book. In Sect. 2.13, we discuss the question of when an algorithm should be called efficient, and introduce the complexity classes P and NP and the concept of NPcompleteness. Finally, as in most chapters of this book, we close with implementation notes (Sect. 2.14) and historical notes and further findings (Sect. 2.15). 2.1 Asymptotic Notation The main purpose of algorithm analysis is to give performance guarantees, for example bounds on running time, that are at the same time accurate, concise, general, and easy to understand. It is difficult to meet all these criteria simultaneously. For example, the most accurate way to characterize the running time T of an algorithm is to view T as a mapping from the set I of all inputs to the set of nonnegative numbers R+ . For any problem instance i, T (i) is the running time on i. This level of detail is so overwhelming that we could not possibly derive a theory about it. A useful theory needs a more global view of the performance of an algorithm. Hence, we group the set of all inputs into classes of “similar” inputs and summarize the performance on all instances in the same class in a single number. The most useful grouping is by size. Usually, there is a natural way to assign a size to each problem instance. The size of an integer is the number of digits in its representation, and the size of a set is the number of elements in that set. The size of an instance is always a natural number. Sometimes we use more than one parameter to measure the size of an instance; for example, it is customary to measure the size of a graph by its number of nodes and its number of edges. We ignore this complication for now. We use size(i) to denote the size of instance i, and In to denote the set of instances of size n for n ∈ N. For the inputs of size n, we are interested in the maximum, minimum, and average execution times:2 worst case: best case: average case: T (n) = max {T (i) : i ∈ In } ; T (n) = min {T (i) : i ∈ In } ; 1 T (n) = ∑ T (i). In  i∈I n We are most interested in the worstcase execution time, since it gives us the strongest performance guarantee. A comparison of the best and the worst case tells us how much the execution time varies for different inputs in the same class. If the 2 We shall make sure that {T (i) : i ∈ In } always has a proper minimum and maximum, and that In is finite when we consider averages. 2.1 Asymptotic Notation 27 discrepancy is big, the average case may give more insight into the true performance of the algorithm. Section 2.9 gives an example. We shall perform one more step of data reduction: We shall concentrate on growth rate or asymptotic analysis. Functions f (n) and g(n) have the same growth rate if there are positive constants c and d such that c ≤ f (n)/g(n) ≤ d for all sufficiently large n, and f (n) grows faster than g(n) if, for all positive constants c, we have f (n) ≥ c·g(n) for all sufficiently large n. For example, the functions n2 , n2 + 7n, 5n2 − 7n, and n2 /10 + 106n all have the same growth rate. Also, they grow faster than n3/2 , which in turn grows faster than n log n. The growth rate talks about the behavior for large n. The word “asymptotic” in “asymptotic analysis” also stresses the fact that we are interested in the behavior for large n. Why are we interested only in growth rates and the behavior for large n? We are interested in the behavior for large n because the whole purpose of designing efficient algorithms is to be able to solve large instances. For large n, an algorithm whose running time has a smaller growth rate than the running time of another algorithm will be superior. Also, our machine model is an abstraction of real machines and hence can predict actual running times only up to a constant factor. A pleasing side effect of concentrating on growth rate is that we can characterize the running times of algorithms by simple functions. However, in the sections on implementation, we shall frequently take a closer look and go beyond asymptotic analysis. Also, when using one of the algorithms described in this book, you should always ask yourself whether the asymptotic view is justified. The following definitions allow us to argue precisely about asymptotic behavior. Let f (n) and g(n) denote functions that map nonnegative integers to nonnegative real numbers: O( f (n)) = {g(n) : ∃c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≤ c · f (n)} , Ω( f (n)) = {g(n) : ∃c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≥ c · f (n)} , Θ( f (n)) = O( f (n)) ∩ Ω( f (n)) , o( f (n)) = {g(n) : ∀c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≤ c · f (n)} , ω ( f (n)) = {g(n) : ∀c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≥ c · f (n)} . The lefthand sides should be read as “big O of f ”, “big omega of f ”, “theta of f ”, “little o of f ”, and “little omega of f ”, respectively. A remark about notation is in order here. In the definitions above, we use “ f (n)” and “g(n)” with two different meanings. In “O( f (n))” and “{g(n) : . . .}”, they denote the functions f and g and the “n” emphasizes that these are functions of the argument n, and in “g(n) ≤ c · f (n)”, they denote the values of the functions at the argument n. Let us see some examples. O n2 is the set of all functions that grow at most quadratically, o n2 is the set of functions that grow less than quadratically, and o(1) is the set of functions that go to 0 as n goes to infinity. Here “1” stands for the function n 7→ 1, which is 1 everywhere, and hence f ∈ o(1) if f (n) ≤ c · 1 for any positive c and sufficiently large n, i.e., f (n) goes to zero as n goes to infinity. Generally, O( f (n)) is the set of all functions that “grow no faster than” f (n). Similarly, 28 2 Introduction Ω( f (n)) is the set of all functions that “grow at least as fast as” f (n). For example, the Karatsuba algorithm for integer multiplication has a worstcase running time in O n1.58 , whereas the school algorithm has a worstcase running time in Ω n2 , so that we can say that the Karatsuba algorithm is asymptotically faster than the school algorithm. The “little o” notation o( f (n)) denotes the set of all functions that “grow strictly more slowly than” f (n). Its twin ω ( f (n)) is rarely used, and is only shown for completeness. The growth rate of most algorithms discussed in this book is either a polynomial or a logarithmic function, or the product of a polynomial and a logarithmic function. We use polynomials to introduce our readers to some basic manipulations of asymptotic notation. Lemma 2.1.Let p(n) = ∑ki=0 ai ni denote any polynomial and assume ak > 0. Then p(n) ∈ Θ nk . Proof. It suffices to show that p(n) ∈ O nk and p(n) ∈ Ω nk . First observe that for n ≥ 1, k k i=0 i=0 p(n) ≤ ∑ ai ni ≤ nk ∑ ai , and hence p(n) ≤ for all positive n. Thus p(n) ∈ O nk . Let A = ∑k−1 i=0 ai . For positive n, we have a ak k n−A p(n) ≥ ak nk − Ank−1 = nk + nk−1 2 2 (∑ki=0 ai )nk k for n > 2A/a . We choose c = a /2 and n = 2A/a in and hence p(n) ≥ (ak /2)n 0 k k k k the definition of Ω n , and obtain p(n) ∈ Ω nk . ⊓ ⊔ Exercise 2.1. Right or wrong? (a) n2 + 106n ∈ O n2 , (b) n log n ∈ O(n), (c) n log n ∈ Ω(n), (d) log n ∈ o(n). Asymptotic notation is used a lot in algorithm analysis, and it is convenient tostretch mathematical notation a little in order to allow sets of functions (such as O n2 ) to be treated similarly to ordinary functions. In particular, we shall always write h = O( f ) instead of h ∈ O( f ), and O(h) = O( f ) instead of O(h) ⊆ O( f ). For example, 3n2 + 7n = O n2 = O n3 . Never forget that sequences of “equalities” involving Onotation are really membership and inclusion relations and, as such, can only be read from left to right. If h is a function, F and G are sets of functions, and ◦ is an operator such as +, ·, or /, then F ◦ G is a shorthand for { f ◦ g : f ∈ F, g ∈ G}, and h ◦ F stands for {h} ◦ F. So f (n) + o( f (n)) denotes the set of all functions f (n) + g(n) where g(n) grows strictly more slowly than f (n), i.e., the ratio ( f (n) + g(n))/ f (n) goes to 1 as n goes to infinity. Equivalently, we can write (1 + o(1)) f (n). We use this notation whenever we care about the constant in the leading term but want to ignore lowerorder terms. 2.2 The Sequential Machine Model 29 Lemma 2.2. The following rules hold for Onotation: c f (n) = Θ( f (n)) for any positive constant c, f (n) + g(n) = Ω( f (n)) , f (n) + g(n) = O( f (n)) if g(n) = O( f (n)) , O( f (n)) · O(g(n)) = O( f (n) · g(n)) . Exercise 2.2. Prove Lemma 2.2. Exercise 2.3. Sharpen Lemma 2.1 and show that p(n) = ak nk + o nk . Exercise 2.4. Prove that nk = o(cn ) for any integer k and any c > 1. How does nlog logn compare with nk and cn ? 2.2 The Sequential Machine Model In 1945, John von Neumann (Fig. 2.1) introduced a computer architecture [243] which was simple, yet powerful. The limited hardware technology of the time forced him to come up with a design that concentrated on the essentials; otherwise, realization would have been impossible. Hardware technology has developed tremendously since 1945. However, the programming model resulting from von Neumann’s design is so elegant and powerful that it is still the basis for most of modern programming. Usually, Fig. 2.1. John von Neumann, programs written with von Neumann’s model in mind born Dec. 28, 1903 in Budapest, also work well on the vastly more complex hardware died Feb. 8, 1957 in Washington, DC. of today’s machines. The variant of von Neumann’s model used in algorithmic analysis is called the RAM (random access machine) model. It was introduced by Shepherdson and Sturgis [294] in 1963. It is a sequential machine with uniform memory, i.e., there is a single processing unit, and all memory accesses take the same amount of time. The (main) memory, or store, consists of infinitely many cells S[0], S[1], S[2], . . . ; at any point in time, only a finite number of them will be in use. In addition to the main memory, there are a small number of registers R1 , . . . , Rk . The memory cells store “small” integers, also called words. In our discussion of integer arithmetic in Chap. 1, we assumed that “small” meant onedigit. It is more reasonable and convenient to assume that the interpretation of “small” depends on the size of the input. Our default assumption is that integers whose absolute value is bounded by a polynomial in the size of the input can be stored in a single cell. Such integers can be represented by a number of bits that is logarithmic in the size of the input. This assumption is reasonable because we could always spread out the contents of a single cell over logarithmically many cells with a logarithmic overhead 30 2 Introduction in time and space and obtain constantsize cells. The assumption is convenient because we want to be able to store array indices in a single cell. The assumption is necessary because allowing cells to store arbitrary numbers would lead to absurdly overoptimistic algorithms. For example, by repeated squaring, we could generate a number with 2n bits in n steps. Namely, if we start with the number 2 = 21 , squaring 1 2 it once gives 4 = 22 = 22 , squaring it twice gives 16 = 24 = 22 , and squaring it n n times gives 22 . Our model supports a limited form of parallelism. We can perform simple operations on a logarithmic number of bits in constant time. A RAM can execute (machine) programs. A program is simply a sequence of machine instructions, numbered from 0 to some number ℓ. The elements of the sequence are called program lines. The program is stored in a program store. Our RAM supports the following machine instructions: • • • • • • • Ri := S[R j ] loads the contents of the memory cell indexed by the contents of R j into register Ri . S[R j ] := Ri stores the contents of register Ri in the memory cell indexed by the contents of R j . Ri := R j ⊙ Rh executes the binary operation ⊙ on the contents of registers R j and Rh and stores the result in register Ri . Here, ⊙ is a placeholder for a variety of operations. The arithmetic operations are the usual +, −, and ∗; they interpret the contents of the registers as integers. The operations div and mod stand for integer division and the remainder, respectively. The comparison operations ≤, <, >, and ≥ for integers return truth values, i.e., either true ( = 1) or false ( = 0). The logical operations ∧ and ∨ manipulate the truth values 0 and 1. We also have bitwise Boolean operations  (OR), & (AND), and ⊕ (exclusive OR, XOR). They interpret contents as bit strings. The shift operators >> (shift right) and << (shift left) interpret the first argument as a bit string and the second argument as a nonnegative integer. We may also assume that there are operations which interpret the bits stored in a register as a floatingpoint number, i.e., a finiteprecision approximation of a real number. Ri :=⊙R j executes the unary operation ⊙ on the contents of register R j and stores the result in register Ri . The operators −, ¬ (logical NOT), and ~ (bitwise NOT) are available. Ri :=C assigns the constant value C to Ri . JZ k, Ri continues execution at program line k, if register Ri is 0, and at the next program line otherwise (conditional branch). There is also the variant JZ R j , Ri , where the target of the jump is the program line stored in R j . J k continues execution at program line k (unconditional branch). Similarly to JZ, the program line can also be specified by the content of a register. A program is executed on a given input step by step. The input for a computation is stored in memory cells S[1] to S[R1 ] and execution starts with program line 1. With the exception of the branch instructions JZ and J, the next instruction to be executed is always the instruction in the next program line. The execution of a program ter 2.2 The Sequential Machine Model 31 minates if a program line is to be executed whose number is outside the range 1..ℓ. Recall that ℓ is the number of the last program line. We define the execution time of a program on an input in the most simple way: Each instruction takes one time step to execute. The total execution time of a program is the number of instructions executed. It is important to remember that the RAM model is an abstraction. One should not confuse it with physically existing machines. In particular, real machines have a finite memory and a fixed number of bits per register (e.g., 32 or 64). In contrast, the word size and memory of a RAM scale with input size. This can be viewed as an abstraction of the historical development. Microprocessors have had words of 4, 8, 16, and 32 bits in succession, and now often have 64bit words. Words of 64 bits can index a memory of size 264 . Thus, at current prices, memory size is limited by cost and not by physical limitations. This statement was also true when 32bit words were introduced. Our complexity model is a gross oversimplification: Modern processors attempt to execute many instructions in parallel. How well they succeed depends on fact