Main Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox

Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox

, , , , , , ,
This textbook is a concise introduction to the basic toolbox of structures that allow efficient organization and retrieval of data, key algorithms for problems on graphs, and generic techniques for modeling, understanding, and solving algorithmic problems. The authors aim for a balance between simplicity and efficiency, between theory and practice, and between classical results and the forefront of research. Individual chapters cover arrays and linked lists, hash tables and associative arrays, sorting and selection, priority queues, sorted sequences, graph representation, graph traversal, shortest paths, minimum spanning trees, optimization, collective communication and computation, and load balancing. The authors also discuss important issues such as algorithm engineering, memory hierarchies, algorithm libraries, and certifying algorithms. Moving beyond the sequential algorithms and data structures of the earlier related title, this book takes into account the paradigm shift towards the parallel processing required to solve modern performance-critical applications and how this impacts on the teaching of algorithms. The book is suitable for undergraduate and graduate students and professionals familiar with programming and basic mathematical language. Most chapters have the same basic structure: the authors discuss a problem as it occurs in a real-life situation, they illustrate the most important applications, and then they introduce simple solutions as informally as possible and as formally as necessary so the reader really understands the issues at hand. As they move to more advanced and optional issues, their approach gradually leads to a more mathematical treatment, including theorems and proofs. The book includes many examples, pictures, informal explanations, and exercises, and the implementation notes introduce clean, efficient implementations in languages such as C++ and Java.
Year: 2019
Publisher: Springer
Language: english
Pages: 509 / 516
ISBN 10: 3030252094
ISBN 13: 9783030252090
File: PDF, 7.52 MB
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
Peter Sanders
Kurt Mehlhorn
Martin Dietzfelbinger
Roman Dementiev

Sequential
and Parallel
Algorithms and
Data Structures
The Basic Toolbox

Sequential and Parallel
Algorithms and Data Structures

Peter Sanders • Kurt Mehlhorn
Martin Dietzfelbinger • Roman Dementiev

Sequential and Parallel
Algorithms and Data Structures
The Basic Toolbox

Peter Sanders
Karlsruhe Institute of Technology
Karlsruhe, Germany

Kurt Mehlhorn
Max Planck Institute for Informatics
Saarbrücken, Germany

Martin Dietzfelbinger
Technische Universität Ilmenau
Ilmenau, Germany

Roman Dementiev
Intel Deutschland GmbH
Feldkirchen, Germany

Intel material (including descriptions of algorithms and their implementations, text, figures,
and tables describing their experimental evaluation and experimental results on Intel processors,
and sample code) is reproduced by permission of Intel Corporation.
ISBN 978-3-030-25208-3
ISBN 978-3-030-25209-0 (eBook)
https://doi.org/10.1007/978-3-030-25209-0
© Springer Nature Switzerland AG 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors
or the editors give a warranty, express or implied, with respect to the material contained herein or for
any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To all algorithmicists

Preface

An algorithm is a precise and unambiguous recipe for solving a class of problems.
If formulated as programs, they can be executed by a machine. Algorithms are at
the heart of every nontrivial computer application. Therefore, every computer scientist and every professional programmer should know about the basic algorithmic
toolbox: structures that allow efficient organization and retrieval of data, key algorithms for problems on graphs, and generic techniques for modeling, understanding,
and solving algorithmic problems. This book is a concise introduction to this basic
toolbox, intended for students and professionals familiar with programming and basic mathematical language. Although the presentation is concise, we have included
many examples, pictures, informal explanations, and exercises, and some linkage to
the real world. The book covers the material typically taught in undergraduate and
first-year graduate courses on algorithms.
Most chapters have the same basic structure. We begin by discussing a problem
as it occurs in a real-life situation. We illustrate the most important applications and
then introduce simple solutions as informally as possible and as formally as necessary to really understand the issues at hand. When we move to more advanced and
optional issues, this approach gradually leads to a more mathematical treatment, including theorems and proofs. Thus, the book should work for readers with a wide
range of mathematical expertise. There are also advanced sections (marked with a
*) where we recommend that readers should skip them on first reading. Exercises
provide additional examples, alternative approaches and opportunities to think about
the problems. It is highly recommended to take a look at the exercises even if there
is no time to solve them during the first reading. In order to be able to concentrate on
ideas rather than programming details, we use pictures, words, and high-level pseudocode to explain our algorithms. A section “Implementation Notes” links these abstract ideas to clean, efficient implementations in real programming languages such
as C++ and Java. Each chapter ends with a section on further findings that provides
a glimpse at the state of the art, generalizations, and advanced solutions.
The first edition of this book treated only sequential algorithms and data structures. Today, almost every computer, be it a desktop, a notebook, or a smartphone,
has multiple cores, and sequential machines have become an exotic species. The

viii

Preface

reason for this change is that sequential processors have ceased to get proportional
performance improvements from increased circuit complexity. Although the number
of transistors in an integrated circuit (still) doubles every two years (Moore’s law),
the only reasonable way to use this transistor budget is to put multiple processor
cores on a chip. The consequence is that nowadays every performance-critical application has to be parallelized. Moreover, big data – the explosion of data set sizes in
many applications – has produced an enormous demand for algorithms that scale to
a large number of processors. This paradigm shift has profound effects on teaching
algorithms. Parallel algorithms are no longer a specialized topic reserved for a small
percentage of students. Rather, every student needs some exposure to parallel algorithms, and parallel solution paradigms need to be taught early on. As a consequence,
parallel algorithms should be integrated tightly and early into algorithms courses. We
therefore decided to include parallel algorithms in the second edition of the book.
Each chapter now has some sections on parallel algorithms. The goals remain the
same as for the first edition: a careful balance between simplicity and efficiency,
between theory and practice, and between classical results and the forefront of research. We use a slightly different style for the sections on parallel computing. We
include concrete programming examples because parallel programming is still more
difficult than sequential programming (the programs are available at github.com/
basic-toolbox-sample-code/basic-toolbox-sample-code/). We
also reference original work directly in the text instead of in the section on history
because the parallel part is closer to current research.
Algorithmics is a modern and active area of computer science, even at the level
of the basic toolbox. We have made sure that we present algorithms in a modern
way, including explicitly formulated invariants. We also discuss important further
aspects, such as algorithm engineering, memory hierarchies, algorithm libraries, and
certifying algorithms.
We have chosen to arrange most of the material by problem domain and not by
solution technique. The chapter on optimization techniques is an exception. We find
that an organization by problem domain allows a more concise presentation. However, it is also important that readers and students obtain a good grasp of the available
techniques. Therefore, we have structured the optimization chapter by techniques,
and an extensive index provides cross-references between different applications of
the same technique. Bold page numbers in the index indicate the pages where concepts are defined.
This book can be used in multiple ways in teaching. We have used the first edition
of the book and a draft version of the second edition in undergraduate and graduate
courses on algorithmics. In a first year undergraduate course, we concentrated on the
sequential part of the book. In a second algorithms course at an advanced bachelor
level or master’s level and with students with some experience in parallel programming we made most of the sequential part of the book a prerequisite and concentrated on the more advanced material such as the starred sections and the parts on
external memory and parallel algorithms. If a first algorithms course is taught later
in the undergraduate curriculum, the book can be used to teach parallel and sequential algorithms in an integrated way. Although we have included some material about

Preface

ix

parallel programming and several concrete programming examples, the parallel part
of the book works best for readers who already have some background in parallel
programming. Another approach is to use the book to provide concrete algorithmic
content for a parallel programming course that uses another book for the programming part. Last but not least, the book should be useful for independent study and
to professionals who have a basic knowledge of algorithms but less experience with
parallelism.
Follow us on an exciting tour through the world of algorithms.
Ilmenau, Saarbrücken, Karlsruhe, Heidelberg
August 2018

Martin Dietzfelbinger
Kurt Mehlhorn
Peter Sanders
Roman Dementiev

A first edition of this book was published in 2008. Since then the book has been
translated into Chinese, Greek, Japanese, and German. Martin Dietzfelbinger translated the book into German. Actually, he did much more than a translation. He thoroughly revised the book and improved the presentation at many places. He also corrected a number of mistakes. Thus, the book gained through the translation, and we
decided to make the German edition the reference for any future editions. It is only
natural that we asked Martin to become an author of the German edition and any
future editions of the book.
Saarbrücken, Karlsruhe
March 2014

Kurt Mehlhorn
Peter Sanders

Soon after the publication of the German edition, we started working on the revised English edition. We decided to expand the book into the parallel world for the
reasons indicated in the preface. Twenty years ago, parallel machines were exotic,
nowadays, sequential machines are exotic. However, the parallel world is much more
diverse and complex than the sequential world, and therefore algorithm-engineering
issues become more important. We concluded that we had to go all the way to implementations and experimental evaluations for some of the parallel algorithms. We
invited Roman Dementiev to work with us on the algorithm engineering aspects of
parallel computing. Roman received his PhD in 2006 for a thesis on “Algorithm Engineering for Large Data Sets”. He now works for Intel, where he is responsible for
performance engineering of a major database system.
Ilmenau, Saarbrücken, Karlsruhe
November 2017

Martin Dietzfelbinger
Kurt Mehlhorn
Peter Sanders

Contents

1

Appetizer: Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Multiplication: The School Method . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Result Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
A Recursive Version of the School Method . . . . . . . . . . . . . . . . . . . .
1.5
Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6
Parallel Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7
Algorithm Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8
The Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9
Proofs of Lemma 1.7 and Theorem 1.9 . . . . . . . . . . . . . . . . . . . . . . .
1.10 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.11 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

1
2
6
10
11
13
15
17
18
21
23
23

2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
The Sequential Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Parallel Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
Parallel Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
Designing Correct Algorithms and Programs . . . . . . . . . . . . . . . . . .
2.7
An Example – Binary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8
Basic Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9
Average-Case Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Parallel-Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11 Randomized Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.13 P and NP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.15 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

25
26
29
32
38
44
46
49
52
57
62
63
68
73
77
79

xii

Contents

3

Representing Sequences by Arrays and Linked Lists . . . . . . . . . . . . . . .
3.1
Processing Arrays in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Processing Linked Lists in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
Unbounded Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
*Amortized Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
Stacks and Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7
Parallel Queue-Like Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8
Lists versus Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9
Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

81
82
86
91
97
102
106
109
113
114
116

4

Hash Tables and Associative Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Hashing with Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Universal Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Hashing with Linear Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Chaining versus Linear Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
*Perfect Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
Parallel Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7
Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8
Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

117
120
122
128
130
131
134
147
149

5

Sorting and Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Simple Sorters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Simple, Fast, and Inefficient Parallel Sorting . . . . . . . . . . . . . . . . . . .
5.3
Mergesort – an O(n log n) Sorting Algorithm . . . . . . . . . . . . . . . . . .
5.4
Parallel Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
A Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7
Parallel Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8
Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9
Parallel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10 Breaking the Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11 *Parallel Bucket Sort and Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . .
5.12 *External Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13 Parallel Sample Sort with Implementations . . . . . . . . . . . . . . . . . . . .
5.14 *Parallel Multiway Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.15 Parallel Sorting with Logarithmic Latency . . . . . . . . . . . . . . . . . . . . .
5.16 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.17 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

153
156
158
160
162
165
168
173
178
180
182
186
187
191
201
205
207
208

Contents

xiii

6

Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Binary Heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Addressable Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
*External Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
Parallel Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5
Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6
Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

211
213
218
224
226
229
230

7

Sorted Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1
Binary Search Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
(a, b)-Trees and Red–Black Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3
More Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4
Amortized Analysis of Update Operations . . . . . . . . . . . . . . . . . . . . .
7.5
Augmented Search Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6
Parallel Sorted Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7
Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8
Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

233
235
238
245
248
250
252
254
256

8

Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Unordered Edge Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2
Adjacency Arrays – Static Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3
Adjacency Lists – Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4
The Adjacency Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . .
8.5
Implicit Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6
Parallel Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7
Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8
Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

259
260
261
261
263
264
265
267
268

9

Graph Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
Parallel Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4
Parallel Traversal of DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5
Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6
Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

271
272
274
282
294
298
299

10 Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 From Basic Concepts to a Generic Algorithm . . . . . . . . . . . . . . . . . .
10.2 Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Nonnegative Edge Costs (Dijkstra’s Algorithm) . . . . . . . . . . . . . . . .
10.4 *Average-Case Analysis of Dijkstra’s Algorithm . . . . . . . . . . . . . . .
10.5 Monotone Integer Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6 Arbitrary Edge Costs (Bellman–Ford Algorithm) . . . . . . . . . . . . . . .
10.7 All-Pairs Shortest Paths and Node Potentials . . . . . . . . . . . . . . . . . . .
10.8 Shortest-Path Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

301
302
306
307
311
313
318
320
322

xiv

Contents

10.9 Parallel Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.10 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
10.11 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . . 331
11 Minimum Spanning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 Cut and Cycle Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 The Jarník–Prim Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Kruskal’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 The Union–Find Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 *External Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6 *Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.9 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

333
334
337
338
340
344
348
351
353
354

12 Generic Approaches to Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1 Linear Programming – Use a Black-Box Solver . . . . . . . . . . . . . . . .
12.2 Greedy Algorithms – Never Look Back . . . . . . . . . . . . . . . . . . . . . . .
12.3 Dynamic Programming – Build It Piece by Piece . . . . . . . . . . . . . . .
12.4 Systematic Search – When in Doubt, Use Brute Force . . . . . . . . . . .
12.5 Local Search – Think Globally, Act Locally . . . . . . . . . . . . . . . . . . .
12.6 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.7 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.8 Historical Notes and Further Findings . . . . . . . . . . . . . . . . . . . . . . . .

357
359
365
368
373
379
389
391
392

13 Collective Communication and Computation . . . . . . . . . . . . . . . . . . . . . .
13.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3 Prefix Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.4 Synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 (All)-Gather/Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.6 All-to-All Message Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.7 Asynchronous Collective Communication . . . . . . . . . . . . . . . . . . . . .

393
396
402
404
406
412
413
418

14 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.1 Overview and Basic Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2 Prefix Sums – Independent Tasks with Known Size . . . . . . . . . . . . .
14.3 The Master–Worker Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4 (Randomized) Static Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . .
14.5 Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6 Handling Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

419
420
422
424
426
427
433

Contents

xv

A

Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1 Mathematical Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Mathematical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4 Useful Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

435
435
436
438
443

B

Computer Architecture Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1 Cores and Hardware Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 The Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.4 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.5 Hardware Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.6 Memory Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.7 The Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.8 CPU Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.9 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

447
447
448
449
451
451
451
452
453
454

C

Support for Parallelism in C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.1 “Hello World” C++11 Program with Threads . . . . . . . . . . . . . . . . . .
C.2 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.3 Asynchronous Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.4 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.5 Useful Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.6 Memory Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.7 Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

455
455
456
457
457
458
458
459

D

The Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.1 “Hello World” and What Is an MPI Program? . . . . . . . . . . . . . . . . . .
D.2 Point-to-Point Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.3 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

461
461
463
464

E

List of Commercial Products, Trademarks and Software Licenses . . . 465
E.1 BSD 3-Clause License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

1
Appetizer: Integer Arithmetic

An appetizer is supposed to stimulate the appetite at the beginning of a meal. This
is exactly the purpose of this chapter. We want to stimulate your interest in algorithmic1 techniques by showing you a surprising result. Although, the school method for
multiplying integers has been familiar to all of us since our school days and seems
to be the natural way to multiply integers, it is not the best multiplication algorithm;
there are much faster ways to multiply large integers, i.e., integers with thousands or
even millions of digits, and we shall teach you one of them.
Algorithms for arithmetic are not only among the oldest algorithms, but are also essential in areas such as cryptography, geometric computing, computer algebra, and
computer architecture. The three former areas need software algorithms for arithmetic on long integers, i.e., numbers with up to millions of digits. We shall present
the algorithms learned in school, but also an improved algorithm for multiplication.
The improved multiplication algorithm is not just an intellectual gem but is also extensively used in applications. Computer architecture needs very fast algorithms for
moderate-length (32 to 128 bits) integers. We shall present fast parallel algorithms
suitable for hardware implementation for addition and multiplication. On the way,
we shall learn basic algorithm analysis and basic algorithm engineering techniques
in a simple setting. We shall also see the interplay of theory and experiment.
We assume that integers2 are represented as digit strings. In the base B number
system, where B is an integer larger than 1, there are digits 0, 1, . . . , B − 1, and a
digit string an−1 an−2 . . . a1 a0 represents the number ∑0≤i<n ai Bi . The most important
systems with a small value of B are base 2, with digits 0 and 1, base 10, with digits 0
1

2

The Soviet stamp on this page shows Muhammad ibn Musa al-Khwarizmi (born approximately 780; died between 835 and 850), Persian mathematician and astronomer from the
Khorasan province of present-day Uzbekistan. The word “algorithm” is derived from his
name.
Throughout this chapter, we use “integer” to mean a nonnegative integer.

© Springer Nature Switzerland AG 2019
P. Sanders et al., Sequential and Parallel Algorithms and Data Structures,
https://doi.org/10.1007/978-3-030-25209-0_1

1

2

1 Appetizer: Integer Arithmetic

to 9, and base 16, with digits 0 to 15 (frequently written as 0 to 9, A, B, C, D, E, and
F). Larger bases, such as 28 , 216 , 232 , and 264 , are also useful. For example,
1 · 24 + 0 · 23 + 1 · 22 + 0 · 21 + 1 · 20 = 21,

“10101” in base 2 represents

9 · 102 + 2 · 101 + 4 · 100 = 924.

“924” in base 10 represents

We assume that we have two primitive operations at our disposal: the addition
of three digits with a two-digit result (this is sometimes called a full adder), and the
multiplication of two digits with a two-digit result.3 For example, in base 10, we
have
3
5
and
6 · 7 = 42.
5
13
We shall measure the efficiency of our algorithms by the number of primitive operations executed.
We can artificially turn any n-digit integer into an m-digit integer for any m ≥ n
by adding additional leading 0’s. Concretely, “425” and “000425” represent the same
integer. We shall use a and b for the two operands of an addition or multiplication
and assume throughout this chapter that a and b are n-digit integers. The assumption
that both operands have the same length simplifies the presentation without changing
the key message of the chapter. We shall come back to this remark at the end of the
chapter. We refer to the digits of a as an−1 to a0 , with an−1 being the most significant
digit (also called leading digit) and a0 being the least significant digit, and write
a = (an−1 . . . a0 ). The leading digit may be 0. Similarly, we use bn−1 to b0 to denote
the digits of b, and write b = (bn−1 . . . b0 ).

1.1 Addition
We all know how to add two integers a = (an−1 . . . a0 ) and b = (bn−1 . . . b0 ). We
simply write one under the other with the least significant digits aligned, and sum
the integers digitwise, carrying a single digit from one position to the next. This digit
is called a carry. The result will be an (n + 1)-digit integer s = (sn . . . s0 ). Graphically,
an−1
bn−1
cn cn−1
sn sn−1
3

...
...
...
...

a1
b1
c1
s1

a0
b0
0
s0

first operand
second operand
carries
sum

Observe that the sum of three digits is at most 3(B − 1) and the product of two digits is
at most (B − 1)2 , and that both expressions are bounded by (B − 1) · B1 + (B − 1) · B0 =
B2 − 1, the largest integer that can be written with two digits. This clearly holds for B = 2.
Increasing B by 1 increases the first expression by 3, the second expression by 2B − 1, and
the third expression by 2B + 1. Hence the claim holds for all B.

1.1 Addition

3

where c0 , . . . , cn is the sequence of carries and s = (sn . . . s0 ) is the sum. We have
c0 = 0, ci+1 · B + si = ai + bi + ci for 0 ≤ i < n, and sn = cn . As a program, this is
written as
c = 0 : Digit
// Variable for the carry digit
for i := 0 to n − 1 do add ai , bi , and c to form si and a new carry c
sn := c
We need one primitive operation for each position, and hence a total of n primitive
operations.
Theorem 1.1. The addition of two n-digit integers requires exactly n primitive operations. The result is an (n + 1)-digit integer.
1.1.1 Parallel Addition
Our addition algorithm produces the result digits one after the other, from least significant to most significant. In fact, the ith carry depends on the (i − 1)th carry, which
in turn depends on the (i − 2)nd carry, and so on. So, it seems natural to compute the
result digits sequentially. Is there a parallel algorithm for addition that produces all
digits in time less than linear in the number of digits? Parallel addition is not an academic exercise but crucial for microprocessor technology, because processors need
fast, hardware-implemented algorithms for arithmetic. Suppose that engineers had
insisted on using serial addition algorithms for microprocessors. In that case it is
likely that we would still be using 8-bit processors wherever possible, since they
would be up to eight times faster than 64-bit processors.
Our strategy for parallel addition is simple and ambitious – we want to perform
all digit additions ai + bi + ci in parallel. Of course, the problem is how to compute
the carries ci in parallel. Consider the following example of the addition of two 8digit binary numbers a and b:
position
a
b
c
x
y

87
1
0
00
p
s

6 54
0 00
1 00
0 01
pss
sss

3
0
1
1
p
g

2 10
1 11
0 10
1 00
pgp
ggp

Row c indicates the carries. Why is there a carry into position 4, why is c4 = 1? Because position 1 generates a carry, since a1 + b1 = 2, and positions 2 and 3 propagate
it, since a2 + b2 = a3 + b3 = 1. Why is there no carry into position 8, why is c8 = 0?
Positions 6 and 7 would propagate a carry, since a6 + b6 = a7 + b7 = 1, but there
is nothing to propagate, since no carry is generated in position 5 since a5 + b5 = 0.
The general rule is: We have a carry into a certain position if a carry is generated
somewhere to the right and then propagated through all intermediate positions.
Rows x and y implement this rule. Row x indicates whether a position generates
(g), propagates (p), or stops (s) a carry. We have

4

1 Appetizer: Integer Arithmetic



g
xi = p


s

if ai + bi = 2,
if ai + bi = 1,
if ai + bi = 0.

The different xi ’s are independent and can be computed in parallel. Row y gives
information whether we have a carry into a certain position. For i ≥ 1, we have a carry
into position i, i.e., ci = 1, if and only if yi−1 = g. We never have a carry into position
0, i.e., c0 = 0. We give two rules for determining the sequence of y’s. Consider the
maximal subsequences of the x-sequence consisting of a string of p’s followed by
either an s or a g or the right end. Within each subsequence, turn all symbols into
the last symbol; leave the sequence of trailing p’s unchanged. In our example, the
maximal subsequences are x7 x6 x5 , x4 , and x3 x2 x1 , and the sequence of trailing p’s
consists of x0 . Therefore y7 y6 y5 become s’s, y4 becomes s, y3 y2 y1 become g’s, and
y0 becomes p. It would be equally fine to turn the trailing p’s into s’s. However, this
would make the formal treatment slightly more cumbersome. Alternatively, we may
state the rule for the yi ’s as follows:
(
xi
if i = 0 or xi ∈ {s, g},
yi =
yi−1 otherwise, i.e., i > 0 and xi = p.
Our task is now to compute the yi ’s in parallel. The first step is to rewrite the
definition of yi as

yi =

(

x0
xi ⊗ yi−1

if i = 0,
if i > 0,

where

⊗
s
p
g

s
s
s
g

p
s
p
g

g
s
g
g

The operator ⊗4 returns its left argument if this argument is s or g and returns its
right argument when the left argument is p. We next expand the definition of yi and
obtain
yi = xi ⊗ yi−1 = xi ⊗ (xi−1 ⊗ yi−2 ) = xi ⊗ (xi−1 ⊗ (xi−2 . . . (x1 ⊗ x0) . . .)).
This formula corresponds to the sequential computation of yi . We first compute x1 ⊗
x0 , then left-multiply by x2 , then by x3 , and finally by xi . The operator ⊗ is associative
(we shall prove this below) and hence we can change the evaluation order without
changing the result. Compare the following formulae for y6 :
y6 = x6 ⊗ (x5 ⊗ (x4 ⊗ (x3 ⊗ (x2 ⊗ (x1 ⊗ x0 )))))
and
y6 = (x6 ⊗ ((x5 ⊗ x4 ) ⊗ ((x3 ⊗ x2) ⊗ (x1 ⊗ x0)).
4

Pronounced “otimes”.

1.1 Addition
x7

x6

N

x4

N

N
N
N
y7

x5

y6

N
y5

x3

x2

N

y4

x0
step 1 (form pairs)

N

N

N

x1

circuit for n = 4

N
y3

5

y2

step 3 (compute y6 , y4 , . . . )

y1

y0

Fig. 1.1. The computation of y7 to y0 from x7 to x0 . The horizontal partition of the computation
corresponds to the general description of the algorithm. We first form pairs. In the second row
of gates, we combine pairs, and in the third row, we combine pairs of pairs. Then we start to
fill in gaps. In row four, we compute y5 , and in the last row, we compute y6 , y4 , and y2 .

The latter formula corresponds to a parallel evaluation of y6 . We compute x5 ⊗ x4 ,
x3 ⊗ x2 , and x1 ⊗ x0 in parallel, then x6 ⊗ (x5 ⊗ x4 ) and (x3 ⊗ x2 ) ⊗ (x1 ⊗ x0 ), and
finally y6 . Thus three rounds of computation suffice instead of the six rounds for the
sequential evaluation. Generally, we can compute yi in ⌈log i⌉ rounds of computation.
The computation of the different yi ’s can be intertwined, as Fig. 1.1 shows.
We next give a more formal treatment for a general base B. We start with the
simple observation that the carry digit is either 0 or 1.
Lemma 1.2. ci ∈ {0, 1} for all i.
Proof. We have c0 = 0 by definition. Assume inductively, that ci ∈ {0, 1}. Then
ai + bi + ci ≤ 2(B − 1) + 1 = 2B − 1 = 1 · B + B − 1 and hence ci+1 ∈ {0, 1}.
⊔
⊓
Two input digits ai and bi will generate a carry no matter what happens to the right
of them if ai + bi > B − 1. On the other hand, the addition involving ai and bi will
stop any carry if ai + bi < B − 1; this holds because an incoming carry is at most 1.
If ai + bi = B − 1, an incoming carry will be propagated to the left. Hence, we define


s if ai + bi < B − 1 (stop),
xi := p if ai + bi = B − 1 (propagate),
(1.1)


g if ai + bi > B − 1 (generate).
Lemma 1.3. The operator
⊗ is associative, i.e., (u ⊗ v) ⊗ w = u ⊗ (v ⊗ w) for any u,
N
v, and w. Let yi = 0≤ j≤i x j . Then ci = 1 if and only if yi−1 = g.

Proof. If u =
6 p, then (u ⊗ v) ⊗ w = u ⊗ w = u = u ⊗ (v ⊗ w). If u = p, then (u ⊗ v) ⊗
w = v ⊗ w = u ⊗ (v ⊗ w).

6

1 Appetizer: Integer Arithmetic

We have a carry into position i if and only if there is a k < i such that a carry is
generated in position k and propagated by positions k + 1 to i − 1, i.e., xk = g and
xk+1 = . . . xi−1 = p. Then yk = xk = g. This completes the argument when i − 1 = k.
Otherwise, yk+1 = p ⊗ yk = g, yk+2 = p ⊗ yk+1 = g, . . . , yi−1 = p ⊗ yi−2 = g.
⊔
⊓
We now come to the parallel computation of the yi ’s. For simplicity, we assume
n to be a power of 2. If n = 1 = 20 , we simply return x0 . Otherwise, we do the
following:
(a) Combine consecutive pairs of inputs: z0 = x1 ⊗ x0 , z1 = x3 ⊗ x2 , . . . , zn/2−1 =
xn−1 ⊗ xn−2.
(b) Apply the algorithm recursively to the sequence zn/2−1 , . . . , z0 to obtain its sequence of prefix sums wn/2−1 , . . . , w0 .
(c) Assemble the output as y2i+1 = wi for i ∈ 0..n/2 − 1 and y0 = x0 and y2i =
x2i ⊗ wi−1 for i ∈ 1..n/2 − 1.

Figure 1.1 illustrates the computation for n = 8. The number N(n) of gates satisfies
N(1) = 0 and N(n) = n/2 + N(n/2) + n/2 − 1 ≤ N(n/2) + n. Note that we use n/2
gates in step (a), n/2 − 1 gates in step (c), and N(n/2) gates in step (b). By repeated
substitution, we obtain N(n) ≤ n + N(n/2) ≤ n + n/2 + n/4 + n/8 + . . . = O(n). The
number of rounds of computation is 0 for n = 1, is 1 for n = 2, and grows by 2
whenever n doubles. Thus the number of rounds is 2 log n − 1.
The algorithm for computing the yi ’s in parallel is an example of a prefix sum
computation – a frequently useful collective computation described in more detail in
Sect. 13.3.

1.2 Multiplication: The School Method
We all know how to multiply two integers. In this section, we shall review the “school
method”. In a later section, we shall get to know a method which is significantly
faster for large integers.
We shall proceed slowly. We first review how to multiply an n-digit integer a by
a one-digit integer b j . We use b j for the one-digit integer, since this is how we need
it below. For any digit ai of a, we form the product ai · b j . The result is a two-digit
integer (ci di ), i.e.,
a i · b j = ci · B + d i .
We form two integers, c = (cn−1 . . . c0 0) and d = (dn−1 . . . d0 ) from the ci ’s and di ’s,
respectively. Since the c’s are the higher-order digits in the products, we add a 0 digit
at the end. We add c and d to obtain the product p j = a · b j . Graphically,
(an−1 . . . ai . . . a0 ) · b j

−→

cn−1 cn−2 . . . ci ci−1 . . . c0 0
dn−1 . . . di+1 di . . . d1 d0
sum of c and d

Let us determine the number of primitive operations. For each i, we need one primitive operation to form the product ai · b j , for a total of n primitive operations. Then

1.2 Multiplication: The School Method

7

we add two (n + 1)-digit numbers (cn−1 . . . c0 0) and (0dn−1 . . . d0 ). However, we may
simply copy the digit d0 to the result and hence effectively add two n-digit numbers.
This requires n primitive operations. So the total number of primitive operations is
2n. The leftmost addition of cn−1 and the carry into this position generates carry 0
as a · b j ≤ (Bn − 1) · (B − 1) < Bn+1 , and hence the result can be written with n + 1
digits.
Lemma 1.4. We can multiply an n-digit number by a one-digit number with 2n primitive operations. The result is an (n + 1)-digit number.
When you multiply an n-digit number by a one-digit number, you will probably
proceed slightly differently. You combine5 the generation of the products ai · b j with
the summation of c and d into a single phase, i.e., you create the digits of c and d
when they are needed in the final addition. We have chosen to generate them in a
separate phase because this simplifies the description of the algorithm.
Exercise 1.1. Give a program for the multiplication of a and b j that operates in a
single phase.
We can now turn to the multiplication of two n-digit integers. The school method
for integer multiplication works as follows: We first form partial products p j by
multiplying a by the jth digit b j of b, and then sum the suitably aligned products
p j · B j to obtain the product of a and b. Graphically,

pn−1,n

p0,n p0,n−1 . . . p0,2 p0,1 p0,0
p1,n p1,n−1 p1,n−2 . . . p1,1 p1,0
p2,n p2,n−1 p2,n−2 p2,n−3 . . . p2,0
...
. . . pn−1,3 pn−1,2 pn−1,1 pn−1,0
sum of the n partial products

The description in pseudocode is more compact. We initialize the product p to 0
and then add to it the partial products a · b j · B j one by one:
p=0:N
for j := 0 to n − 1 do p := p + a · b j · B j

Let us analyze the number of primitive operations required by the school method.
Each partial product p j requires 2n primitive operations, and hence all partial products together require 2n2 primitive operations. The product a · b is a 2n-digit number,
and hence all summations p + a · b j · B j are summations of 2n-digit integers. Each
such addition requires at most 2n primitive operations, and hence all additions together require at most 2n2 primitive operations. Thus, we need no more than 4n2
primitive operations in total.
5

In the literature on compiler construction and performance optimization, this transformation is known as loop fusion.

1 Appetizer: Integer Arithmetic

Tn / sec
8
0.00000469
16
0.0000154
32
0.0000567
64
0.000222
128
0.000860
256
0.00347
512
0.0138
1 024
0.0547
2 048
0.220
4 096
0.880
8 192
3.53
16 384 14.2
32 768 56.7
65 536 227
131 072 910
n

Tn /Tn/2
3.28
3.67
3.91
3.87
4.03
3.98
3.95
4.01
4.00
4.01
4.01
4.00
4.00
4.00

school method
100
10
time / sec

8

1
0.1
0.01
0.001
0.0001

2

4

2

6

2

8

10

2
n

2

12

2

14

2

16

Fig. 1.2. The running time of the school method for the multiplication of n-digit integers.
The three columns of the table on the left give n, the running time Tn in seconds of the C++
implementation given in Sect. 1.8, and the ratio Tn /Tn/2 . The plot on the right shows log Tn
versus log n, and we see essentially a line. Observe that if Tn = α nβ for some constants α and
β , then Tn /Tn/2 = 2β and log Tn = β log n + log α , i.e., log Tn depends linearly on log n with
slope β . In our case, the slope is 2. Please use a ruler to check.

A simple observation allows us to improve this bound. The number a · b j · B j has
n + 1 + j digits, the last j of which are 0. We can therefore start the addition in the
( j + 1)th position. Also, when we add a · b j · B j to p, we have p = a · (b j−1 . . . b0 ), i.e.,
p has n + j digits. Thus, the addition of p and a ·b j ·B j amounts to the addition of two
n + 1-digit numbers and requires only n + 1 primitive operations. Therefore, all n − 1
additions together require no more than (n − 1)(n + 1) < n2 primitive operations. We
have thus shown the following result.
Theorem 1.5. The school method multiplies two n-digit integers with 3n2 primitive
operations.
We have now analyzed the numbers of primitive operations required by the
school methods for integer addition and integer multiplication. The number Mn of
primitive operations for the school method for integer multiplication is essentially
3n2 . We say that Mn grows quadratically. Observe also that
Mn
3n2
=
= 4,
Mn/2
3(n/2)2
i.e., quadratic growth has the consequence of essentially quadrupling the number of
primitive operations when the size of the instance is doubled.
Assume now that we actually implement the multiplication algorithm in our favorite programming language (we shall do so later in the chapter), and then time the

1.2 Multiplication: The School Method

9

program on our favorite machine for various n-digit integers a and b and various n.
What should we expect? We want to argue that we shall see quadratic growth. The
reason is that primitive operations are representative of the running time of the algorithm. Consider the addition of two n-digit integers first. What happens when the
program is executed? For each position i, the digits ai and bi have to be moved to the
processing unit, the sum ai + bi + c has to be formed, the digit si of the result needs
to be stored in memory, the carry c is updated, the index i is incremented, and a test
for loop exit needs to be performed. Thus, for each i, the same number of machine
cycles is executed. We have counted one primitive operation for each i, and hence
the number of primitive operations is representative of the number of machine cycles executed. Of course, there are additional effects, for example pipelining and the
complex transport mechanism for data between memory and the processing unit, but
they will have a similar effect for all i, and hence the number of primitive operations
is also representative of the running time of an actual implementation on an actual
machine. The argument extends to multiplication, since multiplication of a number
by a one-digit number is a process similar to addition and the second phase of the
school method for multiplication amounts to a series of additions.
Let us confirm the above argument by an experiment. Figure 1.2 shows execution
times of a C++ implementation of the school method; the program can be found in
Sect. 1.8. For each n, we performed a large number6 of multiplications of n-digit
random integers and then determined the average running time Tn ; Tn is listed in the
second column of the table. We also show the ratio Tn /Tn/2 . Figure 1.2 also shows
a plot of the data points7 (log n, log Tn ). The data exhibits approximately quadratic
growth, as we can deduce in various ways. The ratio Tn /Tn/2 is always close to four,
and the double logarithmic plot shows essentially a line of slope 2. The experiments
are quite encouraging: Our theoretical analysis has predictive value. Our theoretical
analysis showed quadratic growth of the number of primitive operations, we argued
above that the running time should be related to the number of primitive operations,
and the actual running time essentially grows quadratically. However, we also see
systematic deviations. For small n, the growth factor from one row to the next is by
less than a factor of four, as linear and constant terms in the running time still play a
substantial role. For larger n, the ratio is very close to four. For very large n (too large
to be timed conveniently), we would probably see a factor larger than four, since the
access time to memory depends on the size of the data. We shall come back to this
point in Sect. 2.2.
Exercise 1.2. Write programs for the addition and multiplication of long integers.
Represent integers as sequences (arrays or lists or whatever your programming language offers) of decimal digits and use the built-in arithmetic to implement the primitive operations. Then write ADD, MULTIPLY1, and MULTIPLY functions that add
6

7

The internal clock that measures CPU time returns its timings in some units, say milliseconds, and hence the rounding required introduces an error of up to one-half of this unit. It
is therefore important that the experiment whose duration is to be timed takes much longer
than this unit, in order to reduce the effect of rounding.
Throughout this book, we use log x to denote the logarithm to base 2, log2 x.

10

1 Appetizer: Integer Arithmetic

integers, multiply an integer by a one-digit number, and multiply integers, respectively. Use your implementation to produce your own version of Fig. 1.2. Experiment
with using a larger base than 10, say base 216 .
Exercise 1.3. Describe and analyze the school method for division.

1.3 Result Checking
Our algorithms for addition and multiplication are quite simple, and hence it is fair
to assume that we can implement them correctly in the programming language of our
choice. However, writing software8 is an error-prone activity, and hence we should
always ask ourselves whether we can check the results of a computation. For multiplication, the authors were taught the following technique in elementary school. The
method is known as Neunerprobe in German, “casting out nines” in English, and
preuve par neuf in French.
Add the digits of a. If the sum is a number with more than one digit, sum its
digits. Repeat until you arrive at a one-digit number, called the checksum of a. We
use sa to denote this checksum. Here is an example:
4528 → 19 → 10 → 1.
Do the same for b and the result c of the computation. This gives the checksums
sb and sc . All checksums are single-digit numbers. Compute sa · sb and form its
checksum s. If s differs from sc , c is not equal to a · b. This test was described by
al-Khwarizmi in his book on algebra.
Let us go through a simple example. Let a = 429, b = 357, and c = 154153.
Then sa = 6, sb = 6, and sc = 1. Also, sa · sb = 36 and hence s = 9. So sc =
6 s and
hence c is not the product of a and b. Indeed, the correct product is c = 153153.
Its checksum is 9, and hence the correct product passes the test. The test is not foolproof, as c = 135153 also passes the test. However, the test is quite useful and detects
many mistakes.
What is the mathematics behind this test? We shall explain a more general
method. Let q be any positive integer; in the method described above, q = 9. Let sa
be the remainder, or residue, in the integer division of a by q, i.e., sa = a − ⌊a/q⌋ · q.
Then 0 ≤ sa < q. In mathematical notation, sa = a mod q.9 Similarly, sb = b mod q
and sc = c mod q. Finally, s = (sa · sb ) mod q. If c = a · b, then it must be the case that
6 a · b and uncovers a mistake in the multiplication (or
s = sc . Thus s =
6 sc proves c =
a mistake in carrying out casting out nines). What do we know if s = sc ? We know
that q divides the difference of c and a · b. If this difference is nonzero, the mistake
will be detected by any q which does not divide the difference.
8

9

The bug in the division algorithm of the floating-point unit of the original Pentium chip
became infamous. It was caused by a few missing entries in a lookup table used by the
algorithm.
The method taught in school uses residues in the range 1 to 9 instead of 0 to 8 according to
the definition sa = a − (⌈a/q⌉ − 1) · q.

1.4 A Recursive Version of the School Method

11

Let us continue with our example and take q = 7. Then a mod 7 = 2, b mod 7 = 0
and hence s = (2 · 0) mod 7 = 0. But 135153 mod 7 = 4, and we have uncovered the
fact that 135153 6= 429 · 357.
Exercise 1.4. Explain why casting out nines corresponds to the case q = 9. Hint:
10k mod 9 = 1 for all k ≥ 0.
Exercise 1.5 (Elferprobe, casting out elevens). Powers of ten have very simple remainders modulo 11, namely 10k mod 11 = (−1)k for all k ≥ 0, i.e., 1 mod 11 = +1,
10 mod 11 = −1, 100 mod 11 = +1, 1 000 mod 11 = −1, etc. Describe a simple test
to check the correctness of a multiplication modulo 11.

1.4 A Recursive Version of the School Method
We shall now derive a recursive version of the school method. This will be our first
encounter with the divide-and-conquer paradigm, one of the fundamental paradigms
in algorithm design.
Let a and b be our two n-digit integers which we want to multiply. Let k = ⌊n/2⌋.
We split a into two numbers a1 and a0 ; a0 consists of the k least significant digits and
a1 consists of the n − k most significant digits.10 We split b analogously. Then
a = a 1 · Bk + a 0
and hence

and b = b1 · Bk + b0 ,

a · b = a1 · b1 · B2k + (a1 · b0 + a0 · b1 ) · Bk + a0 · b0.

This formula suggests the following algorithm for computing a · b:
(a) Split a and b into a1 , a0 , b1 , and b0 .
(b) Compute the four products a1 · b1 , a1 · b0, a0 · b1 , and a0 · b0 .
(c) Add the suitably aligned products to obtain a · b.

Observe that the numbers a1 , a0 , b1 , and b0 are ⌈n/2⌉-digit numbers and hence
the multiplications in step (b) are simpler than the original multiplication if ⌈n/2⌉ <
n, i.e., n > 1. The complete algorithm is now as follows. To multiply one-digit numbers, use the multiplication primitive. To multiply n-digit numbers for n ≥ 2, use the
three-step approach above.
It is clear why this approach is called divide-and-conquer. We reduce the problem
of multiplying a and b to some number of simpler problems of the same kind. A
divide-and-conquer algorithm always consists of three parts: In the first part, we
split the original problem into simpler problems of the same kind (our step (a)); in
the second part, we solve the simpler problems using the same method (our step (b));
and, in the third part, we obtain the solution to the original problem from the solutions
to the subproblems (our step (c)). Instead of “we solve the simpler problems using

12

1 Appetizer: Integer Arithmetic

a1
a1 · b0

a1 · b1

a0 · b1

a0
a0 · b0

b1

b0

Fig. 1.3. Visualization of the school
method and its recursive variant. The
rhombus-shaped area indicates the partial
products in the multiplication a · b. The
four subareas correspond to the partial
products a1 · b1 , a1 · b0 , a0 · b1 , and a0 · b0 .
In the recursive scheme, we first sum the
partial products in the four subareas and
then, in a second step, add the four resulting sums.

the same method”, one usually says more elegantly “we solve the simpler problems
recursively”.
What is the connection of our recursive integer multiplication to the school
method? The two methods are strongly related. Both methods compute all n2 digit
products ai b j and then sum the resulting two-digit results (appropriately shifted).
They differ in the order in which these summations are performed. Figure 1.3 visualizes the digit products and their place values as rhombus-shaped regions. The school
method adds the digit products row by row. The recursive method first computes
the partial results corresponding to the four subregions, which it then combines to
obtain the final result with three additions. We may almost say that our recursive integer multiplication is just the school method in disguise. In particular, the recursive
algorithm also uses a quadratic number of primitive operations.
We next derive the quadratic behavior without appealing to the connection to
the school method. This will allow us to introduce recurrence relations, a powerful
concept for the analysis of recursive algorithms.
Lemma 1.6. Let T (n) be the maximum number of primitive operations required by
our recursive multiplication algorithm when applied to n-digit integers. Then
(
1
if n = 1,
T (n) ≤
4 · T (⌈n/2⌉) + 2 · 2 · n if n ≥ 2.
Proof. Multiplying two one-digit numbers requires one primitive multiplication.
This justifies the case n = 1. So, assume n ≥ 2. Splitting a and b into the four pieces
a1 , a0 , b1 , and b0 requires no primitive operations.11 Each piece has at most ⌈n/2⌉
digits and hence the four recursive multiplications require at most 4 · T (⌈n/2⌉) primitive operations. The products a0 · b0 and a1 · b1 · B2k can be combined into a single
number without any cost as the former number is a 2k-digit number and the latter
number ends with 2k many 0’s. Finally, we need two additions to assemble the final
result. Each addition involves two numbers of at most 2n digits and hence requires
at most 2n primitive operations. This justifies the inequality for n ≥ 2.
⊔
⊓
10

11

Observe that we have changed notation; a0 and a1 now denote the two parts of a and are
no longer single digits.
It will require work, but it is work that we do not account for in our analysis.

1.5 Karatsuba Multiplication

13

In Sect. 2.8, we shall learn that such recurrences are easy to solve and yield the
already conjectured quadratic execution time of the recursive algorithm.
Lemma 1.7. Let T (n) be the maximum number of primitive operations required by
our recursive multiplication algorithm when applied to n-digit integers. Then T (n) ≤
5n2 if n is a power of 2, and T (n) ≤ 20n2 for all n.
Proof. We refer the reader to Sect. 1.9 for a proof.

⊔
⊓

1.5 Karatsuba Multiplication
In 1962, the Soviet mathematician Karatsuba [175] discovered a faster way of multiplying large integers. The running time of his algorithm grows like nlog 3 ≈ n1.58. The
method is surprisingly simple. Karatsuba observed that a simple algebraic identity allows one multiplication to be eliminated in the divide-and-conquer implementation,
i.e., one can multiply n-digit numbers using only three multiplications of integers
half the size.
The details are as follows. Let a and b be our two n-digit integers which we want
to multiply. Let k = ⌊n/2⌋. As above, we split a into two numbers a1 and a0 ; a0
consists of the k least significant digits and a1 consists of the n − k most significant
digits. We split b in the same way. Then
a = a 1 · Bk + a 0

and b = b1 · Bk + b0

and hence (the magic is in the second equality)
a · b = a1 · b1 · B2k + (a1 · b0 + a0 · b1 ) · Bk + a0 · b0

= a1 · b1 · B2k + ((a1 + a0 ) · (b1 + b0) − (a1 · b1 + a0 · b0 )) · Bk + a0 · b0 .

At first sight, we have only made things more complicated. A second look, however, shows that the last formula can be evaluated with only three multiplications,
namely, a1 · b1 , a0 · b0 , and (a1 + a0 ) · (b1 + b0 ). We also need six additions.12 That
is three more than in the recursive implementation of the school method. The key
is that additions are cheap compared with multiplications, and hence saving a multiplication more than outweighs the additional additions. We obtain the following
algorithm for computing a · b:
(a) Split a and b into a1 , a0 , b1 , and b0 .
(b) Compute the three products

12

p2 = a1 · b1 ,

p0 = a0 · b0 ,

p1 = (a1 + a0 ) · (b1 + b0 ).

Actually, five additions and one subtraction. We leave it to readers to convince themselves
that subtractions are no harder than additions.

14

1 Appetizer: Integer Arithmetic

(c) Add the suitably aligned products to obtain a · b, i.e., compute a · b according to
the formula
a · b = (p2 · B2k + p0) + (p1 − (p2 + p0)) · Bk .

The first addition can be performed by concatenating the corresponding digit
strings and requires no primitive operation.

The numbers a1 , a0 , b1 , b0 , a1 + a0 , and b1 + b0 are (⌈n/2⌉ + 1)-digit numbers
and hence the multiplications in step (b) are simpler than the original multiplication
if ⌈n/2⌉ + 1 < n, i.e., n ≥ 4. The complete algorithm is now as follows: To multiply
three-digit numbers, use the school method, and to multiply n-digit numbers for n ≥
4, use the three-step approach above.

school method
Karatsuba4
Karatsuba32

10

time / sec

1
0.1
0.01
0.001
0.0001
0.00001
2

4

6

2

2

8

2
n

10

2

12

2

14

Fig. 1.4. The running times of implementations of the Karatsuba and
school methods for integer multiplication. The running times of two
versions of Karatsuba’s method are
shown: Karatsuba4 switches to the
school method for integers with
fewer than four digits, and Karatsuba32 switches to the school method
for integers with fewer than 32 digits.
The slopes of the lines for the Karatsuba variants are approximately 1.58.
The running time of Karatsuba32 is
approximately one-third the running
time of Karatsuba4.

Figure 1.4 shows the running times TS (n), TK4 (n), and TK32 (n) of C++ implementations of the school method and of two variants of the Karatsuba method for
the multiplication of n-digit numbers. Karatsuba4 (running time TK4 (n)) uses the
school method for numbers with fewer than four digits and Karatsuba32 (running
time TK32 (n)) uses the school method for numbers with fewer than 32 digits; we
discuss the rationale for this variant in Sect. 1.7. The scales on both axes are logarithmic. We see, essentially, straight lines of different slope. The running time of
the school method grows like n2 , and hence the slope is 2 in the case of the school
method. The slope is smaller in the case of the Karatsuba method, and this suggests
that its running time grows like nβ with β < 2. In fact, the ratios13 TK4 (n)/TK4 (n/2)
and TK32 (n)/TK32 (n/2) are close to three, and this suggests that β is such that 2β = 3
or β = log 3 ≈ 1.58. Alternatively, you may determine the slope from Fig. 1.4. We
shall prove below that TK (n) grows like nlog 3 . We say that the Karatsuba method has
13

TK4 (1024) = 0.0455, TK4 (2048) = 0.1375, and TK4 (4096) = 0.41.

1.6 Parallel Multiplication

15

better asymptotic behavior than the school method. We also see that the inputs have
to be quite big before the superior asymptotic behavior of the Karatsuba method actually results in a smaller running time. Observe that for n = 28 , the school method
is still faster, that for n = 29 , the two methods have about the same running time, and
that the Karatsuba method wins for n = 210 . The lessons to remember are:
•
•

Better asymptotic behavior ultimately wins.
An asymptotically slower algorithm can be faster on small inputs.

In the next section, we shall learn how to improve the behavior of the Karatsuba
method for small inputs. The resulting algorithm will always be at least as good as
the school method. It is time to derive the asymptotics of the Karatsuba method.
Lemma 1.8. Let TK (n) be the maximum number of primitive operations required by
the Karatsuba algorithm when applied to n-digit integers. Then
(
3n2
if n ≤ 3,
TK (n) ≤
3 · TK (⌈n/2⌉ + 1) + 8n if n ≥ 4.
Proof. Multiplying two n-digit numbers using the school method requires no more
than 3n2 primitive operations, according to Theorem 1.5. This justifies the first line.
So, assume n ≥ 4. Splitting a and b into the four pieces a1 , a0 , b1 , and b0 requires
no primitive operations.14 Each piece and the sums a0 + a1 and b0 + b1 have at most
⌈n/2⌉ + 1 digits, and hence the three recursive multiplications require at most 3 ·
TK (⌈n/2⌉+ 1) primitive operations. We need two additions to form a0 + a1 and b0 +
b1 . The results of these additions have fewer than n digits and hence the additions
need no more than n elementary operations each. Finally, we need three additions
in order to compute the final result from the results of the multiplications. These are
additions of numbers with at most 2n digits. Thus these additions require at most
3 · 2n primitive operations. Altogether, we obtain the bound stated in the second line
of the recurrence.
⊔
⊓
In Sect. 2.8, we shall learn some general techniques for solving recurrences of this
kind.
Theorem 1.9. Let TK (n) be the maximum number of primitive operations required
by the Karatsuba algorithm when applied to n-digit integers. Then TK (n) ≤ 153nlog3
for all n.
Proof. We refer the reader to Sect. 1.9 for a proof.

⊔
⊓

1.6 Parallel Multiplication
Both the recursive version of the school method and the Karatsuba algorithm are
good starting points for parallel algorithms. For simplicity, we focus on the school
14

It will require work, but remember that we are counting primitive operations.

16

1 Appetizer: Integer Arithmetic

method. Recall that the bulk of the work is done in the recursive multiplications
a0 b0 , a1 b1 , a0 b1 , and a1 b0 . These four multiplications can be done independently
and in parallel. Hence, in the first level of recursion, up to four processors can work
in parallel. In the second level of recursion, the parallelism is already 4 · 4 = 16.
In the ith level of recursion, 4i processors can work in parallel. Fig. 1.5 shows a
graphical representation of the resulting computation for multiplying two numbers
a11 a10 a01 a00 and b11 b10 b01 b00 .
a11 ·b11
a1 ·b1

a11 ·b10
a10 ·b11

+

+
+

a10 ·b10
a11 ·b01
a1 ·b0

a11 ·b00
a10 ·b01

+
+

a10 ·b00

a·b

+
+

a01 ·b11
a0 ·b1

a01 ·b10
a00 ·b11

+

+

+
+

+

a00 ·b10
a01 ·b01
a0 ·b0

a01 ·b00
a00 ·b01
a00 ·b00

+

+
+

Fig. 1.5. Task graph for parallel recursive multiplication with two levels of parallel recursion
and using the school method on two-digit numbers as the base case. A “>” stands for a shift
by two digits.

What is the running time of the algorithm if an unlimited number of processors
are available? This quantity is known as the span of a parallel computation; see
Sect. 2.10. For simplicity, we assume a sequential addition algorithm. As before, we
shall count only arithmetic operations on digits. For the span S(n) for multiplying
two n-digit integers, we get
(
1
if n = 1,
S(n) ≤
S(⌈n/2⌉) + 3 · 2 · n if n ≥ 2.
This recurrence has the solution S(n) ≤ 12n. Note that this is much less than the
quadratic number of operations performed. Hence, we can hope for considerable
speedup by parallel processing – even without parallel addition.
Exercise 1.6. Show
 that by using parallel addition (see Sect. 1.1.1), we can achieve
a span O log2 n for parallel school multiplication.

1.7 Algorithm Engineering

17

Parallel programming environments for multicore processors make it relatively
easy to exploit parallelism in the kind of divide-and-conquer computations described
above, see Sect. 2.14 for details. Roughly, they create a task for recursive calls and
automatically assign cores to tasks. They ensure that enough tasks are created to
keep all available cores busy, but also that tasks stay reasonably large. Section 14.5
explains the load-balancing algorithms behind this programming model.
Exercise 1.7. Describe a parallel divide-and-conquer algorithm for the Karatsuba
method. Show that its span is linear in the number of input digits.

1.7 Algorithm Engineering
Karatsuba integer multiplication is superior to the school method for large inputs.
In our implementation, the superiority only shows for integers with more than 1000
digits. However, a simple refinement improves the performance significantly. Since
the school method is superior to the Karatsuba method for short integers, we should
stop the recursion earlier and switch to the school method for numbers which have
fewer than n0 digits for some yet to be determined n0 . We call this approach the
refined Karatsuba method. It is never worse than either the school method or the
original Karatsuba algorithm as long as n0 is not chosen too large.

0.5

Karatsuba, n = 2048
Karatsuba, n = 4096

time / sec

0.4
0.3
0.2
0.1

4

8

16 32 64 128 256 512 1024
recursion threshold n0

Fig. 1.6. The running time of the
Karatsuba method as a function of
the recursion threshold n0 . The times
consumed for multiplying 2048-digit
and 4096-digit integers are shown.
The minimum is at n0 = 32.

What is a good choice for n0 ? We shall answer this question both experimentally
and analytically. Let us discuss the experimental approach first. We simply time the
refined Karatsuba algorithm for different values of n0 and then adopt the value giving
the smallest running time. For our implementation, the best results were obtained for
n0 = 32 (see Fig. 1.6). The asymptotic behavior of the refined Karatsuba method is
shown in Fig. 1.4. We see that the running time of the refined method still grows
like nlog 3 , that the refined method is about three times faster than the basic Karatsuba

18

1 Appetizer: Integer Arithmetic

method and hence the refinement is highly effective, and that the refined method is
never slower than the school method.
Exercise 1.8. Derive a recurrence for the worst-case number TR (n) of primitive operations performed by the refined Karatsuba method.
We can also approach the question analytically. If we use the school method to
multiply n-digit numbers, we need 3n2 primitive operations. If we use one Karatsuba step and then multiply the resulting numbers of length ⌈n/2⌉ + 1 using the
school method, we need about 3(3(n/2 + 1)2) + 7n primitive operations. The latter
is smaller for n ≥ 23, and hence a recursive step saves primitive operations as long
as the number of digits is more than 23. You should not take this as an indication that
an actual implementation should switch at integers of approximately 23 digits, as the
argument concentrates solely on primitive operations. You should take it as an argument that it is wise to have a nontrivial recursion threshold n0 and then determine the
threshold experimentally.
Exercise 1.9. Throughout this chapter, we have assumed that both arguments of a
multiplication are n-digit integers. What can you say about the complexity of multiplying n-digit and m-digit integers? (a) Show that the school method requires no
more than α · nm primitive operations for some constant α . (b) Assume n ≥ m and
divide a into ⌈n/m⌉ numbers of m digits each. Multiply each of the fragments by b
using Karatsuba’s method and combine the results. What is the running time of this
approach?

1.8 The Programs
We give C++ programs for the school and Karatsuba methods below. These programs were used for the timing experiments described in this chapter. The programs
were executed on a machine with a 2 GHz dual-core Intel Core 2 Duo T7200 processor with 4 MB of cache memory and 2 GB of main memory. The programs were
compiled with GNU C++ version 3.3.5 using optimization level -O2.
A digit is simply an unsigned int and an integer is a vector of digits; here, “vector”
is the vector type of the standard template library. A declaration integer a(n) declares
an integer with n digits, a.size() returns the size of a, and a[i] returns a reference to
the ith digit of a. Digits are numbered starting at 0. The global variable B stores the
base. The functions fullAdder and digitMult implement the primitive operations on
digits. We sometimes need to access digits beyond the size of an integer; the function
getDigit(a, i) returns a[i] if i is a legal index for a and returns 0 otherwise:
typedef unsigned int digit;
typedef vector<digit> integer ;
unsigned int B = 10;

// Base, 2 <= B <= 2^16

void fullAdder( digit a, digit b, digit c, digit & s, digit & carry)
{ unsigned int sum = a + b + c; carry = sum/B; s = sum - carry *B; }

1.8 The Programs

19

void digitMult ( digit a, digit b, digit & s, digit & carry)
{ unsigned int prod = a*b; carry = prod/B; s = prod - carry *B; }
digit getDigit (const integer& a, int i )
{ return ( i < a.size()? a[ i ] : 0 ); }

We want to run our programs on random integers: randDigit is a simple random
generator for digits, and randInteger fills its argument with random digits.
unsigned int X = 542351;
digit randDigit () { X = 443143*X + 6412431; return X % B ; }
void randInteger(integer& a)
{ int n = a.size (); for ( int i =0; i <n; i ++) a[ i ] = randDigit ();}

We now come to the school method of multiplication. We start with a routine that
multiplies an integer a by a digit b and returns the result in atimesb. We need a carrydigit carry, which we initialize to 0. In each iteration, we compute d and c such that
c ∗ B + d = a[i] ∗ b. We then add d, the c from the previous iteration, and the carry
from the previous iteration, store the result in atimesb[i], and remember the carry.
The school method (the function mult) multiplies a by each digit of b and then adds
it at the appropriate position to the partial result (the function addAt):
void mult(const integer& a, const digit& b, integer& atimesb)
{ int n = a.size (); assert(atimesb.size() == n+1);
digit carry = 0, c, d, cprev = 0;
for ( int i = 0; i < n; i ++)
{ digitMult (a[ i ], b,d,c );
fullAdder (d, cprev, carry, atimesb[i ], carry );
cprev = c;
}
d = 0;
fullAdder (d, cprev, carry, atimesb[n], carry ); assert(carry == 0);
}
void addAt(integer& p, const integer& atimesbj, int j )
{ // p has length n+m,
digit carry = 0; int L = p.size ();
for ( int i = j ; i < L; i ++)
fullAdder (p[ i ], getDigit (atimesbj, i - j ), carry, p[ i ], carry );
assert(carry == 0);
}
integer mult(const integer& a, const integer& b)
{ int n = a.size (); int m = b.size ();
integer p(n + m,0); integer atimesbj(n+1);
for ( int j = 0; j < m; j ++)
{ mult(a, b[ j ], atimesbj); addAt(p, atimesbj, j ); }
return p;
}

20

1 Appetizer: Integer Arithmetic

For Karatsuba’s method, we also need algorithms for general addition and subtraction. The subtraction method may assume that the first argument is no smaller than
the second. It computes its result in the first argument:
integer add(const integer& a, const integer& b)
{ int n = max(a.size(),b.size ());
integer s(n+1); digit carry = 0;
for ( int i = 0; i < n; i ++)
fullAdder ( getDigit (a, i ), getDigit (b, i ), carry, s[ i ], carry );
s[n] = carry;
return s;
}
void sub(integer& a, const integer& b) // requires a >= b
{ digit carry = 0;
for ( int i = 0; i < a.size (); i ++)
if ( a[ i ] >= ( getDigit (b, i ) + carry ))
{ a[ i ] = a[ i ] - getDigit (b, i ) - carry; carry = 0; }
else { a[ i ] = a[ i ] + B - getDigit (b, i ) - carry; carry = 1;}
assert(carry == 0);
}

The function split splits an integer into two integers of half the size:
void split (const integer& a,integer& a1, integer& a0)
{ int n = a.size (); int k = n/2;
for ( int i = 0; i < k; i ++) a0[i ] = a[ i ];
for ( int i = 0; i < n - k; i ++) a1[i ] = a[k+ i ];
}

The function Karatsuba works exactly as described in the text. If the inputs have
fewer than n0 digits, the school method is employed. Otherwise, the inputs are split
into numbers of half the size and the products p0, p1, and p2 are formed. Then p0 and
p2 are written into the output vector and subtracted from p1. Finally, the modified p1
is added to the result:
integer Karatsuba(const integer& a, const integer& b, int n0)
{ int n = a.size (); int m = b.size (); assert(n == m); assert(n0 >= 4);
integer p(2*n);
if (n < n0) return mult(a,b);
int k = n/2; integer a0(k), a1(n - k ), b0(k), b1(n - k );
split (a,a1,a0); split (b,b1,b0);
integer p2 = Karatsuba(a1,b1,n0),
p1 = Karatsuba(add(a1,a0),add(b1,b0),n0),
p0 = Karatsuba(a0,b0,n0);
for ( int i = 0; i < 2*k; i ++) p[ i ] = p0[i ];
for ( int i = 2*k; i < n+m; i++) p[ i ] = p2[i - 2*k ];
sub(p1,p0); sub(p1,p2); addAt(p,p1,k);
return p;
}

1.9 Proofs of Lemma 1.7 and Theorem 1.9

21

The following program generated the data for Fig. 1.4:
inline double cpuTime() { return double(clock())/CLOCKS_PER_SEC; }
int main(){
for ( int n = 8; n <= 131072; n *= 2)
{ integer a(n), b(n); randInteger(a); randInteger(b);
double T = cpuTime(); int k = 0;
while (cpuTime() - T < 1) { mult(a,b); k++; }
cout << "\n" << n << " school = " << (cpuTime() -T)/k;
T = cpuTime(); k = 0;
while (cpuTime() - T < 1) { Karatsuba(a,b,4); k++; }
cout << " Karatsuba4 = " << (cpuTime() -T) /k; cout. flush ();
T = cpuTime(); k = 0;
while (cpuTime() - T < 1) { Karatsuba(a,b,32); k++; }
cout << " Karatsuba32 = " << (cpuTime() -T) /k; cout.flush ();
}
return 0;
}

1.9 Proofs of Lemma 1.7 and Theorem 1.9
To make this chapter self-contained, we include proofs of Lemma 1.7 and Theorem 1.9. We start with an analysis of the recursive version of the school method.
Recall that T (n), the maximum number of primitive operations required by our recursive multiplication algorithm when applied to n-digit integers, satisfies the recurrence relation
(
1
if n = 1,
T (n) ≤
4 · T (⌈n/2⌉) + 4n if n ≥ 2.
We use induction on n to show T (n) ≤ 5n2 − 4n when n is a power of 2. For n = 1,
we have T (1) ≤ 1 = 5n2 − 4n. For n > 1, we have
T (n) ≤ 4T (n/2) + 4n ≤ 4(5(n/2)2 − 4n/2) + 4n = 5n2 − 4n,
where the second inequality follows from the induction hypothesis. For general n, we
observe that multiplying n-digit integers is certainly no more costly than multiplying
2⌈log n⌉ -digit integers and hence T (n) ≤ T (2⌈log n⌉ ). Since 2⌈log n⌉ ≤ 2n, we conclude
T (n) ≤ 20n2 for all n.
Exercise 1.10. Prove a bound on the recurrence T (1) ≤ 1 and T (n) ≤ 4T (n/2) + 9n
when n is a power of 2.
How did we know that “5n2 − 4n” is the bound to be shown? There is no magic here.
For n = 2k , repeated substitution yields

22

1 Appetizer: Integer Arithmetic

T (2k ) ≤ 4 · T (2k−1 ) + 4 · 2k ≤ 4 · (4 · T k−2 + 4 · 2k−1) + 4 · 2k
≤ 4 · (4 · (4 · T(2k−3 ) + 4 · 2k−2) + 4 · 2k−1) + 4 · 2k

≤ 43 T (2k−3 ) + 4 · (42 · 2k−2 + 41 · 2k−1 + 2k ) ≤ . . .
≤ 4k T (1) + 4
k

k

∑

0≤i≤k−1
k

4i 2k−i ≤ 4k + 4 · 2k
2

∑

2i

0≤i≤k−1
2

≤ 4 + 4 · 2 (2 − 1) = n + 4n(n − 1) = 5n − 4n.
We now turn to the proof of Theorem 1.9. Recall that TK satisfies the recurrence
(
3n2
if n ≤ 3,
TK (n) ≤
3 · TK (⌈n/2⌉ + 1) + 8n if n ≥ 4.
The recurrence for the school method has the nice property that if n is a power of 2,
the arguments of T on the right-hand side are again powers of two. This is not true
for TK . However, if n = 2k + 2 and k ≥ 1, then ⌈n/2⌉ + 1 = 2k−1 + 2, and hence we
should now use numbers of the form n = 2k + 2, k ≥ 0, as the basis of the inductive
argument. We shall show that
TK (2k + 2) ≤ 51 · 3k − 16 · 2k − 8
for k ≥ 0. For k = 0, we have
TK (20 + 2) = TK (3) ≤ 3 · 32 = 27 = 51 · 30 − 16 · 20 − 8.
For k ≥ 1, we have
TK (2k + 2) ≤ 3TK (2k−1 + 2) + 8 · (2k + 2)


≤ 3 · 51 · 3k−1 − 16 · 2k−1 − 8 + 8 · (2k + 2)
= 51 · 3k − 16 · 2k − 8.

Again, there is no magic in coming up with the right induction hypothesis. It is
obtained by repeated substitution. Namely,
TK (2k + 2) ≤ 3TK (2k−1 + 2) + 8 · (2k + 2)


≤ 3k TK (20 + 2) + 8 · 30 (2k + 2) + 31(2k−1 + 2) + . . . + 3k−1 (21 + 2)


k
3k − 1
k
k (3/2) − 1
≤ 27 · 3 + 8 · 2
+2
3/2 − 1
3−1
≤ 51 · 3k − 16 · 2k − 8,

where the first inequality uses the fact that 2k + 2 is even, the second inequality
follows from repeated substitution, the third inequality uses TK (3) = 27, and the last
inequality follows by a simple computation.

1.11 Historical Notes and Further Findings

23

It remains to extend the bound to all n. Let k be the minimum integer such that
n ≤ 2k + 2. Then k ≤ 1 + log n. Also, multiplying n-digit numbers is no more costly
than multiplying (2k + 2)-digit numbers, and hence
TK (n) ≤ 51 · 3k − 16 · 2k − 8 ≤ 153 · 3logn ≤ 153 · nlog3 ,
where we have used the equality 3log n = 2(log 3)·(log n) = nlog 3 .
Exercise 1.11. Solve the recurrence
(
3n2 + 2n
if n < n0 ,
TR (n) ≤
3 · TR(⌈n/2⌉ + 1) + 8n if n ≥ n0 ,
where n0 is a positive integer. Optimize n0 .

1.10 Implementation Notes
The programs given in Sect. 1.8 are not optimized. The base of the number system
should be a power of 2 so that sums and carries can be extracted by bit operations.
Also, the size of a digit should agree with the word size of the machine and a little
more work should be invested in implementing the primitive operations on digits.
1.10.1 C++
GMP [127] and LEDA [195] offer exact arithmetic on integers and rational numbers
of arbitrary size, and arbitrary-precision floating-point arithmetic. Highly optimized
implementations of Karatsuba’s method are used for multiplication here.
1.10.2 Java
java.math implements exact arithmetic on integers of arbitrary size and arbitraryprecision floating-point numbers.

1.11 Historical Notes and Further Findings
Is the Karatsuba method the fastest known method for integer multiplication? No,
much faster methods are known. Karatsuba’s method splits an integer into two parts
and requires three multiplications of integers of half the length. The natural extension is to split integers into k parts of length n/k each. If the recursive step requires
ℓ multiplications of numbers of length n/k, the running time of the resulting algorithm grows like nlogk ℓ . In this way, Toom [316] and Cook [78] reduced the running
time to15 O n1+ε for arbitrary positive ε . Asymptotically even more efficient are the
15

The O(·) notation is defined in Sect. 2.1.

24

1 Appetizer: Integer Arithmetic

algorithms of Schönhage and Strassen [286] and Schönhage [285]. The former multiplies n-bit integers with O(n log n log logn) bit operations, and can be implemented
to run in this time bound on a Turing machine. The latter runs in linear time O(n)
and requires the machine model discussed in Sect. 2.2. In this model, integers with
log n bits can be multiplied in constant time. The former algorithm was improved by
∗
Fürer [117] and De et al. [84] to O((n log n)2c log (n) ) bit operations. Here, log∗ n is
the smallest integer k ≥ 0 such that log(log(. . . log(n) . . .)) ≤ 1 (k-fold application of
the logarithm function). The function log∗ n grows extremely slowly. In March 2019
an algorithm for integer multiplication was announced that requires O(n log n) bit
operations [146]. This bound is widely believed to be best possible.
Modern microprocessors use a base B = 2 multiplication algorithm that does
quadratic work but requires only O(log n) wire delays [149]. The basic idea is to first
compute n2 digit products in parallel. Bits with the same position in the output are
combined together using full adders, which reduce three values at position i to one
value at position i and one value at position i + 1. This way, log3/2 n layers of full
adders suffice to reduce up to n initial bits at position i to two bits. The remaining
bits can be fed into an adder with logarithmic delay as in Sect. 1.1.1.

2
Introduction

When you want to become a sculptor,1 you have to learn some basic techniques:
where to get the right stones, how to move them, how to handle the chisel, how to
erect scaffolding, . . . . Knowing these techniques will not make you a famous artist,
but even if you have an exceptional talent, it will be very difficult to develop into a
successful artist without knowing them. It is not necessary to master all of the basic
techniques before sculpting the first piece. But you always have to be willing to go
back to improve your basic techniques.
This introductory chapter plays a similar role in this book. We introduce basic concepts that make it simpler to discuss and analyze algorithms in the subsequent chapters. There is no need for you to read this chapter from beginning to end before you
proceed to later chapters. You can also skip the parts on parallel processing when
you are only considering sequential algorithms. On the first reading, we recommend
that you should read carefully to the end of Sect. 2.3 and skim through the remaining sections. We begin in Sect. 2.1 by introducing some notation and terminology
that allow us to argue about the complexity of algorithms in a concise way. We then
introduce machine models in Sect. 2.2 that allow us to abstract from the highly variable complications introduced by real hardware. The models are concrete enough to
have predictive value and abstract enough to permit elegant arguments. Section 2.3
then introduces a high-level pseudocode notation for algorithms that is much more
convenient for expressing algorithms than the machine code of our abstract machine.
Pseudocode is also more convenient than actual programming languages, since we
can use high-level concepts borrowed from mathematics without having to worry
about exactly how they can be compiled to run on actual hardware. We frequently
annotate programs to make algorithms more readable and easier to prove correct.
This is the subject of Sect. 2.6. Section 2.7 gives the first comprehensive example:
binary search in a sorted array. In Sect. 2.8, we introduce mathematical techniques
1

The above illustration of Stonehenge is from [255].

© Springer Nature Switzerland AG 2019
P. Sanders et al., Sequential and Parallel Algorithms and Data Structures,
https://doi.org/10.1007/978-3-030-25209-0_2

25

26

2 Introduction

for analyzing the complexity of programs, in particular, for analyzing nested loops
and recursive procedure calls. Additional analysis techniques are needed for averagecase analysis and parallel algorithm analysis; these are covered in Sects. 2.9 and
2.10, respectively. Randomized algorithms, discussed in Sect. 2.11, use coin tosses
in their execution. Section 2.12 is devoted to graphs, a concept that will play an important role throughout the book. In Sect. 2.13, we discuss the question of when an
algorithm should be called efficient, and introduce the complexity classes P and NP
and the concept of NP-completeness. Finally, as in most chapters of this book, we
close with implementation notes (Sect. 2.14) and historical notes and further findings
(Sect. 2.15).

2.1 Asymptotic Notation
The main purpose of algorithm analysis is to give performance guarantees, for example bounds on running time, that are at the same time accurate, concise, general,
and easy to understand. It is difficult to meet all these criteria simultaneously. For
example, the most accurate way to characterize the running time T of an algorithm is
to view T as a mapping from the set I of all inputs to the set of nonnegative numbers
R+ . For any problem instance i, T (i) is the running time on i. This level of detail is
so overwhelming that we could not possibly derive a theory about it. A useful theory
needs a more global view of the performance of an algorithm.
Hence, we group the set of all inputs into classes of “similar” inputs and summarize the performance on all instances in the same class in a single number. The most
useful grouping is by size. Usually, there is a natural way to assign a size to each
problem instance. The size of an integer is the number of digits in its representation,
and the size of a set is the number of elements in that set. The size of an instance is
always a natural number. Sometimes we use more than one parameter to measure the
size of an instance; for example, it is customary to measure the size of a graph by its
number of nodes and its number of edges. We ignore this complication for now. We
use size(i) to denote the size of instance i, and In to denote the set of instances of size
n for n ∈ N. For the inputs of size n, we are interested in the maximum, minimum,
and average execution times:2
worst case:
best case:
average case:

T (n) = max {T (i) : i ∈ In } ;
T (n) = min {T (i) : i ∈ In } ;
1
T (n) =
∑ T (i).
|In | i∈I
n

We are most interested in the worst-case execution time, since it gives us the
strongest performance guarantee. A comparison of the best and the worst case tells
us how much the execution time varies for different inputs in the same class. If the
2

We shall make sure that {T (i) : i ∈ In } always has a proper minimum and maximum, and
that In is finite when we consider averages.

2.1 Asymptotic Notation

27

discrepancy is big, the average case may give more insight into the true performance
of the algorithm. Section 2.9 gives an example.
We shall perform one more step of data reduction: We shall concentrate on
growth rate or asymptotic analysis. Functions f (n) and g(n) have the same growth
rate if there are positive constants c and d such that c ≤ f (n)/g(n) ≤ d for all sufficiently large n, and f (n) grows faster than g(n) if, for all positive constants c, we
have f (n) ≥ c·g(n) for all sufficiently large n. For example, the functions n2 , n2 + 7n,
5n2 − 7n, and n2 /10 + 106n all have the same growth rate. Also, they grow faster than
n3/2 , which in turn grows faster than n log n. The growth rate talks about the behavior
for large n. The word “asymptotic” in “asymptotic analysis” also stresses the fact
that we are interested in the behavior for large n.
Why are we interested only in growth rates and the behavior for large n? We are
interested in the behavior for large n because the whole purpose of designing efficient
algorithms is to be able to solve large instances. For large n, an algorithm whose
running time has a smaller growth rate than the running time of another algorithm
will be superior. Also, our machine model is an abstraction of real machines and
hence can predict actual running times only up to a constant factor. A pleasing side
effect of concentrating on growth rate is that we can characterize the running times
of algorithms by simple functions. However, in the sections on implementation, we
shall frequently take a closer look and go beyond asymptotic analysis. Also, when
using one of the algorithms described in this book, you should always ask yourself
whether the asymptotic view is justified.
The following definitions allow us to argue precisely about asymptotic behavior.
Let f (n) and g(n) denote functions that map nonnegative integers to nonnegative real
numbers:
O( f (n)) = {g(n) : ∃c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≤ c · f (n)} ,
Ω( f (n)) = {g(n) : ∃c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≥ c · f (n)} ,
Θ( f (n)) = O( f (n)) ∩ Ω( f (n)) ,

o( f (n)) = {g(n) : ∀c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≤ c · f (n)} ,
ω ( f (n)) = {g(n) : ∀c > 0 : ∃n0 ∈ N+ : ∀n ≥ n0 : g(n) ≥ c · f (n)} .
The left-hand sides should be read as “big O of f ”, “big omega of f ”, “theta of
f ”, “little o of f ”, and “little omega of f ”, respectively. A remark about notation is
in order here. In the definitions above, we use “ f (n)” and “g(n)” with two different
meanings. In “O( f (n))” and “{g(n) : . . .}”, they denote the functions f and g and the
“n” emphasizes that these are functions of the argument n, and in “g(n) ≤ c · f (n)”,
they denote the values of the functions at the argument n.
Let us see some examples. O n2 is the set of all functions that grow at most

quadratically, o n2 is the set of functions that grow less than quadratically, and o(1)
is the set of functions that go to 0 as n goes to infinity. Here “1” stands for the
function n 7→ 1, which is 1 everywhere, and hence f ∈ o(1) if f (n) ≤ c · 1 for any
positive c and sufficiently large n, i.e., f (n) goes to zero as n goes to infinity. Generally, O( f (n)) is the set of all functions that “grow no faster than” f (n). Similarly,

28

2 Introduction

Ω( f (n)) is the set of all functions that “grow at least as fast as” f (n). For example,
the Karatsuba
algorithm for integer multiplication has a worst-case running time

 in
O n1.58 , whereas the school algorithm has a worst-case running time in Ω n2 , so
that we can say that the Karatsuba algorithm is asymptotically faster than the school
algorithm. The “little o” notation o( f (n)) denotes the set of all functions that “grow
strictly more slowly than” f (n). Its twin ω ( f (n)) is rarely used, and is only shown
for completeness.
The growth rate of most algorithms discussed in this book is either a polynomial
or a logarithmic function, or the product of a polynomial and a logarithmic function. We use polynomials to introduce our readers to some basic manipulations of
asymptotic notation.
Lemma 2.1.Let p(n) = ∑ki=0 ai ni denote any polynomial and assume ak > 0. Then
p(n) ∈ Θ nk .


Proof. It suffices to show that p(n) ∈ O nk and p(n) ∈ Ω nk . First observe that for
n ≥ 1,
k

k

i=0

i=0

p(n) ≤ ∑ |ai |ni ≤ nk ∑ |ai |,

and hence p(n) ≤
for all positive n. Thus p(n) ∈ O nk .
Let A = ∑k−1
i=0 |ai |. For positive n, we have
a

ak
k
n−A
p(n) ≥ ak nk − Ank−1 = nk + nk−1
2
2
(∑ki=0 |ai |)nk

k for n > 2A/a . We choose c = a /2 and n = 2A/a in
and hence p(n) ≥ (ak /2)n
0
k
k
k


k
the definition of Ω n , and obtain p(n) ∈ Ω nk .
⊓
⊔

Exercise 2.1. Right or wrong? (a) n2 + 106n ∈ O n2 , (b) n log n ∈ O(n), (c) n log n ∈
Ω(n), (d) log n ∈ o(n).

Asymptotic notation is used a lot in algorithm analysis, and it is convenient tostretch
mathematical notation a little in order to allow sets of functions (such as O n2 ) to be
treated similarly to ordinary functions. In particular, we shall always write h = O( f )
instead of h ∈ O( f ), and O(h) = O( f ) instead of O(h) ⊆ O( f ). For example,


3n2 + 7n = O n2 = O n3 .

Never forget that sequences of “equalities” involving O-notation are really membership and inclusion relations and, as such, can only be read from left to right.
If h is a function, F and G are sets of functions, and ◦ is an operator such as +, ·,
or /, then F ◦ G is a shorthand for { f ◦ g : f ∈ F, g ∈ G}, and h ◦ F stands for {h} ◦ F.
So f (n) + o( f (n)) denotes the set of all functions f (n) + g(n) where g(n) grows
strictly more slowly than f (n), i.e., the ratio ( f (n) + g(n))/ f (n) goes to 1 as n goes
to infinity. Equivalently, we can write (1 + o(1)) f (n). We use this notation whenever
we care about the constant in the leading term but want to ignore lower-order terms.

2.2 The Sequential Machine Model

29

Lemma 2.2. The following rules hold for O-notation:
c f (n) = Θ( f (n)) for any positive constant c,
f (n) + g(n) = Ω( f (n)) ,
f (n) + g(n) = O( f (n)) if g(n) = O( f (n)) ,
O( f (n)) · O(g(n)) = O( f (n) · g(n)) .
Exercise 2.2. Prove Lemma 2.2.

Exercise 2.3. Sharpen Lemma 2.1 and show that p(n) = ak nk + o nk .

Exercise 2.4. Prove that nk = o(cn ) for any integer k and any c > 1. How does nlog logn
compare with nk and cn ?

2.2 The Sequential Machine Model
In 1945, John von Neumann (Fig. 2.1) introduced
a computer architecture [243] which was simple,
yet powerful. The limited hardware technology of
the time forced him to come up with a design that
concentrated on the essentials; otherwise, realization
would have been impossible. Hardware technology
has developed tremendously since 1945. However,
the programming model resulting from von Neumann’s design is so elegant and powerful that it is still
the basis for most of modern programming. Usually, Fig. 2.1. John von Neumann,
programs written with von Neumann’s model in mind born Dec. 28, 1903 in Budapest,
also work well on the vastly more complex hardware died Feb. 8, 1957 in Washington, DC.
of today’s machines.
The variant of von Neumann’s model used in algorithmic analysis is called the
RAM (random access machine) model. It was introduced by Shepherdson and Sturgis [294] in 1963. It is a sequential machine with uniform memory, i.e., there is a
single processing unit, and all memory accesses take the same amount of time. The
(main) memory, or store, consists of infinitely many cells S[0], S[1], S[2], . . . ; at any
point in time, only a finite number of them will be in use. In addition to the main
memory, there are a small number of registers R1 , . . . , Rk .
The memory cells store “small” integers, also called words. In our discussion of
integer arithmetic in Chap. 1, we assumed that “small” meant one-digit. It is more
reasonable and convenient to assume that the interpretation of “small” depends on
the size of the input. Our default assumption is that integers whose absolute value
is bounded by a polynomial in the size of the input can be stored in a single cell.
Such integers can be represented by a number of bits that is logarithmic in the size
of the input. This assumption is reasonable because we could always spread out the
contents of a single cell over logarithmically many cells with a logarithmic overhead

30

2 Introduction

in time and space and obtain constant-size cells. The assumption is convenient because we want to be able to store array indices in a single cell. The assumption is
necessary because allowing cells to store arbitrary numbers would lead to absurdly
overoptimistic algorithms. For example, by repeated squaring, we could generate a
number with 2n bits in n steps. Namely, if we start with the number 2 = 21 , squaring
1
2
it once gives 4 = 22 = 22 , squaring it twice gives 16 = 24 = 22 , and squaring it n
n
times gives 22 .
Our model supports a limited form of parallelism. We can perform simple operations on a logarithmic number of bits in constant time.
A RAM can execute (machine) programs. A program is simply a sequence of
machine instructions, numbered from 0 to some number ℓ. The elements of the sequence are called program lines. The program is stored in a program store. Our RAM
supports the following machine instructions:
•
•
•

•
•
•
•

Ri := S[R j ] loads the contents of the memory cell indexed by the contents of R j
into register Ri .
S[R j ] := Ri stores the contents of register Ri in the memory cell indexed by the
contents of R j .
Ri := R j ⊙ Rh executes the binary operation ⊙ on the contents of registers R j and
Rh and stores the result in register Ri . Here, ⊙ is a placeholder for a variety of
operations. The arithmetic operations are the usual +, −, and ∗; they interpret
the contents of the registers as integers. The operations div and mod stand for
integer division and the remainder, respectively. The comparison operations ≤,
<, >, and ≥ for integers return truth values, i.e., either true ( = 1) or false ( =
0). The logical operations ∧ and ∨ manipulate the truth values 0 and 1. We also
have bitwise Boolean operations | (OR), & (AND), and ⊕ (exclusive OR, XOR).
They interpret contents as bit strings. The shift operators >> (shift right) and <<
(shift left) interpret the first argument as a bit string and the second argument
as a nonnegative integer. We may also assume that there are operations which
interpret the bits stored in a register as a floating-point number, i.e., a finiteprecision approximation of a real number.
Ri :=⊙R j executes the unary operation ⊙ on the contents of register R j and stores
the result in register Ri . The operators −, ¬ (logical NOT), and ~ (bitwise NOT)
are available.
Ri :=C assigns the constant value C to Ri .
JZ k, Ri continues execution at program line k, if register Ri is 0, and at the next
program line otherwise (conditional branch). There is also the variant JZ R j , Ri ,
where the target of the jump is the program line stored in R j .
J k continues execution at program line k (unconditional branch). Similarly to
JZ, the program line can also be specified by the content of a register.

A program is executed on a given input step by step. The input for a computation
is stored in memory cells S[1] to S[R1 ] and execution starts with program line 1. With
the exception of the branch instructions JZ and J, the next instruction to be executed
is always the instruction in the next program line. The execution of a program ter-

2.2 The Sequential Machine Model

31

minates if a program line is to be executed whose number is outside the range 1..ℓ.
Recall that ℓ is the number of the last program line.
We define the execution time of a program on an input in the most simple way:
Each instruction takes one time step to execute. The total execution time of a program
is the number of instructions executed.
It is important to remember that the RAM model is an abstraction. One should
not confuse it with physically existing machines. In particular, real machines have
a finite memory and a fixed number of bits per register (e.g., 32 or 64). In contrast,
the word size and memory of a RAM scale with input size. This can be viewed as
an abstraction of the historical development. Microprocessors have had words of 4,
8, 16, and 32 bits in succession, and now often have 64-bit words. Words of 64 bits
can index a memory of size 264 . Thus, at current prices, memory size is limited by
cost and not by physical limitations. This statement was also true when 32-bit words
were introduced.
Our complexity model is a gross oversimplification: Modern processors attempt
to execute many instructions in parallel. How well they succeed depends on fact