Numerical Analysis, 2nd Edition, Timothy Sauer

| i
Numerical Analysis
This page intentionally left blank
| iii
Numerical AnalysisS E C O N D E D I T I O N
Timothy SauerGeorge Mason University
Boston Columbus Indianapolis New York San Francisco Upper SaddleRiver Amsterdam Cape TownDubai London Madrid Milan Munich ParisMontral Toronto Delhi Mexico City So Paulo
Sydney Hong Kong Seoul Singapore Taipei Tokyo
Editor in Chief: Deirdre LynchSenior Acquisitions Editor:William HoffmanSponsoring Editor: Caroline CelanoEditorialAssistant: Brandon RawnsleySenior Managing Editor: KarenWernholmSenior Production Project Manager: Beth HoustonExecutiveMarketing Manager: Jeff WeidenaarMarketing Assistant: CaitlinCraneSenior Author Support/Technology Specialist: Joe VetereRightsand Permissions Advisor: Michael JoyceManufacturing Buyer: DebbieRossiDesign Manager: Andrea NixSenior Designer: BarbaraAtkinsonProduction Coordination and Composition: Integra SoftwareServices Pvt. LtdCover Designer: Karen SalzbachCover Image: TimTadder/Corbis
Photo credits: Page 1 Image Source; page 24 National AdvancedDriving Simulator (NADS-1 Simulator) locatedat the University ofIowa and owned by the National Highway Safety Administration(NHTSA); page 39 YaleBabylonian Collection; page 71Travellinglight/iStockphoto; page 138 Rosenfeld Images Ltd./PhotoResearchers,Inc; page 188 Pincasso/Shutterstock; page 243Orhan81/Fotolia; page 281 UPPA/Photoshot; page 348 PaulSpringett04/Alamy; page 374 Bill Noll/iStockphoto; page 431 DonEmmert/AFP/Getty Images/Newscom;page 467 PictureAlliance/Photoshot; page 495 Chris Rout/Alamy; page 505 ToniAngermayer/PhotoResearchers, Inc; page 531 Jinx PhotographyBrands/Alamy; page 565 Phil Degginger/Alamy.
Many of the designations used by manufacturers and sellers todistinguish their products are claimed astrademarks. Where thosedesignations appear in this book, and Pearson Education was awareof a trademarkclaim, the designations have been printed in initialcaps or all caps.
Library of Congress Cataloging-in-Publication DataSauer,Tim.
Numerical analysis / Timothy Sauer. 2nd ed.p. cm.
Includes bibliographical references and index.ISBN-13:978-0-321-78367-7ISBN-10: 0-321-78367-01. Numerical analysis. I.Title.QA297.S348 2012518dc23
2011014232
Copyright 2012, 2006 Pearson Education, Inc.
All rights reserved. No part of this publication may bereproduced, stored in a retrieval system, or transmitted, inanyform or by any means, electronic, mechanical, photocopying,recording, or otherwise, without the priorwritten permission of thepublisher. Printed in the United States of America. For informationon obtainingpermission for use of material in this work, pleasesubmit a written request to Pearson Education, Inc., RightsandContracts Department, 501 Boylston Street, Suite 900, Boston, MA02116, fax your request to617-671-3447, or e-mail athttp://www.pearsoned.com/legal/permissions.htm.
1 2 3 4 5 6 7 8 9 10EB15 14 13 12 11
ISBN 10: 0-321-78367-0ISBN 13: 978-0-321-78367-7
Contents
PREFACE xiii
CHAPTER 0 Fundamentals 10.1 Evaluating a Polynomial 1
0.2 Binary Numbers 50.2.1 Decimal to binary 60.2.2 Binary todecimal 7
0.3 Floating Point Representation of Real Numbers 80.3.1Floating point formats 80.3.2 Machine representation 110.3.3Addition of oating point numbers 13
0.4 Loss of Signicance 16
0.5 Review of Calculus 19
Software and Further Reading 23
CHAPTER 1 Solving Equations 241.1 The Bisection Method 25
1.1.1 Bracketing a root 251.1.2 How accurate and how fast?28
1.2 Fixed-Point Iteration 301.2.1 Fixed points of a function311.2.2 Geometry of Fixed-Point Iteration 331.2.3 Linearconvergence of Fixed-Point Iteration 341.2.4 Stopping criteria40
1.3 Limits of Accuracy 431.3.1 Forward and backward error441.3.2 The Wilkinson polynomial 471.3.3 Sensitivity of root-nding48
1.4 Newtons Method 511.4.1 Quadratic convergence of NewtonsMethod 531.4.2 Linear convergence of Newtons Method 55
1.5 Root-Finding without Derivatives 611.5.1 Secant Method andvariants 611.5.2 Brents Method 64
Reality Check 1: Kinematics of the Stewart platform 67
Software and Further Reading 69
CHAPTER 2 Systems of Equations 712.1 Gaussian Elimination 71
2.1.1 Naive Gaussian elimination 722.1.2 Operation counts 74
vi | Contents
2.2 The LU Factorization 792.2.1 Matrix form of Gaussianelimination 792.2.2 Back substitution with the LU factorization812.2.3 Complexity of the LU factorization 83
2.3 Sources of Error 852.3.1 Error magnication and conditionnumber 862.3.2 Swamping 91
2.4 The PA = LU Factorization 952.4.1 Partial pivoting 952.4.2Permutation matrices 972.4.3 PA = LU factorization 98
Reality Check 2:The EulerBernoulli Beam 102
2.5 Iterative Methods 1062.5.1 Jacobi Method 1062.5.2GaussSeidel Method and SOR 1082.5.3 Convergence of iterativemethods 1112.5.4 Sparse matrix computations 113
2.6 Methods for symmetric positive-denite matrices 1172.6.1Symmetric positive-denite matrices 1172.6.2 Cholesky factorization1192.6.3 Conjugate Gradient Method 1212.6.4 Preconditioning 126
2.7 Nonlinear Systems of Equations 1302.7.1 Multivariate NewtonsMethod 1312.7.2 Broydens Method 133
Software and Further Reading 137
CHAPTER 3 Interpolation 1383.1 Data and Interpolating Functions139
3.1.1 Lagrange interpolation 1403.1.2 Newtons divideddifferences 1413.1.3 How many degree d polynomials pass throughn
points? 1443.1.4 Code for interpolation 1453.1.5 Representingfunctions by approximating polynomials 147
3.2 Interpolation Error 1513.2.1 Interpolation error formula1513.2.2 Proof of Newton form and error formula 1533.2.3 Rungephenomenon 155
3.3 Chebyshev Interpolation 1583.3.1 Chebyshevs theorem 1583.3.2Chebyshev polynomials 1603.3.3 Change of interval 162
3.4 Cubic Splines 1663.4.1 Properties of splines 1673.4.2Endpoint conditions 173
3.5 Bzier Curves 179
Reality Check 3: Fonts from Bzier curves 183
Software and Further Reading 187
Contents | vii
CHAPTER 4 Least Squares 1884.1 Least Squares and the NormalEquations 188
4.1.1 Inconsistent systems of equations 1894.1.2 Fitting modelsto data 1934.1.3 Conditioning of least squares 197
4.2 A Survey of Models 2014.2.1 Periodic data 2014.2.2 Datalinearization 203
4.3 QR Factorization 2124.3.1 GramSchmidt orthogonalization andleast squares 2124.3.2 Modied GramSchmidt orthogonalization2184.3.3 Householder reectors 220
4.4 Generalized Minimum Residual (GMRES) Method 2254.4.1 Krylovmethods 2264.4.2 Preconditioned GMRES 228
4.5 Nonlinear Least Squares 2304.5.1 GaussNewton Method 2304.5.2Models with nonlinear parameters 2334.5.3 The LevenbergMarquardtMethod. 235
Reality Check 4: GPS,Conditioning, and Nonlinear Least Squares238
Software and Further Reading 242
CHAPTER 5 Numerical Differentiation andIntegration 243
5.1 Numerical Differentiation 2445.1.1 Finite differenceformulas 2445.1.2 Rounding error 2475.1.3 Extrapolation 2495.1.4Symbolic differentiation and integration 250
5.2 NewtonCotes Formulas for Numerical Integration 2545.2.1Trapezoid Rule 2555.2.2 Simpsons Rule 2575.2.3 CompositeNewtonCotes formulas 2595.2.4 Open NewtonCotes Methods 262
5.3 Romberg Integration 265
5.4 Adaptive Quadrature 269
5.5 Gaussian Quadrature 273
Reality Check 5: Motion Control in Computer-Aided Modeling278
Software and Further Reading 280
CHAPTER 6 Ordinary Differential Equations 2816.1 Initial ValueProblems 282
6.1.1 Eulers Method 2836.1.2 Existence, uniqueness, andcontinuity for solutions 2876.1.3 First-order linear equations290
6.2 Analysis of IVP Solvers 2936.2.1 Local and global truncationerror 293
viii | Contents
6.2.2 The explicit Trapezoid Method 2976.2.3 Taylor Methods300
6.3 Systems of Ordinary Differential Equations 3036.3.1 Higherorder equations 3046.3.2 Computer simulation: the pendulum 3056.3.3Computer simulation: orbital mechanics 309
6.4 RungeKutta Methods and Applications 3146.4.1 The RungeKuttafamily 3146.4.2 Computer simulation: the HodgkinHuxley neuron3176.4.3 Computer simulation: the Lorenz equations 319
Reality Check 6:The Tacoma Narrows Bridge 322
6.5 Variable Step-Size Methods 3256.5.1 Embedded RungeKuttapairs 3256.5.2 Order 4/5 methods 328
6.6 Implicit Methods and Stiff Equations 332
6.7 Multistep Methods 3366.7.1 Generating multistep methods3366.7.2 Explicit multistep methods 3396.7.3 Implicit multistepmethods 342
Software and Further Reading 347
CHAPTER 7 BoundaryValue Problems 3487.1 Shooting Method 349
7.1.1 Solutions of boundary value problems 3497.1.2 ShootingMethod implementation 352
Reality Check 7: Buckling of a Circular Ring 355
7.2 Finite Difference Methods 3577.2.1 Linear boundary valueproblems 3577.2.2 Nonlinear boundary value problems 359
7.3 Collocation and the Finite Element Method 3657.3.1Collocation 3657.3.2 Finite elements and the Galerkin Method367
Software and Further Reading 373
CHAPTER 8 Partial Differential Equations 3748.1 ParabolicEquations 375
8.1.1 Forward Difference Method 3758.1.2 Stability analysis ofForward Difference Method 3798.1.3 Backward Difference Method3808.1.4 CrankNicolson Method 385
8.2 Hyperbolic Equations 3938.2.1 The wave equation 3938.2.2 TheCFL condition 395
8.3 Elliptic Equations 3988.3.1 Finite Difference Method forelliptic equations 399
Reality Check 8: Heat distribution on a cooling fin 4038.3.2Finite Element Method for elliptic equations 406
Contents | ix
8.4 Nonlinear partial differential equations 4178.4.1 ImplicitNewton solver 4178.4.2 Nonlinear equations in two space dimensions423
Software and Further Reading 430
CHAPTER 9 Random Numbers and Applications 4319.1 Random Numbers432
9.1.1 Pseudo-random numbers 4329.1.2 Exponential and normalrandom numbers 437
9.2 Monte Carlo Simulation 4409.2.1 Power laws for Monte Carloestimation 4409.2.2 Quasi-random numbers 442
9.3 Discrete and Continuous Brownian Motion 4469.3.1 Randomwalks 4479.3.2 Continuous Brownian motion 449
9.4 Stochastic Differential Equations 4529.4.1 Adding noise todifferential equations 4529.4.2 Numerical methods for SDEs 456
Reality Check 9:The BlackScholes Formula 464
Software and Further Reading 465
CHAPTER 10 Trigonometric Interpolation andthe FFT 467
10.1 The Fourier Transform 46810.1.1 Complex arithmetic46810.1.2 Discrete Fourier Transform 47010.1.3 The Fast FourierTransform 473
10.2 Trigonometric Interpolation 47610.2.1 The DFT InterpolationTheorem 47610.2.2 Efcient evaluation of trigonometric functions479
10.3 The FFT and Signal Processing 48310.3.1 Orthogonality andinterpolation 48310.3.2 Least squares tting with trigonometricfunctions 48510.3.3 Sound, noise, and ltering 489
Reality Check 10:The Wiener Filter 492
Software and Further Reading 494
CHAPTER 11 Compression 49511.1 The Discrete Cosine Transform496
11.1.1 One-dimensional DCT 49611.1.2 The DCT and least squaresapproximation 498
11.2 Two-Dimensional DCT and Image Compression 50111.2.1Two-dimensional DCT 50111.2.2 Image compression 50511.2.3Quantization 508
11.3 Huffman Coding 51411.3.1 Information theory and coding51411.3.2 Huffman coding for the JPEG format 517
wangchangdongwangchangdongMarked
wangchangdongwangchangdongMarked
x | Contents
11.4 Modied DCT and Audio Compression 51911.4.1 Modied DiscreteCosine Transform 52011.4.2 Bit quantization 525
Reality Check 11: A Simple Audio Codec 527
Software and Further Reading 530
CHAPTER 12 Eigenvalues and Singular Values 53112.1 PowerIteration Methods 531
12.1.1 Power Iteration 53212.1.2 Convergence of Power Iteration53412.1.3 Inverse Power Iteration 53512.1.4 Rayleigh QuotientIteration 537
12.2 QR Algorithm 53912.2.1 Simultaneous iteration 53912.2.2Real Schur form and the QR algorithm 54212.2.3 Upper Hessenbergform 544
Reality Check 12: How Search Engines Rate Page Quality 549
12.3 Singular Value Decomposition 55212.3.1 Finding the SVD ingeneral 55412.3.2 Special case: symmetric matrices 555
12.4 Applications of the SVD 55712.4.1 Properties of the SVD55712.4.2 Dimension reduction 55912.4.3 Compression 56012.4.4Calculating the SVD 561
Software and Further Reading 563
CHAPTER 13 Optimization 56513.1 Unconstrained Optimizationwithout Derivatives 566
13.1.1 Golden Section Search 56613.1.2 Successive parabolicinterpolation 56913.1.3 NelderMead search 571
13.2 Unconstrained Optimization with Derivatives 57513.2.1Newtons Method 57613.2.2 Steepest Descent 57713.2.3 ConjugateGradient Search 578
Reality Check 13: Molecular Conformation andNumericalOptimization 580
Software and Further Reading 582
Appendix A 583A.1 Matrix Fundamentals 583
A.2 Block Multiplication 585
A.3 Eigenvalues and Eigenvectors 586
A.4 Symmetric Matrices 587
A.5 Vector Calculus 588
Contents | xi
Appendix B 590B.1 Starting MATLAB 590
B.2 Graphics 591
B.3 Programming in MATLAB 593
B.4 Flow Control 594
B.5 Functions 595
B.6 Matrix Operations 597
B.7 Animation and Movies 597
ANSWERSTO SELECTED EXERCISES 599
BIBLIOGRAPHY 626
INDEX 637
This page intentionally left blank
Preface
Numerical Analysis is a text for students of engineering,science, mathematics, and com-puter science who have completedelementary calculus and matrix algebra. The primarygoal is toconstruct and explore algorithms for solving science andengineering problems.The not-so-secret secondary mission is to helpthe reader locate these algorithms in a land-scape of some potentand far-reaching principles. These unifying principles, takentogether,constitute a dynamic eld of current research anddevelopment in modern numerical andcomputational science.
The discipline of numerical analysis is jam-packed with usefulideas. Textbooks run therisk of presenting the subject as a bag ofneat but unrelated tricks. For a deep understanding,readers need tolearn much more than how to code Newtons Method, RungeKutta, andtheFast Fourier Transform. They must absorb the big principles, theones that permeatenumerical analysis and integrate its competingconcerns of accuracy and efciency.
The notions of convergence, complexity, conditioning,compression, and orthogonalityare among the most important of thebig ideas. Any approximation method worth its saltmust converge tothe correct answer as more computational resources are devoted toit, andthe complexity of a method is a measure of its use of theseresources. The conditioningof a problem, or susceptibility to errormagnication, is fundamental to knowing how itcan be attacked. Manyof the newest applications of numerical analysis strive torealizedata in a shorter or compressed way. Finally, orthogonalityis crucial for efciency in manyalgorithms, and is irreplaceablewhere conditioning is an issue or compression is a goal.
In this book, the roles of the ve concepts in modern numericalanalysis are emphasizedin short thematic elements calledSpotlights. They comment on the topic at hand and makeinformalconnections to other expressions of the same concept elsewhere inthe book. Wehope that highlighting the ve concepts in such anexplicit way functions as a Greek chorus,accentuating what isreally crucial about the theory on the page.
Although it is common knowledge that the ideas of numericalanalysis are vital to thepractice of modern science andengineering, it never hurts to be obvious. The RealityChecksprovide concrete examples of the way numerical methods leadto solutions of importantscientic and technological problems. Theseextended applications were chosen to be timelyand close to everydayexperience. Although it is impossible (and probably undesirable)topresent the full details of the problems, the Reality Checksattempt to go deeply enough toshow how a technique or algorithm canleverage a small amount of mathematics into a greatpayoff intechnological design and function. The Reality Checks proved to beextremelypopular as a source of student projects in the rstedition, and have been extended andamplied in the secondedition.
NEWTOTHISEDITION.The second edition features a major expansionof methodsfor solving systems of equations. The Choleskyfactorization has been added to Chapter 2 forthe solution ofsymmetric positive-denite matrix equations. For large linearsystems, dis-cussion of the Krylov approach, including the GMRESmethod, has been added to Chapter4, along with new material on theuse of preconditioners for symmetric and nonsymmet-ric problems.Modied GramSchmidt orthogonalization and theLevenbergMarquardtMethod are new to this edition. The treatment ofPDEs in Chapter 8 has been extended tononlinear PDEs, includingreaction-diffusion equations and pattern formation.Expositorymaterial has been revised for greater readability basedon feedback from students, and newexercises and computer problemshave been added throughout.
TECHNOLOGY. The software package MATLAB is used both forexposition ofalgorithms and as a suggested platform for studentassignments and projects. The amountof MATLAB code provided in thetext is carefully modulated, due to the fact that too much
xiv | Preface
tends to be counterproductive. More MATLAB code is found in theearly chapters, allowingthe reader to gain prociency in a gradualmanner. Where more elaborate code is provided(in the study ofinterpolation, and ordinary and partial differential equations, forexample),the expectation is for the reader to use what is given asa jumping-off point to exploit andextend.
It is not essential that any particular computational platformbe used with this textbook,but the growing presence of MATLAB inengineering and science departments shows thata common language cansmooth over many potholes. With MATLAB, all of the inter-faceproblemsdata input/output, plotting, and so onare solved in onefell swoop. Datastructure issues (for example those that arise whenstudying sparse matrix methods) arestandardized by relying onappropriate commands. MATLAB has facilities for audio andimage leinput and output. Differential equations simulations are simple torealize dueto the animation commands built into MATLAB. These goalscan all be achieved in otherways. But it is helpful to have onepackage that will run on almost all operating systems andsimplifythe details so that students can focus on the real mathematicalissues. Appendix Bis a MATLAB tutorial that can be used as a rstintroduction to students, or as a referencefor those alreadyfamiliar.
The text has a companion website, www.pearsonhighered.com/sauer,thatcontains the MATLAB programs taken directly from the text. Inaddition, new material andupdates will be posted for users todownload.
SUPPLEMENTS. To provide help for students, the StudentsSolutions Manual(SSM: 0-321-78392) is available, with worked-outsolutions to selected exercises. TheInstructors Solutions Manual(ISM: 0-321-783689) contains detailed solutions to theodd-numberedexercises, and answers to the even-numbered exercises. The manualsalsoshow how to use MATLAB software as an aid to solving the typesof problems that arepresented in the Exercises and ComputerProblems.
DESIGNINGTHECOURSE. NumericalAnalysis is structured to move fromfounda-tional, elementary ideas at the outset to more sophisticatedconcepts later in the presentation.Chapter 0 provides fundamentalbuilding blocks for later use. Some instructors like to startat thebeginning; others (including the author) prefer to start at Chapter1 and fold in top-ics from Chapter 0 when required. Chapters 1 and2 cover equation-solving in its variousforms. Chapters 3 and 4primarily treat the tting of data, interpolation and leastsquaresmethods. In chapters 58, we return to the classicalnumerical analysis areas of continuousmathematics: numericaldifferentiation and integration, and the solution of ordinaryandpartial differential equations with initial and boundaryconditions.
Chapter 9 develops random numbers in order to providecomplementary methods toChapters 58: the Monte-Carlo alternative tothe standard numerical integration schemesand the counterpoint ofstochastic differential equations are necessary when uncertaintyispresent in the model.
Compression is a core topic of numerical analysis, even thoughit often hides in plainsight in interpolation, least squares, andFourier analysis. Modern compression techniquesare featured inChapters 10 and 11. In the former, the Fast Fourier Transform istreatedas a device to carry out trigonometric interpolation, bothin the exact and least squaressense. Links to audio compression areemphasized, and fully carried out in Chapter 11on the DiscreteCosine Transform, the standard workhorse for modern audio andimagecompression. Chapter 12 on eigenvalues and singular values isalso written to emphasizeits connections to data compression, whichare growing in importance in contemporaryapplications. Chapter 13provides a short introduction to optimization techniques.
Numerical Analysis can also be used for a one-semester coursewith judicious choiceof topics. Chapters 03 are fundamental for anycourse in the area. Separate one-semestertracks can be designed asfollows:
Preface | xv
Chapters 4, 6, 8, 9, 13
Chapters 4, 10, 11, 12
Chapters 5, 6, 7, 8
Chapters 0 3
traditional calculus/differential equations
concentration
discrete mathematicsemphasis on orthogonality
See Also
Create String Arrays - MATLAB & Simulink I need help with this MATLAB problem. Question 5 % Write a function buildstr that will receive a character and a positive integer n. It will create and return a cell array with strings of increasing lengths, from 1 to the integer n. It will build the st 3-D scatter plot - MATLAB scatter3 - MathWorks France Array Formula Examples – Simple to Advanced
and compression
nancial engineeringconcentration
ACKNOWLEDGMENTS
The second edition owes a debt to many people, including thestudents of many classeswho have read and commented on earlierversions. In addition, Paul Lorczak, MaurinoBautista, and TomWegleitner were essential in helping me avoid embarrassingblunders.Suggestions from NicholasAllgaier, Regan Beckham, PaulCalamai, Mark Friedman, DavidHiebeler, Ashwani Kapila, AndrewKnyazev, Bo Li, Yijang Li, Jeff Parker, Robert Sachs,Evelyn Sander,Gantumur Tsogtgerel, and Thomas Wanner were greatly appreciated.Theresourceful staff at Pearson, including William Hoffman,Caroline Celano, Beth Houston,Jeff Weidenaar, and Brandon Rawnsley,as well as Shiny Rajesh at Integra-PDY, made theproduction of thesecond edition almost enjoyable. Finally, thanks are due to thehelpfulreaders from other universities for their encouragement ofthis project and indispensableadvice for improvement of earlierversions:
Eugene Allgower Colorado State UniversityConstantin BacutaUniversity of DelawareMichele Benzi Emory UniversityJerry BonaUniversity of Illinois at ChicagoGeorge Davis Georgia StateUniversityChris Danforth University of VermontAlberto DelgadoBradley UniversityRobert Dillon Washington State UniversityQiang DuPennsylvania State UniversityAhmet Duran University of Michigan,Ann ArborGregory Goeckel Presbyterian CollegeHerman GollwitzerDrexel UniversityDon Hardcastle Baylor UniversityDavid R. HillTemple UniversityHideaki Kaneko Old Dominion UniversityDanielKaplan Macalester CollegeFritz Keinert Iowa State UniversityAkhtarA. Khan Rochester Institute of TechnologyLucia M. Kimball BentleyCollegeColleen M. Kirk California Polytechnic State UniversitySeppoKorpela Ohio State UniversityWilliam Layton University ofPittsburghBrenton LeMesurier College of CharlestonMelvin LeokUniversity of California, San Diego
xvi | PrefaceDoron Levy Stanford UniversityShankar MahalingamUniversity of California, RiversideAmnon Meir AuburnUniversityPeter Monk University of DelawareJoseph E. Pasciak TexasA&M UniversityJeff Parker Harvard UniversitySteven PavUniversity of California, San DiegoJacek Polewczak California StateUniversityJorge Rebaza Southwest Missouri State UniversityJeffreyScroggs North Carolina State UniversitySergei Suslov Arizona StateUniversityDaniel Szyld Temple UniversityAhlam Tannouri Morgan StateUniversityJin Wang Old Dominion UniversityBruno Welfert ArizonaState UniversityNathaniel Whitaker University of Massachusetts
Preface | xvii
Numerical Analysis
This page intentionally left blank
C H A P T E R
0FundamentalsThis introductory chapter provides basicbuildingblocks necessary for the construction and understand-ing ofthe algorithms of the book. They include fun-damental ideas ofintroductory calculus and functionevaluation, the details ofmachine arithmetic as it is car-ried out on modern computers, anddiscussion of theloss of signicant digits resulting frompoorly-designedcalculations.
After discussing efcient methods for evaluatingpolynomials, westudy the binary number system, therepresentation of oating pointnumbers and the com-mon protocols used for rounding. The effects ofthesmall rounding errors on computations are magniedinill-conditioned problems. The battle to limit theseperniciouseffects is a recurring theme throughout therest of thechapters.
The goal of this book is to present and discuss methods ofsolving mathematical prob-lems with computers. The most fundamentaloperations of arithmetic are addition andmultiplication. These arealso the operations needed to evaluate a polynomial P(x) ataparticular value x. It is no coincidence that polynomials are thebasic building blocks formany computational techniques we willconstruct.
Because of this, it is important to know how to evaluate apolynomial. The readerprobably already knows how and may considerspending time on such an easy problemslightly ridiculous! But themore basic an operation is, the more we stand to gain by doingitright. Therefore we will think about how to implement polynomialevaluation as efcientlyas possible.
0.1 EVALUATING A POLYNOMIALWhat is the best way to evaluate
P(x) = 2x4 + 3x3 3x2 + 5x 1,say, at x = 1/2? Assume that thecoefcients of the polynomial and the number 1/2 arestored inmemory, and try to minimize the number of additions andmultiplications required
2 | CHAPTER 0 Fundamentals
to get P(1/2). To simplify matters, we will not count time spentstoring and fetchingnumbers to and from memory.
METHOD 1 The rst and most straightforward approach is
P
(12
)= 2 1
2 1
2 1
2 1
2+ 3 1
2 1
2 1
2 3 1
2 1
2+ 5 1
2 1 = 5
4. (0.1)
The number of multiplications required is 10, together with 4additions. Two of the additionsare actually subtractions, butbecause subtraction can be viewed as adding a negativestorednumber, we will not worry about the difference.
There surely is a better way than (0.1). Effort is beingduplicatedoperations canbe saved by eliminating the repeatedmultiplication by the input 1/2. A better strategy isto rst compute(1/2)4, storing partial products as we go. That leads to thefollowing method:
METHOD 2 Find the powers of the input number x=1/2 rst, andstore them for future use:12
12
=(
12
)2(
12
)2 1
2=(
12
)3(
12
)3 1
2=(
12
)4.
Now we can add up the terms:
P
(12
)= 2
(12
)4+ 3
(12
)3 3
(12
)2+ 5 1
2 1 = 5
4.
There are now 3 multiplications of 1/2, along with 4 othermultiplications. Counting up,we have reduced to 7 multiplications,with the same 4 additions. Is the reduction from 14to 11 operationsa signicant improvement? If there is only one evaluation to bedone, thenprobably not. Whether Method 1 or Method 2 is used, theanswer will be available beforeyou can lift your ngers from thecomputer keyboard. However, suppose the polynomialneeds to beevaluated at different inputs x several times per second. Then thedifferencemay be crucial to getting the information when it isneeded.
Is this the best we can do for a degree 4 polynomial? It may behard to imagine thatwe can eliminate three more operations, but wecan. The best elementary method is thefollowing one:
METHOD 3 (Nested Multiplication) Rewrite the polynomial so thatit can be evaluated from the insideout:
P(x) = 1 + x(5 3x + 3x2 + 2x3)= 1 + x(5 + x(3 + 3x + 2x2))= 1 +x(5 + x(3 + x(3 + 2x)))= 1 + x (5 + x (3 + x (3 + x 2))). (0.2)
Here the polynomial is written backwards, and powers of x arefactored out of the rest ofthe polynomial. Once you can see towrite it this wayno computation is required to dothe rewritingthecoefcients are unchanged. Now evaluate from the inside out:
0.1 Evaluating a Polynomial | 3
multiply12
2, add + 3 4
multiply12
4, add 3 1
multiply12
1, add + 5 92
multiply12
92, add 1 5
4. (0.3)
This method, called nested multiplication or Horners method,evaluates the polynomialin 4 multiplications and 4 additions. Ageneral degree d polynomial can be evaluated ind multiplicationsand d additions. Nested multiplication is closely related tosyntheticdivision of polynomial arithmetic.
The example of polynomial evaluation is characteristic of theentire topic of computa-tional methods for scientic computing.First, computers are very fast at doing very simplethings. Second,it is important to do even simple tasks as efciently as possible,since theymay be executed many times. Third, the best way may notbe the obvious way. Over thelast half-century, the elds ofnumerical analysis and scientic computing, hand in handwithcomputer hardware technology, have developed efcient solutiontechniques to attackcommon problems.
While the standard form for a polynomial c1 + c2x + c3x2 + c4x3+ c5x4 can bewritten in nested form as
c1 + x(c2 + x(c3 + x(c4 + x(c5)))), (0.4)some applicationsrequire a more general form. In particular, interpolationcalculations inChapter 3 will require the form
c1 + (x r1)(c2 + (x r2)(c3 + (x r3)(c4 + (x r4)(c5)))),(0.5)where we call r1, r2, r3, and r4 the base points. Note thatsetting r1 = r2 = r3 = r4 = 0 in(0.5) recovers the original nestedform (0.4).
The following Matlab code implements the general form of nestedmultiplication(compare with (0.3)):%Program 0.1 Nestedmultiplication%Evaluates polynomial from nested form using HornersMethod%Input: degree d of polynomial,% array of d+1 coefficients c(constant term first),% x-coordinate x at which to evaluate, and%array of d base points b, if needed%Output: value y of polynomialat xfunction y=nest(d,c,x,b)if nargin
4 | CHAPTER 0 Fundamentals
>> nest(4,[-1 5 -3 3 2],1/2,[0 0 0 0])
ans =
1.2500
as we found earlier by hand. The le nest.m, as the rest of theMatlab code shown inthis book, must be accessible from the Matlabpath (or in the current directory) whenexecuting the command.
If the nest command is to be used with all base points 0 as in(0.2), the abbreviatedform
>> nest(4,[-1 5 -3 3 2],1/2)
may be used with the same result. This is due to the narginstatement in nest.m.If the number of input arguments is less than4, the base points are automatically set tozero.
Because of Matlabs seamless treatment of vector notation, thenest command canevaluate an array of x values at once. Thefollowing code is illustrative:
>> nest(4,[-1 5 -3 3 2],[-2 -1 0 1 2])
ans =
-15 -10 -1 6 53
Finally, the degree 3 interpolating polynomial
P(x) = 1 + x(
12
+ (x 2)(
12
+ (x 3)(1
2
)))
from Chapter 3 has base points r1 = 0, r2 = 2, r3 = 3. It can beevaluated at x = 1 by
>> nest(3,[1 1/2 1/2 -1/2],1,[0 2 3])
ans =
EXAMPLE 0.1 Find an efcient method for evaluating the polynomialP(x) = 4x5 + 7x8 3x11 + 2x14.Some rewriting of the polynomial mayhelp reduce the computational effort
required for evaluation. The idea is to factor x5 from each termand write as a polyno-mial in the quantity x3:
P(x) = x5(4 + 7x3 3x6 + 2x9)= x5 (4 + x3 (7 + x3 (3 + x3(2)))).
For each input x, we need to calculate x x = x2, x x2 = x3, andx2 x3 = x5 rst.These three multiplications, combined with themultiplication of x5, and the three multipli-cations and threeadditions from the degree 3 polynomial in the quantity x3 give thetotaloperation count of 7 multiplies and 3 adds per evaluation.
0.2 Binary Numbers | 5
0.1 Exercises
1. Rewrite the following polynomials in nested form. Evaluatewith and without nested form atx = 1/3.(a) P(x) = 6x4 + x3 + 5x2 +x + 1(b) P(x) = 3x4 + 4x3 + 5x2 5x + 1(c) P(x) = 2x4 + x3 x2 +1
2. Rewrite the following polynomials in nested form and evaluateat x = 1/2:(a) P(x) = 6x3 2x2 3x + 7(b) P(x) = 8x5 x4 3x3 + x2 3x +1(c) P(x) = 4x6 2x4 2x + 4
3. Evaluate P(x) = x6 4x4 + 2x2 + 1 at x = 1/2 by consideringP(x) as a polynomial in x2and using nested multiplication.
4. Evaluate the nested polynomial with base points P(x) = 1 +x(1/2 + (x 2)(1/2 + (x 3)(1/2))) at (a) x = 5 and (b) x = 1.
5. Evaluate the nested polynomial with base points P(x) = 4 +x(4 + (x 1)(1 + (x 2)(3 + (x 3)(2)))) at (a) x = 1/2 and (b) x =1/2.
6. Explain how to evaluate the polynomial for a given input x,using as few operations aspossible. How many multiplications andhow many additions are required?(a) P(x) = a0 + a5x5 + a10x10 +a15x15(b) P(x) = a7x7 + a12x12 + a17x17 + a22x22 + a27x27.
7. How many additions and multiplications are required toevaluate a degree n polynomial withbase points, using the generalnested multiplication algorithm?
0.1 Computer Problems
1. Use the function nest to evaluate P(x) = 1 + x + + x50 at x =1.00001. (Use theMatlab ones command to save typing.) Find theerror of the computation by comparing withthe equivalent expressionQ(x) = (x51 1)/(x 1).
2. Use nest.m to evaluate P(x) = 1 x + x2 x3 + + x98 x99 at x =1.00001. Find asimpler, equivalent expression, and use it toestimate the error of the nested multiplication.
0.2 BINARY NUMBERSIn preparation for the detailed study ofcomputer arithmetic in the next section, we needto understand thebinary number system. Decimal numbers are converted from base 10tobase 2 in order to store numbers on a computer and to simplifycomputer operations likeaddition and multiplication. To give outputin decimal notation, the process is reversed. Inthis section, wediscuss ways to convert between decimal and binary numbers.
Binary numbers are expressed as
. . .b2b1b0.b1b2 . . . ,
6 | CHAPTER 0 Fundamentals
where each binary digit, or bit, is 0 or 1. The base 10equivalent to the number is
. . .b222 + b121 + b020 + b121 + b222 . . . .For example, thedecimal number 4 is expressed as (100.)2 in base 2, and 3/4 isrepresentedas (0.11)2.
0.2.1 Decimal to binary
The decimal number 53 will be represented as (53)10 to emphasizethat it is to be interpretedas base 10. To convert to binary, it issimplest to break the number into integer and fractionalparts andconvert each part separately. For the number (53.7)10 = (53)10 +(0.7)10, wewill convert each part to binary and combine theresults.
Integer part. Convert decimal integers to binary by dividing by2 successively andrecording the remainders. The remainders, 0 or 1,are recorded by starting at the decimalpoint (or more accurately,radix) and moving away (to the left). For (53)10, we would have
53 2 = 26 R 126 2 = 13 R 013 2 = 6 R 1
6 2 = 3 R 03 2 = 1 R 11 2 = 0 R 1.
Therefore, the base 10 number 53 can be written in bits as110101, denoted as(53)10 = (110101.)2. Checking the result, we have110101 = 25 + 24 + 22 + 20 =32 + 16 +4 + 1 = 53.
Fractional part. Convert (0.7)10 to binary by reversing thepreceding steps. Multiplyby 2 successively and record the integerparts, moving away from the decimal point to theright.
.7 2 = .4 + 1
.4 2 = .8 + 0
.8 2 = .6 + 1
.6 2 = .2 + 1
.2 2 = .4 + 0
.4 2 = .8 + 0....
Notice that the process repeats after four steps and will repeatindenitely exactly the sameway. Therefore,
(0.7)10 = (.1011001100110 . . .)2 = (.10110)2,where overbarnotation is used to denote innitely repeated bits. Putting the twopartstogether, we conclude that
(53.7)10 = (110101.10110)2.
0.2 Binary Numbers | 7
0.2.2 Binary to decimal
To convert a binary number to decimal, it is again best toseparate into integer and fractionalparts.
Integer part. Simply add up powers of 2 as we did before. Thebinary number(10101)2 is simply 1 24 + 0 23 + 1 22 + 0 21 + 1 20 =(21)10.
Fractional part. If the fractional part is nite (a terminatingbase 2 expansion), proceedthe same way. For example,
(.1011)2 = 12 +18
+ 116
=(
1116
)10
.
The only complication arises when the fractional part is not anite base 2 expansion.Converting an innitely repeating binaryexpansion to a decimal fraction can be done inseveral ways. Perhapsthe simplest way is to use the shift property of multiplication by2.
For example, suppose x = (0.1011)2 is to be converted todecimal. Multiply x by 24,which shifts 4 places to the left inbinary. Then subtract the original x:
24x = 1011.1011x = 0000.1011.
Subtracting yields
(24 1)x = (1011)2 = (11)10.Then solve for x to nd x = (.1011)2 =11/15 in base 10.
As another example, assume that the fractional part does notimmediately repeat, as inx = .10101. Multiplying by 22 shifts to y= 22x = 10.101. The fractional part of y, call itz = .101, iscalculated as before:
23z = 101.101z = 000.101.
Therefore, 7z = 5, and y = 2 + 5/7, x = 22y = 19/28 in base 10.It is a good exerciseto check this result by converting 19/28 tobinary and comparing to the original x.
Binary numbers are the building blocks of machine computations,but they turnout to be long and unwieldy for humans to interpret.It is useful to use base 16at times just to present numbers moreeasily. Hexadecimal numbers are representedby the 16 numerals0,1,2, . . . ,9,A,B,C,D,E,F . Each hex number can be repre-sentedby 4 bits. Thus (1)16 =(0001)2, (8)16 =(1000)2, and (F )16 =(1111)2=(15)10.In the next section, Matlabs format hex for representingmachine numbers will bedescribed.
0.2 Exercises
1. Find the binary representation of the base 10 integers. (a)64 (b) 17 (c) 79 (d) 2272. Find the binary representation of thebase 10 numbers. (a) 1/8 (b) 7/8 (c) 35/16 (d) 31/643. Convert thefollowing base 10 numbers to binary. Use overbar notation fornonterminating
binary numbers. (a) 10.5 (b) 1/3 (c) 5/7 (d) 12.8 (e) 55.4 (f )0.14. Convert the following base 10 numbers to binary. (a) 11.25(b) 2/3 (c) 3/5 (d) 3.2 (e) 30.6
(f) 99.9
8 | CHAPTER 0 Fundamentals
5. Find the rst 15 bits in the binary representation of .
6. Find the rst 15 bits in the binary representation of e.
7. Convert the following binary numbers to base 10: (a) 1010101(b) 1011.101 (c) 10111.01(d) 110.10 (e) 10.110 (f) 110.1101 (g)10.0101101 (h) 111.1
8. Convert the following binary numbers to base 10: (a) 11011(b) 110111.001 (c) 111.001(d) 1010.01 (e) 10111.10101 (f)1111.010001
0.3 FLOATING POINT REPRESENTATION OF REAL NUMBERSIn thissection, we present a model for computer arithmetic of oating pointnumbers.There are several models, but to simplify matters we willchoose one particular model anddescribe it in detail. The model wechoose is the so-called IEEE 754 Floating Point Standard.TheInstitute of Electrical and Electronics Engineers (IEEE) takes anactive interest inestablishing standards for the industry. Theiroating point arithmetic format has becomethe common standard forsingle-precision and double precision arithmetic throughoutthecomputer industry.
Rounding errors are inevitable when nite-precision computermemory locations areused to represent real, innite precisionnumbers. Although we would hope that small errorsmade during a longcalculation have only a minor effect on the answer, this turns outtobe wishful thinking in many cases. Simple algorithms, such asGaussian elimination ormethods for solving differential equations,can magnify microscopic errors to macro-scopic size. In fact, amain theme of this book is to help the reader to recognize whenacalculation is at risk of being unreliable due to magnication ofthe small errors made bydigital computers and to know how to avoidor minimize the risk.
See Also
Create mini-batches for deep learning
0.3.1 Floating point formats
The IEEE standard consists of a set of binary representations ofreal numbers. A oatingpoint number consists of three parts: thesign (+ or ), a mantissa, which contains thestring of signicantbits, and an exponent. The three parts are stored together in asinglecomputer word.
There are three commonly used levels of precision for oatingpoint numbers: singleprecision, double precision, and extendedprecision, also known as long-double precision.The number of bitsallocated for each oating point number in the three formats is32,64,and 80, respectively. The bits are divided among the parts asfollows:
precision sign exponent mantissasingle 1 8 23double 1 11 52longdouble 1 15 64
All three types of precision work essentially the same way. Theform of a normalizedIEEE oating point number is
1.bbb . . .b 2p, (0.6)where each of the N bs is 0 or 1, and p isan M-bit binary number representing the exponent.Normalizationmeans that, as shown in (0.6), the leading (leftmost) bit must be1.
When a binary number is stored as a normalized oating pointnumber, it is left-justied, meaning that the leftmost 1 is shiftedjust to the left of the radix point. The shift
0.3 Floating Point Representation of Real Numbers | 9
is compensated by a change in the exponent. For example, thedecimal number 9, which is1001 in binary, would be stored as
+1.001 23,because a shift of 3 bits, or multiplication by 23, isnecessary to move the leftmost one tothe correct position.
For concreteness, we will specialize to the double precisionformat for most of thediscussion. Single and long-double precisionare handled in the same way, with the exceptionof differentexponent and mantissa lengths M and N . In double precision, usedby manyC compilers and by Matlab, M = 11 and N = 52.
The double precision number 1 is
+1. 000000000000000000000000000000000000000000000000000020,where we have boxed the 52 bits of the mantissa. The next oatingpoint number greaterthan 1 is
+1. 0000000000000000000000000000000000000000000000000001 20,or 1+ 252.
DEFINITION 0.1 The number machine epsilon, denoted mach, is thedistance between 1 and the smallestoating point number greater than1. For the IEEE double precision oating point standard,
mach = 252. The decimal number 9.4 = (1001.0110)2 isleft-justied as+1.0010110011001100110011001100110011001100110011001100 110 . . .23,
where we have boxed the rst 52 bits of the mantissa. A newquestion arises: How do wet the innite binary number representing9.4 in a nite number of bits?
We must truncate the number in some way, and in so doing wenecessarily make asmall error. One method, called chopping, is tosimply throw away the bits that fall off theendthat is, thosebeyond the 52nd bit to the right of the decimal point. Thisprotocol issimple, but it is biased in that it always moves theresult toward zero.
The alternative method is rounding. In base 10, numbers arecustomarily rounded upif the next digit is 5 or higher, and roundeddown otherwise. In binary, this corresponds torounding up if thebit is 1. Specically, the important bit in the double precisionformat isthe 53rd bit to the right of the radix point, the rst onelying outside of the box. The defaultrounding technique,implemented by the IEEE standard, is to add 1 to bit 52 (round up)ifbit 53 is 1, and to do nothing (round down) to bit 52 if bit 53is 0, with one exception: Ifthe bits following bit 52 are 10000 . .. , exactly halfway between up and down, we round upor round downaccording to which choice makes the nal bit 52 equal to 0. (Here wearedealing with the mantissa only, since the sign does not play arole.)
Why is there the strange exceptional case? Except for this case,the rule means roundingto the normalized oating point numberclosest to the original numberhence its name,the Rounding toNearest Rule. The error made in rounding will be equally likely tobeup or down. Therefore, the exceptional case, the case where thereare two equally distantoating point numbers to round to, should bedecided in a way that doesnt prefer up ordown systematically. Thisis to try to avoid the possibility of an unwanted slow drift inlongcalculations due simply to a biased rounding. The choice tomake the nal bit 52 equal to0 in the case of a tie is somewhatarbitrary, but at least it does not display a preference upor down.Problem 8 sheds some light on why the arbitrary choice of 0 is madein case ofa tie.
10 | CHAPTER 0 Fundamentals
IEEE Rounding to Nearest Rule
For double precision, if the 53rd bit to the right of the binarypoint is 0, then round down(truncate after the 52nd bit). If the53rd bit is 1, then round up (add 1 to the 52 bit), unlessall knownbits to the right of the 1 are 0s, in which case 1 is added to bit52 if and only ifbit 52 is 1.
For the number 9.4 discussed previously, the 53rd bit to theright of the binary point isa 1 and is followed by other nonzerobits. The Rounding to Nearest Rule says to round up,or add 1 to bit52. Therefore, the oating point number that represents 9.4 is
+1. 0010110011001100110011001100110011001100110011001101 23.(0.7)
DEFINITION 0.2 Denote the IEEE double precision oating pointnumber associated to x, using the Roundingto Nearest Rule, by(x).
In computer arithmetic, the real number x is replaced with thestring of bits (x).According to this denition, (9.4) is the numberin the binary representation (0.7). Wearrived at the oating pointrepresentation by discarding the innite tail .1100 252 23 = .0110251 23 = .4 248 from the right end of the number and then adding25223 = 249 in the rounding step. Therefore,
(9.4) = 9.4 + 249 0.4 248= 9.4 + (1 0.8)249= 9.4 + 0.2 249.(0.8)
In other words, a computer using double precision representationand the Rounding to Near-est Rule makes an error of 0.2 249 whenstoring 9.4. We call 0.2 249 the roundingerror.
The important message is that the oating point numberrepresenting 9.4 is not equalto 9.4, although it is very close. Toquantify that closeness, we use the standard denitionof error.
DEFINITION 0.3 Let xc be a computed version of the exactquantity x. Then
absolute error = |xc x|,and
relative error = |xc x||x| ,
if the latter quantity exists.
Relative rounding error
In the IEEE machine arithmetic model, the relative roundingerror of (x) is no more thanone-half machine epsilon:
|(x) x||x|
12mach. (0.9)
In the case of the number x = 9.4, we worked out the roundingerror in (0.8), whichmust satisfy (0.9):
|(9.4) 9.4|9.4
= 0.2 249
9.4= 8
47 252 < 1
2mach.
0.3 Floating Point Representation of Real Numbers | 11 EXAMPLE0.2 Find the double precision representation (x) and rounding errorfor x = 0.4.
Since (0.4)10 = (.0110)2, left-justifying the binary numberresults in0.4 = 1.100110 22
= +1. 1001100110011001100110011001100110011001100110011001100110. . . 22.
Therefore, according to the rounding rule, (0.4) is
+1. 100110011001100110011001100110011001100110011001101022.Here, 1 has been added to bit 52, which caused bit 51 also tochange, due to carrying in thebinary addition.
Analyzing carefully, we discarded 253 22 + .0110 254 22 in thetrun-cation and added 252 22 by rounding up. Therefore,
(0.4) = 0.4 255 0.4 256 + 254= 0.4 + 254(1/2 0.1 + 1)= 0.4 +254(.4)= 0.4 + 0.1 252.
Notice that the relative error in rounding for 0.4 is 0.1/0.4mach = 1/4 mach,obeying (0.9).
0.3.2 Machine representation
So far, we have described a oating point representation in theabstract. Here are a few moredetails about how this representationis implemented on a computer. Again, in this sectionwe will discussthe double precision format; the other formats are verysimilar.
Each double precision oating point number is assigned an 8-byteword, or 64 bits, tostore its three parts. Each such word has theform
se1e2 . . . e11b1b2 . . .b52 , (0.10)
where the sign is stored, followed by 11 bits representing theexponent and the 52 bitsfollowing the decimal point, representingthe mantissa. The sign bit s is 0 for a positivenumber and 1 for anegative number. The 11 bits representing the exponent come fromthepositive binary integer resulting from adding 210 1 = 1023 tothe exponent, at least forexponents between 1022 and 1023. Thiscovers values of e1 . . . e11 from 1 to 2046, leaving0 and 2047 forspecial purposes, which we will return to later.
The number 1023 is called the exponent bias of the doubleprecision format. It is usedto convert both positive and negativeexponents to positive binary numbers for storage inthe exponentbits. For single and long-double precision, the exponent biasvalues are 127and 16383, respectively.
Matlabs format hex consists simply of expressing the 64 bits ofthe machinenumber (0.10) as 16 successive hexadecimal, or base 16,numbers. Thus, the rst 3 hexnumerals represent the sign andexponent combined, while the last 13 contain the mantissa.
For example, the number 1, or
1 = +1. 000000000000000000000000000000000000000000000000000020,
12 | CHAPTER 0 Fundamentals
has double precision machine number form
0 011111111110000000000000000000000000000000000000000000000000000
once the usual 1023 is added to the exponent. The rst three hexdigits correspond to
001111111111 = 3FF,so the format hex representation of theoating point number 1 will be3FF0000000000000. You can check thisby typing format hex into Matlab and enter-ing the number 1.
EXAMPLE 0.3 Find the hex machine number representation of thereal number 9.4.From (0.7), we nd that the sign is s = 0, theexponent is 3, and the 52 bits of the
mantissa after the decimal point are
0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1101(2CCCCCCCCCCCD)16.
Adding 1023 to the exponent gives 1026 = 210 + 2, or(10000000010)2. The signand exponent combination is (010000000010)2= (402)16, making the hex format4022CCCCCCCCCCCD.
Now we return to the special exponent values 0 and 2047. Thelatter, 2047, is used torepresent if the mantissa bit string is allzeros and NaN, which stands for Not a Num-ber, otherwise. Since2047 is represented by eleven 1 bits, or e1e2 . . . e11 = (111 11111111)2,the rst twelve bits of Inf and -Inf are 0111 1111 1111 and1111 1111 1111 ,respectively, and the remaining 52 bits (themantissa) are zero. The machine number NaNalso begins 1111 11111111 but has a nonzero mantissa. In summary,
machine number example hex format+Inf 1/0 7FF0000000000000-Inf1/0 FFF0000000000000NaN 0/0 FFFxxxxxxxxxxxxx
where the xs denote bits that are not all zero.The specialexponent 0, meaning e1e2 . . . e11 = (000 0000 0000)2, also denotesa depar-
ture from the standard oating point form. In this case themachine number is interpretedas the non-normalized oating pointnumber
0. b1b2 . . .b52 21022. (0.11)That is, in this case only, theleft-most bit is no longer assumed to be 1. Thesenon-normalizednumbers are called subnormal oating point numbers.They extend the range of very smallnumbers by a few more orders ofmagnitude. Therefore, 252 21022 = 21074 is thesmallest nonzerorepresentable number in double precision. Its machine word is
0 000000000000000000000000000000000000000000000000000000000000001 .
Be sure to understand the difference between the smallestrepresentable number 21074 andmach = 252. Many numbers below machare machine representable, even though addingthem to 1 may have noeffect. On the other hand, double precision numbers below21074cannot be represented at all.
0.3 Floating Point Representation of Real Numbers | 13
The subnormal numbers include the most important number 0. Infact, the subnormalrepresentation includes two different oatingpoint numbers, +0 and 0, that are treatedin computations as thesame real number. The machine representation of +0 has sign bits =0, exponent bits e1 . . . e11 = 00000000000, and mantissa 52 zeros;in short, all 64 bitsare zero. The hex format for +0 is0000000000000000. For the number 0, all is exactlythe same, exceptfor the sign bit s = 1. The hex format for 0 is8000000000000000.
0.3.3 Addition of oating point numbers
Machine addition consists of lining up the decimal points of thetwo numbers to be added,adding them, and then storing the resultagain as a oating point number. The addition itselfcan be done inhigher precision (with more than 52 bits) since it takes place in aregisterdedicated just to that purpose. Following the addition, theresult must be rounded back to52 bits beyond the binary point forstorage as a machine number.
For example, adding 1 to 253 would appear as follows:
1. 000 20 + 1. 000 253= 1.0000000000000000000000000000000000000000000000000000 20+ 0.0000000000000000000000000000000000000000000000000000 1 20
= 1. 0000000000000000000000000000000000000000000000000000 120
This is saved as 1. 20 = 1, according to the rounding rule.Therefore, 1 + 253 is equalto 1 in double precision IEEEarithmetic. Note that 253 is the largest oating point numberwiththis property; anything larger added to 1 would result in a sumgreater than 1 undercomputer arithmetic.
The fact that mach = 252 does not mean that numbers smaller thanmach are negli-gible in the IEEE model. As long as they arerepresentable in the model, computations withnumbers of this sizeare just as accurate, assuming that they are not added orsubtracted tonumbers of unit size.
It is important to realize that computer arithmetic, because ofthe truncation and round-ing that it carries out, can sometimesgive surprising results. For example, if a doubleprecision computerwith IEEE rounding to nearest is asked to store 9.4, then subtract9,and then subtract 0.4, the result will be something other thanzero! What happens is thefollowing: First, 9.4 is stored as 9.4 +0.2 249, as shown previously. When 9 is sub-tracted (note that 9can be represented with no error), the result is 0.4 + 0.2 249.Now,asking the computer to subtract 0.4 results in subtracting (aswe found in Example 0.2) themachine number (0.4) = 0.4 + 0.1 252,which will leave
0.2 249 0.1 252 = .1 252(24 1) = 3 253
instead of zero. This is a small number, on the order of mach,but it is not zero. SinceMatlabs basic data type is the IEEE doubleprecision number, we can illustrate thisnding in a Matlabsession:
>> format long>> x=9.4
x =
9.40000000000000
>> y=x-9
14 | CHAPTER 0 Fundamentals
y =
0.40000000000000
>> z=y-0.4
z =
3.330669073875470e-16
>> 3*2(-53)
ans =
3.330669073875470e-16
EXAMPLE 0.4 Find the double precision oating point sum (1 + 3253) 1.Of course, in real arithmetic the answer is 3 253. However,oating point
arithmetic may differ. Note that 3 253 = 252 + 253. The rstaddition is
1. 000 20 + 1. 100 252= 1.0000000000000000000000000000000000000000000000000000 20+ 0.0000000000000000000000000000000000000000000000000001 1 20
= 1. 0000000000000000000000000000000000000000000000000001 120.This is again the exceptional case for the rounding rule. Sincebit 52 in the sum is 1, wemust round up, which means adding 1 tobit 52. After carrying, we get
+ 1. 000000000000000000000000000000000000000000000000001020,
which is the representation of 1 + 251. Therefore, aftersubtracting 1, the result will be251, which is equal to 2mach = 4253. Once again, note the difference between com-puter arithmeticand exact arithmetic. Check this result by using Matlab.
Calculations in Matlab, or in any compiler performing oatingpoint calculation underthe IEEE standard, follow the precise rulesdescribed in this section. Although oatingpoint calculation cangive surprising results because it differs from exact arithmetic,it isalways predictable. The Rounding to Nearest Rule is thetypical default rounding, although,if desired, it is possible tochange to other rounding rules by using compiler ags. Thecomparisonof results from different rounding protocols is sometimes useful asan informalway to assess the stability of a calculation.
It may be surprising that small rounding errors alone, ofrelative size mach, are capableof derailing meaningfulcalculations. One mechanism for this is introduced in thenextsection. More generally, the study of error magnication andconditioning is a recurringtheme in Chapters 1, 2, and beyond.
0.3 Exercises
1. Convert the following base 10 numbers to binary and expresseach as a oating point number(x) by using the Rounding to NearestRule: (a) 1/4 (b) 1/3 (c) 2/3 (d) 0.9
0.3 Floating Point Representation of Real Numbers | 15
2. Convert the following base 10 numbers to binary and expresseach as a oating point number(x) by using the Rounding to NearestRule: (a) 9.5 (b) 9.6 (c) 100.2 (d) 44/7
3. For which positive integers k can the number 5 + 2k berepresented exactly (with norounding error) in double precisionoating point arithmetic?
4. Find the largest integer k for which (19 + 2k) > (19) indouble precision oating pointarithmetic.
5. Do the following sums by hand in IEEE double precisioncomputer arithmetic, using theRounding to Nearest Rule. (Check youranswers, using Matlab.)(a) (1 + (251 + 253)) 1(b) (1 + (251 + 252 +253)) 1
6. Do the following sums by hand in IEEE double precisioncomputer arithmetic, using theRounding to Nearest Rule:
(a) (1 + (251 + 252 + 254)) 1(b) (1 + (251 + 252 + 260)) 1
7. Write each of the given numbers in Matlabs format hex. Showyour work. Then checkyour answers with Matlab. (a) 8 (b) 21 (c) 1/8(d) (1/3) (e) (2/3) (f) (0.1) (g) (0.1)(h) (0.2)
8. Is 1/3 + 2/3 exactly equal to 1 in double precision oatingpoint arithmetic, using the IEEERounding to Nearest Rule? You willneed to use (1/3) and (2/3) from Exercise 1. Doesthis help explainwhy the rule is expressed as it is? Would the sum be the same ifchoppingafter bit 52 were used instead of IEEE rounding?
9. (a) Explain why you can determine machine epsilon on acomputer using IEEE doubleprecision and the IEEE Rounding toNearest Rule by calculating (7/3 4/3) 1. (b) Does(4/3 1/3) 1 alsogive mach? Explain by converting to oating point numbersandcarrying out the machine arithmetic.
10. Decide whether 1 + x > 1 in double precision oating pointarithmetic, with Rounding toNearest. (a) x = 253 (b) x = 253 +260
11. Does the associative law hold for IEEE computeraddition?
12. Find the IEEE double precision representation (x), and ndthe exact difference (x) x forthe given real numbers. Check thatthe relative rounding error is no more than mach/2.(a) x = 1/3 (b)x = 3.3 (c) x = 9/7
13. There are 64 double precision oating point numbers whose64-bit machine representationshave exactly one nonzero bit. Findthe (a) largest (b) second-largest (c) smallest ofthesenumbers.
14. Do the following operations by hand in IEEE double precisioncomputer arithmetic, using theRounding to Nearest Rule. (Check youranswers, using Matlab.)(a) (4.3 3.3) 1 (b) (4.4 3.4) 1 (c) (4.93.9) 1
15. Do the following operations by hand in IEEE double precisioncomputer arithmetic, using theRounding to Nearest Rule.(a) (8.37.3) 1 (b) (8.4 7.4) 1 (c) (8.8 7.8) 1
16 | CHAPTER 0 Fundamentals
16. Find the IEEE double precision representation (x), and ndthe exact difference (x) x forthe given real numbers. Check thatthe relative rounding error is no more than mach/2.(a) x = 2.75 (b)x = 2.7 (c) x = 10/3
0.4 LOSS OF SIGNIFICANCEAn advantage of knowing the details ofcomputer arithmetic is that we are therefore in abetter position tounderstand potential pitfalls in computer calculations. One majorproblemthat arises in many forms is the loss of signicant digitsthat results from subtracting nearlyequal numbers. In its simplestform, this is an obvious statement. Assume that throughconsiderableeffort, as part of a long calculation, we have determined twonumbers correctto seven signicant digits, and now need to subtractthem:
123.4567 123.4566
000.0001
The subtraction problem began with two input numbers that weknew to seven-digit accu-racy, and ended with a result that hasonly one-digit accuracy. Although this example isquitestraightforward, there are other examples of loss of signicancethat are more subtle,and in many cases this can be avoided byrestructuring the calculation.
EXAMPLE 0.5 Calculate
9.01 3 on a three-decimal-digit computer.This example is stillfairly simple and is presented only for illustrative purposes.
Instead of using a computer with a 52-bit mantissa, as in doubleprecision IEEE standardformat, we assume that we are using athree-decimal-digit computer. Using a three-digitcomputer meansthat storing each intermediate calculation along the way impliesstoringinto a oating point number with a three-digit mantissa. Theproblem data (the 9.01 and3.00) are given to three-digit accuracy.Since we are going to use a three-digit computer,being optimistic,we might hope to get an answer that is good to three digits. (Ofcourse, wecant expect more than this because we only carry alongthree digits during the calculation.)Checking on a hand calculator,we see that the correct answer is approximately 0.0016662 =1.6662103. How many correct digits do we get with the three-digitcomputer?
None, as it turns out. Since
9.01 3.0016662, when we store this intermediateresult to threesignicant digits we get 3.00. Subtracting 3.00, we get a nal answerof 0.00.No signicant digits in our answer are correct.
Surprisingly, there is a way to save this computation, even on athree-digit com-puter. What is causing the loss of signicance isthe fact that we are explicitly subtractingnearly equalnumbers,
9.01 and 3. We can avoid this problem by using algebra torewrite
the expression:
9.01 3 = (
9.01 3)(9.01 + 3)
9.01 + 3= 9.01 3
2
9.01 + 3= 0.01
3.00 + 3 =.016
= 0.00167 1.67 103.
Here, we have rounded the last digit of the mantissa up to 7since the next digit is 6. Noticethat we got all three digitscorrect this way, at least the three digits that the correctanswer
0.4 Loss of Signicance | 17
rounds to. The lesson is that it is important to nd ways toavoid subtracting nearly equalnumbers in calculations, ifpossible.
The method that worked in the preceding example was essentiallya trick. Multiplyingby the conjugate expression is one trick thatcan help restructure the calculation. Often,specic identities canbe used, as with trigonometric expressions. For example,calculationof 1 cosx when x is close to zero is subject to loss ofsignicance. Lets compare thecalculation of the expressions
E1 = 1 cosxsin2 x
and E2 = 11 + cosxfor a range of input numbers x. We arrived atE2 by multiplying the numerator and denomi-nator ofE1 by 1 + cosx,and using the trig identity sin2 x + cos2 x = 1. In inniteprecision,the two expressions are equal. Using the double precisionof Matlab computations, we getthe following table:
x E1 E21.00000000000000 0.649223205204760.649223205204760.10000000000000 0.501252086288580.501252086288570.01000000000000 0.500012500208480.500012500208340.00100000000000 0.500000124992190.500000125000020.00010000000000 0.499999998627930.500000001250000.00001000000000 0.500000041386850.500000000012500.00000100000000 0.500044450291340.500000000000130.00000010000000 0.499600361081320.500000000000000.00000001000000 0.000000000000000.500000000000000.00000000100000 0.000000000000000.500000000000000.00000000010000 0.000000000000000.500000000000000.00000000001000 0.000000000000000.500000000000000.00000000000100 0.000000000000000.50000000000000
The right column E2 is correct up to the digits shown. The E1computation, due to thesubtraction of nearly equal numbers, ishaving major problems below x = 105 and has nocorrect signicantdigits for inputs x = 108 and below.
The expression E1 already has several incorrect digits for x =104 and gets worse asx decreases. The equivalent expression E2 doesnot subtract nearly equal numbers and hasno such problems.
The quadratic formula is often subject to loss of signicance.Again, it is easy to avoidas long as you know it is there and howto restructure the expression.
EXAMPLE 0.6 Find both roots of the quadratic equation x2 + 912x= 3.Try this one in double precision arithmetic, for example, usingMatlab. Neither
one will give the right answer unless you are aware of loss ofsignicance and know howto counteract it. The problem is to nd bothroots, lets say, with four-digit accuracy. So farit looks like aneasy problem. The roots of a quadratic equation of form ax2 + bx +c = 0are given by the quadratic formula
x = b
b2 4ac2a
. (0.12)For our problem, this translates to
x = 912 924 + 4(3)
2.
18 | CHAPTER 0 Fundamentals
Using the minus sign gives the root
x1 = 2.824 1011,
correct to four signicant digits. For the plus sign root
x2 = 912 + 924 + 4(3)
2,
Matlab calculates 0. Although the correct answer is close to 0,the answer has no correctsignicant digitseven though the numbersdening the problem were specied exactly(essentially with innitelymany correct digits) and despite the fact that Matlab computeswithapproximately 16 signicant digits (an interpretation of the factthat the machineepsilon of Matlab is 252 2.2 1016). How do weexplain the total failure to getaccurate digits for x2?
The answer is loss of signicance. It is clear that 912 and
924 + 4(3) are nearlyequal, relatively speaking. More precisely,as stored oating point numbers, their mantissasnot only start offsimilarly, but also are actually identical. When they aresubtracted, asdirected by the quadratic formula, of course theresult is zero.
Can this calculation be saved? We must x the loss of signicanceproblem. Thecorrect way to compute x2 is by restructuring thequadratic formula:
x2 = b +
b2 4ac2a
= (b +
b2 4ac)(b + b2 4ac)2a(b + b2 4ac)
= 4ac2a(b + b2 4ac)
= 2c(b + b2 4ac) .
Substituting a,b,c for our example yields, according to Matlab,x2 = 1.062 1011,which is correct to four signicant digits ofaccuracy, as required.
This example shows us that the quadratic formula (0.12) must beused with care incases where a and/or c are small compared with b.More precisely, if 4|ac| b2, then band
b2 4ac are nearly equal in magnitude, and one of the roots issubject to loss of
signicance. If b is positive in this situation, then the tworoots should be calculated as
x1 = b +
b2 4ac2a
and x2 = 2c(b + b2 4ac) . (0.13)
Note that neither formula suffers from subtracting nearly equalnumbers. On the other hand,if b is negative and 4|ac| b2, then thetwo roots are best calculated as
x1 = b +
b2 4ac2a
and x2 = 2c(b + b2 4ac) . (0.14)
0.5 Review of Calculus | 19
0.4 Exercises
1. Identify for which values of x there is subtraction of nearlyequal numbers, and nd analternate form that avoids the problem.
(a) 1 secxtan2 x
(b) 1 (1 x)3
x(c) 1
1 + x 1
1 x2. Find the roots of the equation x2 + 3x 814 = 0 withthree-digit accuracy.3. Explain how to most accurately compute thetwo roots of the equation x2 + bx 1012 = 0,
where b is a number greater than 100.
4. Prove formula 0.14.
0.4 Computer Problems
1. Calculate the expressions that follow in double precisionarithmetic (using Matlab, forexample) for x = 101, . . . ,1014.Then, using an alternative form of the expression thatdoesnt sufferfrom subtracting nearly equal numbers, repeat the calculation andmake a tableof results. Report the number of correct digits in theoriginal expression for each x.
(a) 1 secxtan2 x
(b) 1 (1 x)3
x
2. Find the smallest value of p for which the expressioncalculated in double precision arithmeticat x = 10p has no correctsignicant digits. (Hint: First nd the limit of the expression asx0.)
(a) tanx xx3
(b) ex + cosx sinx 2
x3
3. Evaluate the quantity a +
a2 + b2 to four correct signicant digits,where a =12345678987654321 and b = 123.
4. Evaluate the quantity
c2 + d c to four correct signicant digits,where c = 246886422468and d = 13579.
5. Consider a right triangle whose legs are of length 3344556600and 1.2222222. How muchlonger is the hypotenuse than the longerleg? Give your answer with at least four correctdigits.
0.5 REVIEW OF CALCULUSSome important basic facts from calculuswill be necessary later. The Intermediate ValueTheorem and the MeanValue Theorem are important for solving equations in Chapter1.Taylors Theorem is important for understanding interpolation inChapter 3 and becomesof paramount importance for solvingdifferential equations in Chapters 6, 7, and 8.
The graph of a continuous function has no gaps. For example, ifthe function is positivefor one x-value and negative for another,it must pass through zero somewhere. This fact isbasic for gettingequation solvers to work in the next chapter. The rst theorem,illustratedin Figure 0.1(a), generalizes this notion.
20 | CHAPTER 0 Fundamentals
a b
y
c
(a)a bc
f (c)
(b)a b
f (c)
c
(c)
Figure 0.1 Three important theorems from calculus. There existnumbers c between
a and b such that: (a) f (c) = y, for any given y between f (a)and f (b), by Theorem0.4, the Intermediate Value Theorem (b) theinstantaneous slope of f at c equals
(f (b) f (a))/(b a) by Theorem 0.6, the Mean Value Theorem (c)the vertically shadedregion is equal in area to the horizontallyshaded region, by Theorem 0.9, the Mean
Value Theorem for Integrals, shown in the special case g(x) =1.
THEOREM 0.4 (Intermediate Value Theorem) Let f be a continuousfunction on the interval [a,b]. Thenf realizes every value betweenf (a) and f (b). More precisely, if y is a number betweenf (a) andf (b), then there exists a number c with a c b such that f (c) =y.
EXAMPLE 0.7 Show that f (x) = x2 3 on the interval [1,3] musttake on the values 0 and 1.Because f (1) = 2 and f (3) = 6, allvalues between 2 and 6, including 0 and
1, must be taken on by f . For example, setting c = 3, note thatf (c) = f (3) = 0, andsecondly, f (2) = 1.
THEOREM 0.5 (Continuous Limits) Let f be a continuous functionin a neighborhood of x0, and assumelimn xn = x0. Then
limnf (xn) = f
(lim
nxn)
= f (x0).
In other words, limits may be brought inside continuousfunctions.
THEOREM 0.6 (Mean Value Theorem) Let f be a continuouslydifferentiable function on the interval[a,b]. Then there exists anumber c between a and b such that f (c) = (f (b) f (a))/(b a).
EXAMPLE 0.8 Apply the Mean Value Theorem to f (x) = x2 3 on theinterval [1,3].The content of the theorem is that because f (1) = 2and f (3) = 6, there must
exist a number c in the interval (1,3) satisfying f (c) = (6(2))/(3 1) = 4. It is easyto nd such a c. Since f (x) = 2x, thecorrect c = 2.
The next statement is a special case of the Mean ValueTheorem.
THEOREM 0.7 (Rolles Theorem) Let f be a continuouslydifferentiable function on the interval [a,b],and assume that f (a)= f (b). Then there exists a number c between a and b such thatf(c) = 0.
0.5 Review of Calculus | 21
x0
P0(x)
P1(x)
P2(x)f(x)
Figure 0.2 Taylors Theorem with Remainder. The function f (x),denoted by the solid
curve, is approximated successively better near x0 by the degree0 Taylor polynomial
(horizontal dashed line), the degree 1 Taylor polynomial(slanted dashed line), and the
degree 2 Taylor polynomial (dashed parabola). The differencebetween f (x) and its
approximation at x is the Taylor remainder.
Taylor approximation underlies many simple computationaltechniques that we willstudy. If a function f is known well at apoint x0, then a lot of information about f at nearbypoints can belearned. If the function is continuous, then for points x near x0,the functionvalue f (x) will be approximated reasonably well by f(x0). However, if f (x0) > 0, then fhas greater values fornearby points to the right, and lesser values for points to theleft, sincethe slope near x0 is approximately given by thederivative. The line through (x0,f (x0)) withslope f (x0), shown inFigure 0.2, is the Taylor approximation of degree 1. Furthersmallcorrections can be extracted from higher derivatives, and givethe higher degree Taylorapproximations. Taylors Theorem uses theentire set of derivatives at x0 to give a fullaccounting of thefunction values in a small neighborhood of x0.
THEOREM 0.8 (Taylors Theorem with Remainder) Let x and x0 bereal numbers, and let f be k + 1 timescontinuously differentiableon the interval between x and x0. Then there exists a numbercbetween x and x0 such that
f (x) = f (x0) + f (x0)(x x0) + f(x0)2! (x x0)
2 + f(x0)3! (x x0)
3 +
+ f(k)(x0)
k! (x x0)k + f
(k+1)(c)(k + 1)! (x x0)
k+1.
The polynomial part of the result, the terms up to degree k in xx0, is called thedegree k Taylor polynomial for f centered at x0.The nal term is called the Taylorremainder. To the extent that theTaylor remainder term is small, Taylors Theorem gives away toapproximate a general, smooth function with a polynomial. This isvery convenientin solving problems with a computer, which, asmentioned earlier, can evaluate polynomialsvery efciently.
EXAMPLE 0.9 Find the degree 4 Taylor polynomial P4(x) for f (x)= sinx centered at the pointx0 = 0. Estimate the maximum possibleerror when using P4(x) to estimate sinx for|x| 0.0001.
22 | CHAPTER 0 Fundamentals
The polynomial is easily calculated to be P4(x) = x x3/6. Notethat the degree4 term is absent, since its coefcient is zero. Theremainder term is
x5
120cosc,
which in absolute value cannot be larger than |x|5/120. For |x|0.0001, the remainderis at most 1020/120 and will be invisiblewhen, for example, x x3/6 is used in doubleprecision to approximatesin 0.0001. Check this by computing both in Matlab.
Finally, the integral version of the Mean Value Theorem isillustrated in Figure 0.1(c).
THEOREM 0.9 (Mean Value Theorem for Integrals) Let f be acontinuous function on the interval [a,b],and let g be anintegrable function that does not change sign on [a,b]. Then thereexists anumber c between a and b such that b
a
f (x)g(x) dx = f (c) ba
g(x) dx.
0.5 Exercises
1. Use the Intermediate Value Theorem to prove that f (c) = 0for some 0 < c < 1.(a) f (x) = x3 4x + 1 (b) f (x) = 5cosx 4(c) f (x) = 8x4 8x2 + 1
2. Find c satisfying the Mean Value Theorem for f (x) on theinterval [0,1]. (a) f (x) = ex(b) f (x) = x2 (c) f (x) = 1/(x +1)
3. Find c satisfying the Mean Value Theorem for Integrals with f(x),g(x) in the interval [0,1].(a) f (x) = x,g(x) = x (b) f (x) =x2,g(x) = x (c) f (x) = x,g(x) = ex
4. Find the Taylor polynomial of degree 2 about the point x = 0for the following functions:(a) f (x) = ex2 (b) f (x) = cos5x (c) f(x) = 1/(x + 1)
5. Find the Taylor polynomial of degree 5 about the point x = 0for the following functions:(a) f (x) = ex2 (b) f (x) = cos2x (c) f(x) = ln(1 + x) (d) f (x) = sin2 x
6. (a) Find the Taylor polynomial of degree 4 for f (x) = x2about the point x = 1.(b) Use the result of (a) to approximate f(0.9) and f (1.1).(c) Use the Taylor remainder to nd an errorformula for the Taylor polynomial. Give error
bounds for each of the two approximations made in part (b).Which of the two approximationsin part (b) do you expect to becloser to the correct value?
(d) Use a calculator to compare the actual error in each casewith your error bound from part (c).7. Carry out Exercise 6 (a)(d)for f (x) = lnx.8. (a) Find the degree 5 Taylor polynomial P(x)centered at x = 0 for f (x) = cosx. (b) Find an
upper bound for the error in approximating f (x) = cosx for x in[/4,/4] by P(x).9. A common approximation for
1 + x is 1 + 12x, when x is small. Use the degree 1 Taylor
polynomial of f (x) = 1 + x with remainder to determine aformula of form 1 + x =1 + 12x E. Evaluate E for the case ofapproximating
1.02. Use a calculator to compare the
actual error to your error bound E.
Software and Further Reading | 23
Software and Further Reading
The IEEE standard for oating point computation is published asIEEE Standard 754 [1985].Goldberg [1991] and Stallings [2003]discuss oating point arithmetic in great detail, andOverton [2001]emphasizes the IEEE 754 standard. The texts Wilkinson [1994] andKnuth[1981] had great inuence on the development of both hardwareand software.
There are several software packages that specialize ingeneral-purpose scientic com-puting, the bulk of it done in oatingpoint arithmetic. Netlib (http://www.netlib.org) isa collection offree software maintained by AT&T Bell Laboratories, theUniversity ofTennessee, and Oak Ridge National Laboratory. Thecollection consists of high-qualityprograms available in Fortran,C, and Java, but it comes with little support. The commentsin thecode are meant to be sufciently instructive for the user to operatethe program.
The Numerical Algorithms Group (NAG) (http://www.nag.co.uk)markets a librarycontaining over 1400 user-callable subroutines forsolving general applied math problems.The programs are available inFortran and C and are callable from Java programs. NAGincludeslibraries for shared memory and distributed memory computing.
The International Mathematics and Statistics Library (IMSL) is aproduct of RogueWave Software (www.roguewave.com), and covers areassimilar to those covered by theNAG library. Fortran, C, and Javaprograms are available. It also provides PV-WAVE,a powerfulprogramming language with data analysis and visualizationcapabilities.
The computing environments Mathematica, Maple, and Matlab havegrown to encom-pass many of the same computational methodspreviously described and have built-in edit-ing and graphicalinterfaces. Mathematica (http://www.wolframresearch.com) andMaple(www.maplesoft.com) came to prominence due to novel symboliccomputing engines.Matlab has grown to serve many science andengineering applications through tool-boxes, which leverage thebasic high-quality software into divers directions.
In this text, we frequently illustrate basic algorithms withMatlab implementations.The Matlab code given is meant to beinstructional only. Quite often, speed and reliabilityare sacricedfor clarity and readability. Readers who are new to Matlab shouldbeginwith the tutorial in Appendix B; they will soon be doing theirown implementations.
C H A P T E R
1Solving EquationsA recently excavated cuneiform tablet showsthat theBabylonians calculated the square root of 2 correctlytowithin ve decimal places.Their technique is unknown,but in thischapter we introduce iterative methods thatthey may have used andthat are still used by moderncalculators to nd square roots.
The Stewart platform, a six-degree-of-freedomrobot that can belocated with extreme precision, wasoriginally developed by EricGough of Dunlop Tire Cor-poration in the 1950s to test airplanetires. Today its
applications range from ight simulators, which areoften ofconsiderable mass, to medical and surgicalapplications,whereprecision is very important.Solvingthe forward kinematics problemrequires determiningthe position and orientation of the platform,given thestrut lengths.
Reality Check 1 on page 67 uses themethods developed in thischapter to solve the forwardkinematics of a planar version of theStewart platform.
Equation solving is one of the most basic problems in scienticcomputing. This chapterintroduces a number of iterative methods forlocating solutions x of the equationf (x) = 0. These methods are ofgreat practical importance. In addition, they illustratethe centralroles of convergence and complexity in scientic computing.
Why is it necessary to know more than one method for solvingequations? Often,the choice of method will depend on the cost ofevaluating the function f and perhaps itsderivative. Iff (x) = exsinx, it may take less than one-millionth of a second to determinef(x), and its derivative is available if needed. If f (x) denotesthe freezing temperature ofan ethylene glycol solution under xatmospheres of pressure, each function evaluation mayrequireconsiderable time in a well-equipped laboratory, and determiningthe derivativemay be infeasible.
In addition to introducing methods such as the Bisection Method,Fixed-Point Iteration,and Newtons Method, we will analyze theirrates of convergence and discuss their com-putational complexity.Later, more sophisticated equation solvers are presented,includingBrents Method, that combines the best properties ofseveral solvers.
1.1 The Bisection Method | 25
1.1 THE BISECTION METHODHow do you look up a name in anunfamiliar phone book? To look up Smith, you mightbegin by openingthe book at your best guess, say, the letter Q. Next you may turnasheaf of pages and end up at the letter U. Now you have bracketedthe name Smith andneed to hone in on it by using smaller andsmaller brackets that eventually converge tothe name. The BisectionMethod represents this type of reasoning, done as efcientlyaspossible.
1.1.1 Bracketing a root
DEFINITION 1.1 The function f (x) has a root at x = r if f (r) =0.
The rst step to solving an equation is to verify that a rootexists. One way to ensurethis is to bracket the root: to nd aninterval [a,b] on the real line for which one of the pair{f (a),f(b)} is positive and the other is negative. This can be expressedas f (a)f (b) < 0.If f is a continuous function, then there willbe a root: an r between a and b for whichf (r) = 0. This fact issummarized in the following corollary of the IntermediateValueTheorem 0.4:
THEOREM 1.2 Let f be a continuous function on [a,b], satisfyingf (a)f (b) < 0. Then f has aroot between a and b, that is, thereexists a number r satisfying a < r < b andf (r) = 0.
In Figure 1.1, f (0)f (1) = (1)(1) < 0. There is a root justto the left of 0.7. How canwe rene our rst guess of the rootslocation to more decimal places?
10.5
y
x
1
1
Figure 1.1 A plot of f (x) = x3 + x 1. The function has a rootbetween 0.6 and 0.7.
Well take a cue from the way our eye nds a solution when given aplot of a function.It is unlikely that we start at the left end ofthe interval and move to the right, stopping at theroot. Perhaps abetter model of what happens is that the eye rst decides thegeneral location,such as whether the root is toward the left or theright of the interval. It then follows that upby deciding moreprecisely just how far right or left the root lies and graduallyimprovesits accuracy, just like looking up a name in the phonebook. This general approach is madequite specic in the BisectionMethod, shown in Figure 1.2.
26 | CHAPTER 1 Solving Equations
a0 c0a1
a2
c1
b2c2
b0b1
Figure 1.2 The Bisection Method. On the rst step, the sign of f(c0) is checked.
Since f (c0)f (b0) < 0, set a1 = c0,b1 = b0, and the intervalis replaced by the right half[a1,b1]. On the second step, thesubinterval is replaced by its left half [a2,b2].
Bisection Method
Given initial interval [a,b] such that f (a)f (b) < 0while (ba)/2 > TOL
c = (a + b)/2if f (c) = 0, stop, endif f (a)f (c) < 0
b = celse
a = cend
endThe nal interval [a,b] contains a root.The approximate rootis (a + b)/2.
Check the value of the function at the midpoint c = (a + b)/2 ofthe interval. Sincef (a) and f (b) have opposite signs, either f(c) = 0 (in which case we have found a root andare done), or thesign of f (c) is opposite the sign of either f (a) or f (b). If f(c)f (a) < 0,for example, we are assured a solution in theinterval [a,c], whose length is half that ofthe original interval[a,b]. If instead f (c)f (b) < 0, we can say the same of theinterval[c,b]. In either case, one step reduces the problem tonding a root on an interval ofone-half the original size. This stepcan be repeated to locate the function more and moreaccurately.
A solution is bracketed by the new interval at each step,reducing the uncertainty inthe location of the solution as theinterval becomes smaller. An entire plot of the func-tion f is notneeded. We have reduced the work of function evaluation to onlywhat isnecessary.
EXAMPLE 1.1 Find a root of the functionf (x) = x3 + x 1 by usingthe Bisection Method on the interval[0,1].
As noted, f (a0)f (b0) = (1)(1) < 0, so a root exists in theinterval. The intervalmidpoint is c0 = 1/2. The rst step consistsof evaluatingf (1/2) = 3/8 < 0 and choosingthe new interval[a1,b1] = [1/2,1], since f (1/2)f (1) < 0. The second stepconsists of
1.1 The Bisection Method | 27
evaluating f (c1) = f (3/4) = 11/64 > 0, leading to the newinterval [a2,b2] = [1/2,3/4].Continuing in this way yields thefollowing intervals:
i ai f (ai) ci f (ci) bi f (bi)
0 0.0000 0.5000 1.0000 +1 0.5000 0.7500 + 1.0000 +2 0.50000.6250 0.7500 +3 0.6250 0.6875 + 0.7500 +4 0.6250 0.6562 0.6875 +50.6562 0.6719 0.6875 +6 0.6719 0.6797 0.6875 +7 0.6797 0.6836 +0.6875 +8 0.6797 0.6816 0.6836 +9 0.6816 0.6826 + 0.6836 +
We conclude from the table that the solution is bracketedbetween a9 0.6816and c9 0.6826. The midpoint of that interval c100.6821 is our best guess for theroot.
Although the problem was to nd a root, what we have actuallyfound is an inter-val [0.6816,0.6826] that contains a root; inother words, the root is r = 0.6821 0.0005. We will have to besatised with an approximation. Of course, the approx-imation can beimproved, if needed, by completing more steps of theBisectionMethod.
At each step of the Bisection Method, we compute the midpoint ci= (ai + bi)/2 ofthe current interval [ai,bi], calculate f (ci), andcompare signs. If f (ci)f (ai) < 0, weset ai+1 = ai and bi+1 =ci . If, instead, f (ci)f (ai) > 0, we set ai+1 = ci and bi+1 =bi .Each step requires one new evaluation of the function f andbisects the interval contain-ing a root, reducing its length by afactor of 2. After n steps of calculating c and f (c),we have donen + 2 function evaluations, and our best estimate of the solutionis themidpoint of the latest interval. The algorithm can be writtenin the following Matlabcode:
%Program 1.1 Bisection Method%Computes approximate solution off(x)=0%Input: function handle f; a,b such that f(a)*f(b)=0error(f(a)f(b)tolc=(a+b)/2;fc=f(c);if fc == 0 %c is a solution,done
28 | CHAPTER 1 Solving Equations
breakendif sign(fc)*sign(fa)> f=@(x) x3+x-1;This commandactually denes a function handlef, which can be used as input forotherMatlab functions.

Numerical Analysis, 2nd Edition, Timothy Sauer - [PDF Document] (2024)