Final Report Template

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.net/publication/286245602

Views 56 Downloads 5 File size 1MB

Recommend stories

Project Final Report Template

Project Final Report Template Document Guidelines The Project Final Report is intended to concisely summarize the outcom

14 0 70KB Read more

Temperature Mapping Report Template

Temperature Mapping Report LOCATION : DATE : Table of Contents Introduction .........................................

35 2 370KB Read more

Excel Weekly Status Report Template

Project Name BRMS Tech Project Manager Swapnil Wale Date 10-Jan-15 Project Health Card Green Green Amber Red Ambe

15 0 72KB Read more

4a. CSO Conference Report Template Rev05202019

CSO CONFERENCE REPORT TEMPLATE 1. IDENTIFYING INFORMATION Region: Province: City/Municipality: Venue: No. of CSO pax: Da

23 0 379KB Read more

Template

87 5 10KB Read more

Template

72 4 12KB Read more

Hsgac Finance Report Final

Hunter Biden, Burisma, and Corruption: The Impact on U.S. Government Policy and Related Concerns U.S. Senate Committee o

24 0 4MB Read more

nba final report

Washington University in St Louis Senior Design Project Report NBA Scheduling with Binary Programming Authors: Mark Ja

15 0 1MB Read more

2013 Final Project Report

University of Cape Town CHE 4049 F Project 3 Group O Word Count: 19 320 [Type the abstract of the document here. The

16 0 2MB Read more

Digital Thermometer Report Final

CHAPTER 1 INTRODUCTION A digital thermometer is used to measure the atmospheric temperature.EThe digital -thermometer

16 1 704KB Read more

Author / Uploaded
EduardVallejos

Citation preview

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.net/publication/286245602

Discontinuous Galerkin Method for a 1D Elliptic Equation in CUDA Thesis · November 2015

CITATIONS

READS

0

433

1 author: Ivonne Leonor Lino University of Nottingham 1 PUBLICATION 0 CITATIONS SEE PROFILE

All content following this page was uploaded by Ivonne Leonor Lino on 08 December 2015. The user has requested enhancement of the downloaded ﬁle.

Discontinuous Galerkin Method for a 1D Elliptic Equation in CUDA G14SCD MSc Dissertation in Scientific Computation 2014/15 School of Mathematical Sciences University of Nottingham

Ivonne Leonor Medina Lino. Supervisor: Professor Paul Houston

I have read and understood the School and University guidelines on plagiarism. I confirm that this work is my own, apart from the acknowledged references.

Contents Abstract

3

Acknowledgements

3

Dedication

3

1 Introduction

4

2 Discontinuous Galerkin Finite Element Methods (DGFEMs) for elliptic problems.

5

2.1

Mathematical model of the DGFEMs . . . . . . . . . . . . . . . . . . . . .

5

2.2

Flux formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.3

Derivation of the primal formulation . . . . . . . . . . . . . . . . . . . . .

7

2.3.1

Jump and Average . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Primal formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.4

3 The elliptic problem in 1D 3.1

3.2

9

Mathematical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.1.1

Local and global matrices . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2

Right Hand Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

DGFEMs in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 CUDA: The parallel computing language

17

4.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2

CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3

Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1

Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4

CUDA memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5

Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1

5 Results

32

5.1

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2

Numerical results of the serial curve . . . . . . . . . . . . . . . . . . . . . . 32 5.2.1

Coefficients in serial and in parallel . . . . . . . . . . . . . . . . . . 33

5.3

Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4

Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5

Discussion of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusions

42

A Tutorial to install Cuda 7.5 on Ubuntu 15.04

44

B CUDA Code

46

C Matlab Code

61

D Error Code

69

References

73

2

Abstract This work provides an analysis of the performance of the Discontinuous Galerkin Finite Element Method (DGFEMs) for a 1D Elliptic Problem in parallel using GPU technology. DGFEMs was chosen as a numerical method due to its stability, robustnesses, and mainly because the method is very suitable for parallelization due to a large degree of locallity and intensive arithmetic operations. It is well known that the time used to run and get results from a code increases as we use more elements in the DGFEMs. The aim of this work is to develop a CUDA code for an Elliptic Problem in 1D using DGFEMs to reduce this time and, compare its performance with its counterpart in serial. At the end of the work some benchmarks were performed to show that the code in parallel is much faster than the code in serial.

Acknowledgements I would like to thank Professor Paul Houston who supported me from the very beginning, when I started the MSc. He has been the most patient lecturer and always very approachable. It has been an honor and great pleasure being his pupil. I appreciate and thank him deeply for all his help. I also want to thank my dear family and friends who have supported me and helped me to cope with all the emotional and economic difficulties during the whole process. Finally, I would like to thank my country and Mexican government (through CONACYT) who trusted me and gave me the opportunity to study this Master to have a better future in my career and life.

Dedication To my husband Benjamin, my son Eduardo Emanuel and my mom Maria del Carmen.

3

1

Introduction

Discontinuous Galerkin Finite Element Methods (DGFEMs) were introduced for the numerical solution of first-order hyperbolic problems in 1973 [23]. In more recent times this method has enjoyed considerable success due to its robustness, stability and local conservativeness. Also because it is easy to implement and is highly parallelizable [27], [17], [19]. Depending on the input parameters, we obtain several variations of DGFEMs methods: Symmetric Interior Penalty Galerkin (SIPG) method, introduced in the late 1970s [3], [31], the Global Element Method [10], the Nonsymmetric Interior Penalty Galerkin (NIPG) method [25] and [22], the Incomplete Interior Penalty Galerkin (IIPG) [9], the case of elliptic problems written as first order system[5] Discontinuous Galerkin Finite Element Methods (DGFEMs) for elliptic equations were proposed with variants by [12], [4],[31] and [2], called interior penalty (IP) methods from that time.Examples of the method applied to elliptic and parabolic problems can also be seen in [4], [31] and [2] Works about the pure elliptic problem can be seen in [8], [6] and [7]. For this work we will use the symmetric interior penalty method (SIP). which will be described in detail in the next section. The main aim of this work is to study the performance of a code which solves a Discontinuous Galerkin Method for an elliptic problem. The comparison was made between a serial code presented in [24] and its counterpart in parallel. CUDA was used as the language for the implementation of parallel programming, using its libraries. The code in parallel was developed by the author of this work inspired in the serial code of [24] except by the Conjugate Gradient Algorithm (CG) which can be found in the N vidiar Cuda Toolkit 7.5 [1]. In section 2, the Discontinuous Galerkin Finite Element Methods (DGFEMs) for an elliptic problem is seen in general for a multidimensional problem with the model taken from [3]. In section 3 the particular problem of DGFEMs for the elliptic problem in 1D is studied in detail, taking the discretization from [24]. In section 4, the model for paradigm of parallel programming is studied and explained throughout the particular model of 4

CUDA language which was developed on Nvidia GPU hardware. In section 5 results of the serial and parallel code are shown as well as some benchmarking between them using different input parameters. Finally in Appendix A a tutorial for installing CUDA technology is described in detailed, in Appendix B the CUDA code in parallel is presented and in Appendix C we present the serial code with some modifications from the original. One of the most important achievements of this work was the implementation of the model in parallel using the GPU technology and verifying the more efficient performance of this code in comparison with its counterpart in serial.

2

Discontinuous Galerkin Finite Element Methods (DGFEMs) for elliptic problems.

Discontinuous Galerkin Finite Element Methods (DGFEMs) combines features of the finite volume method and the finite element method and can be applied to solve hyperbolic, elliptic and parabolic problems.

2.1

Mathematical model of the DGFEMs

The model problem of DGFEMs is introduced [3] as

−∆u = f

in Ω,

u = 0 on ∂Ω,

(2.1)

where f is a given function in L2 (Ω) and Ω ⊆ Rn , n > 1, is a polygonal domain. The problem can be rewritten as [3]

φ = ∇u,

−∇ · φ = f

in Ω,

5

u = 0,

on ∂Ω.

(2.2)

2.2

Flux formulation

As it is usually done for finite element method [16], we multiply (2.2) by functions τ and v, respectively and integrate on a subset K of Ω. Then we have [3] Z

Z

Z

φ · τ dx = −

u∇ · τ dx +

K

K

Z

unK · τ ds,

Z φ · ∇vdx =

Z φ · nK vds,

f vdx +

K

(2.3)

∂K

K

(2.4)

∂K

where nK is known as the outward normal unit vector to ∂K. Assume the K triangle to be regular shape and let the finite element space associated with that triangulation Th = {K} of the domain Ω. Then, we have [3]

Vh := {v ∈ L2 (Ω) : v|K ∈ P (K) ∀K ∈ Th }, Σh := {τ ∈ [L2 (Ω)]n : τ |K ∈ Σ(K) ∀K ∈ Th }, where the space of polynomial functions is given by P (K) = Pp (K) of degree at most p > 1 on K and Σ(K) = [Pp (K)]n . Let us consider the general formulation given by [8]:

Find uh ∈ Vh and φh ∈ Σh such that for all K ∈ Th we have Z

Z

Z

φh · τ dx = − K

uh ∇ · τ dx + K

Z

Z φh · ∇vdx =

K

u¯K nK · τ ds,

(2.5)

∂K

Z f vdx +

K

φ¯K · nK vds,

(2.6)

∂K

∀τ ∈ Σ(K) and ∀v ∈ P (K), with φ¯K the numerical approximation of φ and u¯K the numerical approximation of u which are called “numerical fluxes”, on the boundary of K. This formulation is called “flux formulation” as the DGFEMs method must be expressed in terms of the numerical fluxes φ¯K and u¯K . 6

2.3

Derivation of the primal formulation

Now, following [3], We denote by H l (Th ) be the space of functions on Ω whose restriction to each element K belongs to the Sobolev space ∈ H l (K). Let Γ denote the union of the boundaries of the elements K of Th . Functions in T (L) := ΠK∈Th L2 (∂K) are evaluated twice on Γ0 := Γ/∂Ω and evaluated once on ∂Ω. Therefore, L2 (Γ) is identified as the subspace of T (Γ) which is the space of the functions for which the values coincide on all internal edges. The scalar numerical flux is defined as u¯ = (¯ uK )K∈Th and the vector numerical flux φ¯ = (φ¯K )K∈Th , both of them linear functions given by

u¯ : H 1 (Th ) → T (Γ), 2.3.1

φ¯ : H 2 (Th ) × [H 1 (Th )]2 → [T (Γ)]2 .

(2.7)

Jump and Average

Let e be the interior edge between K1 and K2 , let the unit normal vectors be n1 and n2 on Γ pointing to the exterior to K1 and K2 on e respectively. We define the average {v} as [2] 1 {v} = (v1 + v2 ), 2

(2.8)

and jump [v] as

[v] = v1 n1 + v2 n2

on e ∈ Eh0 ,

(2.9)

where Eh0 is the set of the interiors edges e. For ν ∈ [T (Γ)]2 , and ν1 , ν2 defined analogously we have 1 ν = (ν1 + ν2 ) 2

[ν] = ν1 · n1 + ν2 · n2

7

on e ∈ Eh0

(2.10)

(2.11)

2.4

Primal formulation

Again, following [3], adding over all the elements equations 2.5 and 2.6, using the average and jump operator, applying integrations by parts and suitable identities, we get the next result (for details see [3]) Z f vdx ∀v ∈ Vh ,

Bh (uh , v) =

(2.12)

Ω

Z ∇h uh · ∇h vdx

Bh (uh , v) :=

(2.13)

ZΩ

¯ · [v])ds ([¯ u − uh ] · {∇h v} − {φ}

+ ZΓ

¯ ({¯ u − uh }[∇h v] − [φ]v)ds ∀uh , v ∈ H 2 (Th )

+ Γ0

Where q ∈ T (Γ) and ϕ ∈ [T (Γ)]2 and we have r : [L2 (Γ)]2 → Σh and l : L2 (Γ0 ) → Σh given by Z

Z

ϕ · τ ds,

r(ϕ) · τ dx = −

Z

Z l(q) · τ dx =

Ω

(2.14)

Γ

Ω

q[τ ]ds ∀τ ∈ Σh .

(2.15)

Γ0

The form Bh is bilinear and the proof of (2.13) can be seen in [3] Equation 2.12 is called the “primal formulation” and 2.13 is called the “primal form”, and recalling the definition of φh given in Section 2.2 we have

φh = φh (uh ) := ∇h uh − r([¯ u(uh ) − uh ]) − l({u¯h (uh ) − uh }).

(2.16)

All these general concepts can be applied to the one dimensional case as we will see in the next section.

8

3

The elliptic problem in 1D

In this section we will use and apply the concepts of the “primal formulation” defined and deduced in the section 2 for the one dimensional case using the deduction in [24].

3.1

Mathematical model

In this part of the text we aim to deduce the discretization of the elliptic problem in 1D and get the linear algebra problem similarly as we did in general in Section 2.1. It can be seen this deduction with more detail in [24]. Let start with some definitions [16]: Definition 1 Let Ω ∈ Rn , n ≥ 1, be an open bounded set. We write C(Ω) to denote the set of all continuous (real-valued) functions defined on Ω. Let α = (α1 , ..., αn ) ∈ Nn . The length of a multi-index is defined as |α| = α1 + ... + αn . Definition 2 Dα =

∂ ∂x1

α1

···

∂ ∂xn

αn

=

∂ |α| α n ∂xx 1 ···∂xα n

Definition 3 Let Ω ∈ Rn , n ≥ 1, be an open bounded set; then for m ≥ 0, we write C m (Ω) = {v ∈ C(Ω) : Dα v ∈ C(Ω)

∀|α| ≤ m}

Now, we consider the boundary value problem in the interval (0, 1) [24]:

∀x ∈ (0, 1),

−(a(x)u0 (x)) = g(x),

(3.1)

u(0) = 1,

(3.2)

u(1) = 0,

(3.3)

where a ∈ C 1 (0, 1) and g ∈ C 0 (0, 1), using the definition 3. It is assumed that there are two constants a0 and a1 such that ∀x ∈ (0, 1),

0 < a0 ≤ a(x) ≤ a1 .

9

(3.4)

u is said to be the classical solution of 3.1-3.3 if u ∈ C 2 (0, 1) and u satisfies the equations 3.1-3.3 pointwise [24]. Let Eh be a partition of (0,1) given by 0 = x0 < x1 < . . . < xN = 1 and let In = (xn , xn+1 ), n = 0, 1, ..., N . With this definition we have

hn = xn+1 − xn , hn−1,n = max(hn−1 , hn ), h = max0≤n≤N −1 hn . The space of a piecewise discontinuous polynomial of degree k is denoted by Dk (Eh ):

Dk (Ek ) = {v : v|In ∈ Pk (In ), ∀j = 0, ..., N − 1}

(3.5)

Where Pk (In ) is the space of the polynomial of degree k on the interval In . We can define

v(x+ n ) = lim v(xn + ),

(3.6)

v(x− n ) = lim v(xn − ),

(3.7)

→0 >0

→0 >0

and just as we did in Section 2.3.1 we define here the jump of v at the endpoints of In [24]:

+ [v(xn )] = v(x− n ) − v(xn ),

and average of v at the endpoints of In as:

v(xn ) =

1 + v(x− ) + v(x ) , n n 2

∀n = 1, ..., N − 1

This definition is extended to the end point as:

[v(x0 )] = v(x+ 0 ), 10

{v(x0 )} = v(x+ 0 ),

[v(xN )] = v(x− N ),

{v(xN )} = v(x− N ). Given the definition above the penalty terms are defined as [24]:

J0 (v, w) =

N X

α0

n=0

hn−1,n

[v(xn )][w(xn )],

(3.8)

Where α0 and α1 are real nonnegative numbers. At this point we apply the same process of the finite element method as in 3.1 which means multiplying by v ∈ Dk (Eh ) for each interval In :

Z

xn+1

0 + a(x)u0 (x)v 0 (x)dx − a(xn+1 )u0 (xn+1 )v(x− n+1 ) + a(xn )u (xn )v(xn ) (3.9)

Zxnxn+1 =

f (x)v(x)dx,

n = 0, ..., N − 1

xn

From this we can derive

[a(xn )u0 (xn )v(xn )] = {a(xn )u0 (xn )}[v(xn )] + {v(xn )}[a(xn )u0 (xn )],

(3.10)

Then we have

N −1 Z xn+1 X n=0 N X

+

N X {a(xn )u0 (xn )}[v(xn )]

a(x)u0 (x)v 0 (x)dx −

xn

n=0

{a(xn )v 0 (xn )}[u(xn )] =

Z

1

f (x)v(x)dx 0

n=0

− a(x0 )v 0 (x0 )u(x0 ) + a(xN )v 0 (xN )u(xN ) Z 1 = f (x)v(x)dx − a(x0 )v 0 (x0 ). 0

11

(3.11)

For this work will be restricted to the case when = −1, for which the DGFEMs form is symmetric, i.e. ∀v, w,

a−1 (v, v) =

N −1 Z xn+1 X n=0

a1 (v, w) = a1 (w, v),

0

2

a(x)(v (x)) dx − 2

N X

xn

(3.12)

{a(xn )u0 (xn )}[v(xn )]

(3.13)

n=0

+ J0 (v, v), Finally, the statement for the DGFEMs method is given by [24]: Find uDG ∈ Dk (Eh ) such that h

∀v ∈ Dk (Eh ),

a (uh , v) = L(v),

(3.14)

where L : Dk (Eh ) → R is given by Z L(v) = 0

1

α0 v(x0 ), g(x)v(x)dx − a(x0 )v (x0 ) + h0,1 0

(3.15)

Piecewise quadratic polynomials are used as the monomial basis function given by:

φn0 (x) = 1,

x − xn+1/2 , xn+1 − xn

(3.17)

(x − xn+1/2 )2 , (xn+1 − xn )2

(3.18)

φn1 (x) = 2

φn2 (x) = 4

(3.16)

with the midpoint of the interval given by xn+1/2 = 12 (xn + xn+1 ). We also have xN = x0 + nh,

h=

φn0 (x) = 1,

12

1 N

(3.19)

(3.20)

2 (x − (n + 1/2)h), h

(3.21)

4 (x − (n + 1/2)h)2 , h2

(3.22)

φn1 (x) =

φn2 (x) =

0

φn0 (x) = 0,

(3.23)

2 , h

(3.24)

8 (x − (n + 1/2)h). h2

(3.25)

0

φn1 (x) =

0

φn2 (x) =

Then the DGFEMs solution is given by

uh (x) =

N −1 X 2 X

Ujm φjm (x),

(3.26)

m=0 j=0

Finally a linear system is obtained:

AU = b,

(3.27)

where U is the vector of the unknown real numbers to be solved um j , b is the vector with components L(φin ) and A is the matrix with entries a (φjm , φin ) 3.1.1

Local and global matrices

We also can write the linear system with An U n as [24] 

un0



    U n = un1  ,   un2 Z (An )ij =

(φni )0 (x)(φnj )0 (x)dx,

In

13

(3.28)

Computing the corresponding coefficients of An we get 

0 0 1  An =  0 4 h 0 0

0



  0, 

16 3

We can also compute the interior nodes xn . Expanding jump and average terms we get [24]

α0 DG [P (xn )][v(xn )] h 1 0 + DG + α0 DG + + = (P DG )0 (x+ v (x [P (xn )][v(x+ )[v(x )[P (x )] − )] + n n n n n )] 2 2 h 0 − DG − α0 DG − 1 DG 0 − − − (P ) (xn )[v(xn )] + v (xn )[P (xn )] + [P (xn )][v(x− n )] 2 2 h 0 + DG − α0 DG + 1 − )[v(x )] − v (x )[P (x )] + [P (xn )][v(x− − (P DG )0 (x+ n n n n n )] 2 2 h α0 1 DG 0 − + (P ) (xn )[v(xn )− ] + v 0 (xn )− [P DG (xn )+ ] + [P DG (x− n )][v(xn )]. 2 2 h −(P DG )0 (xn )[v(xn )] + v 0 (xn )[P DG (xn )] +

Using definitions 3.16 to 3.25 it is possible to compute the local matrices for the interior nodes given by: 

0



α0

0

0



−2 + α0



α 1−α −2 + α  1   0 0 0 Bn = − − α −1 + + α 2 − − α ,, h  0 0 0 2 + α 1 − 2 − α −2 + 2 + α

Cn =

−1 + α0

 1    + α0 −1 + + α0 −2 + + α0 ,  , h  0 0 0 2 + α 1 + 2 + α −2 + 2 + α 

−α0

−1 + α0

2 − α0



 1   0 0 0 Dn =  − − α −1 + + α 2 − − α ,, h  0 0 0 −2 − α −1 + 2 + α 2 − 2 − α

14

 En =

−α0

1 − α0



2 − α0

 1   0 0 0 ,  +α −1 + + α −2 + + α  , h  0 0 0 −2 − α 1 − 2 − α −2 − 2 − α

And also for the boundary nodes: 

α0

2 − α0



−4 + α0

 1   0 0 0 F0 = −2 − α −2 + 2 + α 4 − 2 − α ,  , h  0 0 0 4 + α 2 − 4 − α −4 + 4 + α  F0 =

α0

−2 + α0

−4 + α0



 1   2 + α0 −2 + 2 + α0 4 + 2 + α0  , h  0 0 0 4 + α −2 + 4 + α −4 + 4 + α

We define the next matrices to assemble the global matrix as [24]:

T = An + Bn + Cn+1 ,

(3.29)

T0 = A0 + F0 + C1 ,

(3.30)

TN = AN −1 + Fn + BN +1 ,

(3.31)

Then the global matrix is given by:  T D1  0  E1 T D2    ... ... ...  A=  ... ...    TN −2  

15



... M EN −1

            DN −1   TN

3.1.2

Right Hand Side

The expansion of the right hand side vector is given by [24]

L(φin )

Z =

1

0

f (x)φin (x)dx − K(x0 )(φin ) (x0 ) +

0

α0 i φ (x0 ). h n

(3.32)

Using the definition of the local variable φni and making a change of variable, we obtain Z

1

g(x)φni (x)dx

0

h = 2

1

Z

h g( t + (n + 1/2)h)ti dt. 2 −1

(3.33)

The last integral is approximated using the Gauss quadrature rule. If v is a polynomial of degree 2QG − 1 the Gauss quadrature is exact. We obtain QG

1

Z

f (x)φni (x)dx

0

h hX wj f ( sj + (n + 1/2)h)sij . ≈ 2 j=1 2

(3.34)

and finally the component of the vector b is given by [24]

−1 N −1 N −1 (b00 , b01 , b02 , b10 , b11 , b12 , b20 , b21 , b22 , ..., bN , b1 , b2 ). 0

3.2

(3.35)

DGFEMs in CUDA

With the result obtained above we can do a computer program to calculate the numerical solution of the elliptic equation. At the end of the program it is necessary to solve the Linear Algebra Problem (LAP) Ax = b with A the global matrix, x the coefficients of the solution and b the right hand side. As we know from Computer Linear Algebra (CLA) [11], [28], [14], we could choose among different solvers to get faster results, this is important in the case we have a huge amount of data, i.e., if we want to calculate the solution for a number of elements. In this case we have to choose a faster and parallelizable solver. In this sense a solver which is highly parallelizable is the Conjugate Gradient algorithm (CG) [15]. Also from CLA [11], [28], [14], we know that, the part of the code which needs more floating-point operations (FLOP) is the solution of the LAP. Therefore in this work

16

we will use the parallel paradigm via CUDA to compute the LAP together with the CG. In the next chapter we will explain what is the CUDA language and how to use it.

4

CUDA: The parallel computing language

It is well known the great necessity of the big scale computing in all areas of science and especially in Scientific Computation where almost all problems in this area involve a Partial Differential Equation (PDE) that should be solved. The PDE leads to a linear algebra problem, which means solving the problem Ax = b with matrices that sometimes have hundreds, thousands or millions of entries per matrix per step-time. It is this area which this work focuses on.

4.1

Background

Less than 10 years ago (November 2006), Nvidiar opened its language called CUDA (Compute Unified Device Architecture) to the general community. The language is a general purpose parallel programming model and a computing platform that is used in Nvidiar Graphic Processor Unit or GPU’s which comes with a highly parallel, many core processor and multithread system. These GPU’s have enormous computational horsepower and great memory bandwidth in comparison with the classical CPU, as shown in Figure 1. To explain this great difference in the performance, we have to explain the differences between the architecture of the CPU and the GPU. The main difference between both of them is the way they are built. As Figure 2 shows, the GPU has many more transistors devoted to data processing than the CPU, which has more memory dedicated to data caching and flow control.

17

Figure 1: (Taken from [1])Floating-Point Operations per Second for the CPU and GPU.

Figure 2: (Taken from [1]).The GPU devotes more transistors to data processing.

4.2

CUDA Architecture

To understand and learn the CUDA language it is important to have a clear knowledge of the hardware because the software and the hardware have a very close relationship, and also the way the code is written has a direct connection with the architecture of the GPU. A GPU has a certain number of arithmetic logic units which are divided into smaller groups called grids, these grids are divided into blocks, which are finally subdivided into

18

Figure 3: Memory Hierarchy (Taken from [1]). threads, this idea is shown in Figure 3 schematically. Every thread is able to run a function which in CUDAr programming is called a “kernel”. After declaring a function, it is necessary to specified the number of threads per block and the number of blocks per grid within the “chevron” syntax given by > and they can be of type int or dim3.

19

4.3

Programming model

The CUDAr programming model as we have described above relies on the architecture of the GPU which mainly consist of a processor chip with a parallel system. At the bottom of Figure 4 we can see a pair of the physical device, they are also called graphic cards. This graphic card is manage by the driver Application Programming Interface (API) which can be found in the Nvidia web page, As in many languages, CUDAr has libraries and APIs to help the process of coding. NVIDIA also provides the runtime API and the libraries. The user or developer can install the API’s and the libraries (see Appendix A). This libraries are very similar as the libraries in other languages. Actually CUDA itself was built on C and in general all the CUDA language is very similar to C and also their sentences. This gives CUDA a great advantage for C and C++ programmers who can learn CUDA very straightforward.

Figure 4: Components of the ’device’ or software stack (Taken from [13]). In Figure 5 (also called software stack) we can see the general structure of the device 20

which is organize in cores, connected with the share memory and then this with the texture memory. This physical connection has relevance and a very close relationship with the structure of the language. Figure 5 is also called software stack. The counterpart of this is the CPU or “host” and is the code that classical runs in serial. CUDA also has support for other languages like Fortran and Python as you can see in [26] We can mention that at the beginning of this dissertation, we tried to work on CUDA Fortran, but it does not have much support for this kind of calculations and binding does not have enough documentation, then we decided to swap for Cuda C.

Figure 5: Schematically representation of the device (Taken from [13]).

21

4.3.1

Memory hierarchy

There are three principal concepts to be learned about the CUDA programming model: the group of threads and its hierarchy, the concept of shared memory and finally the synchronization. All these concepts have simple representation in the syntax of the C language. We can say that the aim behind the idea of parallelization is to find out what part of the algorithm has information which is independent of the rest of the calculations, and which is reflected on the code as the loops with independent information parallelized between one cycle and another. Every independent calculation can be done by a thread. The order of threads that can be used in total is currently 1024 threads per block but there are millions of blocks containing threads within the grids, Figure 6 shows how threads, blocks and grids are organized within the device. This architecture is easily scalable, which allows GPU architecture to increase the number of processors that can be used for calculations for future models of dedicated hardware. The model itself guides the programmer to partition the problem in blocks of threads, and then in pieces that are finer and can share information and cooperate within the same block. Each block of threads can do calculations sequentially or concurrently on any of the multiprocessors available in the GPU and only the multiprocessor count will be known by the runtime system.

22

Figure 6: Grid of Thread Blocks (Taken from [1]) A very important concept that is defined in the actual code is the “kernel” which is equivalent to a “function” in C but with the important characteristic that kernels will be executed in parallel as many times as the number of threads defined for that specific kernel and all of them at the same time. The syntax for a kernel begins with the keyword __global__ and the chevron syntax that contains the number of threads and blocks used for that kernel. Each thread in the kernel is identified by a unique thread id given by a built-in threadIdx variable. Source Code 1 shows an explicit example of the kernel syntax:

23

Source Code 1: Addition of two vector launching one single thread (Taken from [1]) 1 2 3 4 5 6 7 8 9 10 11 12 13

// Kernel definition __global__ void VecAdd(float* a, float* b, float* c) { int i = threadIdx.x; c[i] = a[i] + b[i]; } int main() { ... // Kernel invocation with N threads VecAdd(a, b, c); ... }

In this example we can see how the kernel can be declared (line 2) with the keyword __global__, the input parameters which are the vector pointer a and b and the output parameter which is C. For this kernel we are calculating a simple sum of vectors. Consider that if we do this with a serial code we would use a loop with a do or a for with i as a dummy variable. For this CUDA code we are using the dummy variable i (line 4) but instead of using loops we use the built-in threadIdx.x variable, which means that we will do all the i operations in parallel or at the same time. In the function main (line 11) the kernel is called as a normal function but including the chevron syntax > and including the information of the number of blocks (1) and threads per block (N) inside the brackets. Figure 7 shows in picture this kernel which was taken from [18], here, every entry from vector a is added to the corresponding entry from vector b. It is important to note that every sum is being performed for a different thread and we usually have enough threads available for every entry of the vector (the maximum number of threads that can be launched in a CUDA kernel is of order 1017 ) then, we can use them to perform all the additions at the same time. In the Source Code 2 (taken from [1]) we can see the addition of two matrices A and B of size NxN and the result is stored in matrix C. The maximum number of threads per

24

Figure 7: Vector summation (Taken from [18]) block that can be used is currently 1024, but, depending on the architecture of the GPU, the number of blocks can be millions.

25

Figure 8: A 2D hierarchy of blocks and threads(Taken from [18])

Blocks and threads can be organized in two or three dimension, as schematically shown in Figure 8. The dimension of blocks can be specified in the CUDA variables int or dim3 and are accessible through the build-in variable blockIdx. The dimension of threads can also be specified in the CUDA variables int or dim3 and are accessible through the build-in variable threadIdx.

26

Source Code 2: Addition of two matrices (Taken from [1]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd(A, B, C); ... }

The same way as in the Source code 1 above, in the Source code 2 we define the kernel (line 2) but now with two dummy variables: One for the first entry of the matrix and the other for the second one. Here every pairwise addition is perform by each thread in the kernel MatAdd. The two variables are build-in as before and one uses .x and the other .y (line 5 and 6) to distinguish between them. We call the kernel at line 15 but we previously define the int variable numBlocks (line 13) and the specific CUDA kind variable dim3 and threadsPerBlock(N, N) (line 14). We suppose N is already defined. The CUDA type dim3 can be used as a one, two or three dimensional variable to specify the dimension of the block, as blocks are composed of three dimensions. Finally, in line 15 the number of thread per block and the number of blocks per grid are passed via the dim variables numBlocks (two dimensional)and threadsPerBlock (one dimensional) respectively to the kernel MatAdd.

27

Source Code 3: Addition of two matrices (Taken from [1]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd(A, B, C); ... }

In the Source code 3 we can see the use of blockIdx. Here, we perform the same sum of matrices as before but in the definition of i and j we are using blockIdx as two dimensional variable and the built-in variable blockIdx is used to define the number of threads and blocks. These are the first examples of the parallelization of simple algorithms. In the next pages we explain more about the architecture and more concepts about the other levels of memory.

4.4

CUDA memory.

CUDA memory is divided into three different levels of memory with different levels of accessibility. The first one is “global memory” for which all threads can access. The second one is “shared memory” where thread from the same block can share information between them. The last one is “constant and texture memory” which can be accessed for all threads and is only readable. 28

In figure 7 together with the Source Code 1, we can see a simple example of the global memory. In this work shared and texture memory are not used explicitly but it can be seen more about the topic in [18].

29

4.5

Data transfer

It is important to know the basis of the programming model as we described above to understand the way we could develop a program properly and what is behind the scenes in more complex problems. However, most of the time, it is very frequent to use API’s (Section 4.3) for these complex problems. For this work this API and its libraries are extensibility used and they handle the specific problem to pass and calculate the number of threads and blocks that should be used but still is necessary to allocate and pass memory from the CPU to the GPU and take the information of the results from the GPU to the CPU. In the Code 4 we can see a complete example about the information transfer. From line 1 to 7 is declared the kernel VecAdd. In line 14 and 15 memory is allocated in the host for the vectors A and B. From line 19 to 24, memory is allocated in the device. In line 26 and 27 information is passed from host to device. In line 32 the kernel is invoked. In line 35 information is passed from the device to the host and finally from line 37 to 39 device memory is released. The same process is done when we use libraries, some explicit examples of the corresponding implementation can be seen in the Nvidiar CUDA Toolkit and specifically for this work, we used two libraries: CUBLASr [20] and CUSPARSEr [21].

30

Source Code 4: Vector addition code sample (Taken from [1]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); // Initialize input vectors ... // Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; VecAdd(d_A, d_B, d_C, N); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } // Free host memory }

31

5

Results

5.1

Methodology

The main purpose of the results of this work is to show the performance of the code in parallel versus the performance of a code in serial. To accomplish this we took the code in [24] as a reference which you could see in the Appendix C and the code in parallel can be seen in Appendix B. For this work we use CUDA C, the model of the graphic card was GeForce GTX 765M and a processor intel core i7. To solve the linear system in parallel we used a Conjugate Gradient Algorithm that can be found in the Nvidia Cuda Toolkit 7.5. A benchmark was made to compare the performance of both codes together with the same input parameters. And at the end the norm of the L2 norm was calculated for the Matlab code. In the codes (the symmetrization parameter) was called ss which for this model (SIPG) equals one. The penalty parameter α0 was called penal and nel=number of elements.

5.2

Numerical results of the serial curve

The serial code given by [24] gave us the coefficients for the numerical result of the solution of the linear system. We added a loop to calculate the numerical solution and could compare directly with the analytical solution. As a result of this we get: ur , which is the solution at the right boundary of the element, and ul , which is the solution at the left boundary of the element. With this two points we get the numerical solution of the elliptic problem. In Figure 9 we have plotted ur (x) with asterisks in red and ul (x) with circles in blue. In Figure 10 we compare this last result with the analytic solution of the elliptic problem which is [24]:

2

p(x) = (1 − x)e−x ;

(5.1)

we can see in this Figure that the numerical result is in agreement with the analytical 32

Numerical Solution of the DG 1 0.8

u(x)

0.6 0.4 0.2 0 -0.2

0

0.2

0.4

0.6

0.8

1

x Figure 9: ur (x) and ul (x) curves. result. 5.2.1

Coefficients in serial and in parallel

In Figure 12 it is shown the plot of the coefficients αjm from the codes in parallel and in serial. The coefficients obtained from the serial code [24] are given in red circles and the coefficients of the code in CUDA are given in green crosses. This first comparison was made using the original solver i.e. the direct method, for the code in serial and CG for the code in CUDA. As the graphic shows, the results with both codes are the same as we expected.

33

Numerical Solution from the Matlab Code 1

u(x)

0.8 0.6 0.4 0.2 0 -0.2

0

0.2

0.4

0.6

0.8

1

x Figure 10: Analytic, ur and ul curves.

Numerical solution from the CUDA code 1

u(x)

0.8 0.6 0.4 0.2 0 -0.2

0

0.2

0.4

0.6

0.8

1

x Figure 11: Numerical results of the DGFEMs elliptical problem from the code in parallel.

34

Coefficients in serial and in parallel. Input: nel=10,ss=-1,penal=2 1

Matlab Cuda

j

Coefficients α m

0.8 0.6 0.4 0.2 0 -0.2 0

0.2

0.4

0.6

0.8

1

x

Figure 12: Coefficients given in the serial code and in the parallel code.

5.3

Benchmarking

We also did some benchmarks to compare the performance of the serial code versus the performance of the parallel code. The original serial code was solved with the solver MLD. However, in order to do a most fair comparison, we changed this solver to the already implemented Matlab functions CGS which is the matlab implementation of the Conjugate Gradient Square algorithm and PCG which is the matlab implementation of the Conjugate Gradient Algorithm. The function PCG was used without preconditioner, which is exactly the same implementation used for the CUDA code.

35

Benchmark: ss=-1, penal=3.0 250

Matlab Cuda

Time [s]

200

150

100

50

0 0

1000

2000

3000

4000 5000 6000 No. of elements

7000

8000

9000

Figure 13: Benchmark with CGS function. Input: ss=-1, penal=3.0

In figure 13 we can see the comparison of the performance of the codes with the function CGS and the input parameters ss=-1, penal=3.0. In abscissa axis are plotted the number of elements and in the ordinate axes is plotted the time in seconds. We ran the program once to get every point. The results obtained from the Matlab code are in pink crosses and the results from the CUDA code are in red crosses. It can be seen from the graph that at the beginnig i.e. for few elements the time is almost the same but for many elements the code in Matlab takes more than 200 times to get the same result.

36

Benchmark ss=-1.0, penal=5.0 50

Time [s]

Matlab Cuda

0 0

1000

2000

3000 4000 5000 No. of elements

6000

7000

8000

Figure 14: Benchmark with CGS function. Input: ss=-1, penal=5.0

Figure 14 shows a graph with the same solver as in Figure 13 but with parameters ss=-1 and penal=5.0. It is interesting to note that in the first part of the graphic, points from the CUDA code take more time than the points from the code in Matlab. With 2000 elements both codes have the same performance, from that point, the Matlab code started to take more time until the end. This behavior at the beginning of the graph could be explained taking in account data transfer rate between host and device, which is the time the information spends to pass from the host to the device i.e. from the CPU to the GPU. Although we did not measure this rate, from the Figure we can say that is the reason the performance of the CUDA code for few elements takes more time than the Matlab code. This rate is not very important when the amount of information is bigger i.e. when the number of element increases and for more than 7000 elements the Matlab code takes much more time than the CUDA code.

37

Benchmark: Function PCG. Input: ss=-1,penal=5 40

Matlab Cuda

35 30 Time [s]

25 20 15 10 5 0 0

1000

2000

3000 4000 5000 No. of elements

6000

7000

8000

Figure 15: Benchmark with PCG function. Input: ss=-1, penal=5.0.

38

Benchmark. Function PCG ss=-1.0, penal=2000.0 80

Matlab Cuda

70 60 Time [s]

50 40 30 20 10 0 0

1000

2000

3000 4000 5000 No. of elements

6000

7000

8000

Figure 16: Benchmark with PCG function. Input: ss=-1, penal=2000.

In Figure 15 it is shown a benchmark using the Matlab PCG function as a solver for the serial code which is a fairer comparison than the benchmark with the CGS. For this graphic, the PCG was used without any preconditioner. We took as the input parameters ss=-1 and penal=5.0. We also can see, that subtle behavior which was described above at the beginning of the figure regarding the data transfer rate. At the end of the figure we can see that the performance is more that 35 times better for the last point which was for 7500 elements. Figure 16 shows a benchmark for the input ss=-1.0 and penal=2000, the behavior is the same in for Figure 15 and it can be seen that the performance of the CUDA code is more than 70 times better than the Matlab code for the 7500 elements.

39

5.4

Error

The L2 norm of the error [29], [16], defined from equation 5.3 to 5.6 was plotted vs the mesh size h in order to assure the correctness of the DGFEMs solution. According to theory [29], [16], for piece wise polynomial basis of degree 2 a slope of 3 is expected in the convergence plot. This result was achieved using a quadrature rule for the integral of the error. Since the error function its not a polynomial function (exponential function from the analytic solution of the problem), then the integral will not be exact, nevertheless a large number of terms (large number of quadrature points) for the integral was used Z xi+1 Nq X 2 (u − uh ) dx ≈ (u(xq ) − uh (xˆq ))2 wˆq |J|, (5.2) xi

q=1

Where xq is the value of x mapped in the domain (-1,1) and xˆq are the roots of the Legendre polynomials in the interval (-1,1) and wˆ are the weights, then this quantity was computed for all elements and each contribution was added in a loop. The error code can be seen in Appendix D. The subroutine lgwt.m was taken from [30].

ku−

un k2L2 (0,1)

=

N Z X i=1

xi

(u − un )2 dx

(5.3)

xi−1

Taking in to account the expression 5.2, we can compute the error squared using the following expression:

ku−

un k2L2 (0,1)

Nq N X X ∼ (u(xq ) − un (xˆq ))2 ω ˆ q |J|

(5.4)

i=1 q=1

The variable x mapped in the local domain has the following expression which depends on the endpoints of each element and the quadrature points: xq =

1 + xˆq 1 − xˆq xi−1 + xi 2 2

(5.5)

The jacobian is computed using the following expression:

|J| =

dx xi − xi−1 = dξ 2 40

(5.6)

The DGFEMs approximation is given by : uh (xˆq ) =

3 X

Uj (xˆq )j−1 ,

(5.7)

j=1

Where (xˆq )j−1 are the mapped basis polynomials 1, x, x2 in the interval (-1,1) and Uj are the coefficients. Finally the expected convergence plot is obtained:

L2 norm vs h

10 -2

L2 norm

10 -4

10

-6

10 -8

10 -10 -4 10

10

-3

10

-2

10

-1

h Figure 17: L2 norm of the error vs the mesh size h.

41

10

0

5.5

Discussion of the results

As we could see from Figure 12 results from the CUDA code are in good agreement with the results from the Matlab code [24] and numerical results coincide with the analytical solution shown in Figure 10 as we expected. From the second part of the results which is the benchmarking we also can see a much better performance from the CUDA code in comparison with the Matlab code as it is expected. An important result is the fact that when we try to calculate the solution for more elements the performance of the CUDA core is much better in comparison with the Matlab code. This could be explained because we are using more FLOP’s as we increase the number of elements, when the difference between the code in serial and in parallel is more evident. This could be more beneficial when we try to get the solution for multidimensional cases of the problem when the number of elements grows faster. We could also see qualitatively that the data transfer rate between host and device affects the performance of the CUDA code and explain the beginning of Figures 14, 15 and 16. Another important issue was the benchmarking technique, for this results we use the tic and toc Matlab timing function and the basic shell command time for measuring the performance of the CUDA code but as we know from [18] CUDA has its own specific function to measure different parts of the code. For future works we could improve this area to obtain better results. The last result is shown in Figure 17, this shows us the accuracy of the numerical results. We obtains a linear curve in a log-log graphic with slope 3 as we use polynomials of degree 2 for our basis function which is consistent with the theory.

6

Conclusions

This research has attempted to show the performance of a CUDA code of a Discontinuous Galerkin Finite Element Methods (DGFEMs) for an elliptic problem in 1D. A comparison was made between a serial code in Matlab and the same problem implemented in parallel in CUDA. Some benchmarks showed the performance of the code in serial was substantially

42

slower than the code in parallel, as we expected. Using exactly the same solver we found that the CUDA code could be more than 70 time faster than the code in Matlab. We also obtain that numerical results for the code in parallel and in serial are in agreement. Finally, the L2 error norm of the serial code was calculated, obtaining a straight line of slope 3 as it was expected. Future studies should concentrate on the use of better profiling for a fairer benchmarks, especially, it could be use in the development of a multidimensional version of the problem. This work may well represent a step forward in the use of GPU in the implementation of the DGFEMs for elliptic problems. Application of these findings will make it easier to perform DGFEMs in more dimensions where its LAP is numerically and computationally more demanding.

43

A

Tutorial to install Cuda 7.5 on Ubuntu 15.04

First of all you must have an Nvidiar CUDA capable card and can check this with: $ lspci |grep NVIDIA You also can check you Ubuntu version with: $lsb_release -a It is recommended to have a brand new installation of Ubuntu to avoide possible conflict with others compilers or Nvidia drivers. Then you should be download the Nvidiar Cuda driver from: https://developer.nvidia.com/cuda-downloads Install all these libraries: sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglu1-mesa glew-utils mesa-utils Open the next file: sudo gedit /etc/modprobe.d/blacklist.conf And add these sentence to the end of the file: blacklist vga16fb blacklist nouveau blacklist rivafb blacklist nvidiafb blacklist rivatv A very important step is to disable the Noveau with: options nouveau modeset=0

44

and create a file at /etc/modprobe.d/blacklist-nouveau.conf with the following contents: blacklist nouveau and regenerate the kernel initramfs: sudo update-initramfs -u Reboot into text mode (runlevel 3). This can usually be accomplished by adding the number ”3” to the end of the system’s kernel boot parameters. Since the NVIDIA drivers are not yet installed, the text terminals may not display correctly. Temporarily adding ”nomodeset” to the system’s kernel boot parameters may fix this issue. Consult https://wiki.ubuntu.com/Kernel/KernelBootParameters on how to make the above boot parameter changes. The reboot is required to completely disable the Nouveau drivers and prevent the graphical interface from loading. The CUDA driver cannot be installed while the Nouveau drivers are loaded or while the graphical interface is active. Then change the permission to the executable with: $chmod 777 -R

cuda__linux.run

Two steps that are not described in the official guide but work for me are: Eliminate the file: .X0-lock at the archive /temp After the step above reboot the computer. Another key step is that when the installer is running you have to write yes and accept all the options but when it asks whether you want to install the OpenGL library, you must say no, otherwise you will get in an infinite loop at log on stage and not able to start the 45

session. If your specific application use the OpenGL library, maybe this tutorial is not going to help you, otherwise it will work. Say yes and accept all the others options. If you either get an error or the executable skip something in the instalation, read the error message it sends to you at the end. It is going to give you a hint to solve the problem. If everything is done properly, you should have your Cuda Toolkit installation complete. After this make sure you add the next path at the end of the .bashrc file at home folder: for 32 bits include these lines: $export PATH=/usr/local/cuda-6.5/bin:\$PATH$ $export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib:\$LD_LIBRARY_PATH$

and for 64 bits: $export PATH=/usr/local/cuda-6.5/bin:\$PATH$ $export LD_LIBRARY_PATH=/usr/local/cuda-7.5 /lib:/usr/local/cuda-6.5/lib64:\$LD_LIBRARY_PATH$

Then, close the terminal. Open another terminal and go to the Nvidia Sample solvers. We recommend to start with the deviceQuery example which is in the NVIDIA CUDA Toolkit, it contains all the information of your Nvidia Card and its architecture. Finally, run make to get the executable and then execute the result of this step. Good luck with the installation!

B

CUDA Code

This code was developed by the author of this work and inspired from [24] except for the implementation of the CG algorithm which can be found in the CUDA Toolkit Version 7.5. To more details of this please check https://developer.nvidia.com/cuda-toolkit 46

1 2 3 4 5 6 7 8 9 10

/* * Copyright 1993-2015 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */

11 12 13 14 15 16

/* * This sample implements a conjugate gradient solver on GPU * using CUBLAS and CUSPARSE * */

17 18 19 20 21

// includes, system #include #include #include

22 23 24 25 26

/* Using #include #include #include

updated (v2) interfaces to cublas */

27 28 29

30

// Utilities and system includes #include // helper for shared functions common to CUDA ,→ Samples #include // helper function CUDA error checking and ,→ initialization

31 32 33 34

#include "common.h" #include #include

35 36 37 38

#define WIDTH 600 #define HEIGHT 600

39 40 41

float Aglobal [HEIGHT][WIDTH]; int n,m;

42

47

43 44

const char *sSDKname

= "conjugateGradient";

45 46 47 48 49 50

float sourcef(float xval) { //source function for exact solution = (1-x)e^(-x^2) float yval; yval=-(2*xval-2*(1-2*xval)+4*xval*(xval-pow(xval,2)))*exp(-xval*xval);

51

return yval;

52 53

}

54 55 56

// const int nel=500,mz=3; // int glodim = nel * mz;

57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

int main(int argc, char **argv) { int M = 0, N = 0, nz = 0, *I = NULL, *J = NULL; float *val = NULL; const float tol = 1e-5f; const int max_iter = 10000; float *x; float *rhs; float a, b, na, r0, r1; int *d_col, *d_row; float *d_val, *d_x, dot; float *d_r, *d_p, *d_Ax; int k; float alpha, beta, alpham1;

72 73 74 75

float *dA,*A; int *dNnzPerRow,*nnzh=NULL; int totalNnz;

76 77 78 79 80 81

int nel=200,mz=3; int glodim = nel * mz; // const int glodim2=1500; // float Aglobal[1500][1400]; float rhsglobal[glodim];

82 83 84 85 86

// This will pick the best possible CUDA capable device cudaDeviceProp deviceProp; int devID = findCudaDevice(argc, (const char **)argv);

87

48

if (devID < 0) { printf("exiting...\n"); exit(EXIT_SUCCESS); }

88 89 90 91 92 93

checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));

94 95 96 97

98

// Statistics about the GPU device // printf("> GPU device has %d Multi-Processors, SM %d.%d compute ,→ capabilities\n\n", // deviceProp.multiProcessorCount, deviceProp.major, deviceProp.minor);

99 100

//

int version = (deviceProp.major * 0x10 + deviceProp.minor);

101 102 103 104

// // // ,→

if (version < 0x11) { printf("%s: requires a minimum CUDA compute 1.1 capability\n", sSDKname);

105

// cudaDeviceReset causes the driver to clean up all state. While // not mandatory in normal operation, it is good practice. It is also // needed to ensure correct operation when the application is being // profiled. Calling cudaDeviceReset causes all profile data to be // flushed before the application exits cudaDeviceReset(); exit(EXIT_SUCCESS);

106 107 108 109 110 111 112 113

// // //

}

114

//////////////////////////Riviere code//////////////////////////////////////////

115 116 117 118 119

,→

int je,ie; float ss,penal; float Amat[3][3],Bmat[3][3],Cmat[3][3],Dmat[3][3],Emat[3][3],F0mat[3][3],FNmat[3][3];

120

//symmetric interior penalty Galerkin (SIPG) method ss=-1.0; penal=5.0;

121 122 123 124

//dimension of global matrix //glodim = nel * mz;

125 126 127 128

//

printf("\nglodim=%d\n",glodim);

129

49

130 131

// //

float Aglobal[glodim][glodim]; float rhsglobal[glodim];

132 133 134

/* for(int j=0;j