8 Parallelization

8.1 Wafl and Parallelization

Parallelization is the evaluation of program parts in several threads, i.e. in parallel. Since each Wafl program represents an expression, parallelization requires the program to be segmented into many sub-expressions that are evaluated in parallel.

Only pure functional expressions can be parallelized. If a non-pure expression is parallelized, its result or its side-effects could be different for different evaluations.

The current Wafl implementation does not check whether the object of parallelization is a pure functional expression or not. The developers must therefore take care of this.

Explicit Parallelization

Wafl supports explicit and implicit parallelization.

Explicit parallelization is based on the explicit use of parallel versions of some Wafl library functions. These functions are implemented to split the job into parts and to use many threads to complete the evaluation.

The explicit parallel functions include:

map_par - A parallel version of the map function.
mapIdx_par - A parallel version of the mapIdx function.
filter_par - A parallel version of the filter function.
fold_par - A parallel version of the fold function.
newArrayFn_par - A parallel version of the newArrayFn function.

Each of the parallel versions of the functions has the same type and returns the same result as the sequential version, but is evaluated in many parallel threads.

Implicit Parallelization

Implicit parallelization assumes that some parts of the programs are implicitly (automatically) parallelized by the interpreter. An important module of the Wafl interpreter is a parallelization engine. Its task is to analyze the available data about the expressions and decide whether an expression should be evaluated sequentially or in parallel. It also decides on many of the parallelization options.

Simplified, one can say that the parallelization engine tries to evaluate the expected benefits of parallelization, taking into account the following:

the specific parallelization function;
the size of the data to be processed;
the complexity of the processing function applied.

Based on the evaluated benefits, the engine compares them with a threshold and decides whether parallelization should be used or not. The threshold value can be configured.

By default, the implicit parallelization is disabled. The main reason for this is to avoid errors during the implicit parallelization of non-pure expressions.

In the future, after the interpreter is upgraded to check the purity of expressions, the implicit parallelization might be enabled by default.

The following functions can be implicitly parallelized:

map
mapIdx
fold
filter
newArrayFn.

Whether a particular instance of these functions is parallelized or not is decided by the parallelization engine. Currently, the parallelization engine decides based on the size of the data collection to be processed, and on the number of parallel jobs currently active.

Implicit Parallelization Functions

Sometimes it is useful to have implicit parallelization, but to suggest to the parallelization engine that a part of the code should be evaluated in parallel, if possible. Such a suggestion has the effect of lowering the threshold for the parallelization decision.

To suggest that a program segment should be parallelized, there is a function parallel. Its type is ('1 -> '1) and its semantics is almost like the identity function \x:x. The only difference is that the threshold for parallelization is lowered.

It can be applied to any expression, but has different effects when applied to functions.

In the general case, if the bound expression is not a simple function reference, as in parallel(map(lst,fn)), the following semantics are applied:

remember the current parallelization threshold;
lower the parallelization threshold;
evaluate the expression and
reset the parallelization threshold to the previous value.

So the main difference is whether the arguments are evaluated with parallelization enabled or not.

On the other hand, if the argument is a function, as in parallel(map)(lst,fn), the parallelization suggestion is strictly bound to the given function. The main difference is that the parallelization threshold for the function’s arguments is not lowered. We could describe the semantics as follows:

evaluate the arguments;
remember the current parallelization threshold;
lower the parallelization threshold;
evaluate the function and
reset the parallelization threshold to the previous value.

If the specified function is a library function like map, the result is almost the same as using a specialized parallel map implementation. On the other hand, if the specified function is a user-defined function, then the result is the same as defining the user-defined function with parallel.

For example, if we have:

f(x,y,z) = ...defexp...;

then we can suggest its parallelization in two ways:

if we want to parallelize only a selected application, then we should use parallel(f)(a,b,c), and
if we want to parallelize all function applications, we should use parallel for the entire expression of the function definition:
```
f(x,y,z) = parallel(...defexp...);
```

An alternative is to define a new function parallel_f as follows:

parallel_f = parallel(f);

Implicit Parallelization Command Line Options

The configuration of the parallelization engine can be set via command line options:

-pd, -parallel-disable
- Disables the implicit parallelization.
-pa, -parallel-auto
- Enables the implicit parallelization.
-ps, -parallel
- Enables and suggests the implicit parallelization. Similar to -pa, but this mode is biased towards parallelization by using a lower threshold.
-pf, -parallel-force
- Forces the parallelization. A parallel implementation will be used wherever possible.

Explicit Sequential Functions

When implicit parallelization is used, it is possible that the parallelization engine decides to parallelize a part of the program that is not allowed to be parallelized (usually because it is not purely functional). To prevent this, there are explicit sequential versions of functions that prevent implicit parallelization.

There are sequential versions of every implicitly parallelizable function:

map_seq - A sequential version of the map function.
mapIdx_seq - A sequential version of the mapIdx function.
filter_seq - A sequential version of the filter function.
fold_seq - A sequential version of the fold function.
newArrayFn_seq - A sequential version of the newArrayFn function.

The functions foldl and foldr are always sequential, therefore fold_seq corresponds to the function fold. It is a special case of foldl and foldr where an associative function is used.

8.2 Parallel Functions

Isolation

The key word in parallelization is isolation. The prerequisite for parallel evaluation of some tasks is that they are well isolated from each other. In some concurrent environments, we expect the parallel tasks to communicate with each other to ensure a synchronization. However, in a functional context, especially for purely functional tasks, synchronization is usually not required. All we need is isolation of the tasks.

The main problem with task isolation is data sharing. It is very inconvenient if one task changes some data that the other task is processing at the same time. In a purely functional environment, we do not have such problems. Even in mixed environments, as is the case with Wafl, we usually do not have such problems. As long as we process the data in memory and do not change it, we can split the data into chunks and process the chunks in multiple threads quite safely and in full isolation.

In a few cases, problems can arise, and these are the cases we need to watch out for:

When one or more tasks perform side effects of any kind that are observed by other tasks and can affect their computations:
- by file operations such as creating, deleting or modifying files, or
- by database operations such as creating, deleting or modifying the contents of the database, or
When the order of computation is important, i.e. the operations we perform are not fully associative with the specific data set.

We have already discussed the functions that can work in parallel, as well as theirs sequential versions and the versions that permit the implicit parallelization. Most of them, with the exception of the fold_par function (and of course fold), normally perform some independent operations on independent data. Only in the case of fold_par the order of computation can be important. In other cases, we can say that we have complete isolation and safe parallelization as long as we do not create side effects. For this reason, we pay additional attention to folding.

Folding

Not every sequence fold can be parallelized. For example, consider the following expression:

(1..20).foldr(
    \x,l: if x%2=0 then x:l else l,
    []
)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

The expression folds an integer sequence by evaluating a filtered list of even integers. (Of course, we can do this more simply with filter.) This is an example of fold that cannot be implemented with foldl with the same arguments, since foldl expects its first argument to be an integer and the second argument to be a list of integers.

In order to be able to use both foldl and foldr with the same result in the general case, two important conditions must be met:

the fold function must have a type that is compatible with ('1 * '1 -> '1), i.e. it must represent a binary operator on a domain; and
it must be an associative function, i.e. for any arguments a, b and c the following must hold: f(a,f(b,c)) == f(f(a,b),c).

If we want the fold to be evaluated in parallel, there is an additional condition:

the specified value zero must really be a neutral value for the given operation, as it can be used several times.

The following functions are intended for such cases:

Function / Type and Description

fold

(Sequence['2]['1] * ('1 * '1 -> '1) * '1 -> '1)
Returns the sequence fold by an associative function:
fold == foldl == foldr

fold_seq

(Sequence['2]['1] * ('1 * '1 -> '1) * '1 -> '1)
Non-parallel fold of sequence by an associative function:
fold_seq == foldl == foldr

fold_par

(Sequence['2]['1] * ('1 * '1 -> '1) * '1 -> '1)
Folds the sequence elements in parallel. [‘fn’ must be associative]: [‘zero’ must be ZERO, because it can be used many times]:
fold_par(seq,fn,zero) == foldl(seq,fn,zero) == foldr(seq,fn,zero)

These functions assume that the fold is performed by a given binary operator of type ('1 * '1 -> '1), which is an associative operator, and return the same result as foldl. The function fold_par assumes that the zero value is a neutral value for the operator.

However, the interpreter does not check whether the operator is associative or not and if the specified zero value is really the zero for the operator. So, please use these functions with caution.

The base function here is fold. It can be implicitly parallelized. However, if parallelization is required, then fold_par should be used, and if parallelization is to be prevented, then fold_seq should be used.

In the following examples, the fold is performed with an operator that returns the sum of the arguments, but additionally outputs the arguments to the console:

(1..50).fold_seq(
    \x,y: x + y
      => echoTxt( (if x=0 then '0 ' else ' ') + '+ ' + y$ ),
    0
).echoTxt(' =\n')

0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 =
1275

(1..50).fold(
    \x,y: x + y
      => echoTxt( (if x=0 then '0 ' else ' ') + '+ ' + y$ ),
    0
).echoTxt(' =\n')

0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 =
1275

(1..50).fold_par(
    \x,y: x + y
      => echoTxt( (if x=0 then '0 ' else ' ') + '+ ' + y$ ),
    0
).echoTxt(' =\n')

0 + 10 + 11 + 2 + 12 + 3 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 200 + 21 + 22 + 23 + 240 + 41 + 42 + 43 + 44 + 45 + 25 + 26 + 27 + 28 + 29 + 30 + 46 + 4 + 47 + 48 + 49 + 500 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 5 + 6 + 7 + 8 + 9 + 100 + 55 + 155 + 255 + 355 + 455 =
1275

The result is always the same, but in the case of parallel evaluation, the order of operations may be different. After each thread has folded its part of the sequence, the results of the threads are also folded. The zero is used once more than the number of threads.

Suggesting and Preventing Parallelization

As already mentioned, we can use command line options to force (-pf) or prevent parallelization(-pd). We can also use the functions that always evaluate in parallel as map_par or sequentially as map_seq. There are two additional functions that control the implicit parallelization:

Function / Type and Description

parallel

('1 -> '1)
Marks an expression to be parallelized, if possible.

sequential

('1 -> '1)
Force the expression to run sequentially.

The function parallel(exp) suggests to the interpreter that the specified expression should be evaluated in parallel. It does not enforce parallel evaluation, but lowers the threshold for parallelization of the specified expression. There are two versions of this function:

When applied to a function, it returns a new function with the same functionality, but with the implicit parallelization threshold lowered.
When applied to an expression, it evaluates the given expression with the lowered implicit parallelization threshold.

In the following examples, the same expression is evaluated three times. The first example is as usual, with standard implicit parallelization:

(1..100).fold(
    \x,y: x + y
      => echoTxt( (if x=0 then '0 ' else ' ') + '+ ' + y$ ),
    0
).echoTxt(' =\n')

0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 + 100 =
5050

In the next example, we use parallel(fold) to suggest that the fold function should be evaluated in parallel:

(1..100).(parallel(fold))(
    \x,y: x + y
      => echoTxt( (if x=0 then '0 ' else ' ') + '+ ' + y$ ),
    0
).echoTxt(' =\n')

0 + 1 + 2 + 3 + 40 + 31 + 5 + 6 + 70 + 21 + 8 + 9 + 10 + 320 + 610 + 51 + 52 + 53 + 54 + 55 + 22 + 56 + 57 + 58 + 59 + 600 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 500 + 110 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 900 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 + 100 + 12 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 300 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 33 + 62 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 63 + 64 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 65 + 66 + 67 + 68 + 69 + 700 + 55 + 155 + 255 + 355 + 455 + 555 + 655 + 755 + 855 + 955 =
5050

And finally, by applying parallel to an expression, we suggest that the entire expression should be evaluated in parallel:

(1..100).fold(
    \x,y: x + y
      => echoTxt( (if x=0 then '0 ' else ' ') + '+ ' + y$ ),
    0
).echoTxt(' =\n').parallel()

0 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 200 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 400 + 710 + 21 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 800 + 910 + 81 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 + 1000 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 300 + 510 + 1 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 600 + 41 + 2 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 100 + 55 + 155 + 255 + 355 + 455 + 555 + 655 + 755 + 855 + 955 =
5050

Recall that <exp>.parallel() is equivalent to parallel(<exp>), so the parallel function in the previous example was applied to the entire previous expression.

Implicit parallelization depends on many properties of the specific interpreter implementation and the specific platform. Therefore, it is not always possible to repeat the given examples with the same degree of parallelization. The consequence is that it is not always possible to predict how an expression will be evaluated.

For this reason, we will not go into the results of the previous examples in detail. We expect that the first example was not evaluated in parallel and that the other two examples were evaluated in parallel, but we cannot be sure. Try changing the list size to see the different results.

The function sequential(exp) is another function for controlling the implicit parallelization. This function forces the sequential evaluation of the given expression. During the evaluation of the given expression, only the explicitly parallel functions such as map_par will work in parallel and all other computation will be performed sequentially.

As with parallel, there are two versions of sequential function:

When applied to a function, it returns a new function with the same functionality, but with forced sequential evaluation.
When applied to an expression, it forces the sequential evaluation of the given expression.

The sequential function is stronger than the parallel function. While parallel suggests the parallelization, sequential forces sequential evaluation. Any use of parallel in expressions bound by sequential is ignored. On the other hand, the use of sequential in expressions bound by the parallel has the expected effect.

In the following examples, we compute (almost) the same thing in different ways. Some of the expressions will always work sequentially, some will always work in parallel and the others will depend on the program options:

{#
    'Always sequential:',
    fold_seq.test(),
    fold.sequential().test(),   // sequential( fold )
    fold.test().sequential(),   // sequential(...exp...)
    fold.test().sequential().parallel(),
    fold.test().parallel().sequential(),
    fold.sequential().parallel().test(),
    fold.parallel().sequential().test(),
    'Depending on the options:',
    fold.test(),
    fold.parallel().test(),     // parallel( fold )
    fold.test().parallel(),     // parallel(...exp...)
    'Always parallel:',
    fold_par.test()
#}
where {
    test( foldFn ) = 
        if (1..1000).foldFn( operator+, 1 ) = 500501
        then 'seq' else 'par';
}

{# 'Always sequential:', 'seq', 'seq', 'seq', 'seq', 'seq', 'seq', 'seq', 'Depending on the options:', 'seq', 'par', 'par', 'Always parallel:', 'par' #}

We assume that the sequential fold works in a single thread and uses a single initial value 1. The result is therefore the sum of the integers in the range 1..1000 plus 1, i.e. 500501. If, on the other hand, the fold is evaluated in parallel, the initial value 1 is added many times and the result is greater than 500501.