Functional approach to distributed computations and Big Data with F# MBrace cloud monads. Part I.

This is the series of blog posts about the functional approach to cloud computations and big data using F# and MBrace framework.

Motivation

MBrace is fully open source framework written in F# and offers very clear and easy programming model. It aims to create all the necessary conditions to make distributed cloud scale computations and data as simple as possible. At the same time it is totally straightforward to setup MBrace clusters. In this series you will get the concept of "cloud monads" or cloud computation expressions, find out how to create a cluster and configure the environment, explore the features and opportunities we have with new F# functional cloud programming model and view the code of MBrace examples.

F# in the cloud

First, have a look at the simple example:

Where the definition of getContentLength function looks like this:

This code is calculating the longest web page from the list of given URL addresses by content length. If we go through it line by line we will see the maxContentLength function. It's body is surrounded with the cloud block. Inside we are creating a set of jobs for finding the content length of each web page from the input array. And then we are executing these jobs in parallel in the cloud, finally returning the maximum value from all the results we got. This example is very simple, just to illustrate the small application for cloud workflows, but in real life we would have much longer list of input data.

This code looks almost like a normal code we usually write, but performs complex and machine effort consuming computations faster by doing them in the cloud. All the distribution work is performed in the background. So no matter how difficult or resource consuming the task is - the computation is spread across the worker nodes of your cloud cluster behind the scenes. All the tricky things, like the strategy of how the work is distributed, sending the code to be executed on some machine in the cloud, balancing the pressure between the worker nodes, coordination while getting the results from the cloud to client - all these tricky things are not falling upon the user and are implemented inside the MBrace framework.

How is it possible?

It is possible to do because of F# computation expressions feature (also known as monads) and MBrace uses this feature to give the user an opportunity to focus and concentrate on your main code logic. It allows the software engineer to be abstracted away from handling side effects by doing it behind the scenes. Like in a previous example, if the function performs computations somewhere else than on your local machine - it is a side effect. Other examples of side effects are lazyness, asychronicity, some unpredictable and non-deterministic behavior, and the like. But generally, the execution of code in a way, which differs from the standard way in the background behind the scenes - it's a side effect.

So MBrace is based on this concept and inside - it consolidates the mechanism of code distribution and spreading data across the cloud. If you remember the async computation expression, it runs the operation asynchronously on it's own thread and then immediately releases the current thread back to the thread pool, where this all happens behind the scenes. The same thing's with cloud computation expression, but it operates cluster machines instead of threads and it is defined inside MBrace framework rather than F# core.

What happens inside?

In this part of F# and cloud programming series, before diving into all the features of MBrace and looking into lots of examples of what is possible to do with the cloud, it will be interesting to lift the veil of what is happening inside, to have a full picture.

What cloud block hides.

If we get back and look at the example, we'll remember that the code in our function was surrounded in a cloud block. If we look inside the framework we will see that the name of the block...

cloud { (* code logic *) }

... is the same as the name of an instance of CloudBuilder class:

let cloud = new CloudBuilder()

The builder class usually consists of methods that contain the logic for dealing with side effect. In our case, MBrace has CloudBuilder class with the logic for distribution of tasks across the cloud cluster.

In F# all let statements form the structure, whcich is similar to continuation passing style. So each method of the builder class takes operation and continuation and uses them inside the body to spread the code through cloud cluster machines. For example, Bind method from the CloudBuilder class is executed when we use let! within our cloud block.

Simply, in other words, when the compiler meets let! it takes the operation, calls Bind, which distributes the operation through the cloud, and returns the result of deferred execution in a Cloud<'T> wrapper type. When the result of deferred computation is ready, its value is unwrapped from generic Cloud<'T> type to concrete type 'T, ready for usage.

If we look a little bit closer on how the distribution occurs, the logic of Bind method inside MBrace is the following. First, it obtains the execution context and continuation. MBrace contains distributed runtime provider definition, so for parallel execution of workflows their execution context should be passed an instance of ICloudRuntimeProvider. Next, method checks if the cancellation is requested and if no - prepares the continuation object. And finally, it executes the specified Cloud<'T> workflow in the execution context with given continuation. It also uses trampolining mechanism for offloading execution stack in the thread pool, for environments where tail call optimizations are not handled, so it checks for bind counts and computes if the threshold is reached.

Bind method is one interesting thing. Another one is Cloud.Parallel combinator used in the first example:

let! lengths = Cloud.Parallel lengthsJob

We pass a sequence of some specific computations to Cloud.Parallel, we expect from it to parallelize the work somehow and get the ready results.

How does it parallelize?

How does it make it happen? First, it obtains the object resolved from execution context registered resources. Then it passes the computations to be performed in parallel to ScheduleParallel method of the runtime object, which is implemented as a parallel fork/join. And finally, it runs the workflow and preserves the exception information by appending a function information entry to the symbolic stack trace in the exception continuation:

A short summary...

So, if you get back to our first example, you already know what cloud block means and how it works. You know that within the cloud block there are special rules and when it meets keywords with exclamation mark, like let!, it triggers Bind method for the given operation. You also know what Cloud.Parallel does with obtained sequence of jobs, and because of "exclamation mark" keywords results are unwrapped from the Cloud<'T> type of deferred execution to the concrete type 'T.

Let's save it in our memory!

  • Contain actions for deferred execution in the cloud
  • Use a set of worker machines to perform jobs
  • Implemented inside MBrace Framework
  • Built as F# computation expressions/monads
  • Inspired by async workflows
  • Use CloudBuilder methods to handle side effects
  • Operate the Cloud<'T> wrapper type

What's next?

If you are curious how to start using MBarce - jump into the next part!


19.05.2015
|
fsharp cloud big data mbrace functional programming
Yandex.Metrica