# Getting Started with Parallel Processing in Rust

Rust helps mitigate concurrency hazards by design, but it's still up to the programmer to construct their program logic thoughtfully so they can take advantage of the power of concurrent and parallel processing.

Rust is a systems programming language that is quickly gaining traction at well-known companies including Amazon, Discord, Dropbox, Meta, Alphabet, and Microsoft. It is built for performance, reliability, and productivity and has been voted the most loved programming language according to Stack Overflow's Annual Developer Survey since 2016. Some large-scale commercial projects that have been built using Rust include:

- Mozilla's Servo parallel browser engine
- Discord's Read States service
- Polkadot’s Substrate blockchain engine
- Figma's Multiplayer service

All of these real-world use cases of Rust utilize and benefit from concurrent and parallel processing, which can be daunting to implement on a good day, and pretty terrifying when implemented badly. Rust helps mitigate concurrency hazards by design, but it's still up to the programmer to construct their program logic thoughtfully so they can take advantage of the power of concurrent and parallel processing.

**When should I use concurrent or parallel processing, instead of serial processing?**

Most modern processors have multiple cores to work with, which means you can use these cores to achieve significant performance gains:

- When you have a lot of independent computations to process, such as a giant for-loop.
- When some of your threads contain computations that are particularly lengthy to calculate. It's nice to run these on the "backburner" without blocking your program from performing other computations.
- When you have low parallel overhead

**How do I implement parallel processing in Rust?**

My favorite way to learn new programming languages is by combining it with my love for math and solving problems in Project Euler. To demonstrate parallelization in Rust, let's solve a simple problem that I tweaked slightly so we can focus on the implementation of our solution:

*If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 or 5 below 1,000,000.*

**Solution Methodology**

While the mathematically elegant solution would be to use an arithmetic series, let's just focus on the simple solution, which is to figure out if each number in the range is divisible by 3 or 5. If it is, let's add it to a running sum we're keeping track of.

**Example 1 - No parallelization**

```
fn euler1_unpar(input: i32) -> i64 {
let mut sum: i64 = 0;
for i in 1..input {
if i % 3 == 0 || i % 5 == 0 {
sum += i as i64;
}
}
sum
}
```

**Code walkthrough:** In this example, we accept an `input`

, and iterate on every number between `1..input`

to determine if it is divisible by 3 or 5 using the modulo `%`

operator. If it is, then add the value to a running `sum`

we're keeping track of. At the end, return the sum. In Rust, you can return a value by simply calling it without a semicolon after the expression. Since Rust is a strongly-typed language, we need to tell the compiler to add the original `i32`

input and convert the sum to an `i64`

, so that we have enough space to store the answer.

Let's calculate a performance benchmark for this function so we can compare it to our multithreaded optimization that we'll write next. We can calculate this benchmark with the easybench crate, an importable package in Rust.

```
use easybench::{bench};
let input = 1000000;
println!("euler1_unpar: {}", bench(|| euler1_unpar(input) ) );
>> euler1_unpar: 14.429298ms (R²=0.999, 70 iterations in 21 samples)
```

Our unparallelized function takes about `14.4`

milliseconds to execute.

**Example 2 - Parallelized (2 threads)**

```
fn euler1_par(input: i32) -> i64 {
use std::thread;
let handle1 = thread::spawn(move || {
let mut thread1_sum: i64 = 0;
for i in 1..input / 2 {
if i % 3 == 0 || i % 5 == 0 {
thread1_sum += i as i64;
}
}
thread1_sum
});
let handle2 = thread::spawn(move || {
let mut thread2_sum: i64 = 0;
for i in (input / 2)..input {
if i % 3 == 0 || i % 5 == 0 {
thread2_sum += i as i64;
}
}
thread2_sum
});
handle1.join().unwrap() + handle2.join().unwrap()
}
```

**Code walkthrough:** Here, we use the `thread`

module so that we can take advantage of the native multithreading available in Rust. A new thread is created by calling `thread::spawn`

, into which a `closure`

is passed. Closures are anonymous functions that allow you to access environment variables, such as the `input`

variable. This closure does the same mathematical computation as `euler1_unpar`

, except we only process one half of the total range in the thread. The other half is saved for the second thread. We also need to `move`

a copy of the input into the closure's scope so that the thread can take ownership of the data and use it. Writing code like this can seem tedious and time-consuming, but is required by Rust to help reduce the risk of concurrency errors.

`thread::spawn`

returns a `JoinHandle`

type, which contains some convenience methods that allow us to take back control over the threads and handle their results. In this case, `JoinHandle::join()`

halts execution of the function until our threads have finished their calculations. `.unwrap()`

exposes the answers we've calculated in each thread, and then finally we sum those answers up.

Let's see how long this function takes to run:

```
use easybench::{bench};
let input = 1000000;
println!("{}", euler1_par(input));
> euler1_par: 7.345441ms (R²=0.998, 133 iterations in 27 samples)
```

The parallelized function takes about `7.3`

milliseconds to execute.

**Conclusion**

The parallelized code runs almost twice as fast as our unparallelized code, and we seem to only lose a little performance due to the overhead of setting up the threads. Nice!

This example demonstrates a way to get started with parallel processing in Rust. You often need to design your program with parallelization in mind from the get-go, as you are forced to think about the flow of your code and determine what pieces of the code take the longest to run and would benefit from parallelization.