Sometimes, there are cases when it’s required to start a single process multiple times in parallel, e.g. starting the same sequence on multiple sequencers of a given UVC. Most conveniently, this would be done by combining a for/foreach loop while wrapping the process in a fork-join_none block. However, this can yield unexpected results if not used with care. Here’s a simple example that illustrates such a case:

The expected result after calling example_task is:

But the actual result is:

Systemverilog states the following about the fork-join_none blocks (Page 97, Table 9-1, “fork-join_none” row):
“The parent process continues to execute concurrently with all the processes spawned by the fork. The spawned processes do not start executing until the parent thread executes a blocking statement.”
What this means is that all the processes in a fork-join_none block will only start as soon as the block that is calling them (i.e. the parent process) either reaches its end or is blocked by any of the delay/event/wait operators or any other blocking statements. In the example, the consequence of this is that firstly the for loop will be fully iterated (ind reaches 5) and only then all the forked threads created in the loop will actually start, which is why all $displays show ind = 5. Adding another display statement which logs the current value of ind at the start of the “for loop” confirms this effect:

From the LRM explanation, it follows that a workaround for this could be to just add a delay operator before the end of each loop iteration:

This is close to the desired result, but the use of the time delay operator can be very inconvenient. The trick is to use an automatic variable that will copy the current value of the iterator at the start of each thread. The following code will give the originally intended result:

This works because of the way declaring and initializing automatic variables works in SystemVerilog:
Every time a process block (begin-end or fork-join) is entered, separate memory space will be allocated for each automatic variable declared at the start of that block. In addition, every time a process block is finished, automatic variables that are local to it will be erased. In this example, the fork-join_none block is entered 5 times and so 5 copies of a_ind will be allocated for each iteration. When spawned, each of the threads will create their own copy of a_ind, assign the current iterator value to it and use it in the display statement. This is in contrast to the first example where all the threads use the same ind variable and show the same value.
Here is an example of where this can be used in practice – Let’s say that we have a master SPI agent connected to multiple SPI slave ports of a DUT and we want to send a read command to each of the slaves at the same time. For read commands, we use an array read_seq of spi_slave_read_seq sequences. Each of the slave ports has a distinct id (slave_id). The following would be a correct way to instantiate, randomize and start each read sequence:

Extra note: Even though tasks in the examples are executed in parallel, their results will be displayed in the log in some sequential order. This order will not necessarily follow the for loop and it may actually depend on the simulation tool used. Running the example with Synopsys VCS showed the results in increasing order:

While Cadence xrun displayed the results in descending order:
