Updated: Oct 13, 2021
Collaborators: Daniel Hallmark, Aaron Jackson & Ryan Johnston
Those of you who have read our previous blog entries might have gathered :) that we're big
proponents of open-source workflow and process automation engines like Flowable (https://flowable.com) and Camunda (https://camunda.com). One of the primary reasons for that is their ability to easily handle very large volumes of process instances without breaking a sweat. Although there are competitive engines that are specifically designed to handle truly massive volumes of transactions (e.g. Netflix Conductor (https://github.com/Netflix/conductor)), the reality is that open-source engines like Camunda and Flowable effectively combine extensive and superior capability sets with the ability to handle volumes of transactions that far exceed the needs of the vast majority of workflow and process automation users. That makes them an excellent starting point for anyone looking to incorporate process automation capabilities into their enterprise.
Given the competition from other platforms, however, and the need to be able to process larger and larger amounts of data in today's modern enterprises, the open-source platform vendors are always looking for opportunities to improve their performance at scale. The Flowable team seems to have done exactly that in recent versions with their new global acquire lock feature, which has been tested & added to recent versions of Flowable Enterprise (https://flowable.com/products/) and recent branches of the open-source code (the flowable-release-6.6.1 and flowable-release-6.6.2 branches at https://github.com/flowable/flowable-engine). Joram Barrez - from Flowable - covered the change in detail in a series of blog entries starting with this one: https://blog.flowable.org/2021/04/14/handling-asynchronous-operations-with-flowable-part-1-introducing-the-new-async-executor/. In those entries, Joram does an excellent job of explaining the functionality of the Async Executor. If you aren't familiar with it or its related concepts, we encourage you to read through at least the first entry in the series to become acquainted with its capabilities and the reasons behind its use.
How does their new global acquire lock improve performance? At a high level, it improves
efficiency in clustered setups by eliminating concurrency during the job acquisition cycle. Once the jobs are acquired by a cluster of Flowable/Async Executor instances, those jobs will be processed concurrently or by multiple Async Executors at a time. Here's a diagram from Joram's series of blog entries that provides a depiction of how this works:
The little process model in the upper left depicts a process instance creating asynchronous jobs or timer jobs in the appropriate tables, whereas the boxes on the right depict four separate Flowable/Async Executor instances. The lock image in the middle shows that each Async Executor instance has to acquire a global lock before pulling a set of jobs to execute. The theoretical bottom line? They're eliminating concurrency where it creates a bottleneck to maximize beneficial concurrency, thus increasing overall performance. Does it work? We decided to find out by running our own set of performance tests. :)
Testing a change like this requires some serious horsepower, so we decided to leverage Amazon Web Services to give us flexibility and the ability to stand up a cluster of up to 8 nodes. That 8-node cluster uses c5.2xlarge nodes, each of which have 8 virtual CPU's and 16 GB RAM. The database is a single db.m6g.8xlarge node, high performance RDS PostgreSQL 13.1 database. Here is a high-level, logical diagram of the environment:
Using that environment, we installed two versions of the open-source version of Flowable: (1)
version 6.6.0 , which does not have the global acquire lock feature, and (2) 220.127.116.11 , which does have the global acquire lock feature. Using those two versions, we ran almost 50 individual tests with varying settings, including different numbers of nodes & threads. (As a quick side note, we wanted these tests to provide "real world" numbers, so we didn't take any steps to inflate our numbers. For instance, we left the history level setting at "full", because most customers need that history data for reporting or compliance purposes.) We could obviously fill a fairly large book with our findings and takeaways, but we're going to focus on three important sets of observed results. The first batch of results illustrates our findings when the tests were run with no-op jobs, i.e. jobs that don't perform any specific operations, and a configured maximum of 8 Async Executor threads/node:
NOTE: All of the settings used in these three tests were consistent save for the job acquisition page size and the global locking setting.
Using version 6.6.0 - where the global acquire lock isn't available - and the default job
acquisition size of 1, we were able to process 160 jobs per second, a good number that would equate to almost 10,000 jobs per minute. Using version 18.104.22.168 with global acquire locking disabled and a job acquisition size of 256, we were able to process 420 jobs per second. Finally, using version 22.214.171.124 with global acquire locking enabled and a job acquisition size of 256, we were able to process 2,450 jobs per second, a 15-fold increase over 6.6.0 with global acquire locking unavailable and a job acquire size of 1.
Our next batch of results illustrates the relative performance with and without global locking
enabled in 126.96.36.199 with two different job execution time ranges, illustrating the results that
could be observed in real-world environments:
NOTE: All of the settings used in these four tests were consistent save for the global locking
setting and the job duration.
In the tests above, which were also run with a maximum of 8 Async Executor threads/node, we notice that the throughput numbers are closer, meaning that the performance advantage of running with the global acquire lock enabled is smaller. We believe this is a result of less job acquisition contention when global locking is disabled, as the Async Executors aren't making as many acquisition attempts in any given time window due to the longer job duration. However, the tests that were run with global locking enabled still maintained a noticeable performance advantage, illustrating that the global acquire lock setting provides improved performance across a range of varying job profiles.
Finally, we increased the maximum number of Async Executor threads per node to 64 and 128 for a final set of four tests, and here are the results we observed:
Interestingly, we achieved the same results with global locking enabled and disabled with 64
threads and this set of longer-running jobs. In this case, the reduced number of job acquisitions as a result of the longer job execution times seems to be below the limit where contention becomes an issue (with global locking disabled) at this thread count and with this number of nodes. However, when we bumped the thread pool up to 128 threads per node, we noticed a linear increase in the job throughput with global locking enabled and a significant reduction with global locking disabled. As you might guess, contention was the issue here; we saw numerous database deadlocks at 128 threads/node and with global locking disabled, apparently leading to that reduced performance. At this higher volume and with this higher thread count on our 8 nodes, global locking really shines.
We mentioned database deadlocks in the paragraph above. We did encounter those deadlocks fairly often in our testing, but they only occurred when running with global locking disabled and at higher job acquisition volumes. This suggests that our database server - which was running on very powerful hardware! - was having significant issues keeping up with the large volume of job acquisition attempts. We suspect that this likely happens often in very high volume environments in the real world, and global locking seems to have solved this problem in our testing.
A quick word on the job acquisition page size... As we increase that number without global locking enabled, we'll see more exceptions - called optimistic locking exceptions or OLE's for short - in the log. These exceptions aren't inherently bad; in fact, the system is built to seamlessly recover from these issues. However, there is obviously some overhead associated with retrying the transactions when these OLE's are encountered. Since these exceptions do have an impact on the system, the job acquisition size should generally be lower in environments without the global acquire lock enabled. A few of our tests listed in the tables above use higher job acquisition page sizes than we might recommend without global locking enabled, but those results give you an idea of the maximum throughput that might be able to be reasonably achieved with global locking disabled and a higher job acquisition page size value.
So, bottom line, does it work? Overall, our results were consistent with those gathered by and reported by the Flowable team. Based on those test results and with the multiple different scenarios that we ran, we can say that the answer is an emphatic yes and that this is a very good step forward in improving the performance of Flowable. Not only does it increase throughput, but it seems to do so in a way that reduces load on a back-end relational database and thus makes an overall environment more stable. We would recommend that anyone currently using Flowable in a high-volume environment consider upgrading either to Flowable 188.8.131.52 or higher or - if possible - Flowable Enterprise 3.9.0 or higher, which both offer the new global acquire lock feature.
If you need more information on the above or would like more details regarding the tests we ran and the results we gathered, please don't hesitate to contact us at firstname.lastname@example.org.
Additionally, we would be happy to help if you're interested in seeing if Flowable open-source or Flowable Enterprise is right for your organization.