>STM is an optimistic concurrency system. This means that threads never block waiting for locks. Instead, each concurrent operation proceeds, possibly in parallel, on their own independent transaction log. Each transaction tracks which pieces of data it has accessed or mutated and if at commit time it is detected that some other transaction has been committed and altered data which this transaction also accessed, then the latter transaction is rolled back and is simply retried.
I already foresaw (and it gets mentioned later), the problem that if you have many small, frequent operations, they will prevent a big, long operation from happening because they will always change the state and cause conflicts before the big one can finish. You can easily code yourself an app that will softlock forever.
The post doesn't offer a good solution (it talks about one where there's a tradeoff, but you don't need tradeoffs for this)
The way this gets fixed it is to make the lock acquisition (or in this case, the priority of the merge?) preemptive (Wound-Wait)
All transactions have a global, ever incrementing number attached to them, their ID, which we call "seniority", if you try to get a a lock, and the lock is held by a transaction with a lower seniority (=> a higher ID), you kill the transaction, take the lock, and once you're done the transaction you killed is allowed to retry
In the meantime, if a transaction with the lower seniority tries to get the lock, it gets blocked
This insures that your program will always finish.
In the case of "lots of frequent small transactions" + "one big, long report", the report will get killed a few times, until your report inevitably becomes the most senior transaction to ask for this resource, and is allowed to complete.
databases also solve this problem by allowing the read only transaction to see a consistent snapshot of the data at some point in time. so the reader can do its work without needing a logical mutex.
With default settings they let other processes' commits cut into the middle of your read. (See READ_COMMITTED.). I was flabbergasted when I read this.
When you turn MSSQL up to a higher consistency level you can get it to grind to a halt and throw Deadlock exceptions on toy code. (Moving money between accounts will do it).
I used to advertise STM as like ACID-databases for shared-memory systems, but I guess it's even more consistent than that.
Postgres's `set transaction isolation level serializable` doesn't result in deadlocks (unless you explicitly grab locks) but does require your application to retry transactions that fail due to a conflict.
Yeah, STMs can use MVCC in the same way, but that doesn't solve the long-transaction problem for bulk update transactions. The first example in Gray and Reuter is adding monthly interest to every savings account, exactly once.
long-running transactions are a really fundamentally bad idea if the length of the transaction scales with the number of accounts/users. these things need to be batched up in a way that code with at-least-once semantics correctly solves the problem. (in one TX handle just a bunch of accounts, calculating the interest, recording that these accounts already got the interest, and moving to the next batch.)
the problem with STM (or any other concurrency control mechanism) is that it's not their job to solve these issues (as usually they require conscious product design), so they are not going to solved by "just use STM".
There are some different approaches, but that's definitely one of them. However, it's not without its tradeoffs; you're sacrificing atomicity, in the sense that if updating some account throws an error, the already-updated accounts stay updated.
I think STM looks pretty cool in toy examples, but in practice it's pretty bad. Very difficult for me to make strong logical argument about this, just based on how it feels.
In Clojure, there are first-class STM primitives with retriable transactions in the standard lib, but every time I tried to use them, the code was slow, difficult to reason about, full of "colored" functions, and non-composable.
Also, unintuitively, the code starts immediately smelling slightly "wrong", as if I tried to put into code my first childish thoughts instead of actually thinking about how the system _should_ work. Like the notorious bank accounts example: yeah, cool, two accounts' balances are TVars — but what if I have 10M accounts? Is the list of accounts also a TVar since sometimes new accounts are created? Are indexes on the list also TVars? How do you persist the lists, how do you synchronize slow persistence and in-memory transactions? How do you synchronize all the deposits with a backup/failover machine? As you continue trying to untangle all this with STM, you start drowning in complexity, and encountering huge contention among transactions touching many parts of your systems, and spewing countless retry logs.
It's not only my own experience — I believe it's widely accepted in Clojure community that nobody actually uses STM, and instead uses simple atomic updates or queues.
I have a similar experience, though it's not necessarily related to shortcomings in the STM implementation (which I always felt was quite competent in the space). To expand, I think a few things are true:
1. I've come to feel that actually managing concurrency is a "below the fold" competency for most engineers. I'm not saying it should be, and people need to understand what a data race is.
2. As you pointed out, most state in services that I work on is not in need of coordinated updates, so atoms are where you want to be, or single-writer queues. 100% agreed
3. Structured concurrency approaches, fibers, and other abstracted runtimes seem to be the best way to get concurrent and parallel work, given points 1 and 2 about how coordinated state really isn't that common for most work on my team
A lot of times when we're updating state we're more having to think about it in the database, employing either a last-write-wins strategy or when coordination is in need, some form of pessimistic or optimistic locking.
I chalk this up to problems with Clojure's STM API specifically rather than STM generally, e.g. Haskell's STM implementation is considerably more useful.
I don't have any experience with STMs in other languages, but as far as I can see, the ideas are very similar in Haskell and Kotlin implementations, for example, so I guess the downsides are the same as well
I found that a lot of the problems I had been having with mutexes, stem from the fact that traditionally the mutex and the data it protects are separate.
Bolting them together, like Rust's Mutex<T> does, solves a lot these problems. It let's you write normal, synchronous code and leave the locking up to the caller, but without making it a nightmare. You can't even access the data without locking the mutex.
This isn't an attack on the (very well written) article though. Just wanted to add my two cents.
Mutexes suffer from a host of problems, and imo are not a very good concurrency primitive - they were designed to turn single-threaded code into multi-threaded. With todays 8+ cores in most systems, usually a single point of contention quickly becomes a problem.
They're liable to deadlocks/livelocks, and sometimes not only with other explicitly Mutex-like things (it might happen some library you use has a lock hidden deep inside).
They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc). We've run into an issue with .NET, that their version of Mutex didn't wake up the blocked thread on Linux as fast as on Windows, meaning we needed about 100x the time to serve a request as the thread was sleeping too long.
There are questions like when to use spinlocks and when to go to wait sleep, which unfortunately the developer has to answer.
Not assigning blame here, just pointing out that threading primitives and behaviors don't translate perfectly between OSes.
Multi-threading is hard, other solutions like queues suffer from issues like backpressure.
That's why I'm skeptical about Rust's fearless concurrency promise - none of these bugs are solved by just figuring out data races - which are a huge issue, but not the only one.
Your view on mutex performance and overhead is outdated, at least for the major platforms: The Rust standard library mutex only requires 5 bytes, doesn't allocate, and only does a syscall on contention. The mutex implementation in the parking_lot library requires just 1 byte per mutex (and doesn't allocate and only does a syscall on contention). This enables very fine-grained, efficient locking and low contention.
These are OS primitives I'm talking about - I haven't checked out the standard library version but the parking_lot version uses a spinlock with thread sleep when the wait times get too high - it has no way of getting notified when the mutex gets unblocked nor does it support priority inversion.
It seems it's optimized for scenarios with high performance compute heavy code, and short critical sections.
These assumptions may let it win benchmarks, but don't cover the use cases of all users. To illustrate why this is bad, imagine if you have a Mutex protected resource that becomes available after 10us on average. This locks spins 10 times checking if it has become available )(likely <1us) then yields the thread. The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should). But even best-case, you're left waiting 10ms, which is a typical scheduler tick.
In contrast OS based solutions are expensive but not that expensive, let's say that add 1us to the wait. Then you would wait 11us for the resource.
A method call taking 10ms and one taking 15 us is a factor of 60x, which can potentially kill your performance.
You as the user of the library are implicitly buying into these assumptions which may not hold for your case.
There's also nothing in Rust that protects you from deadlocks with 100% certainty. You can fuzz them out, and use helpers, but you can do that in any language.
So you do need to be mindful of how your mutex works, if you want to build a system as good as the one it replaces.
The best practices I adopt for rust avoid the use of mutex whenever possible precisely because of how easy a deadlock is. It turns out it is always possible. There are entire languages the disallow any mutable state, much less shared mutable state. The question becomes how much performance are you willing to sacrifice to avoid the mutex. By starting with no shared mutable state and adding it when something is too slow, you end up with very few mutexes.
> avoid the use of mutex […] It turns out it is always possible
How would you handle the archetypical example of a money transfer between two bank accounts, in which 100 units of money need to be subtracted from one account and atomically added to another account, after checking that the first account contains at least 100 units?
The simplest pure functional way would be to copy the whole database instantiating a new copy with the desired change if the condition was met. That obviously doesn't scale, which is where the performance thing comes in. A still pure way would be to use a persistent tree or hash mapped trie that allows efficient reuse of the original db. There are times a purely functional approach doesn't perform well enough, but even with large scale entity component type systems in both rust and c++, the number of times I've had to use a mutex to be performant is small. Atomic is much more common, but still not common. Persistent data structures alleviate most of the need.
pure or not eventually this comes down to durability, no?
and the way to do it is to either have some kind single-point-of-control (designated actor or single-threaded executor) or mark the data (ie. use some concurrency control primitive either wrapping the data or in some dedicated place where the executors check [like JVM's safepoints])
using consistent hashing these hypothetical accounts could be allocated to actors and then each transaction is managed by the actor of the source (ie. where the money is sent from, where the check needs to happen), with their own durable WAL, and periodically these are aggregated
(or course then the locking is hidden in the maintenance of the hashring as eating philosophers are added/removed)
Since the thread mentions Rust: in Rust, you often replace Mutexes with channels.
In your case, you could have a channel where the Receiver is the only part of the code that transfers anything. It'd receive a message Transfer { from: Account, to: Account, amount: Amount } and do the required work. Any other threads would therefore only have copies of the Sender handle. Concurrent sends would be serialized through the queue's buffering.
I'm not suggesting this is an ideal way of doing it
What you're describing is called the "Actor model"; in your example, the receiver is an actor that has exclusive control over all bank accounts.
The actor model reaches its limits as soon as you need transactions involving two or more actors (for example, if you need to atomically operate on both the customers actor and the bank accounts actor). Then you can either pull all involved concerns into a single actor, effectively giving up on concurrency, or you can implement a locking protocol on top of the actor messages, which is just mutexes with extra steps.
No single concurrency primitive covers all use cases. I was addressing your misconceptions about mutex performance and overhead, not whether mutexes are the best solution to your particular problem.
> […] it has no way of getting notified when the mutex gets unblocked […] The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should).
You've misunderstood the parking_lot implementation. When thread B tries to lock a mutex that's currently locked by thread A, then, after spinning a few cycles, thread B "parks" itself, i.e., it asks the kernel to remove it from the Runnable task queue. On Linux, this is done using the futex syscall. When thread A unlocks the mutex, it detects that another thread is waiting on that mutex. Thread A takes one thread from the queue of waiting threads and "unparks" it, i.e., it asks the kernel to move it into the Runnable task queue. The kernel is notified immediately, and if there's a free CPU core available, will tend to dispatch the thread to that core. On a non-realtime OS, there's no guarantee how long it takes for an unblocked thread to be scheduled again, but that's the case for all concurrency primitives.
Unfortunately the standard library mutex is designed in such a way that condition variables can't use requeue, and so require unnecessary wakeups. I believe parking lot doesn't have this problem.
How does it avoid cache contention with just a few bytes per mutex? That is, multiple mutex instances sharing a cache line. Say I have a structure with multiple int32 counters protected by their own mutex.
Cache contention is (mostly) orthogonal to your locking strategy. If anything, fine-grained locking has the potential to improve cache contention, because
1) the mutex byte/word is more likely to be in the same cache line as the data you want to access anyway, and
2) different threads are more likely to write to mutex bytes/words in different cache lines, whereas in coarse-grained locking, different threads will fight for exclusive access over the cache line containing that one, global mutex.
@magicalhippo: Since I'm comment-rate-throttled, here's my answer to your question:
Typically, you'd artificially increase the size and alignment of the structure:
#[repr(align(64))]
struct Status {
counter: Mutex<u32>,
}
This struct now has an alignment of 64, and is also 64 bytes in size (instead of just the 4+1 required for Mutex<u32>), which guarantees that it's alone in the cache line. This is wasteful from a memory perspective, but can be worth it from a performance perspective. As often when it comes to optimization, it very heavily depends on the specific case whether this makes your program faster or slower.
> different threads are more likely to write to mutex bytes/words in different cache lines
If you got small objects and sequential allocation, that's not a given in my experience.
Like in my example, the ints could be allocated one per thread to indicate some per thread status, and the main UI thread wants to read them every now and then hence they're protected by a mutex.
If they're allocated sequentially, the mutexes end up sharing cache lines and hence lead to effective contention, even though there's almost no "actual" contention.
Yes yes, for a single int you might want to use an atomic variable but this is just for demonstration purposes. I've seen this play out in real code several times, where instead of ints it was a couple of pointers say.
The issue might be allocating the int contiguously in the first place. No language magic is going to help you avoid thinking about mechanical sympathy.
And allocating the int contiguously might actually be the right solution is the cost of sporadic false sharing is less than the cost of wasting memory.
But the mutex encapsulates the int, so if the mutex ensured it occupied a multiple of cache lines, there would be no contention. At the very small cost of a few bytes of memory.
By not avoiding it. And a year later you get to write a blog post about how you discovered and fixed this phenomenon hitherto unknown to computer science.
Traditionally traditionally, monitors were declared together with the data they contained, and the compiler enforced that the data was not accessed outside the monitor. Per Brinch Hansen wrote a rather bitter broadside against Java's concurrency model when it came out.
> You can't even access the data without locking the mutex.
It's even nicer than that: you can actually access data without locking the mutex, because while you hold a mutable borrow to the mutex, Rust statically guarantees that no one else can acquire locks on the mutex.
Given a data item of non-thread safe type (i.e. not Mutex<T> etc), the borrow checker checks that there's only ever one mutable reference to it. This doesn't solve concurrency as it prevents multiple threads from even having the ability to access that data.
Mutex is for where you have that ability, and ensures at runtime that accesses get serialized.
The maybe unexpected point is that if you know you're the only one who has a reference to a Mutex (i.e. you have a &mut), you don't need to bother lock it; if no one else knows about the Mutex, there's no one else who could lock it. It comes up when you're setting things up and haven't shared the Mutex yet.
This means no atomic operations or syscalls or what have you.
Do you have an example? I don't program in Rust, but I imagine I'd rarely get into that situation. Either my variable is a local (in a function) in which case I can tell pretty easily whether I'm the only one accessing it. Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing. How is Rust going to help here? I imagine it's only making the optimal thing harder to achieve.
I can see that there are some cases where you have heap-data that is only visible in the current thread, and the borrow checker might be able to see that. But I can imagine that there are at least as many cases where it would only get in the way and probably nudge me towards unnecessary ceremony, including run-time overhead.
It's relevant when you have more complex objects, such as ones that contain independent mutexes that lock different sections of data.
You want the object to present its valid operations, but the object could also be constructed in single or multithreaded situations.
So you'd offer two APIs; one which requires a shared reference, and internally locks, and a second which requires a mutable reference, but does no locking.
Internally the shared reference API would just lock the required mutexes, then forward to the mutable reference API.
When you construct an object containing a mutex, you have exclusive access to it, so you can initialize it without locking the mutex. When you're done, you publish/share the object, thereby losing exclusive access.
struct Entry {
msg: Mutex<String>,
}
...
// Construct a new object on the stack:
let mut object = Entry { msg: Mutex::new(String::new()) };
// Exclusive access, so no locking needed here:
let mutable_msg = object.msg.get_mut();
format_message(mutable_msg, ...);
...
// Publish the object by moving it somewhere else, possibly on the heap:
global_data.add_entry(object);
// From now on, accessing the msg field would require locking the mutex
Initialization is always special. A mutex can't protect that which doesn't exist yet. The right way to initialize your object would be to construct the message first, then construct the composite type that combines the message with a mutex. This doesn't require locking a mutex, even without any borrow checker or other cleverness.
Dude, it's a simplified example, of course you can poke holes into it. Here, let me help you fill in the gaps:
let mut object = prepare_generic_entry(general_settings);
let mutable_msg = object.msg.get_mut();
do_specific_message_modification(mutable_msg, special_settings);
The point is, that there are situations where you have exclusive access to a mutex, and in those situations you can safely access the protected data without having to lock the mutex.
Sorry, I don't find that convincing but rather construed. This still seems like "constructor" type code, so the final object is not ready and locking should not happen before all the protected fields are constructed.
There may be other situations where you have an object in a specific state that makes it effectively owned by a thread, which might make it possible to forgo locking it. These are all very ad-hoc situations, most of them would surely be very hard to model using the borrow checker, and avoiding a lock would most likely not be worth the hassle anyway.
Not sure how this can help me reduce complexity or improve performance of my software.
>I don't program in Rust, but I imagine I'd rarely get into that situation.
Are you sure? Isn't having data be local to a thread the most common situation, with data sharing being the exception?
>Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing.
That's exactly what the borrow checker does. It tracks how many mutable references you have to your data structure at compile time. This means you can be sure what is local and what is shared.
Meanwhile without the borrow checker you always have to assume there is a remote probability that your mental model is wrong and that everything goes wrong anyways. That's mentally exhausting. If something goes wrong, it is better to only have to check the places where you know things can go wrong, rather than the entire code base.
I use lots of locals but only to make my code very "local", i.e. fine-grained, editable and clear, using lots of temporary variable. No complicated expressions. That's all immutable data (after initialization). I rarely take the address of such data but make lots of copies. If I take its address, then as an immutable pointer, maybe not in the type system but at least in spirit.
I keep very little state on the stack -- mostly implicit stuff like mutex lock / mutex unlock. By "state" I mean object type things that get mutated or that need cleanup.
I always have a "database schema" of my global state in mind. I define lots of explicit struct types instead of hiding state as locals in functions. I've found this approach of minimizing local state to be the right pattern because it enables composability. I'm now free to factor functionality into separate functions. I can much more freely change and improve control flow. With this approach it's quite rare that I produce bugs while refactoring.
So yes, I have lots of locals but I share basically none of them with other threads. Also, I avoid writing any code that blocks on other threads (other than maybe locking a mutex), so there's another reason why I would not intentionally share a local with another thread. Anything that will be shared with another thread should be allocated on the heap just for the reason that we want to avoid blocking on other threads.
In that sense, the borrow checker is a tool that would allow me to write code more easily that I never wanted written in the first place.
I find it better to model that as an Actor than a mutex, but I guess it's inherently the same thing, except the actor also allows asynchronous operations.
that isn't a mutex, that's delegating work asynchronously and delegating something else to run when it is complete (the implicitly defined continuation through coroutines).
In systems programming parlance, a mutex is a resource which can be acquired and released, acquired exactly once, and blocks on acquire if already acquired.
No. It’s not a property of the type so you can have multiple items under a mutex and you’re not at the mercy of whoever wrote it, it works fine with POD types, it does not force a lock / unlock on each method call (instead the compiler essentially ensures you hold the lock before you can access the data), and the borrow checker is there to ensure you can not leak any sort of sub-states, even though you can call all sorts of helpers which have no requirement to be aware of the locking.
It’s what synchronized classes wish they had been, maybe.
This felt like a blast from the past. At a few times reading this article, I had to go back and check that, yes, it's actually a new article from the year 2025 on STM in Haskell and it's even using the old bank account example.
I remember 15 or 20 years (has it been that long?) when the Haskell people like dons were banging on about: 1) Moore's law being dead, 2) future CPUs will have tons of cores, and 3) good luck wrangling them in your stone age language! Check out the cool stuff we've got going on over in Haskell!
Yeah, remember when we used to care about making better programming languages that would perform faster and avoid errors, instead of just slapping blockchains or AI on everything to get VC money. Good times.
OOP is very useful/powerful, don't throw the good parts out. Java messed up by deciding everything must be an object when there are many other useful way to program. (you can also argue that smalltalk had a better object model, but even then all objects isn't a good thing). Functional programing is very powerful and a good solution to some problems. Procedural programing is very powerful and a good solution to some problems. You can do both in Java - but you have to wrap everything in an object anyway despite the object not adding any value.
Java was derived from C++, Smalltalk, and arguably Cedar, and one of its biggest differences from C++ and Smalltalk is that in Java things like integers, characters, and booleans aren't objects, as they are in C++ and Smalltalk. (Cedar didn't have objects.)
Right. Everything a user can do is object, but there are a few non-object built ins. (they are not objects in C++ either, but C++ doesn't make everything you write be an object)
I feel this is a case of using the same word to mean something different. C++ “object” here seems to mean something more akin to “can be allocated and stuffed into an array” than a Smalltalk-type object.
i.e. C++ primitive types are defined to be objects but do not fit into a traditional object-oriented definition of “object”.
Yes, many people believe that C++ isn't really "object-oriented", including famously Alan Kay, the inventor of the term. Nevertheless, that is the definition of "object" in C++, and Java is based on C++, Smalltalk, and Cedar, and makes an "object"/"primitive" distinction that C++, Smalltalk, and Cedar do not, so "Java [did something] by deciding everything must be an object" is exactly backwards.
I'm not sure who invented "object oriented", but objects were invented by Simula in 1967 (or before, but first released then?) and that is where C++ takes the term from. Smalltalk-80 did some interesting things on top of objects that allow for object oriented programming.
In any case, Alan Kay is constantly clear that object oriented programming is about messages, which you can do in C++ in a number of ways. (I'm not sure exactly what Alan Kay means here, but it appears to exclude function calls, but would allow QT signal/slots)
The specific thing you can do in Smalltalk (or Ruby, Python, Objective-C, Erights E, or JS) that you can't do in C++ (even Qt C++, and not Simula either) is define a proxy class you can call arbitrary methods on, so that it can, for example, forward the method call to another object across the network, or deserialize an object stored on disk, or simply log all the methods called on a given object.
This is because, conceptually, the object has total freedom to handle the message it was sent however it sees fit. Even if it's never heard of the method name before.
You can do that in C++ too - it is just a lot of manual work. Those other languages just hide (or make easy) all the work needed to do that. There are trade offs though - just because you can in C++ doesn't mean you should: C++ is best where the performance cost of that is unacceptable.
No, in C++ it's literally impossible. The language provides no way to define a proxy class you can call arbitrary methods on. You have to generate a fresh proxy class every time you have a new abstract base class you want to interpose, either by hand, with a macro processor, or with run-time code generation. There's no language mechanism to compile code that calls .fhqwhgads() successfully on a class that doesn't have a .fhqwhgads() method declared.
you don't call fhqwhgads() on your proxy class though. You call runFunction("fhqwhgads") and it all compiles - the proxy class then string matches on the arguments. Of course depending on what you want to do it can be a lot more complex. That is do manually what other languages do for you automatically under the hood.
Again, this is not something you should do, but you can.
That doesn't provide the same functionality, because it requires a global transformation of your program, changing every caller of .fhqwhgads() (possibly including code written by users of your class that they aren't willing to show you, example code in the docs on your website, code in Stack Overflow answers, code in printed books, etc.). By contrast, in OO languages, you typically just define a single method that's a few lines of code.
You're sinking into the Turing Tarpit where everything is possible but nothing of interest is easy. By morning, you'll be programming in Brainfuck.
To be clear, I’m not trying to pick at whether or not C++ is “really object oriented”.
What I’m saying is that the discrepancy between primitives in C++ and Java is entirely one of definition. Java didn’t actually change this. Java just admitted that “objects” that don’t behave like objects aren’t.
On the contrary, Java objects are very different from C++ objects, precisely because they lack a lot of the "primitive-like" features of C++ objects such as copying, embedding as fields, and embedding in arrays. (I'm tempted to mention operator overloading, but that's just syntactic sugar.)
Java differs from C++ in an endless number of ways.
What I’m saying is that in both C++ and Java, there are a set of primitive types that do not participate in the “object-orientedness”. C++ primitives do not have class definitions and cannot be the base of any class. This is very much like Java where primitives exist outside the object system.
If the C++ standard used the term “entities” instead of “objects” I don’t think this would even be a point of discussion.
The entire design of C++ is built around eliminating all distinctions between primitive "entities" and user-defined "entities" in a way that Java just isn't. It's true that you can't inherit from integers, but that's one of very few differences. User-defined "entities" don't (necessarily) have vtables, don't have to be heap-allocated, can overload operators, can prevent subclassing, don't necessarily inherit from a common base class, etc.
C++'s strange definition of "object" is a natural result of this pervasive design objective, but changing the terminology to "entity" wouldn't change it.
> The entire design of C++ is built around eliminating all distinctions between primitive "entities" and user-defined "entities"
If the intent was to erase all distinction between built-in and user-defined entities then making the primitive types unable to participate in object hierarchies was a pretty big oversight.
But at this point I think we’re talking past each other. Yes, in Java objects are more distinct from primitives than in C++. But also yes, in C++ there is a special group of “objects” that are special and are notably distinct from the rest of the object system, very much like Java.
You can read Stroustrup's books and interviews, if the language design itself doesn't convey that message clearly enough; you don't have to guess what his intentions and motivations were. And, while I strongly disagree with you on how "special and notably distinct" primitive types are in C++, neither of us is claiming that C++ is less adherent to the principle that "everything is an object" than Java. You think it's a little more, and I think it's a lot more.
But we agree on the direction, and that direction is not "Java [did something] by deciding everything must be an object," but its opposite.
I don’t actually think it’s any more adherent to that notion. This is exactly why I tried to point out the discrepancies in definitions. You have to define what an “object” is or the discussion is meaningless.
If the definition of object is something like “an instance of a class that has state, operations, and identity” then C++ primitives are fundamentally not objects. They have no identity and they are not defined by a class. If “participates in a class hierarchy” is part of the definition, then C++ is way less OO than Java.
I don’t quite understand what your definition is, but you seem to be arguing that user-defined entities are more like primitives in C++, so it’s more object-oriented. So maybe “consistency across types == object orientedness”? Except C++ isn’t really more consistent. Yes, you can create a user-defined type without a vtable, but this is really a statement that user defined types a far more flexible than primitives. But also if “consistency across types” is what makes a language OO then C seems to be more OO than C++.
I don't think C++ is object-oriented, and it is certainly way less OO than Java in most ways. Its "classes" aren't the same kind of thing as classes in OO languages and its "objects" aren't OO objects.
In part this is because by default even C++ class instances don't have identity, or anyway they only have identity in the sense that ints do, that every (non-const) int has an address and mutable state. You have to define a destructor, a copy constructor, and an assignment operator to give identity to the instances of a class in C++.
With respect to "participates in a class hierarchy", that has not been part of the definition of OO since the Treaty of Orlando. But, in Java, all objects do participate in a class hierarchy, while no primitives do, while, in C++, you can also create class instances (or "class" "instances") that do not participate in a class hierarchy (without ever using inheritance, even implicitly). So, regardless of how OO or non-OO it may be, it's another distinction that Java draws between primitives and class instances ("objects" in Java) that C++ doesn't.
That's true, and in Smalltalk it's not true. In Cedar there is no inheritance. At any rate it's not a case of Java making more things objects than its forebears.
> At any rate it's not a case of Java making more things objects than its forebears.
There are other cases though. E.g. in C++ you could always have function pointers which were not objects and functions which were not methods. In Java you had to use instances of anonymous classes in place of function pointers, and all of your functions had to be (possibly static) methods.
Function pointers in C++ are "objects", and the difference between static methods and C-style functions which are not methods seems purely syntactic to me, or at best a question of namespacing. Regardless, static methods are not objects either in Java or in C++, so that is also not a case of something being an "object" in Java and not in C++.
Java is most of why we have a proliferation of VM-based languages and a big part of why WASM looks the way it does (though as I understand it, WASM is the shape it is in some measure because it rejects JVM design-for-the-language quirks).
I would also not blame Java for the worst of OO, all of that would have happened without it. There were so many OO culture languages pretending to that throne. Java got there first because of the aforementioned VM advantage, but the core concepts are things academia was offering and the industry wanted in non-Ada languages.
I would say Java also had a materially strong impact on the desire for server, database and client hardware agnosticism.
Some of this is positive reinforcement: Java demonstrates that it's better if you don't have to worry about what brand your server is, and JDBC arguably perfected ODBC.
Some of it is negative: a lot of the move to richer client experiences in the browser has to do with trying to remove client-side Java as a dependency, because it failed. It's not the only bridged dependency we removed, of course: Flash is equally important as a negative.
> I would also not blame Java for the worst of OO, all of that would have happened without it. There were so many OO culture languages pretending to that throne. Java got there first because of the aforementioned VM advantage, but the core concepts are things academia was offering and the industry wanted in non-Ada languages.
Really? Were there other major contenders that went as far as the public class MyApp { public static void main( nonsense?
And it's all still true, although I would offer the usual argument that concurrency!=parallelism, and if you reach for threads&STM to try to speed something up, you'll probably have a bad time. With the overhead of GC, STM-retries, false-sharing, pointer-chasing, etc you might have a better time rewriting it single-threaded in C/Rust.
STM shines in a concurrent setting, where you know you'll multiple threads accessing your system and you want to keep everything correct. And nothing else comes close.
To be maximally fair to the Haskell people, they have been enormously influential. Haskell is like Canada: you grow up nicely there and then travel the world to bring your energy to it.
Disabled on most CPUs, plagued by security issues. I haven't used it but I assume debugging would be extremely painful, since any debug event would abort the transaction.
True. At some point in the now distant past, AMD had a proposal for a very restricted form of HTM that allowed CAS up to 7 memory locations as they had some very specific linked list algorithms that they wanted optimize and the 7 location restrictions worked well with the number of ways of their memory.
I'd like to see what kind of hardware acceleration would help STMs without imposing severe limitations on their generality.
To me, the appealing things about STMs are the possibility of separating concerns of worst-case execution time and error handling, which are normally pervasive concerns that defeat modularity, from the majority of the system's code. I know this is not the mainstream view, which is mostly about manycore performance.
Not an expert, but my understanding is that HTM basically implements the fast path: you still need a fully fledged STM implementation as a fallback in case of interference, or even in the uncontended case if the working set doesn't fit in L1 (because of way collision for example) and the HTM always fails.
I've been excited about STMs since I read "Composable Memory Transactions" back in 02005, shortly before it was published, and I still believe in the glorious transactional future, but it's difficult to adopt an STM piecemeal; it kind of wants to own your entire program, the same way that garbage collection or nonblocking I/O do, and more so than multithreading with locks. You kind of have to commit entirely to an STM. The efforts in C# to adopt an STM ending around 02010 were a disaster as a result.
The article says a couple of things about STMs that are not true of STMs in general, just true of the Haskell STM the author is familiar with, like a small Brazilian child confidently telling you that almost everyone speaks Portuguese.
One of these is, "STM is an optimistic concurrency system." The possibility of making your concurrency 100% lock-free is one of the most appealing things about STMs, and I think it could be a key to solving the UI latency problem, which just keeps getting worse and worse. Actors and CSP don't normally help here; an Actor is just as "blocking" as a lock. But you can implement an STM with partly pessimistic concurrency, or purely pessimistic, and it might even be a good idea.
Another is, "One last benefit of STM which we haven't yet discussed is that it supports intelligent transaction retries based on conditions of the synchronized data itself." This was an innovation introduced by "Composable Memory Transactions", and many STMs do not support it, including Keir Fraser's awesomely fast version. I am even less certain that it is the correct tradeoff for all uses than I am about purely optimistic synchronization.
But all of this is why I'm rereading Gray and Reuter's Transaction Processing right now after 25 years. With the benefit of 35 years of hindsight, it's a frustrating mix of inspiring long-range vision and myopic boneheadedness. But it shares a lot of hard-won wisdom about problems like long transactions that pop up in a new guise in STMs.
I think writing concurent programs will always be a hard problem, relative to the difficulty of writing non-concurrent programs, and the only "solution" is to isolate, minimize, and regulate contention. The implementation details of TM, locks, monitors, semaphores, actors, message queues, transactions, etc., are at best "distractions", at worst hindrances. I think a good model of a concurrent program, one that lends itself to writing the program simply, will be applicable across many different implementations. Anything that obscures the development of such a model is harmful. Worst of all is the sheer prevalence of shared resources (especially shared memory). Sharing brings contention, so control sharing.
I don't agree that whether you're using TM, shared-memory monitors, or actors with message queues is an implementation detail or that there is a better programming model that hides the differences between them. You can implement any of them on top of any of the others, but you're still programming to whatever model you put on top.
In the implementation that runs, yes, you have to choose something. However, I think the fundamental design is independent of those options, and probably should be developed independently.
There is surely some sense in which that's true; the choice of concurrency primitives, even among such radically different choices, won't change literally every design decision in your system. But it is very pervasive, and it regularly provokes failures that are visible to users.
On the problems with STM in C#, see https://joeduffyblog.com/2010/01/03/a-brief-retrospective-on... (I can't believe nobody else has posted this link yet). As with the Chris Penner article, there are a lot of things described as features of STMs in general which are actually just properties of the STM he worked on, which explains some of the things that sound like nonsense if you've only worked with Haskell's STM or Clojure's. (Duffy is much better about delineating the boundaries of the systems he's talking about, though, because he knows there are alternatives.)
interesting, the notation also implies 2005 and 2010 A.D and not B.C, or maybe the notation is about exactly A.D? either way, interesting choice if it was intentional. we say “year 515” without disambiguation right
I wish people would comment about transactions, optimistic synchronization, CSP, actors, priority inversion, Fraser's astounding code (https://www.cl.cam.ac.uk/research/srg/netos/projects/archive...) etc., but I guess we each do what we can, and you have to meet people where they are. I probably couldn't have posted a comment any better than yours when I was 12, and maybe you're 12, so maybe you can't do any better. Hopefully, eventually, you will.
> A data race occurs any time two threads access the same memory location concurrently and non-deterministically when at least one of the accesses is a write.
From what I understand of the C++ memory model (shared by C and Rust), this is not the definition of data race – a data race occurs when two or more threads access memory concurrently where at least one access is a write and the accesses are unsynchronized. However, synchronized accesses may not have a deterministic ordering, in which case a race condition occurs.
(Confusing as it may be, I believe this is standard terminology.)
Optimistic locking works great when what is excluded is effectively a calculation. The annoyance, though, is you have basically acknowledged that you can use a compare-and-swap at the end of your calculation to know that things worked.
This is not at all always the case, though. Sometimes, what you need to use mutual exclusion for is actively working with something that is, itself, active work. That is, sometimes you have to stop something even starting.
Think about how you would model a laundry mat. It isn't enough to say you could use optimistic locks for access to the machines. You can't use the machine until you can use the machine.
This is not unheard of in computing, either. Sometimes, you can't process a buffer until you know it has been finished by the code before you.
> Think about how you would model a laundry mat. It isn't enough to say you could use optimistic locks for access to the machines.
One machine? Too easy, I want multiple machines!
I want to reserve this machine and that machine. And this is happening while someone else wants to grab that machine and this machine. It costs money to grab a machine, so I'll insert a coin into this machine, and then insert a coin into that machine. The other guy grabbed that machine before me, so am I locked out? It turns out he only had one coin, not two: not only did this mean he untook that machine, but the coin he spent flew out of the machine and back into his pocket. While that was going on, I took a nap - instead of busily waiting for that machines - and the laundromat operator woke me up when it was time to try to grab both machines again.
myActions :: Wallets -> Machines -> STM ()
myActions wallets machines = do
bookMachine "this machine"
bookMachine "that machine"
where
bookMachine :: Machine -> STM ()
bookMachine m = do
reserveMachine machines m
spendCoin
spendCoin :: STM ()
spendCoin = do
decrement wallets
newBalance <- getBalance wallets
when (newBalance < 0)
(throwSTM "Ran out of money")
bool transfer(boost::synchronized<Account>& sync_from,
boost::synchronized<Account>& sync_to, int amount) {
auto [from, to] = synchronize(sync_from, sync_to);
if (from->withdraw(amount)) {
to->deposit(amount)
return true;
} else {
return false
}
}
Hopefully synchronized will make it into the standard library at some point, in the meantime it is not terribly hard to write it yourself if you do not want a boost dependency.
Purely optimistic STM implementations that abort transactions early and don't permit other transactions to read uncommitted data can guarantee forward progress, and I believe that both Haskell's STM and Fraser and Harris's STM do, though I could easily be mistaken about that.
Probably you are right. I vaguely remembered the "Why Transactional Memory Should Not Be Obstruction-Free" paper, but I might have misunderstood or forgotten what it meant (the implementation can be non obstruction-free, but it doesn't mean it can live-lock).
I'm reading the Kuznetsov and Ravi paper https://www.researchgate.net/publication/272194871_Why_Trans... now; I assume that's the one you mean? Its definition of "obstruction-freedom" is that every transaction "not encountering steps of concurrent transactions" must commit. This seems to be neither necessary nor sufficient for avoiding livelock, but certainly very helpful. Their weaker "progressiveness" property seems almost as good.
They claim that their STM "LP" is not obstruction-free but is wait-free, which is a very strong claim! WP explains, "A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. ‘Non-blocking’ was used as a synonym for ‘lock-free’ in the literature until the introduction of obstruction-freedom in 2003." Kuznetsov and Ravi say of LP, "every transactional operation completes in a wait-free manner."
Its normative force seems to be founded on claims about performance, but it would be very surprising if the performance cost of guaranteed forward progress or obstruction-freedom were too high for me to want to pay it, since what I'm most interested in is latency and fault-tolerance, not parallel speedups.
As far as I know, wait-free is a superset of lock-free and lock-free is a superset of obstruction-free. How can LP be wait-free but not obstruction free?
In any case a wait-free algorithm can't live-lock by definition (progress and fairness of all threads is guaranteed), but the catch is that while the STM runtime itself might have this property, it doesn't necessarily translate to an algorithm implemented in the runtime (which makes sense, you should be able to implement a lock with an STM).
So, yes, the paper is interesting, but probably not relevant for this discussion and I shouldn't have brought it up.
Now the question again remains, do concrete STM implementations actually provide the guarantee you mentioned earlier, i.e. does a transaction aborting guarantees that another succeeds? I would think it doesn't as I think it would be very costly to implement: both transactions might end up aborting when detecting conflict.
Maybe what concrete runtimes actually guarantee is that there is an upper bound on the spurious aborts and restarts as worst case you can fall back on a single global lock for serialization when the upper bound is reached.
You avoid livelock, as I understand the term in an STM, if the only thing that can prevent a transaction from committing when it tries to commit is some other transaction having committed. That way, forward progress is guaranteed; as long as some transaction commits, you're not livelocked, are you?
I'm not familiar with "obstruction-free"ness; should I be?
This is a nice article, and I appreciate the examples. The next problem to solve is how to persist state on disk across two different accounts after a transfer has been done.
>Moore's law is dying, beefy single cores are no longer keeping up.
On the other hand, there are many other things that could be done to avoid wasting all the extra power gained over the years which don't even require any parallelism boost.
I wonder what happened to hardware transactional memory. Your CPU caches already keep track of which core is keeping which line in its cache and whether they have modified it via the MESI protocol:
The fundamental problem here is shared memory / shared ownership.
If you assign exclusive ownership of all accounting data to a single thread and use CSP to communicate transfers, all of these made up problems go away.
STM isn't really used in Go like it is in Haskell.
Here's the example from a Go STM package that's based on Haskell STM. It has gotchas that you won't encounter in Haskell though, due to the nature of these languages.
Indeed, “Software Transactional Memory” sounds like a magic black box. It feels a bit like reading “just use an sql database and transactions”. It’s not really telling me how the problem is solved, just to use someone else’s solution without understating how it’s implemented.
Scala supports it with for-comprehensions which are equivalent to Haskell's do-notation but STM is not part of the Scala standard library. Zio and Cats Effect are two popular Scala effects systems with STM.
Verse only supports single-threaded transactional memory. Epic hasn't yet demonstrated that their approach can actually scale to be used from multiple threads in a useful manner, though they claim that it will.
I've found it interesting to think about trying to adopt data structures like CRDT designed for distributed systems to the problem of local consistency between CPU-local data structures spread across cores for parallelism.
c# offers a very convinient way to pack data access and locking into one thing. the "lock" instruction. it does hover not let you lock more that one "resource" at a time.
Shared data is hard no matter what you do. Parallelism and shared data do not mix. STM makes some things easier, but you still will run into problems if you have a lot of it. You must design your code such that you spend a lot of CPU cycles doing single thread no shared data calculations between every place where you need to share data. If you can't do that you can't do parallelism.
When there are only a few places where data needs to be shared a mutexs works since you put your best programmer on maintaining just that code and with only a few places they can figure out it. You can also make a variable atomic, which sometimes works better than a mutex and sometimes worse. You can use STM. However no matter what you use the reality of synchronizing between cores means you can't do any of the above "very often".
> If the programmer forgets to lock the mutex the system won't stop them from accessing the data anyways, and even then there's no actual link between the data being locked and the lock itself, we need to trust the programmers to both understand and respect the agreement. A risky prospect on both counts.
Not every language is like that. I would not use language (that allows this) for parallel programming. You can use other ways but how can you guarantee everyone who will edit the code-base will not use unenforced mutex?
>Ugh, a correct transfer function should conceptually just be the composition of our well encapsulated withdraw and a deposit functions, but defining it correctly has forced us to remove the locking from both withdraw and deposit, making both of them less safe to use.
I know OOP isn't cool any more, but the above is what OOP solves.
TFA's transfer() and withdraw() functions aren't compliant with double-entry accounting anyways, so you'd mark them private and only expose Transfer to callers. Let the class handle its own details.
OOP does not provide any help in solving the problem in question, and indeed encapsulation is a major obstacle to solving it. I think you haven't understood what problem is being discussed.
When working with high throughput concurrent code like consumer producer pipelines, it's better to avoid shared mutable state entirely. Something actor like fits better and C# or Go channels works wonders.
It's not necessarily shared. You could assign a single thread to own the account balances, reading requests from a message queue. That probably scales better than locking. A single thread can do several million transactions per second, more than the entire global financial system.
And in a green threading system like Go or Scala Cats, the balances thread isn’t a thread at all, and it will run in the same thread as the transfer caller when the call isn’t contended, so you don’t even have a context switch.
The account-owning thread has to accept messages for every atomic action you need it to do.
There are lots of standard shared queues you can pick from that have been fully regression tested. That's almost always better than mixing concurrency in with your business logic.
Sure, but what I meant is when there is some other thing that needs to happen atomically together with the balance handled by that one thread (i.e, both balance and other thing change or neither do). You'll need another thread for that other thing, then a method to synchronize the two and.. you're back at the mutex.
Isn't that a typical case where you don't have to share anything?
Divide the image into N chunks, let N threads handle each one, no sharing, just need a single sync point at the end to wait on completion.
tl;dr mutexes are evil because they don't compose, STM is the solution because it does compose, otherwise just avoid shared state, or even state entirely.
Not anything that's not already covered in any undergraduate CS course.
You made me check the programme of many university courses.
Most parallel programming courses are at Masters level and require specialization, but those are much more advanced (sorting networks, distributed computing on supercomputers and GPU, consensus algorithms, parallelization of linear solvers...)
There are undergraduate courses that cover simple things like multithreading patterns in mainstream languages, but it seems they are only available if you go to an education institution with both a practical mindset and a firm grip of the fundamentals, which is unfortunately quite rare, as most institutions tend to specialize in one or the other.
>STM is an optimistic concurrency system. This means that threads never block waiting for locks. Instead, each concurrent operation proceeds, possibly in parallel, on their own independent transaction log. Each transaction tracks which pieces of data it has accessed or mutated and if at commit time it is detected that some other transaction has been committed and altered data which this transaction also accessed, then the latter transaction is rolled back and is simply retried.
I already foresaw (and it gets mentioned later), the problem that if you have many small, frequent operations, they will prevent a big, long operation from happening because they will always change the state and cause conflicts before the big one can finish. You can easily code yourself an app that will softlock forever.
The post doesn't offer a good solution (it talks about one where there's a tradeoff, but you don't need tradeoffs for this)
The way this gets fixed it is to make the lock acquisition (or in this case, the priority of the merge?) preemptive (Wound-Wait)
All transactions have a global, ever incrementing number attached to them, their ID, which we call "seniority", if you try to get a a lock, and the lock is held by a transaction with a lower seniority (=> a higher ID), you kill the transaction, take the lock, and once you're done the transaction you killed is allowed to retry
In the meantime, if a transaction with the lower seniority tries to get the lock, it gets blocked
This insures that your program will always finish.
In the case of "lots of frequent small transactions" + "one big, long report", the report will get killed a few times, until your report inevitably becomes the most senior transaction to ask for this resource, and is allowed to complete.
https://old.reddit.com/r/haskell/comments/1oujfmi/mutexes_su... apparently "optimistic" glosses over some details
databases also solve this problem by allowing the read only transaction to see a consistent snapshot of the data at some point in time. so the reader can do its work without needing a logical mutex.
> databases also solve this problem
They could, but are typically configured not to.
With default settings they let other processes' commits cut into the middle of your read. (See READ_COMMITTED.). I was flabbergasted when I read this.
When you turn MSSQL up to a higher consistency level you can get it to grind to a halt and throw Deadlock exceptions on toy code. (Moving money between accounts will do it).
I used to advertise STM as like ACID-databases for shared-memory systems, but I guess it's even more consistent than that.
Postgres's `set transaction isolation level serializable` doesn't result in deadlocks (unless you explicitly grab locks) but does require your application to retry transactions that fail due to a conflict.
Yeah, STMs can use MVCC in the same way, but that doesn't solve the long-transaction problem for bulk update transactions. The first example in Gray and Reuter is adding monthly interest to every savings account, exactly once.
long-running transactions are a really fundamentally bad idea if the length of the transaction scales with the number of accounts/users. these things need to be batched up in a way that code with at-least-once semantics correctly solves the problem. (in one TX handle just a bunch of accounts, calculating the interest, recording that these accounts already got the interest, and moving to the next batch.)
the problem with STM (or any other concurrency control mechanism) is that it's not their job to solve these issues (as usually they require conscious product design), so they are not going to solved by "just use STM".
There are some different approaches, but that's definitely one of them. However, it's not without its tradeoffs; you're sacrificing atomicity, in the sense that if updating some account throws an error, the already-updated accounts stay updated.
I think STM looks pretty cool in toy examples, but in practice it's pretty bad. Very difficult for me to make strong logical argument about this, just based on how it feels.
In Clojure, there are first-class STM primitives with retriable transactions in the standard lib, but every time I tried to use them, the code was slow, difficult to reason about, full of "colored" functions, and non-composable.
Also, unintuitively, the code starts immediately smelling slightly "wrong", as if I tried to put into code my first childish thoughts instead of actually thinking about how the system _should_ work. Like the notorious bank accounts example: yeah, cool, two accounts' balances are TVars — but what if I have 10M accounts? Is the list of accounts also a TVar since sometimes new accounts are created? Are indexes on the list also TVars? How do you persist the lists, how do you synchronize slow persistence and in-memory transactions? How do you synchronize all the deposits with a backup/failover machine? As you continue trying to untangle all this with STM, you start drowning in complexity, and encountering huge contention among transactions touching many parts of your systems, and spewing countless retry logs.
It's not only my own experience — I believe it's widely accepted in Clojure community that nobody actually uses STM, and instead uses simple atomic updates or queues.
I have a similar experience, though it's not necessarily related to shortcomings in the STM implementation (which I always felt was quite competent in the space). To expand, I think a few things are true:
1. I've come to feel that actually managing concurrency is a "below the fold" competency for most engineers. I'm not saying it should be, and people need to understand what a data race is. 2. As you pointed out, most state in services that I work on is not in need of coordinated updates, so atoms are where you want to be, or single-writer queues. 100% agreed 3. Structured concurrency approaches, fibers, and other abstracted runtimes seem to be the best way to get concurrent and parallel work, given points 1 and 2 about how coordinated state really isn't that common for most work on my team
A lot of times when we're updating state we're more having to think about it in the database, employing either a last-write-wins strategy or when coordination is in need, some form of pessimistic or optimistic locking.
Bummer, I thought for a second that we had found a magic bullet for all our concurrency problems!
I chalk this up to problems with Clojure's STM API specifically rather than STM generally, e.g. Haskell's STM implementation is considerably more useful.
I go into more detail here: https://news.ycombinator.com/item?id=35805138 and https://news.ycombinator.com/item?id=32759886
Thanks, this is very valuable! Have you tried other implementations of STM such as Haskell's?
I don't have any experience with STMs in other languages, but as far as I can see, the ideas are very similar in Haskell and Kotlin implementations, for example, so I guess the downsides are the same as well
What I've heard from Haskell programmers suggests that the downsides are very different, even though the ideas are very similar.
I found that a lot of the problems I had been having with mutexes, stem from the fact that traditionally the mutex and the data it protects are separate. Bolting them together, like Rust's Mutex<T> does, solves a lot these problems. It let's you write normal, synchronous code and leave the locking up to the caller, but without making it a nightmare. You can't even access the data without locking the mutex.
This isn't an attack on the (very well written) article though. Just wanted to add my two cents.
Mutexes suffer from a host of problems, and imo are not a very good concurrency primitive - they were designed to turn single-threaded code into multi-threaded. With todays 8+ cores in most systems, usually a single point of contention quickly becomes a problem.
They're liable to deadlocks/livelocks, and sometimes not only with other explicitly Mutex-like things (it might happen some library you use has a lock hidden deep inside).
They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc). We've run into an issue with .NET, that their version of Mutex didn't wake up the blocked thread on Linux as fast as on Windows, meaning we needed about 100x the time to serve a request as the thread was sleeping too long.
There are questions like when to use spinlocks and when to go to wait sleep, which unfortunately the developer has to answer.
Not assigning blame here, just pointing out that threading primitives and behaviors don't translate perfectly between OSes.
Multi-threading is hard, other solutions like queues suffer from issues like backpressure.
That's why I'm skeptical about Rust's fearless concurrency promise - none of these bugs are solved by just figuring out data races - which are a huge issue, but not the only one.
this is an overly simplistic and somewhat reductive perspective on a pretty fundamental concept/primitive
> they were designed to turn single-threaded code into multi-threaded
not really
> usually a single point of contention quickly becomes a problem.
not generally, no
> They're liable to deadlocks/livelocks,
deadlocks/livelocks are orthogonal to any specific primitive
> They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc).
the mutex as a primitive is orthogonal to any specific implementation...
etc. etc.
Your view on mutex performance and overhead is outdated, at least for the major platforms: The Rust standard library mutex only requires 5 bytes, doesn't allocate, and only does a syscall on contention. The mutex implementation in the parking_lot library requires just 1 byte per mutex (and doesn't allocate and only does a syscall on contention). This enables very fine-grained, efficient locking and low contention.
These are OS primitives I'm talking about - I haven't checked out the standard library version but the parking_lot version uses a spinlock with thread sleep when the wait times get too high - it has no way of getting notified when the mutex gets unblocked nor does it support priority inversion.
It seems it's optimized for scenarios with high performance compute heavy code, and short critical sections.
These assumptions may let it win benchmarks, but don't cover the use cases of all users. To illustrate why this is bad, imagine if you have a Mutex protected resource that becomes available after 10us on average. This locks spins 10 times checking if it has become available )(likely <1us) then yields the thread. The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should). But even best-case, you're left waiting 10ms, which is a typical scheduler tick.
In contrast OS based solutions are expensive but not that expensive, let's say that add 1us to the wait. Then you would wait 11us for the resource.
A method call taking 10ms and one taking 15 us is a factor of 60x, which can potentially kill your performance.
You as the user of the library are implicitly buying into these assumptions which may not hold for your case.
There's also nothing in Rust that protects you from deadlocks with 100% certainty. You can fuzz them out, and use helpers, but you can do that in any language.
So you do need to be mindful of how your mutex works, if you want to build a system as good as the one it replaces.
The best practices I adopt for rust avoid the use of mutex whenever possible precisely because of how easy a deadlock is. It turns out it is always possible. There are entire languages the disallow any mutable state, much less shared mutable state. The question becomes how much performance are you willing to sacrifice to avoid the mutex. By starting with no shared mutable state and adding it when something is too slow, you end up with very few mutexes.
> avoid the use of mutex […] It turns out it is always possible
How would you handle the archetypical example of a money transfer between two bank accounts, in which 100 units of money need to be subtracted from one account and atomically added to another account, after checking that the first account contains at least 100 units?
The simplest pure functional way would be to copy the whole database instantiating a new copy with the desired change if the condition was met. That obviously doesn't scale, which is where the performance thing comes in. A still pure way would be to use a persistent tree or hash mapped trie that allows efficient reuse of the original db. There are times a purely functional approach doesn't perform well enough, but even with large scale entity component type systems in both rust and c++, the number of times I've had to use a mutex to be performant is small. Atomic is much more common, but still not common. Persistent data structures alleviate most of the need.
pure or not eventually this comes down to durability, no?
and the way to do it is to either have some kind single-point-of-control (designated actor or single-threaded executor) or mark the data (ie. use some concurrency control primitive either wrapping the data or in some dedicated place where the executors check [like JVM's safepoints])
using consistent hashing these hypothetical accounts could be allocated to actors and then each transaction is managed by the actor of the source (ie. where the money is sent from, where the check needs to happen), with their own durable WAL, and periodically these are aggregated
(or course then the locking is hidden in the maintenance of the hashring as eating philosophers are added/removed)
Eliminating the durability constraint doesn't make it any easier to program, just easier to get good performance on.
Distributing accounts among different actors, without two-phase commit or its moral equivalent, enables check kiting.
Since the thread mentions Rust: in Rust, you often replace Mutexes with channels.
In your case, you could have a channel where the Receiver is the only part of the code that transfers anything. It'd receive a message Transfer { from: Account, to: Account, amount: Amount } and do the required work. Any other threads would therefore only have copies of the Sender handle. Concurrent sends would be serialized through the queue's buffering.
I'm not suggesting this is an ideal way of doing it
What you're describing is called the "Actor model"; in your example, the receiver is an actor that has exclusive control over all bank accounts.
The actor model reaches its limits as soon as you need transactions involving two or more actors (for example, if you need to atomically operate on both the customers actor and the bank accounts actor). Then you can either pull all involved concerns into a single actor, effectively giving up on concurrency, or you can implement a locking protocol on top of the actor messages, which is just mutexes with extra steps.
> […] but don't cover the use cases of all users.
No single concurrency primitive covers all use cases. I was addressing your misconceptions about mutex performance and overhead, not whether mutexes are the best solution to your particular problem.
> […] it has no way of getting notified when the mutex gets unblocked […] The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should).
You've misunderstood the parking_lot implementation. When thread B tries to lock a mutex that's currently locked by thread A, then, after spinning a few cycles, thread B "parks" itself, i.e., it asks the kernel to remove it from the Runnable task queue. On Linux, this is done using the futex syscall. When thread A unlocks the mutex, it detects that another thread is waiting on that mutex. Thread A takes one thread from the queue of waiting threads and "unparks" it, i.e., it asks the kernel to move it into the Runnable task queue. The kernel is notified immediately, and if there's a free CPU core available, will tend to dispatch the thread to that core. On a non-realtime OS, there's no guarantee how long it takes for an unblocked thread to be scheduled again, but that's the case for all concurrency primitives.
> A method call taking 10ms and one taking 15 us is a factor of 60x
667 (a thousand 15μs calls take 15ms)
Unfortunately the standard library mutex is designed in such a way that condition variables can't use requeue, and so require unnecessary wakeups. I believe parking lot doesn't have this problem.
It's called a futex and supported by both Linux and Windows since ages.
The 1-byte-per-mutex parking_lot implementation works even on systems that don't provide a futex syscall or equivalent.
How does it avoid cache contention with just a few bytes per mutex? That is, multiple mutex instances sharing a cache line. Say I have a structure with multiple int32 counters protected by their own mutex.
Cache contention is (mostly) orthogonal to your locking strategy. If anything, fine-grained locking has the potential to improve cache contention, because
1) the mutex byte/word is more likely to be in the same cache line as the data you want to access anyway, and
2) different threads are more likely to write to mutex bytes/words in different cache lines, whereas in coarse-grained locking, different threads will fight for exclusive access over the cache line containing that one, global mutex.
@magicalhippo: Since I'm comment-rate-throttled, here's my answer to your question:
Typically, you'd artificially increase the size and alignment of the structure:
This struct now has an alignment of 64, and is also 64 bytes in size (instead of just the 4+1 required for Mutex<u32>), which guarantees that it's alone in the cache line. This is wasteful from a memory perspective, but can be worth it from a performance perspective. As often when it comes to optimization, it very heavily depends on the specific case whether this makes your program faster or slower.> different threads are more likely to write to mutex bytes/words in different cache lines
If you got small objects and sequential allocation, that's not a given in my experience.
Like in my example, the ints could be allocated one per thread to indicate some per thread status, and the main UI thread wants to read them every now and then hence they're protected by a mutex.
If they're allocated sequentially, the mutexes end up sharing cache lines and hence lead to effective contention, even though there's almost no "actual" contention.
Yes yes, for a single int you might want to use an atomic variable but this is just for demonstration purposes. I've seen this play out in real code several times, where instead of ints it was a couple of pointers say.
I don't know Rust though, so just curious.
The issue might be allocating the int contiguously in the first place. No language magic is going to help you avoid thinking about mechanical sympathy.
And allocating the int contiguously might actually be the right solution is the cost of sporadic false sharing is less than the cost of wasting memory.
There's no silver bullet.
But the mutex encapsulates the int, so if the mutex ensured it occupied a multiple of cache lines, there would be no contention. At the very small cost of a few bytes of memory.
the mutex forcing alignment would be extremely wasteful. FWIW, I have used 1-bit spin locks.
By not avoiding it. And a year later you get to write a blog post about how you discovered and fixed this phenomenon hitherto unknown to computer science.
Traditionally traditionally, monitors were declared together with the data they contained, and the compiler enforced that the data was not accessed outside the monitor. Per Brinch Hansen wrote a rather bitter broadside against Java's concurrency model when it came out.
Was this the article?
http://brinch-hansen.net/papers/1999b.pdf
This is a toned-down, but still scathing, version of what I remember reading.
> You can't even access the data without locking the mutex.
It's even nicer than that: you can actually access data without locking the mutex, because while you hold a mutable borrow to the mutex, Rust statically guarantees that no one else can acquire locks on the mutex.
https://doc.rust-lang.org/std/sync/struct.Mutex.html#method....
Given a data item of non-thread safe type (i.e. not Mutex<T> etc), the borrow checker checks that there's only ever one mutable reference to it. This doesn't solve concurrency as it prevents multiple threads from even having the ability to access that data.
Mutex is for where you have that ability, and ensures at runtime that accesses get serialized.
The maybe unexpected point is that if you know you're the only one who has a reference to a Mutex (i.e. you have a &mut), you don't need to bother lock it; if no one else knows about the Mutex, there's no one else who could lock it. It comes up when you're setting things up and haven't shared the Mutex yet.
This means no atomic operations or syscalls or what have you.
Do you have an example? I don't program in Rust, but I imagine I'd rarely get into that situation. Either my variable is a local (in a function) in which case I can tell pretty easily whether I'm the only one accessing it. Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing. How is Rust going to help here? I imagine it's only making the optimal thing harder to achieve.
I can see that there are some cases where you have heap-data that is only visible in the current thread, and the borrow checker might be able to see that. But I can imagine that there are at least as many cases where it would only get in the way and probably nudge me towards unnecessary ceremony, including run-time overhead.
It's relevant when you have more complex objects, such as ones that contain independent mutexes that lock different sections of data.
You want the object to present its valid operations, but the object could also be constructed in single or multithreaded situations.
So you'd offer two APIs; one which requires a shared reference, and internally locks, and a second which requires a mutable reference, but does no locking.
Internally the shared reference API would just lock the required mutexes, then forward to the mutable reference API.
When you construct an object containing a mutex, you have exclusive access to it, so you can initialize it without locking the mutex. When you're done, you publish/share the object, thereby losing exclusive access.
Initialization is always special. A mutex can't protect that which doesn't exist yet. The right way to initialize your object would be to construct the message first, then construct the composite type that combines the message with a mutex. This doesn't require locking a mutex, even without any borrow checker or other cleverness.
Dude, it's a simplified example, of course you can poke holes into it. Here, let me help you fill in the gaps:
The point is, that there are situations where you have exclusive access to a mutex, and in those situations you can safely access the protected data without having to lock the mutex.Sorry, I don't find that convincing but rather construed. This still seems like "constructor" type code, so the final object is not ready and locking should not happen before all the protected fields are constructed.
There may be other situations where you have an object in a specific state that makes it effectively owned by a thread, which might make it possible to forgo locking it. These are all very ad-hoc situations, most of them would surely be very hard to model using the borrow checker, and avoiding a lock would most likely not be worth the hassle anyway.
Not sure how this can help me reduce complexity or improve performance of my software.
>I don't program in Rust, but I imagine I'd rarely get into that situation.
Are you sure? Isn't having data be local to a thread the most common situation, with data sharing being the exception?
>Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing.
That's exactly what the borrow checker does. It tracks how many mutable references you have to your data structure at compile time. This means you can be sure what is local and what is shared.
Meanwhile without the borrow checker you always have to assume there is a remote probability that your mental model is wrong and that everything goes wrong anyways. That's mentally exhausting. If something goes wrong, it is better to only have to check the places where you know things can go wrong, rather than the entire code base.
I use lots of locals but only to make my code very "local", i.e. fine-grained, editable and clear, using lots of temporary variable. No complicated expressions. That's all immutable data (after initialization). I rarely take the address of such data but make lots of copies. If I take its address, then as an immutable pointer, maybe not in the type system but at least in spirit.
I keep very little state on the stack -- mostly implicit stuff like mutex lock / mutex unlock. By "state" I mean object type things that get mutated or that need cleanup. I always have a "database schema" of my global state in mind. I define lots of explicit struct types instead of hiding state as locals in functions. I've found this approach of minimizing local state to be the right pattern because it enables composability. I'm now free to factor functionality into separate functions. I can much more freely change and improve control flow. With this approach it's quite rare that I produce bugs while refactoring.
So yes, I have lots of locals but I share basically none of them with other threads. Also, I avoid writing any code that blocks on other threads (other than maybe locking a mutex), so there's another reason why I would not intentionally share a local with another thread. Anything that will be shared with another thread should be allocated on the heap just for the reason that we want to avoid blocking on other threads.
In that sense, the borrow checker is a tool that would allow me to write code more easily that I never wanted written in the first place.
I find it better to model that as an Actor than a mutex, but I guess it's inherently the same thing, except the actor also allows asynchronous operations.
You can go full circle and also make operations on a mutex asynchronous. Hence the realization that message passing and shared memory are truly dual.
The very idea of a mutex is that it is synchronous. You wait until you can acquire the mutex.
If it's asynchronous, it's not a mutex anymore, or it's just used to synchronously setup some other asynchronous mechanism.
A mutex is a way to guarantee mutual exclusion nothing more nothing less; You can recover synchronous behaviour if you really want:
that isn't a mutex, that's delegating work asynchronously and delegating something else to run when it is complete (the implicitly defined continuation through coroutines).
In systems programming parlance, a mutex is a resource which can be acquired and released, acquired exactly once, and blocks on acquire if already acquired.
Do a CPS transform of your typical std::mutex critical section and you'll find they are exactly the same.
They're not, the interactions with the memory model are different, as are the guarantees.
CPS shouldn't be able to deadlock for example?
CPS can trivially deadlock for all meaningful definitions of deadlock.
Would you consider this a mutex?
What about: my_mutex mux; where the code runs in a user space fiber.Would you consider boost synchronized a mutex?
Don't confuse the semantics with the implementation details (yes async/await leaks implementation details).
This doesn't solve the deadlock problem, however.
Sounds like the Java synchronized class.
No. It’s not a property of the type so you can have multiple items under a mutex and you’re not at the mercy of whoever wrote it, it works fine with POD types, it does not force a lock / unlock on each method call (instead the compiler essentially ensures you hold the lock before you can access the data), and the borrow checker is there to ensure you can not leak any sort of sub-states, even though you can call all sorts of helpers which have no requirement to be aware of the locking.
It’s what synchronized classes wish they had been, maybe.
Not at all. With rust you cannot accidentally leak a reference, and here's the killer: it guarantees these properties at compile time.
This felt like a blast from the past. At a few times reading this article, I had to go back and check that, yes, it's actually a new article from the year 2025 on STM in Haskell and it's even using the old bank account example.
I remember 15 or 20 years (has it been that long?) when the Haskell people like dons were banging on about: 1) Moore's law being dead, 2) future CPUs will have tons of cores, and 3) good luck wrangling them in your stone age language! Check out the cool stuff we've got going on over in Haskell!
Yeah, remember when we used to care about making better programming languages that would perform faster and avoid errors, instead of just slapping blockchains or AI on everything to get VC money. Good times.
Only half-joking: maybe Java was a mistake. I feel like so much was lost in programming language development because of OOP...
OOP is very useful/powerful, don't throw the good parts out. Java messed up by deciding everything must be an object when there are many other useful way to program. (you can also argue that smalltalk had a better object model, but even then all objects isn't a good thing). Functional programing is very powerful and a good solution to some problems. Procedural programing is very powerful and a good solution to some problems. You can do both in Java - but you have to wrap everything in an object anyway despite the object not adding any value.
> everything must be an object when there are many other useful way to program.
Perhaps you would prefer a multi-paradigm programming language?
http://mozart2.org/mozart-v1/doc-1.4.0/tutorial/index.html
Java was derived from C++, Smalltalk, and arguably Cedar, and one of its biggest differences from C++ and Smalltalk is that in Java things like integers, characters, and booleans aren't objects, as they are in C++ and Smalltalk. (Cedar didn't have objects.)
Right. Everything a user can do is object, but there are a few non-object built ins. (they are not objects in C++ either, but C++ doesn't make everything you write be an object)
In C++ integers and characters are objects. See https://en.cppreference.com/w/cpp/language/objects.html, for example, which explicitly mentions "unsigned char objects", "a bit-field object", "objects of type char", etc.
I feel this is a case of using the same word to mean something different. C++ “object” here seems to mean something more akin to “can be allocated and stuffed into an array” than a Smalltalk-type object.
i.e. C++ primitive types are defined to be objects but do not fit into a traditional object-oriented definition of “object”.
Yes, many people believe that C++ isn't really "object-oriented", including famously Alan Kay, the inventor of the term. Nevertheless, that is the definition of "object" in C++, and Java is based on C++, Smalltalk, and Cedar, and makes an "object"/"primitive" distinction that C++, Smalltalk, and Cedar do not, so "Java [did something] by deciding everything must be an object" is exactly backwards.
I'm not sure who invented "object oriented", but objects were invented by Simula in 1967 (or before, but first released then?) and that is where C++ takes the term from. Smalltalk-80 did some interesting things on top of objects that allow for object oriented programming.
In any case, Alan Kay is constantly clear that object oriented programming is about messages, which you can do in C++ in a number of ways. (I'm not sure exactly what Alan Kay means here, but it appears to exclude function calls, but would allow QT signal/slots)
The specific thing you can do in Smalltalk (or Ruby, Python, Objective-C, Erights E, or JS) that you can't do in C++ (even Qt C++, and not Simula either) is define a proxy class you can call arbitrary methods on, so that it can, for example, forward the method call to another object across the network, or deserialize an object stored on disk, or simply log all the methods called on a given object.
This is because, conceptually, the object has total freedom to handle the message it was sent however it sees fit. Even if it's never heard of the method name before.
You can do that in C++ too - it is just a lot of manual work. Those other languages just hide (or make easy) all the work needed to do that. There are trade offs though - just because you can in C++ doesn't mean you should: C++ is best where the performance cost of that is unacceptable.
No, in C++ it's literally impossible. The language provides no way to define a proxy class you can call arbitrary methods on. You have to generate a fresh proxy class every time you have a new abstract base class you want to interpose, either by hand, with a macro processor, or with run-time code generation. There's no language mechanism to compile code that calls .fhqwhgads() successfully on a class that doesn't have a .fhqwhgads() method declared.
you don't call fhqwhgads() on your proxy class though. You call runFunction("fhqwhgads") and it all compiles - the proxy class then string matches on the arguments. Of course depending on what you want to do it can be a lot more complex. That is do manually what other languages do for you automatically under the hood.
Again, this is not something you should do, but you can.
That doesn't provide the same functionality, because it requires a global transformation of your program, changing every caller of .fhqwhgads() (possibly including code written by users of your class that they aren't willing to show you, example code in the docs on your website, code in Stack Overflow answers, code in printed books, etc.). By contrast, in OO languages, you typically just define a single method that's a few lines of code.
You're sinking into the Turing Tarpit where everything is possible but nothing of interest is easy. By morning, you'll be programming in Brainfuck.
To be clear, I’m not trying to pick at whether or not C++ is “really object oriented”.
What I’m saying is that the discrepancy between primitives in C++ and Java is entirely one of definition. Java didn’t actually change this. Java just admitted that “objects” that don’t behave like objects aren’t.
On the contrary, Java objects are very different from C++ objects, precisely because they lack a lot of the "primitive-like" features of C++ objects such as copying, embedding as fields, and embedding in arrays. (I'm tempted to mention operator overloading, but that's just syntactic sugar.)
Java differs from C++ in an endless number of ways.
What I’m saying is that in both C++ and Java, there are a set of primitive types that do not participate in the “object-orientedness”. C++ primitives do not have class definitions and cannot be the base of any class. This is very much like Java where primitives exist outside the object system.
If the C++ standard used the term “entities” instead of “objects” I don’t think this would even be a point of discussion.
It's not some minor point of terminology.
The entire design of C++ is built around eliminating all distinctions between primitive "entities" and user-defined "entities" in a way that Java just isn't. It's true that you can't inherit from integers, but that's one of very few differences. User-defined "entities" don't (necessarily) have vtables, don't have to be heap-allocated, can overload operators, can prevent subclassing, don't necessarily inherit from a common base class, etc.
C++'s strange definition of "object" is a natural result of this pervasive design objective, but changing the terminology to "entity" wouldn't change it.
> The entire design of C++ is built around eliminating all distinctions between primitive "entities" and user-defined "entities"
If the intent was to erase all distinction between built-in and user-defined entities then making the primitive types unable to participate in object hierarchies was a pretty big oversight.
But at this point I think we’re talking past each other. Yes, in Java objects are more distinct from primitives than in C++. But also yes, in C++ there is a special group of “objects” that are special and are notably distinct from the rest of the object system, very much like Java.
You can read Stroustrup's books and interviews, if the language design itself doesn't convey that message clearly enough; you don't have to guess what his intentions and motivations were. And, while I strongly disagree with you on how "special and notably distinct" primitive types are in C++, neither of us is claiming that C++ is less adherent to the principle that "everything is an object" than Java. You think it's a little more, and I think it's a lot more.
But we agree on the direction, and that direction is not "Java [did something] by deciding everything must be an object," but its opposite.
I don’t actually think it’s any more adherent to that notion. This is exactly why I tried to point out the discrepancies in definitions. You have to define what an “object” is or the discussion is meaningless.
If the definition of object is something like “an instance of a class that has state, operations, and identity” then C++ primitives are fundamentally not objects. They have no identity and they are not defined by a class. If “participates in a class hierarchy” is part of the definition, then C++ is way less OO than Java.
I don’t quite understand what your definition is, but you seem to be arguing that user-defined entities are more like primitives in C++, so it’s more object-oriented. So maybe “consistency across types == object orientedness”? Except C++ isn’t really more consistent. Yes, you can create a user-defined type without a vtable, but this is really a statement that user defined types a far more flexible than primitives. But also if “consistency across types” is what makes a language OO then C seems to be more OO than C++.
I don't think C++ is object-oriented, and it is certainly way less OO than Java in most ways. Its "classes" aren't the same kind of thing as classes in OO languages and its "objects" aren't OO objects.
In part this is because by default even C++ class instances don't have identity, or anyway they only have identity in the sense that ints do, that every (non-const) int has an address and mutable state. You have to define a destructor, a copy constructor, and an assignment operator to give identity to the instances of a class in C++.
With respect to "participates in a class hierarchy", that has not been part of the definition of OO since the Treaty of Orlando. But, in Java, all objects do participate in a class hierarchy, while no primitives do, while, in C++, you can also create class instances (or "class" "instances") that do not participate in a class hierarchy (without ever using inheritance, even implicitly). So, regardless of how OO or non-OO it may be, it's another distinction that Java draws between primitives and class instances ("objects" in Java) that C++ doesn't.
I think we are in agreement.
In C++, everything is an object as defined by the C++ spec, but a lot of things are not objects in an OO sense.
In Java, almost everything is an object in an OO sense, but some stuff is definitely not.
C++ standard gets the definition of object from the C standard, i.e. a bunch of memory with a type.
Nothing to do with the objects of OOP.
Just like Java, you cannot inherit from integers or characters. Depending on what you want to do with them that might or might not matter.
That's true, and in Smalltalk it's not true. In Cedar there is no inheritance. At any rate it's not a case of Java making more things objects than its forebears.
> At any rate it's not a case of Java making more things objects than its forebears.
There are other cases though. E.g. in C++ you could always have function pointers which were not objects and functions which were not methods. In Java you had to use instances of anonymous classes in place of function pointers, and all of your functions had to be (possibly static) methods.
Function pointers in C++ are "objects", and the difference between static methods and C-style functions which are not methods seems purely syntactic to me, or at best a question of namespacing. Regardless, static methods are not objects either in Java or in C++, so that is also not a case of something being an "object" in Java and not in C++.
> Function pointers in C++ are "objects"
In the C++ sense, but they don't have classes and don't participate in inheritance. Whereas the early-Java equivalents do.
Java is most of why we have a proliferation of VM-based languages and a big part of why WASM looks the way it does (though as I understand it, WASM is the shape it is in some measure because it rejects JVM design-for-the-language quirks).
I would also not blame Java for the worst of OO, all of that would have happened without it. There were so many OO culture languages pretending to that throne. Java got there first because of the aforementioned VM advantage, but the core concepts are things academia was offering and the industry wanted in non-Ada languages.
I would say Java also had a materially strong impact on the desire for server, database and client hardware agnosticism.
Some of this is positive reinforcement: Java demonstrates that it's better if you don't have to worry about what brand your server is, and JDBC arguably perfected ODBC.
Some of it is negative: a lot of the move to richer client experiences in the browser has to do with trying to remove client-side Java as a dependency, because it failed. It's not the only bridged dependency we removed, of course: Flash is equally important as a negative.
> I would also not blame Java for the worst of OO, all of that would have happened without it. There were so many OO culture languages pretending to that throne. Java got there first because of the aforementioned VM advantage, but the core concepts are things academia was offering and the industry wanted in non-Ada languages.
Really? Were there other major contenders that went as far as the public class MyApp { public static void main( nonsense?
And it's all still true, although I would offer the usual argument that concurrency!=parallelism, and if you reach for threads&STM to try to speed something up, you'll probably have a bad time. With the overhead of GC, STM-retries, false-sharing, pointer-chasing, etc you might have a better time rewriting it single-threaded in C/Rust.
STM shines in a concurrent setting, where you know you'll multiple threads accessing your system and you want to keep everything correct. And nothing else comes close.
Yeah, I wrote an essay on STM in Haskell for a class back in 2005 I think.
To be maximally fair to the Haskell people, they have been enormously influential. Haskell is like Canada: you grow up nicely there and then travel the world to bring your energy to it.
I think Intel x86 had some hardware support for STM at some point. Not sure what's the status of that.
Disabled on most CPUs, plagued by security issues. I haven't used it but I assume debugging would be extremely painful, since any debug event would abort the transaction.
That's not software transactional memory, it's hardware transactional memory, and their design was not a good one.
Well, HTM was not useful per se, except accelerating an STM implementation.
It isn't very useful for that, but you can use it to implement other higher-level concurrency primitives like multi-word compare and swap efficiently.
True. At some point in the now distant past, AMD had a proposal for a very restricted form of HTM that allowed CAS up to 7 memory locations as they had some very specific linked list algorithms that they wanted optimize and the 7 location restrictions worked well with the number of ways of their memory.
Nothing came out of it unfortunately.
I'd like to see what kind of hardware acceleration would help STMs without imposing severe limitations on their generality.
To me, the appealing things about STMs are the possibility of separating concerns of worst-case execution time and error handling, which are normally pervasive concerns that defeat modularity, from the majority of the system's code. I know this is not the mainstream view, which is mostly about manycore performance.
Not an expert, but my understanding is that HTM basically implements the fast path: you still need a fully fledged STM implementation as a fallback in case of interference, or even in the uncontended case if the working set doesn't fit in L1 (because of way collision for example) and the HTM always fails.
I'm no expert either, but maybe some other hardware feature would be more helpful to STMs than hardware TM is.
I've been excited about STMs since I read "Composable Memory Transactions" back in 02005, shortly before it was published, and I still believe in the glorious transactional future, but it's difficult to adopt an STM piecemeal; it kind of wants to own your entire program, the same way that garbage collection or nonblocking I/O do, and more so than multithreading with locks. You kind of have to commit entirely to an STM. The efforts in C# to adopt an STM ending around 02010 were a disaster as a result.
The article says a couple of things about STMs that are not true of STMs in general, just true of the Haskell STM the author is familiar with, like a small Brazilian child confidently telling you that almost everyone speaks Portuguese.
One of these is, "STM is an optimistic concurrency system." The possibility of making your concurrency 100% lock-free is one of the most appealing things about STMs, and I think it could be a key to solving the UI latency problem, which just keeps getting worse and worse. Actors and CSP don't normally help here; an Actor is just as "blocking" as a lock. But you can implement an STM with partly pessimistic concurrency, or purely pessimistic, and it might even be a good idea.
Another is, "One last benefit of STM which we haven't yet discussed is that it supports intelligent transaction retries based on conditions of the synchronized data itself." This was an innovation introduced by "Composable Memory Transactions", and many STMs do not support it, including Keir Fraser's awesomely fast version. I am even less certain that it is the correct tradeoff for all uses than I am about purely optimistic synchronization.
But all of this is why I'm rereading Gray and Reuter's Transaction Processing right now after 25 years. With the benefit of 35 years of hindsight, it's a frustrating mix of inspiring long-range vision and myopic boneheadedness. But it shares a lot of hard-won wisdom about problems like long transactions that pop up in a new guise in STMs.
I think writing concurent programs will always be a hard problem, relative to the difficulty of writing non-concurrent programs, and the only "solution" is to isolate, minimize, and regulate contention. The implementation details of TM, locks, monitors, semaphores, actors, message queues, transactions, etc., are at best "distractions", at worst hindrances. I think a good model of a concurrent program, one that lends itself to writing the program simply, will be applicable across many different implementations. Anything that obscures the development of such a model is harmful. Worst of all is the sheer prevalence of shared resources (especially shared memory). Sharing brings contention, so control sharing.
I don't agree that whether you're using TM, shared-memory monitors, or actors with message queues is an implementation detail or that there is a better programming model that hides the differences between them. You can implement any of them on top of any of the others, but you're still programming to whatever model you put on top.
In the implementation that runs, yes, you have to choose something. However, I think the fundamental design is independent of those options, and probably should be developed independently.
There is surely some sense in which that's true; the choice of concurrency primitives, even among such radically different choices, won't change literally every design decision in your system. But it is very pervasive, and it regularly provokes failures that are visible to users.
On the problems with STM in C#, see https://joeduffyblog.com/2010/01/03/a-brief-retrospective-on... (I can't believe nobody else has posted this link yet). As with the Chris Penner article, there are a lot of things described as features of STMs in general which are actually just properties of the STM he worked on, which explains some of the things that sound like nonsense if you've only worked with Haskell's STM or Clojure's. (Duffy is much better about delineating the boundaries of the systems he's talking about, though, because he knows there are alternatives.)
See also https://www.infoq.com/news/2010/05/STM-Dropped/.
> 02005, 02010
Are you planning for this comment to be relevant for long enough that the leading zeros will be helpful for disambiguation?
Clearly they are referring to the years 1029 and 1032 (decimal). I just want to know what calendar system they're using...
Since the reunification of China under the most glorious of all the dynasties, perhaps? Or the founding of Chichén Itzá?
I try not to post comments that won't be relevant 8000 years in the future.
interesting, the notation also implies 2005 and 2010 A.D and not B.C, or maybe the notation is about exactly A.D? either way, interesting choice if it was intentional. we say “year 515” without disambiguation right
He does it to get somebody to comment, and it worked this time.
I wish people would comment about transactions, optimistic synchronization, CSP, actors, priority inversion, Fraser's astounding code (https://www.cl.cam.ac.uk/research/srg/netos/projects/archive...) etc., but I guess we each do what we can, and you have to meet people where they are. I probably couldn't have posted a comment any better than yours when I was 12, and maybe you're 12, so maybe you can't do any better. Hopefully, eventually, you will.
Remember! Brawndo: It's What Plants Crave.
> A data race occurs any time two threads access the same memory location concurrently and non-deterministically when at least one of the accesses is a write.
From what I understand of the C++ memory model (shared by C and Rust), this is not the definition of data race – a data race occurs when two or more threads access memory concurrently where at least one access is a write and the accesses are unsynchronized. However, synchronized accesses may not have a deterministic ordering, in which case a race condition occurs.
(Confusing as it may be, I believe this is standard terminology.)
Optimistic locking works great when what is excluded is effectively a calculation. The annoyance, though, is you have basically acknowledged that you can use a compare-and-swap at the end of your calculation to know that things worked.
This is not at all always the case, though. Sometimes, what you need to use mutual exclusion for is actively working with something that is, itself, active work. That is, sometimes you have to stop something even starting.
Think about how you would model a laundry mat. It isn't enough to say you could use optimistic locks for access to the machines. You can't use the machine until you can use the machine.
This is not unheard of in computing, either. Sometimes, you can't process a buffer until you know it has been finished by the code before you.
> Think about how you would model a laundry mat. It isn't enough to say you could use optimistic locks for access to the machines.
One machine? Too easy, I want multiple machines!
I want to reserve this machine and that machine. And this is happening while someone else wants to grab that machine and this machine. It costs money to grab a machine, so I'll insert a coin into this machine, and then insert a coin into that machine. The other guy grabbed that machine before me, so am I locked out? It turns out he only had one coin, not two: not only did this mean he untook that machine, but the coin he spent flew out of the machine and back into his pocket. While that was going on, I took a nap - instead of busily waiting for that machines - and the laundromat operator woke me up when it was time to try to grab both machines again.
In C++ (with a bit of help from boost).
Hopefully synchronized will make it into the standard library at some point, in the meantime it is not terribly hard to write it yourself if you do not want a boost dependency.Does this avoid the dining philosopher deadlock?
yes, 'synchronize' uses a try_lock/backoff algorithm, same as std::scoped_lock.
edit: it could theoretically livelock, but I believe most if not all STM implementations also do not guarantee forward progress.
Purely optimistic STM implementations that abort transactions early and don't permit other transactions to read uncommitted data can guarantee forward progress, and I believe that both Haskell's STM and Fraser and Harris's STM do, though I could easily be mistaken about that.
Probably you are right. I vaguely remembered the "Why Transactional Memory Should Not Be Obstruction-Free" paper, but I might have misunderstood or forgotten what it meant (the implementation can be non obstruction-free, but it doesn't mean it can live-lock).
I'm reading the Kuznetsov and Ravi paper https://www.researchgate.net/publication/272194871_Why_Trans... now; I assume that's the one you mean? Its definition of "obstruction-freedom" is that every transaction "not encountering steps of concurrent transactions" must commit. This seems to be neither necessary nor sufficient for avoiding livelock, but certainly very helpful. Their weaker "progressiveness" property seems almost as good.
They claim that their STM "LP" is not obstruction-free but is wait-free, which is a very strong claim! WP explains, "A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. ‘Non-blocking’ was used as a synonym for ‘lock-free’ in the literature until the introduction of obstruction-freedom in 2003." Kuznetsov and Ravi say of LP, "every transactional operation completes in a wait-free manner."
Its normative force seems to be founded on claims about performance, but it would be very surprising if the performance cost of guaranteed forward progress or obstruction-freedom were too high for me to want to pay it, since what I'm most interested in is latency and fault-tolerance, not parallel speedups.
I need to re-read the paper, but:
>"LP" is not obstruction-free but is wait-free
As far as I know, wait-free is a superset of lock-free and lock-free is a superset of obstruction-free. How can LP be wait-free but not obstruction free?
In any case a wait-free algorithm can't live-lock by definition (progress and fairness of all threads is guaranteed), but the catch is that while the STM runtime itself might have this property, it doesn't necessarily translate to an algorithm implemented in the runtime (which makes sense, you should be able to implement a lock with an STM).
So, yes, the paper is interesting, but probably not relevant for this discussion and I shouldn't have brought it up.
Now the question again remains, do concrete STM implementations actually provide the guarantee you mentioned earlier, i.e. does a transaction aborting guarantees that another succeeds? I would think it doesn't as I think it would be very costly to implement: both transactions might end up aborting when detecting conflict.
Maybe what concrete runtimes actually guarantee is that there is an upper bound on the spurious aborts and restarts as worst case you can fall back on a single global lock for serialization when the upper bound is reached.
You avoid livelock, as I understand the term in an STM, if the only thing that can prevent a transaction from committing when it tries to commit is some other transaction having committed. That way, forward progress is guaranteed; as long as some transaction commits, you're not livelocked, are you?
I'm not familiar with "obstruction-free"ness; should I be?
It kind of threw me off that we went from golang to Haskell so would love to see they bank example actually comeback full circle to golang
This is a nice article, and I appreciate the examples. The next problem to solve is how to persist state on disk across two different accounts after a transfer has been done.
>Moore's law is dying, beefy single cores are no longer keeping up.
On the other hand, there are many other things that could be done to avoid wasting all the extra power gained over the years which don't even require any parallelism boost.
Maybe, but on a 64-core machine, a single-threaded task can't even use 2% of the computer. Even interpreted Python wastes only 97% of the computer.
I wonder what happened to hardware transactional memory. Your CPU caches already keep track of which core is keeping which line in its cache and whether they have modified it via the MESI protocol:
https://en.wikipedia.org/wiki/MESI_protocol
So it has most of the hardware to support transactional memory, only it's not exposed to the user.
Intel had their own version, but it turned out it was buggy, so they disabled it and never put it in any subsequent CPU so that was that.
The fundamental problem here is shared memory / shared ownership.
If you assign exclusive ownership of all accounting data to a single thread and use CSP to communicate transfers, all of these made up problems go away.
Yes, multithreaded problems go away on a single thread.
Is there any way for an external thread to ask (via CSP) for the state, think about the state, then write back the new state (via CSP)?
If so, you're back to race conditions - with the additional constraints of a master thread and CSP.
That would be shared ownership again.
So then I would sell STM to you from the "other end".
Everyone else has multiple threads, and should replace their locks with STM for ease and safety.
You've got safe single-thread and CSP, you should try STM to gain multithreading and get/set.
CSP suffers from backpressure issues (which is not to say its bad, but it's not a panacea either)
I would've enjoyed if the solution was proposed in Go as well.
STM isn't really used in Go like it is in Haskell.
Here's the example from a Go STM package that's based on Haskell STM. It has gotchas that you won't encounter in Haskell though, due to the nature of these languages.
https://github.com/anacrolix/stm/blob/master/cmd/santa-examp...
For the equivalent Haskell, check out the link at the top of the file.
Indeed, “Software Transactional Memory” sounds like a magic black box. It feels a bit like reading “just use an sql database and transactions”. It’s not really telling me how the problem is solved, just to use someone else’s solution without understating how it’s implemented.
https://www.microsoft.com/en-us/research/wp-content/uploads/...
So cool! Any languages support STM first class besides Haskell?
Scala supports it with for-comprehensions which are equivalent to Haskell's do-notation but STM is not part of the Scala standard library. Zio and Cats Effect are two popular Scala effects systems with STM.
Not "first class" but pretty good in Kotlin
https://arrow-kt.io/learn/coroutines/stm/
I believe Clojure has first class support for STM.
Looks like somebody made a Rust experiment back when Rust was new: https://docs.rs/stm/latest/stm/
Scala has great STM in the same way (monad-based).
The new Verse lang by Epic Games & a core Haskell contributor has a lot of transaction features. I don’t know if it’s exactly the same as STM though.
Verse only supports single-threaded transactional memory. Epic hasn't yet demonstrated that their approach can actually scale to be used from multiple threads in a useful manner, though they claim that it will.
I think a decade ago or so, people started trying to integrate STM in Pypy
There are c++ libraries that offer it.
I've found it interesting to think about trying to adopt data structures like CRDT designed for distributed systems to the problem of local consistency between CPU-local data structures spread across cores for parallelism.
@OP perhaps there’s a comparison bug in withdraw(): if (a.balance <= amount)
I've caught unbalanced parens:
A nice trick in go is embedding sync.mutex into any type eg account.Lock() / Unlock()
c# offers a very convinient way to pack data access and locking into one thing. the "lock" instruction. it does hover not let you lock more that one "resource" at a time.
All of the "my non-Haskell language does this" comments in the thread are the same (with maybe a Rust exception).
The "lock" instruction is what the article is telling you to ditch.
> If the programmer forgets to lock the mutex the system won't stop them from accessing the data anyways
If the programmer forgets to "lock"
> and even then there's no actual link between the data being locked and the lock itself
lock (thing) { return thing.contents // shared, mutable array given out freely to the world }
'contents' has no notion that it has anything to do with this "lock"ing thing.
OK, seems my english was not enough to see the point. after reading this clear explantation is have to agree with you 100%
Shared data is hard no matter what you do. Parallelism and shared data do not mix. STM makes some things easier, but you still will run into problems if you have a lot of it. You must design your code such that you spend a lot of CPU cycles doing single thread no shared data calculations between every place where you need to share data. If you can't do that you can't do parallelism.
When there are only a few places where data needs to be shared a mutexs works since you put your best programmer on maintaining just that code and with only a few places they can figure out it. You can also make a variable atomic, which sometimes works better than a mutex and sometimes worse. You can use STM. However no matter what you use the reality of synchronizing between cores means you can't do any of the above "very often".
It is rather long-winded, and ends with a donation request. I don't like that style of writing.
> If the programmer forgets to lock the mutex the system won't stop them from accessing the data anyways, and even then there's no actual link between the data being locked and the lock itself, we need to trust the programmers to both understand and respect the agreement. A risky prospect on both counts.
Not every language is like that. I would not use language (that allows this) for parallel programming. You can use other ways but how can you guarantee everyone who will edit the code-base will not use unenforced mutex?
>Ugh, a correct transfer function should conceptually just be the composition of our well encapsulated withdraw and a deposit functions, but defining it correctly has forced us to remove the locking from both withdraw and deposit, making both of them less safe to use.
I know OOP isn't cool any more, but the above is what OOP solves.
TFA's transfer() and withdraw() functions aren't compliant with double-entry accounting anyways, so you'd mark them private and only expose Transfer to callers. Let the class handle its own details.
OOP does not provide any help in solving the problem in question, and indeed encapsulation is a major obstacle to solving it. I think you haven't understood what problem is being discussed.
> I know OOP isn't cool any more, but the above is what OOP solves.
The article was a lengthy explanation of why the problem occurs in non-STM settings. In OO settings.
What do you propose as an OO solution?
When working with high throughput concurrent code like consumer producer pipelines, it's better to avoid shared mutable state entirely. Something actor like fits better and C# or Go channels works wonders.
TFA has a whole section praising actors for certain tasks and explaining why it doesn't fit here.
> it's better to avoid shared mutable state entirely.
Yes! I would even go so far as to have the type system separate the mutable from the non-mutable for this reason!
> Something actor like fits better and C# or Go channels works wonders.
Or even STM channels/queues:
Account balance is necessarily a shared mutable state.
It's not necessarily shared. You could assign a single thread to own the account balances, reading requests from a message queue. That probably scales better than locking. A single thread can do several million transactions per second, more than the entire global financial system.
And in a green threading system like Go or Scala Cats, the balances thread isn’t a thread at all, and it will run in the same thread as the transfer caller when the call isn’t contended, so you don’t even have a context switch.
What if you want to compose an action on a balance with something else? (That is what the OP is about)
Also, with a queue, you've moved the shared state elsewhere, namely, into the queue.
The account-owning thread has to accept messages for every atomic action you need it to do.
There are lots of standard shared queues you can pick from that have been fully regression tested. That's almost always better than mixing concurrency in with your business logic.
Sure, but what I meant is when there is some other thing that needs to happen atomically together with the balance handled by that one thread (i.e, both balance and other thing change or neither do). You'll need another thread for that other thing, then a method to synchronize the two and.. you're back at the mutex.
Sure, but sometimes shared mutable state is better, especially from a performance point of view. For example blurring an image.
Isn't that a typical case where you don't have to share anything? Divide the image into N chunks, let N threads handle each one, no sharing, just need a single sync point at the end to wait on completion.
tl;dr mutexes are evil because they don't compose, STM is the solution because it does compose, otherwise just avoid shared state, or even state entirely.
Not anything that's not already covered in any undergraduate CS course.
I've never taken an undergraduate CS course so I'm happy to have read this!
Which CS course did you go to?
You made me check the programme of many university courses.
Most parallel programming courses are at Masters level and require specialization, but those are much more advanced (sorting networks, distributed computing on supercomputers and GPU, consensus algorithms, parallelization of linear solvers...)
There are undergraduate courses that cover simple things like multithreading patterns in mainstream languages, but it seems they are only available if you go to an education institution with both a practical mindset and a firm grip of the fundamentals, which is unfortunately quite rare, as most institutions tend to specialize in one or the other.
During my undergrad this stuff was heavily covered in my Intro to Operating Systems class. But that was 20 years ago, may be different now.