With those test parameters for how long it would take a human to complete the same work, it fits a similar pattern to METR; i.e. at "humans would take 11.5 hours" (Figure 4, median) you're pushing your luck for any success with all but the most recent models*, and METR is testing software where AI has the possibility of fully automating a lot of its own tests.
Even more recent models than they tested, like Opus 4.5, are only 50% successful for tasks that take humans 5h20m: https://metr.org/time-horizons/
Assuming the bubble doesn't pop/WW3 doesn't start first (IDK, 25% and 5% respectively?), and if trends continue (???), I expect a similar paper this time next year to show something like 50% success at automation of similar tasks.
* which they didn't test, I don't blame them for that because this field moves too fast
Or they've determined that micromanaging it is circuitous and increases their dependence on tech giants, so it's a bad deal given that they also need to know the work well enough to verify it anyway.
Actual paper: https://www.remotelabor.ai/paper.pdf
Sounds about right.
With those test parameters for how long it would take a human to complete the same work, it fits a similar pattern to METR; i.e. at "humans would take 11.5 hours" (Figure 4, median) you're pushing your luck for any success with all but the most recent models*, and METR is testing software where AI has the possibility of fully automating a lot of its own tests.
Even more recent models than they tested, like Opus 4.5, are only 50% successful for tasks that take humans 5h20m: https://metr.org/time-horizons/
Assuming the bubble doesn't pop/WW3 doesn't start first (IDK, 25% and 5% respectively?), and if trends continue (???), I expect a similar paper this time next year to show something like 50% success at automation of similar tasks.
* which they didn't test, I don't blame them for that because this field moves too fast
Duplicate of this one: https://news.ycombinator.com/item?id=47011722
Also
https://news.ycombinator.com/item?id=46928172
https://news.ycombinator.com/item?id=47004754
translation: "96% of people trying to replace workers with AI don't know how to prompt it effectively or supervise its output."
Or they've determined that micromanaging it is circuitous and increases their dependence on tech giants, so it's a bad deal given that they also need to know the work well enough to verify it anyway.
The 4% is using it to write posts about ai on linkedin.
So what you're saying is the interface fails the common case?
96% are "holding it wrong".
There's a saying that if everywhere you go it smells like shit, you might just have some shit smeared on your own nose.
96% is not "holding it wrong".
[dead]