I’ve been working with LLMs a lot lately, and one consistent UX bottleneck is inference speed.
Many tasks follow this pattern: process small chunks → batch for inference → split results again. Parallelizing helps, but naive asyncio.gather approaches often backfire—each stage waits on the slowest batch, killing responsiveness. Mixing fast per-item logic with slower batch steps needs smarter coordination.
Technical approach:
Built a pipeline library that handles the streaming coordination automatically. Uses async generators throughout with intelligent queuing for order preservation when needed.
Architecture decisions:
Stream-first design: Results flow by default, with optional collection
Flexible ordering: Choose between speed (unordered) and sequence (ordered)
Memory efficiency: O(batch_size) memory usage, not O(dataset_size)
Backpressure handling: Automatic coordination between fast and slow stages
Error boundaries: Configurable failure strategies at task level.
I’ve been working with LLMs a lot lately, and one consistent UX bottleneck is inference speed.
Many tasks follow this pattern: process small chunks → batch for inference → split results again. Parallelizing helps, but naive asyncio.gather approaches often backfire—each stage waits on the slowest batch, killing responsiveness. Mixing fast per-item logic with slower batch steps needs smarter coordination.
Technical approach: Built a pipeline library that handles the streaming coordination automatically. Uses async generators throughout with intelligent queuing for order preservation when needed.
Architecture decisions:
Stream-first design: Results flow by default, with optional collection Flexible ordering: Choose between speed (unordered) and sequence (ordered) Memory efficiency: O(batch_size) memory usage, not O(dataset_size) Backpressure handling: Automatic coordination between fast and slow stages Error boundaries: Configurable failure strategies at task level.