Insights
Engineering Insights
Practical writing on software architecture, SaaS products, AI automation, legacy modernisation, and the business of building reliable systems.
Curated links from external sources — not 360Softy original articles.
Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries
CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features
CUDA 13.2 arrives with a major update: NVIDIA CUDA Tile is now supported on devices of compute capability 8.X architectures (NVIDIA Ampere and NVIDIA Ada), as... CUDA 13.2 arrives with a major update: NVIDIA CUDA Tile is now supported on devices of compute capability 8.X architectures (NVIDIA Ampere and NVIDIA Ada), as well as 10.X, 11.X and 12.X architectures (NVIDIA Blackwell). In an upcoming release of the CUDA Toolkit, all GPU architectures starting with Ampere will be fully supported. If yo
Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core
In the rapidly evolving landscape of large language model (LLM) development, NVIDIA Megatron Core has emerged as the foundational framework for training massive... In the rapidly evolving landscape of large language model (LLM) development, NVIDIA Megatron Core has emerged as the foundational framework for training massive transformer models at scale. The open source library offers industry-leading parallelism and GPU-optimized performance. Now developed GitHub-first in the NVIDIA/Megatron-LM re
Announcing the AI Gateway Working Group
The community around Kubernetes includes a number of Special Interest Groups (SIGs) and Working Groups (WGs) facilitating discussions on important topics between interested contributors. Today, we're excited to announce the formation of the AI Gateway Working Group, a new initiative focused on developing standards and best practices for networking infrastructure that supports AI workloads in Kubernetes environments. What is an AI Gateway? In a Kubernetes context, an AI Gateway refers to network
Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library
Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and... Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques such as disaggregated serving, KV cache loading, and wide expert parallelism. In d
Removing the Guesswork from Disaggregated Serving
Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal... Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal configuration for any given workload (such as hardware, parallelism, and prefill/decode split) resides in a massive, multi-dimensional search space that is impossible to explore manually or t
Persuasive Design: Ten Years Later
Many product teams still lean on usability improvements and isolated behavioral tweaks to address weak activation, drop-offs, and low retention – only to see results plateau or slip into shallow gamification. Anders Toxboe updates persuasive design for today’s reality, clarifying what has actually held up over the last decade.
Work with 360Softy
Building a SaaS product, AI system, or business platform?
Book a free consultation and we will tell you honestly whether we can help.

