One of the differentiating features of an IaaS cloud implementation, is that you do not get access to a consolidated scalable storage infrastructure. At least not in the same way that you might expect if you were just scaling out compute nodes attached to the same SAN. You get remote block storage (Elastic Block Storage, EBS, in the case of Amazon) connected to a specific machine image, and you get REST-style object storage (Simple Storage Service, S3, in the case of Amazon) which is shared amongst images but does n0t speak the traditional APIs.
A lot of people have become dependent on EBS as it seems closest to what they are used to. Amazon failed because of simultaneous failure of its EBS in two Availability zones. If you were dependent on one of these (or mirrored across the two) you lost access to the filesystem from your Instances. It is also worth noting that EBS images are not like CIFS or NFS filesystems in that they can only be attached from a single instance, so you are still left with a bunch of headaches if you have a replicated mid-tier that expects to see a filesystem (for example to retrieve unstructured data). It may be sensible to move to the use of the S3 mechanism (or some portable abstraction over it) for new applications, but if you have an existing application that expects to see a filesystem in the traditional way, this will require you to rewrite your code, so you are left looking for a distributed cloud-agnostic shared filesystem with multi-way replication (including asynchronous replication), and this is where Gluster fits in.
Gluster is a filesystem (actually a user-space not a kernel process) that typically runs inside Amazon or VMware machine images (typically dedicated images, although they can do other stuff as well) and it consolidates the disks attached to those images to appear as a single Network Attached Storage (NAS) resource with a unified global namespace to any number of client images, thus allowing the use of standard Posix APIs (i.e. unmodified codebases) to access large volumes of data or large numbers of images to access to the same data. The images connect to the NAS via NFS or CIFS (aka Samba or SMB). There is also a native driver that offers parallel access to the underlying data which offers more performance. Underlying the architecture is typically commodity hardware. There is no requirement for special-purpose storage. Many of Gluster’s customers are online media companies that use it to store large numbers of media files, although there are customers in many sectors, including finance and government.
It should be noted that Gluster is not just for clouds or even virtualization, you can run it without an underlying hypervisor and and they have even just announced a Gluster appliance. Some of these options may be useful in cloud-bursting scenarios, but our main focus in this post is the IaaS cloud.
We had a long conversation with Gluster about performance, and there is a comprehensive whitepaper. It seems counter-intuitive that a software-only virtual NAS system that is a third of the price of a SAN could end up having three times the performance (as Gluster claims). There are many vested interests in the storage market, including those hardware vendors that supply or resell storage, who seek to maintain the mystique around storage, yet we note that Internet companies with their own datacenters (including facebook with their OpenCompute intiaitive) build their datacenters with commodity disks directly attached to servers. Also, since a version of their software is open source, it is quite hard for Gluster to lie about performance.
So, if we look at the performance and scalability issues in building a distributed filesystem, the first major problem is the “metadata” (directory). This is typically some kind of index that has a traversal time that is at best logarithmic in the number of files, and it requires to be maintained in a single location to avoid update anomalies. The Gluster filesystem has addressed these scalability issues by using a hashing algorithm to partition data, allowing access to be redirected to the correct server in a time which is essentially independent of the number of files, and without reference to a centralized metadata repository. The resulting scalability is (according to Gluster) limitless.
The second major problem is getting data on and off the disks. There are improvements that are possible to sustained data transfer rates and seek times that depend on the physical characteristics of the device such as spin speed, and there are local caches which improve performance, or it is possible to move to solid state disks, and indeed you can do all of this whether or not you use Gluster. However, if you add more physical disks you will definitely get more storage capacity and (subject to other limitations discussed below) more “performance”, where performance is measured in Gigabytes per second.
Third there is the code path between the application and the disk. In comparison to the alternative approach of using a SAN, the code path followed by the data between the filesystem API and the disk is undoubtedly longer. Instead of passing via the VM filesystem driver through the hypervisor to the underlying SAN, it passes through the virtual and physical network drivers of two VM images, before passing through the VM filesystem driver on a remote virtual machine to through the hypervisor to the commodity disk. The client-side of the extended code-path does not, generally, however significantly impact performance because it’s distributed amongst the various clients. The problem arises on the server-side of the code path where data is is being read or written through some number of file-server nodes (in a cloud these would be instances). However, given the partitioning of the access, the performance bottleneck can be eased by adding more nodes to the cluster. Gluster claim it is linear, and unless there are hot-spots in their hashing algorithm this is likely to be fairly close to the truth.
Finally, at a certain number of nodes, overall bandwidth can become an issue, and you need to up the specification on the network controllers to the nodes.
Gluster in an IaaS cloud
When you stick all of this in the Cloud, the nodes become machine instances which are connected to their disks through, for example, EBS. Naturally this leads to a performance hit because you have two layers of indirection between the Client and the disk hardware, it is actually going through the Gluster node and on to the EBS node. Gluster admit to a performance degradation of more than 50% resulting from this extra layer. In other words you need twice as many nodes in the cloud as you would in premise. On the positive side, the key benefit of the Gluster approach is that the elasticity of the compute architecture can be matched by the elasticity of the filesystem. If you run out of storage you can add another node to the Gluster filesystem. As you scale down your storage requirements (admittedly this is a fairly rare scenario) you can remove nodes. More generally you can add and remove nodes from the Gluster filesystem to balance capacity and data transfer rates. Quite how this works under the covers with the distributed hashed metadata defining the partitioning is unclear, but Gluster assures us it does.
The basic mechanism for providing availability is through replication. First, Gluster suggest that the disks themselves are RAID-ed at the point they attach to the nodes. This means that a node is resilient to an individual disk failure. Next, Gluster provides N-Way synchronous replication to multiple nodes. It is this piece that could have solved the problem in the EBS outage. Gluster can replicate across multiple availability zones, not just two, so if you were unlucky enough to be caught up on the two availability zones that failed in the Amazon outage, and prescient enough to use Gluster to do 3-way replication, you would have been OK.
In addition, Gluster has just announced asynchronous replication that can be used across multiple Regions and/or even to provide a federated cloud. This is an area of great interest and hope to discuss in more detail with Gluster in due course.
Cluster licensing and OpenStack
Of course a lot of the overheads in Gluster on the Cloud would go away if the ran Gluster natively instead of using EBS. You don’t really need both, and we had a very interesting conversation with Gluster about their emerging membership of OpenStack. OpenStack does not have an EBS layer and there is no doubt that Gluster could contribute one, indeed it already has an open source version of its product. Gluster owns the IPR to its filesystem in its entirety, and runs a dual-licensed model: commercial and Affero GPL (AGPL). The commercial licensing model includes packaging, indemnification support and upgrades, the AGPL license allows free use and modification subject to publication of that modification. In practice enterprise customers with the filesystem scalability requirements that Gluster addresses will likely need the support of the commercial license.
The Affero GPL license (in contrast to the regular GPL license) puts a duty to publish changes onto any consumer of the software, including an enterprise customer or a service provider (in fact anyone apart from Gluster). It thus protects Gluster’s interests against a Cloud Service provider enhancing the codebase and refusing to re-contribute its fixes back. Another user of this license is SugarCRM (as from 2010). Whilst AGPL and the Apache2 license used by OpenStack are compatible in the sense you can mix software from both licences, if Gluster were to donate to OpenStack it would require to re-license under Apache, and would lose some of AGPL protections.
As the cloud matures, it is becoming clear that a range of different storage requirements and solutions emerge and there are increasing requirements for access to the mass of unstructured data such as video and audio, in particular with the “long tail” of infrequently-accessed data in a media server or, for example an image application for a healthcare provider. What Gluster is able to do is provide applications which expect huge volumes of this data to be available to them in a traditional filesystem (rather than an object store) to migrate to the cloud without modification of the codebase, and scale their filesystem requirements along with their compute requirements. They also add extra resilience through multi-way replication and have a story emerging about asynchronous replication that may allow them to play a big role in the future of federated clouds.