That’s right – the Stones thought two’s a crowd. I wonder what they would think of public cloud vendors! Most CSOs dread the unknown and it is not known how these public clouds are operated. We at Netskope decided to build our own cloud because for cloud optimization – or to guarantee performance, security, and availability – we should know everything about the cloud in which our customers put their trust.
Building your own cloud is not easy – which is why the majority of SaaS startups invariably choose a public cloud vendor. There are a lot of problems you have to solve if you build your own cloud:
- Where do you host your data centers?
- How do you build your network?
- With what vendors do you choose to work?
- How do you plan for growth?
- How do you secure your cloud?
- How do you make sure it is compliant with corporate and industry policies and regulations?
At Netskope, we started by creating a blueprint of how our eventual app deployment would look. We assumed we will have customers in different parts of the world and decided to build the app from the ground up to be globally distributed. When someone decides to build their app to be globally distributed, they can easily underestimate the planning and work that entails. So we took great care to make sure our design was sound. At the same time, we did not over-engineer things. We used the same basic tenets of good cloud app practices:
- Build to scale out
- Ensure all interfaces are stateless
- Anticipate failures in the design
- Create a true Service-Oriented Architecture (SOA) with SLAs for each service
- Account for global server load balancing (GSLB) and geographic residency requirements
Scale out versus scale up
There was great debate about this a few years ago, but the verdict is in: Scale out wins in most situations. Scale up may work for you, but it has a certain breaking point. With scale-out architecture, you can push the breaking point so far out that you can mitigate much of the scalability risks.
Stateless architecture
In any cloud app, maintaining state is hard. You do not know what is going to fail. Your trusted ISP may decide to go down or there may be an outage in an interconnect. Recovering state after a failure is complex and never works. Netskope’s engineering team has done a great job solving the challenge of avoiding state when it was simpler to be stateful.
Resilient to failures but can recover if they happen
Our cloud is built to be resilient to failures. We have replicas and backups of data, multi-path links to all nodes in the network, and various other protections that help us resist those temporary fade-outs. In addition, our software is built for transparent retries, exponential back-off (so that we do not DDoS ourselves) and use common sense techniques to limit cascading failures.
True SOA with SLA guarantees
An age old design pattern in software is to separate the interface from the implementation. This is very true with the cloud. We have built very robust private APIs that allow us to build loosely-coupled distributed systems. Each API has a SLA much like what Amazon documented in their seminal Dynamo paper. This allows us to build features knowing full well what the expected performance will be.
GSLB and geo-residency
Building an enterprise-class cloud application that has great performance requires many points-of-presence (PoP). But a global enterprise application also needs to make sure it is built for geo-residency and privacy requirements in the countries it serves. If you build your app from the ground up knowing you have to support these requirements, it is easy for you to cater to international customers as well as use your global scale for better performance.
These are just the few basic tenets that we followed in our system design to allow us to scale. In future blog posts, we will talk about how we secured our cloud and some more details about our network design.