Premise
Google Spanner brings a new and exciting database for OLTP use, with greater scaling capabilities than any other traditional SQL database. The database is full of innovative technologies. This is a new horse for new courses. It offers very large OTLP databases that are global in scale, where consistency is more important than availability. Wikibon believes Google Spanner will do well in solving emerging probabilistic problems driven by AI and advanced analytics. The potential market for this is large. However, Wikibon believes that Google will have a hard time replacing the vast majority of traditional deterministic systems of record.
Executive Summary
Google Spanner offers great scalability with consistency over distance. Spanner is different from traditional databases like Oracle, and cannot offer the same levels of availability. However, Spanner can address many new problems in the rapidly growing probabilistic application space, at a much lower cost.
Some key points customers should consider:
*Spanner does not offer an easy lift-and-shift solution for existing SQL databases. AWS Aurora has significant head-start in this regard.
*Converting existing database systems to Google Spanner just to save money should be strongly avoided. The business risks are very high, especially as the architecture is different. Applying resources towards new digital transformation almost always results in better business returns.
*Senior technology executives should invest in evaluating Google Spanner, understanding where it fits, and trying it out on new projects. The operational aspects of Spanner, including availability, backup, and recovery should be a focus areas of any pilot projects. Senior executives should discuss Google’s planned investments and roadmap with Google. A Google Mug (see Figure 2 may be of significant help in Oracle negotiations.
Because Google Spanner has a different architecture, and the high cost and business risk of migration, Wikibon projects that only about 5% of Oracle licenses will migrate to Google Spanner. However, Google Spanner can participate in a rapidly growing market related to probabilistic workloads driven by advanced analytics and AI, which will feed into existing systems of record.
Database Fundamentals and CAP Theorem
This section covers database theory, and discusses three database models. Experts in database theory can skip this section.
SQL and ACID Properties
A database is an organized set of data stored on IT systems that supports a business mission. A database management system (DBMS) is software that captures, processes and stores that data in a database.
The most dominant type of DBMS is the relational database, and most use SQL as a standard for accessing data. SQL defines the properties of transactions against the database, the most important of which are the ACID properties (Atomicity, Consistency, Isolation, Durability). Oracle, IBM DB2, and Microsoft SQL Server are considered the leading high-end, large-scale database systems. They provide great consistency, very high availability/recoverability, and high functionality. They are also the most expensive, and the database software is the greatest single cost component for developing and deploying mission critical database systems.
CAP Theorem & CA Database Models
The CAP theorem (Figure 1) dictates that a system cannot have guaranteed capabilities of being Consistent, Available, and Partition Tolerant (scalable). The CAP theorem states only two of the three can be guaranteed, but not all three. The Oracle, DB2 and SQLServer databases focus on guaranteeing availability and consistency (CA model in Figure 1).
It should be noted that it is technically difficult to provide databases that guarantee any of the database dimensions. Many databases fail at all three.
NoSQL & AP Database Models
NoSQL databases became popular after about the year 2000 and the emergence of Big Data in the last ten years. They do not require fixed table schemas, can use denormalized data, and often support horizontal and distributed partitioning. This makes them faster and able to support high throughputs. However, NoSQL databases guarantee availability and are partition-able (AP model in Figure 1), but cannot guarantee consistency. Many of them have an eventually consistent model. Leaving consistency to be added by the programmer is almost never a viable alternative for mission critical applications.
CP Database Models
There is a relatively new class of relational databases that combines scalable performance with the ACID properties of traditional relational databases (CP model in Figure 1). These include Citus, CockroachDB, Google Spanner, MemSQL, and others.
Google Spanner
Google Spanner was initiated in 2012 for internal Google use, and was announced as a Google Cloud product in February 2017. A good paper on the technology is entitled ” Spanner, TrueTime & The CAP Theorem“, by Eric Brewer VP, Infrastructure, Google. Eric Brewer is the author of the CAP Theorem (see section above).
Google Spanner Scaling
As discussed above, Google Spanner is at its core, a Consistent/Partition Tolerant model (CP). Almost all traditional databases such as AWS Aurora, IBM DB2, Microsoft SQL Server & MySQL variants are designed using a Consistent/Available model (CA). The ability to horizontally scale a CA model is limited to a relatively small number of nodes. Most CA database scaling is vertical, with faster processors, faster IO, and large DRAM memories. The greatest scaling of all was the IBM Parallel Sysplex, which allowed 32 systems to be geographically connected and deployed against the same application. There comes a point in scaling when techniques such as sharding need to utilized. This is a manually intensive process of database redesign.
The major advantage of Spanner over traditional relational models is the ease of addressing performance issues. In theory and in practice you can just add more nodes to scale, while also maintaining consistency. Maintaining consistency means ease of programming, especially in update heavy environments. NoSQL databases with an Available/Partition Tolerant model can also scale the same way, but consistency falls by the wayside when there are updates.
The elephant in the Google Spanner room is how to achieve Availability!
Google Spanner Availability
Again, to reiterate, Google Spanner is at heart a CP model. Google’s strategy with Spanner is not to claim guaranteed availability, but to improve availability as much as possible.
Google has introduced many features that improve availability to a very high level. Spanner utilizes the Paxos Algorithm, which helps solve for consensus in a network of unreliable processors. The Paxos algorithm is well understood, and companies such as WANdisco have successfully utilized it to provide consistency across large distances for backup and other solutions, and have a partnership with IBM for Cloud delivery.
Network availability for Google Spanner is helped enormously by Google’s excellent dedicated private global network. There are multiple independent fibers connecting every data center, and there is significant redundancy within each data center. Latency is both low and consistent. High network availability is a major contribution to improving overall availability.
Spanner also utilizes “TrueTime”, a globally synchronized atomic clock, which allows events to be ordered in real time. This technique was first introduced on the IBM mainframe as “IBM Parallel Sysplex“, which allows up to 32 distributed mainframes to be closely connected. Advanced databases such as IBM DB2 and Oracle are supported by Parallel Sysplex across regions. Spanner TrueTime has taken this further to enable consistency across regions, continents, and even between continents, with many more nodes. TrueTime also enables taking snapshots across multiple independent systems, and storing multiple snapshot versions in logs over time. This improves the recovery time, and can improve overall availability.
Google Spanner Outage Time vs. Recovery Time
Google claims five 9’s (99.999%) availability with a 30 second outage time. This is very high availability. However, while many applications can be back online in 30 seconds, they cannot recover in 30 seconds. An outage leads to potential loss of data, and recovery means going back to a consistent point (Recovery Point Objective, RPO), and applying updates from that point. A true Recovery Time Objective (RTO) is measured by when the application is fully available again.
Spanner will avoid some outages that can bring down traditional systems, especially performance and overload situations. Spanner can avoid these by being able to scale quickly, or by running the system at lower utilizations over many nodes.
More complex and functional databases take longer to recover, and the five 9’s claim by Google assumes a 30 second recovery. If the recovery time is 10 minutes or more, then 99.99% is a more realistic estimate for the SLA. For very complex database applications with high value data, recovery times are often measured in hours, and 99.9% would be a realistic SLA.
Google Spanner Functionality
Spanner is new as a general purpose database. The Wikibon functionality assessment is that Spanner is limited. OLAP capabilities and performance in particular are poor. Resource access controls are limited. Backup integration is very limited. View functionality is limited, and data modeling tools are limited. At the moment there are few Google resources available for problem resolution, workarounds, and day-to-day support for development and operations.
When very high availability requirement are needed, traditional tools such as Oracle’s RAC, Active Data Guard, and GoldenGate can provide proven solutions. These are much more expensive, but offer much higher levels of availability.
Google can change this with heavy investments over a long period of time. In particular, Google will need to provide access to real people with real expertise, especially at the introduction phase of Spanner adoption.
Deterministic vs. Probabilistic Workloads
One useful classification of applications and workloads is as either deterministic or probabilistic. With most traditional computing workloads, if you have the same input and same code, you expect to generate exactly the same output. We can classify this as a deterministic workload. Examples of this are systems of record, such as finance, ERP, payroll, stock control, etc. Most compliance models assume deterministic outcomes.
There is also a broad set of problems that can be solved more efficiently with probabilistic methods, especially when the applications are driven by large amounts of data and compute requirements. In order to meet the constraints of elapsed time to solution, systems can dynamically choose algorithms designed for “good enough” output, which can be computed quicker. Examples of probabilistic systems could include real-time price updates, or deciding the price to bid for delivering an advertisement in real-time to an end-user. Speed to solution is more important than absolute accuracy and repeatability of the assessment.
In very general terms, CA is a better database model fit for classic deterministic system-of-record applications, and is currently the database architecture of choice for the majority of applications. Probabilistic systems that require greater scalability and lower latency have to give up guarantees on either consistency (AP database models) or availability (CP database models).
Google has built the largest distributed databases in the world to support its own digital services, such as consumer search. Google has later focused on response times. Google has also built the most sophisticated enterprise digital services for selling enterprise advertisements to the consumers of Google’s digital services. Over 85% of Google revenue comes from enterprise advertisements. Google’s development of Spanner is a tribute to the technical inventiveness of Google engineers, striving to solve the challenges of emerging probabilistic systems.
Horses for Courses
There are fundamentally three database models, CA, AP, and CP. All are viable solutions, each has a solution space, and each is a fit for different workloads.
As an OLTP solution, Google Spanner is an ideal solution for systems that can tolerate an occasional loss of a data element early in the capture cycle. For example, rarely failing to register a single mobile call record for a user is a nuisance, but not normally a business exposure. Other examples of the solution space would include applications that provide probabilistic assessments, such as those based on AI and advanced analytics.
The early use of Spanner by Google is in support of Google’s Adwords business, Gmail, and Google Photo digital services. The challenge in Adword applications is keeping track of billions of clicks, and rolling those up into advertisement placements and billing. Much of this application is probabilistic, spans very large countries, and has low latency imperatives.
Facebook might well have selected Spanner early on, had it been available. Again, loss in the early stages of capture would probably be spotted and corrected by the user. It would have reduced the number of very expensive re-architectures of the MySQL databases that were needed, and provided a much easier platform to scale and maintain.
Large-scale public cloud email systems are another potential use-case for Spanner. The current failed delivery rate across cloud email systems is one per thousand (99.9% success rate). Add on late deliveries, and you have very poor cloud email SLA. There is plenty of space for improvement!
Google Spanner in Action
Optiva is a relatively new startup supporting Communications Service Providers (CSPs) with SaaS cloud solutions. The SaaS solutions are Optiva Charging Engine™ and Optiva Revenue Management Suite™.
Some of the traditional software vendors in the global revenue management system space are Amdocs, Ericsson, Huawei, Microsoft and Oracle. CSPs are experiencing significant change in the industry, as they pivot from traditional telecommunication architectures to supporting Network Function Virtualization (NFV) and Software Defined Networks (SDN). There is certainly pressure to provide cloud solutions in this space, especially for mobile devices where data is gathered and stored in Internet POPs. For some of the larger players, scalability is a crucial capability for solutions in this space.
In the Footnotes below is a series of video clips from Google Next 2018. Deepti Srivastava (Google Cloud Product Manager) & Danielle Royston (CEO of Optiva) discuss Google Spanner with John Furrier and Dave Vellante of SiliconANGLE. It is a little long, but well worth the time– we’ve organized the discussion in clips to make it easier to consume.
Oracle Database
Oracle’s database is a CA model. It is the most functionally rich CA model database in the industry, with a strong ecosystem of support for developers and deployers of large-scale solutions. Oracle offers very good support services for development and operations. Oracle is also expensive and many customers complain about the company’s aggressive business practices– e.g. audits and legal tripwires in contracts. Oracle does not have a CP database model solution.
Oracle’s technical approach to scaling is to provide hardware and software that is well integrated, with products such as Exadata and other converged solutions. Technologies such as low-latency IO, in-memory databases support, and low-latency system networking all enable significant vertical scaling, especially when integrated into the database code.
Oracle’s cloud strategy is to provide a seamless hybrid strategy with Oracle Cloud, and Oracle Cloud at Customer. This allows exactly the same software and infrastructure to be place in the cloud and on premises. Oracle accepts the responsibility for maintaining everything up to the database layer, or even the Oracle application layer. This significantly lowers the cost and levels of expertise required inside enterprise IT. Availability and recovery times are also improved. In principle this also allows transporting the application to where the data is, and avoids the elapsed-time and high-costs of moving data.
As new hardware and software technologies become available, Oracle’s price umbrella is probably not sustainable. Wikibon expects Oracle will move slowly towards a higher volume and lower-cost business model.
Conversion to Google Spanner
Conversion of an existing Oracle database system which is correctly using the Oracle functionality is very expensive, and carries significant business risk. Conversion of large-scale database systems with the objective of saving money is almost always an illusion. The business risks are usually enormous. Wikibon has sophisticated models and experience to evaluate these business risks – conversions are rarely recommended.
Wikibon would strongly recommend senior IT executives to focus new technologies such as Google Spanner on new business opportunities, and link the new systems to existing systems. SaaS software that is deployed on Spanner technology is also a good potential initial deployment.
Conclusions & Questions
Conclusions on Google Spanner
Google spanner is an exciting new technology offering great scalability with consistency. It can address many problems in the rapidly growing probabilistic application space. However, Google will need to invest in more functionality and better support to make an impact on the market.
Spanner does not offer an easy lift-and-shift solution for existing databases. AWS Aurora, which is a traditional CA database model built on MySQL, has a significant head-start.
Database system conversions to just save money should be strongly avoided. The business risks are very high. Putting the resources towards new digital transformation almost always creates a better business return.
Wikibon projects that only 5% of Oracle licenses by value will migrate to Google Spanner. The presence of Google Spanner and AWS Aurora will put pressure on Oracle pricing, which will decline over time. The cost and business risks of migration will be the greatest friction against change, together with development and maintenance productivity.
Google Spanner can participate strongly in a smaller but rapidly growing market of probabilistic workloads, which will feed into systems of record. Over the next decade, the probabilistic application market and support software will grow to be many times the value of todays traditional high-end market.
Questions for Google Spanner
Wikibon suggests asking senior Google management three key questions about Spanner.
- Is Google committed to providing additional development functionality for Spanner, and robust operational functionality such as backup? What timescale?
- Is Google committed to providing Spanner availability functionality equivalent to mission critical Oracle (e.g., RAC, Active Data Guard, GoldenGate)? What timescale?
- Is Google committed to providing real-people to help Spanner users trouble shoot in both the development and operational areas?
Action Item
Oracle customers should place Google Spanner mugs (see Figure 2) – or an AWS Aurora mug – on every senior manager’s desk. It should be used as a talking point with all Oracle personnel, especially when negotiating contracts.
Complex database systems using IBM DB2, Microsoft SQL Server, or Oracle, where very high availability is a business imperative, are very unlikely to be candidates for Google Spanner. Senior executives should avoid conversion projects from existing large-scale database systems like the plague, and only as a last resort.
They should instead invest in understanding Google Spanner, understanding where it fits, and trying it out on new projects. The operational aspects of Spanner, including availability, backup, and recovery should be focus areas of any pilot projects.
Footnotes:
The video below is composed of clips from theCube Interview with Deepti Srivastava (Google Cloud Product Manager) & Danielle Royston (CEO of Optiva). They are discussing Google Spanner with John Furrier and Dave Vellante of SiliconANGLE. The event was Google Next 2018, in San Francisco.