Formerly known as Wikibon

A Look at 2017’s Big Data & Machine Learning Database Up & Comers

Relentless invention is keeping the big data and machine learning markets from coalescing to end-to-end platforms. The door is open for new big data and machine learning players, but only those that successfully and consistently tie technology to use cases will make it through.

This report is one of Wikibon’s annual forecast reports. The others are:

Mainstream enterprises are still gaining experience with big data use cases; proprietary SQL data warehouse (DW) offload was the first “bowling pin” to drop. But collecting the data, in a New SQL analytic data warehouse or an application database supporting an operational application, lays the foundation for early data science work and the eventual ability to put predictive analytics to work in all applications. With ever faster analytics becoming more critical to applications, application databases are subsuming more analytic functionality and blurring the lines between what used to be two distinct product categories.

Apps utilizing faster, more complex analytics are continuously ingesting, processing, analyzing, and either informing or automating actions.  New tools are making this possible. Plentiful start-up funding and customer experimentation are yielding the new technology from up and coming database vendors. Wikibon sees vendors making progress on customer challenges across all software categories tracked in our 2017 Big Data and Machine Learning forecast. Wikibon has identified several database vendors getting traction with customers, funding, and partnerships. The most impactful of the new breed are creating database technology to support the scale, speed, and scope requirements of applications and analytics. However, because of the much lower price points for these new products, these up and coming newcomers are still operating under the revenue radar of the incumbent vendors.

Scale, data variety, and ever more sophisticated analytics are driving database innovation.

The relational DBMS model was intended to support both OLTP and analytic workloads, but traditional hardware trade-offs made that impossible:

  • disk storage that was large but slow to access;
  • memory that was fast to access but small.

As a result, the market created a two-DBMS architecture to support applications. One DBMS was designed to support OLTP and one was designed to support analytics. However, new storage and memory trade-offs are catalyzing a rethinking of that architecture:

  • Solid state storage such as flash is both large capacity and now much faster.
  • And memory, while always fast, is now much larger relative to storage.

A number of new DBMS technologies are emerging that will store application data and support both operational- and analytic-style workloads on the same system.


The impact? The big data market is pursuing two technology enhancement vectors: One vector focuses on new types of application databases (typically supporting large semiconductor memories). Application databases are gaining richer analytic capabilities, starting with traditional OLAP queries and beginning to extend to machine learning. Machine learning support can range from scoring data offline to managing the operational aspects of model deployment all the way to online model training. The other vector focuses on adding support in analytic databases for faster and richer data and ever more advanced analytics. How the market ultimately coalesces around these alternatives will have a significant long-term impact on application architectures and business systems. Following are some of the leading entrants in each category (See figure 1):

Figure 1: Summary capabilities of the new generation of databases designed to support applications with orders of magnitude more data than traditional applications at correspondingly lower price points with advanced, low-latency analytics. (Note 1: In-memory support means the ability to use main memory natively for storage, not just as a buffer cache. Note 2: Flash storage support means database storage I/O is designed around the characteristics of flash, not mechanical HDD).

Noteworthy application database contenders

  • Splice Machine. Splice Machine started out as the first OLTP database implemented on Hadoop, integrating a SQL query engine on top of HBase. It graduated to supporting analytics in 2016 when it added Spark as an execution engine that could handle OLAP queries. In 2017 we should see the first customer applications that combine streaming, batch, and interactive workloads. Splice Machine still needs to build out its predictive analytics capability, where it can integrate with a data science pipeline. Also announced in 2016 was support for AWS, making Splice Machine available as a service just like Snowflake.
  • SnappyData.  SnappyData integrates Spark more deeply than any vendor, allowing it to execute native operations not just in SQL but using Spark APIs such as stream processing or machine learning. The company spun out of Pivotal and builds on the well-regarded, in-memory, Gemfire database.  SnappyData also supports approximate queries which can speed up responses by as much as 100x in return for about 1% reduction in accuracy. That capability means the database can process workloads 100x the capacity of competitors. Approximate queries combined with additional support for stream processing enables SnappyData to handle streams with much greater velocity than more traditional transactional databases. For now, the database has a single head-node with a failover replica, but maximum scalability will come when the head-node can scale independently.
  • MemSQL. MemSQL delivers OLTP and OLAP performance competitive with individually specialized databases by building two engines, one as an in-memory row-store for high-speed transactions, the other as a disk-based column-store for large capacity OLAP queries. MemSQL’s support for predictive analytics offers options. Spark can process incoming streams of data and provide real-time predictions based on a trained model. MemSQL can then take the data and compare it historical or other contextual data. Alternatively, MemSQL can look up the relevant data, call an external predictive model, and serve up the prediction along with the data.
  • Redis. Redis combines in-memory performance with multiple data types and integrated machine learning. Unlike many NoSQL key-value stores, Redis explicitly manages many different types of data, which allows it to perform actions specific to each data type. For example, it can ingest high velocity time series data and indicate the average and maximum values for the last hour with a single command. Recent integration allows Spark developers to read and write all of Redis’s native data types with in-memory performance. In addition, the open source community is beginning to add native support for predictive analytics. Redis can support a number of popular machine learning models that were trained externally, through Spark, for example. A single database operation can look up relevant data, call the native model that executes as if it were a native in-memory command, and serve up an answer as fast as sub millisecond speeds. Neural Redis is beginning to go beyond just predictions with the ability to manage and train multiple models with live data feedback loops, including neural networks that support deep learning.

 

Noteworthy analytic database up & comers

  • Snowflake is redefining the modern data warehouse and partially coopting Data Lakes. Hadoop Data Lakes featured low cost storage and schema-on-read data flexibility. Snowflake features both, separating storage from compute completely as well so that idle compute clusters in the cloud can spin down. While Snowflake wouldn’t be the front of a data pipeline, high velocity ingest might be handled by Kafka or Cassandra. Snowflake fits where there’s tolerance for somewhat more latency than an application database or stream processor with built-in analytics. If predictive models need scoring using data from highly complex queries or where some of the models get scored offline from huge historical data sets, Snowflake fits really well. Finally, Snowflake is designed as a service for low administrative overhead and serves the new generation of business intelligence users who have to explore and visualize big data.

 

Action Item

Big Data pros need to evaluate carefully the emerging class of new application databases despite the attraction of continuing to leverage existing administrative and developer skills on traditional databases. New applications need ever more sophisticated analytics with ever faster answers on orders of magnitude more data. Traditional databases haven’t or can’t keep up with all three new requirements: scale-out elasticity and pricing that is orders of magnitude lower to accommodate larger data volumes, and increasingly advanced analytics. However, where huge stores of data are still required for high-throughput advanced analytics with very low administrative overhead, new analytic databases expand the choices beyond traditional, incumbent vendors.

You may also be interested in

Book A Briefing

Fill out the form , and our team will be in touch shortly.

Skip to content