The big data market is rapidly evolving. As we predicted, the focus on infrastructure is giving way to a focus on use cases, applications, and creating sustainable business value with big data capabilities.
Big Data Market Evolution Is Accelerating
Wikibon identifies three broad waves of usage scenarios driving big data and machine learning adoption. All require orders of magnitude greater scalability and lower price points than what preceded them. The usage scenarios are:
- Data Lake applications that greatly expand on prior generation data warehouses
- Massively scalable Web and mobile applications that anticipate and influence users’ digital experiences
- Autonomous applications that manage an ecosystem of smart, connected devices (IoT)
Each of these usage scenarios represents a wave of adoption that depends on the maturity of a set of underlying technologies which we analyze independently. All categories but applications — those segments that collectively comprise infrastructure — slow from 27% growth in 2017 to single digits in 2023. Open source options and integrated stacks of services from cloud providers combine to commoditize infrastructure technologies.
Wikibon forecasts the following categories:
- Application databases accrue functionality of analytic databases. Analytics increasingly will inform human and machine decisions in real-time. The category totals $2.6bn in 2016 and growth slowly tapers off from 30% and spend peaks at $8.5bn in 2024.
- Analytic databases evolve beyond data lakes MPP SQL databases, the backbone for data lakes, continue to evolve and eventually will become the platform for large-scale advanced, offline analytics. The category totals $2.5bn in 2016 with growth at half the level of application databases and its size peaks at $3.8bn in 2023.
- Application infrastructure for core batch compute slows. This category, which includes offerings like Spark, Splunk, and AWS EMR, totals $1.7bn in 2016 with 35% growth in 2017 but slows to single digits in 2023 as continuous processing captures more growth.
- Continuous processing infrastructure is boosted by IoT applications. The category will be the basis for emerging microservice-based big data applications, including much of intelligent. It totals $200M in 2016 with 45% growth in 2017 and only slows below 30% in 2025.
- The data science tool chain is evolving into models with APIs. Today, data science tool chains require dedicated specialist to architect, administer, and operate. However, complex data science toolchains, including those for machine learning, are transforming into live, pre-trained models accessible through developer APIs. The cottage industry of tools today totals $200M growing at 45% in 2017 but only dips below 30% in 2025 when it totals $1.8bn.
- Machine learning applications are mostly custom-built today. They will become more pervasive in existing enterprise applications in addition to spawning new specialized vendors. The market today totals $900M growing at 50% 2017 and reaches almost $18bn in 2027.
Big Data Analytics Segment Analysis
As users gain experience with big data technology and use cases, the focus of the market will inevitably turn from hardware and infrastructure software to applications. That process is well underway in the big data market, as hardware accounts for an ever smaller share of total spend (see Figure 2 below). Infrastructure software is facing twin challenges. First, open source options are undermining prices. Second, more vendors, especially those with open source products, have adopted pricing models around helping operate their products on-prem or in the cloud. While there is room for vendors to deliver value with this approach, there is a major limitation. Given the relatively immature state of the market, customers typically run products from many vendors. And it’s the interactions between these products from multiple vendors that creates most of the operational complexity for customers. As a result, market growth for infrastructure software will slow to single digits in 2023 from 26% in 2017 (see Figure 1). Public cloud vendors will likely solve the multi-vendor management problem. Before more fully packaged applications can take off, customers will require a heavy mix of professional services since the reusable building blocks are so low-level. By 2027 applications will account for 40% of software spend, up from 11% in 2016 and professional services will taper off to 32% of all big data spend in 2027, down from 40% in 2016. To summarize:
- Open source and public cloud continue to pressure pricing.
- Applications today require heavy professional services spend.
- Business objectives are driving software to support ever lower latency decisions.
Application Databases
Application databases are going to grow faster than analytic databases. Application databases are rapidly adding functionality in two dimensions. First, within the category there is a race to manage multiple data types, also called polyglot storage. Customers are driving this so that data can remain in one place with multiple types of processing brought to bear, thereby minimizing the lag caused by moving big data sets through a lengthy pipeline of different stores. Second, customers are requiring application databases to have more and more analytic functionality, again so that data doesn’t have to be moved across multiple stores in a pipeline for the analytics to inform a decision. Today, most analysis is done offline in a data warehouse, with technically sophisticated organizations beginning to use that data in predictive models using machine learning. Early examples include Splice Machine, Snappy Data, and Redis. Eventually, these same databases should be able to manage more of the process of continuously updating and training the predictive models. To summarize:
- In 2016, the market size for application databases was $2.6bn.
- In 2017, we predict the market size will be $3.3bn, which represents growth of 30%.
- In 2027, the market should be $7.9bn, with a CAGR of 11%.
- We believe that the seminal inflections points will be:
- The addition of polyglot, or multi-data type storage, creating greater concentration.
- The steady accumulation of more analytic functionality to inform decisions in real-time.
Analytic Databases
The Hadoop ecosystem has been substituting for traditional SQL data warehouses in whole or part for the past several years. Besides Hadoop’s upfront cost advantage, there was its mix and match flexibility. But an open source community focused on delivering specialized tools has been gradually unbundling the Hadoop ecosystem into separate storage (HDFS), query (Impala, Drill, Hive), catalog (HCatalog), and other services. The result? More stifling complexity, as developers and admins struggle to gain experience with big data use cases, business capabilities, and increasingly basic tooling. Consequently, tools are starting to reintegrate. What’s emerging is a combination of Hadoop’s commodity, polyglot storage archive with scale-out analytic MPP SQL databases, such as Snowflake and Crate.io. At the same time, application databases are adding progressively more analytic functionality, starting with the OLAP-type queries that have been the bread and butter of data warehouses. To summarize:
- In 2016, the analytic database market size was $2.5bn.
- In 2017, we predict the market size will be $2.8bn, which represents growth of 15%.
- In 2027, the market should be $3.4bn, with a CAGR of 3%.
- We believe that the seminal inflections points will be:
- The open source community’s attempt to unbundle Hadoop-basesd analytic database functionality is going into reverse; we’re starting to see a reintegration of function, especially at the level of data pipelines.
- More integrated databases will enable lower latency analytics.
- Application databases will accrete ever more analytic functionality, eventually shrinking the category.
Big Data Application Infrastructure
While MapReduce recedes into the rear view mirror, Spark is emerging to be the de facto standard analytic application infrastructure, not only for its speed but also for its deeply integrated analytic functionality. But until we see deeper integration of storage functionality, building applications more sophisticated than analytic pipelines will be difficult. Some, such as Snappy Data, Splice Machine, Iguaz.io, and Redis have already taken matters into their own hands with innovative solutions. We expect to see more follow these vendors’ lead. Meanwhile, Splunk is as big as all the Hadoop vendors put together and growing just as fast, if not faster. Splunk is the “anti-Hadoop” in its deeply integrated functionality. That integration makes it far easier to build an expanding set of applications. Its sweet spot has been operational intelligence but that will likely expand rapidly to include such new areas that increasingly overlap with Hadoop, such as cybersecurity, IoT, and others. Cask is pioneering a different take on integrating application infrastructure by putting an abstraction layer on existing Apache ecosystem technologies. Collectively, ever greater integration and higher-levels of abstraction will drive application infrastructure into the big data generation’s equivalent of the J2EE application server, to borrow Cask’s positioning. What remains open to question, however, is which technology sets gets generally chosen to handle application infrastructure for edge intelligence. To summarize:
- In 2016, the Application Infrastructure market size was $1.7bn.
- In 2017, we predict the market size will be $2.4bn, which represents growth of 35%.
- In 2027, the market should be $6.8bn, with a CAGR of 13%.
- We believe that the seminal inflections points will be:
- The Spark ecosystem will deliver deeper storage integration and open up a new class of operational applications that go beyond analytics.
- Splunk continues its rise as the “anti-Hadoop” for applications beyond operational intelligence by virtue of its integrated simplicity.
- More integrated and higher layers of abstraction will continue to broaden appeal.
Stream Processing Technologies
While stream processing workloads have existed for decades, they were in obscure applications like manufacturing shop floor systems. Four factors are driving stream processing into the mainstream. First is the same need as with application databases for ever faster decisions. In other words, a decision has to be made with the arrival of each new item of data rather than after collecting a batch of data over time. Second is the rise of Kafka and its ability to unlock and deliver data from just about any source at any level of scale and at high speed. Third, developers don’t have to learn an entirely new paradigm: Spark, Flink, and others are making it possible to build stream processing workloads the same way developers build batch processing workloads. Fourth, streaming is becoming required infrastructure that continuously connects the increasingly popular approach of building micro-services. Another major growth driver will kick-in as IoT applications take off. Making decisions on each element of data streaming off of physical devices will require a lightweight engine where a clustered, highly available solution may require too much overhead. These edge conditions may open an opportunity for a new competitor. To summarize:
- In 2016, the market size for stream processing technology was $200M.
- In 2017, we predict the market size will be $400M, which represents growth of 100%.
- In 2027, the market should be $6.1bn, with a CAGR of 34%.
- We believe that the seminal inflections points will be:
- Making batch and streaming look the same to programmers enables the rise of continuous processing applications.
- Ever richer analytics driven by the same need in application databases for real-time decisions.
- Microservice architectures require continuous processing to connect to each other.
- Edge intelligence for IoT applications potentially opens the door for new entrants.
Machine Learning and Data Science Pipeline Technologies
Not even the Microsoft, Google, Amazon, and IBM have the wherewithal yet to integrate all the functionality of the toolchains required to build and operate any and all combinations of analytic pipelines. As a result, this category is currently the most fragmented of all categories in the forecast. Microsoft and IBM appear to be the largest companies trying to deliver an integrated toolchain. In the meantime, these same vendors are offering horizontal and vertical templates in code form for functions such as marketing campaign optimization or fraud prevention. But more integrated and live, trained solutions for specific applications ready for mainstream developers are appearing first for conversational user interfaces (bots, speech, vision) from Amazon, Microsoft, IBM, and Google. Over the next 5+ years we should see those same horizontal and vertical functions turn into similar live, trained solutions, at which point they will graduate to the application category. To summarize:
- In 2016, the market size for machine learning and data science pipeline technologies was $200M.
- In 2017, we predict the market size will be $300M, which represents growth of 50%.
- In 2027, the market should be $2.8bn, with a CAGR of 25%.
- We believe that the seminal inflections points will be:
- Evolution from fragmented, standalone tools to ever-more integrated tool chains, likely from developer platform vendors.
- Greater availability of models in ever more specialized categories in developer-ready API form, first for nexgen conversational UIs delivered as bots, speech, and vision.
- As models for functions such as fraud prevention grow from templates to fully trained and continuously learning models, they will become applications.
Advanced Analytics Applications
A taxonomy of machine learning applications is still emerging and for the most part they are highly specialized. Some of the first categories include micro-apps such as anti-money laundering, departmental apps such as cybersecurity, vertical applications such as CPG demand and replenishment planning, and multi-enterprise ecosystem applications such as ad-tech. The traditional enterprise application vendors such as SAP and Oracle have mostly been slow to add machine learning to the product lines, though that should eventually change. The emerging generation of enterprise application vendors that will compete with today’s incumbents is likely to be the public cloud vendors, including IBM. Their developer platform offerings are ideally positioned to evolve from data science toolkits to API-accessible models to fully trained and continuously learning models. Using IoT applications as an example, today it’s extremely difficult to build a high-fidelity “digital twin” of a sophisticated physical asset. Over the next 5-10 years, more and more of the lifecycle of a building or airplane should be digitally captured, simulated, and optimized. To summarize:
- In 2016, the market size was $900M.
- In 2017, we predict the market size will be $1.4bn, which represents growth of 55%.
- In 2027, the market should be $17.8bn, with a CAGR of 31%.
- We believe that the seminal inflections points will be:
- The rise of a new generation of vertical and departmental application vendors.
- Incumbent vendors will add machine learning capabilities but their traditional focus on record keeping will make it challenging technically and in selling.
- Over 5-10 years, physical products will be modeled in addition to business workflows.
Hardware Segment Analysis
The 10 year forecast period covers some major shifts in the hardware market that will impact how big data software is designed and used.
Storage is the obvious category to expect growth when dealing with ever bigger data sets. However, there is an accelerating shift from expensive shared SAN and NAS devices to direct attached commodity hard disks to flash storage attached to server CPU’s. That shift to commodity storage is offsetting the accelerating growth in total storage capacity. As a result, while total storage spend will grow from $3.6bn in 2016 to $8.6bn in 2027, its share of total hardware spend dips slightly from 41% to 38% (see the blue shaded area in Figure 3).
New compute capabilities, driven by new storage, is in turn changing analytic software. Flash storage is enabling a shift to much faster connections to CPU than with traditional mechanical hard drives. With fast connections to CPU and memory, it’s possible to add much more memory to compute because the time to save the volatile contents of memory to non-volatile storage shrinks. Within the CPU itself, the end of speed increases has given way to specialized parallel processing. GPU’s and, soon, FPGA’s are greatly accelerating analytic tasks from simple database OLAP processing to the most advanced neural networks involved in deep learning. As a result, despite the relative decline in importance of traditional x86 instruction set CPU’s, overall compute, including memory, grows from $4.5bn in 2016 to $12.5bn in 2027, which represents an increase in share of total hardware spend from 51% to 55% (see the orange shaded area in Figure 3).
Action Item
In the near-term, big data pro’s should look to their vendors for greater product integration in order to help simplify development and administration of Data Lake applications. Fragmented and immature data science tools and limited availability of packaged machine learning applications don’t have to prevent customers from getting value from these advanced applications. Customers can engage professional services firms for help in building custom, high-value applications.
Addendum
This 2017 version of the Wikibon big data and analytics forecast is just one of four related forecast reports. The others are:
- Usage Scenarios
- Definitions
- Market Share
- Continuous Applications