Computers have been invented to compute large amounts of data at very high speed, relative to the human brain speed. In a sense, computers have always been about "big data". Although "big data" is probably more a marketing term than a technical one, we must acknowledge that we are at a threshold where data mining can be done on such amounts that new approaches are emerging.
We have a principle at quasardb: "Whatever the amount of data you can imagine, there will be more".
In this blog post we will share with you our vision and give you an overview of what we call the right to data approach.
A technology that is often associated with "big data" is Hadoop.
The Hadoop stack was developed by Yahoo's engineers after Google published their Google File System paper. If you don't know what Hadoop is, quickly put, Hadoop is an ecosystem, a collection of technologies, for distributed storage and processing.
It's always interesting when considering software to know where it comes from. The number one job of Google and Yahoo is indexing the web and the whole framework is based on the principle that "moving computation is much cheaper than moving data".
That's absolutely correct when you are indexing large amounts of text. Downloading terabytes of data to index them and upload the index somewhere is much less efficient than indexing the data in place, especially when, 15 years ago, 100 Mbit network were the de-facto standard.
The framework has enjoyed some success in a wide range of use case, sometimes successfully replacing improperly used technologies. Today we feel it is plateauing for one main reason: Hadoop is heavily centered around the map/reduce programming model and not every computing problem is fit for this model.
Another obvious problem of map/reduce is of cognitive order: it requires engineers to completely pivot their infrastructure. That's often easier said than done: you can't just throw away decades of software engineering because of a shiny new model.
That's why, when you include training and re-engineering, moving computation instead of data isn't so cheap.
For certain classes of problems (such as finding correlations) map/reduce is inefficient, slow and cumbersome. Indeed, the usual way to deal with this problem is to add a lot of transformation steps, which adds overhead and complexity.
Apache Spark is an exciting new technology that addresses some of the limitations of the Hadoop framework. First of all, it greatly extends the toolbox beyond map/reduce: that gives more freedom and flexibility to the engineer. Secondly, it is strictly in-memory and doesn't store intermediate steps on disk, contrary to Hadoop, giving it much more speed. Thirdly, it gives more options to the implementer in terms of programming language, even if Apache Spark is written in Scala.
However, being purely an in-memory solution also means you have to solve the persistence problem by yourself. Fortunately, it's possible to plug the persistence engines of Hadoop into Apache Spark or use any database of your choice. You can even use a CSV file!
At quasardb we also provide you with an connector for Apache Spark, because we believe analytics and storage are two different things that should be done by different technologies.
Apache Spark isn't the only solution for in-memory analytics, exciting high speed SQL databases like MemSQL or VoltDB also bring an answer. There are also field tested mixed workload in-memory analytics engines such as QuartetFS' ActivePivot.
At quasardb we strongly believe that in-memory databases are the future of analytics and we are comfortable saying so because quasardb is not an analytics engine and doesn't aim to be. So which problem do we solve?
We've been working over the last decade on a technology that wants to turn the tables of big data around with a simple, but crazy idea: what if we made the pain of moving data go away?
In a world where accessing remote data is inexpensive, you don't have to worry about rewriting your algorithms. Everything you have done for the last decades will stay the same, you will just plug an "infinite" data back end that will feed your engine at very high speed.
Quasardb is a highly optimized distributed transactional database that enables you to do record-level (and not file level) bulk transfers. That enables us to deliver transfer speed several orders of magnitude faster than existing technologies because we only transfer what you need with a heavily optimized binary network protocol.
In other words, instead of downloading a whole file and then looking into that file for the information you need, you can queue a list of requests for as many records as you want and receive an aggregated answer.
All of this, with the comfort of transactions and consistence, like all enterprise-grade databases.
Behind the scenes dozens of unique innovations such as distributed secondary index, zero-knowledge distributed transactions and zero-copy bulks transfers make it possible to query a subset of data based on simple criteria at very high speed. We've heavily invested in both low-level system programming and high-level innovative algorithms.
We’'ve been working with industry leaders such as Cisco, DDN and Microsoft and managed to benchmark transfers speeds as high as 100 GBit/s. That is, from a remote disk to hot in memory, at 100 GBit/s. We can do that on 10 GBit/s networks thanks to our link aggregation technology and we will happily make use of your InfiniBand network, if any.
You might retort, what's the point?
And you could be right. An incredible technical accomplishment, but without any practical application. "My infrastructure is fast enough for my needs".
Or is it?
We've talked about in-memory computing and something is coming in that field: big memory machines. That is, machines with several terabytes of memory. There are not new in the sense that they didn't exist before, but new in the sense "you can buy it without your CFO calling your names".
When all the data you need to analyze fits in memory, things that took minutes (if not hours) now take seconds and finding correlations on large data sets becomes feasible again.
The only downside is that loading all the data may take several hours, sometimes days...
...unless you have a data backend that can deliver record level bulk transfers at very high speed.
That opens up very interesting use cases where big memory machines become the perfect analytics engines and where you can totally decorrelate the power you need for storage from the power you need for analytics.
In Finance it's interesting to be able to store the output of a computation grid for future analysis. Think about CVA and FRTB and the challenges they represent. With our approach, you protect your investment, you can still work on a programming model you know and like while benefiting from the latest and greatest in terms of technology.
With our friends at QuartetFS we successfully loaded at a sustained rate of 4 GB/s financial data from remote storage to "ready to use" numerical data for a CVA application. That only took 5 quasardb nodes running on commodity hardware, 10 GBit/s network and the combined usage of quasardb's link aggregation technology and ActivePivot's high performance data loader.
In other words, loading and converting 100 GB of historical financial data took 25 seconds. With 5 commodity servers. As a measure of comparison, doing the same from a directly attached high-performance SSD storage would take at least 100 seconds, without the possibility to grow the storage indefinitely and assuming you have a "perfect" representation of your data on disk.
That opens up a lot of possibilities where you can interactively work on the data set of your choosing, drastically reduces downtime and grow your infrastructure with your needs.
In our vision, this is just a step toward the journey of making data access constraint free. We will continue to work with our partners to deliver fully integrated solutions that will enable programmers to work in such a way that users will no longer have to worry about the location or form of their data.
That's what we call the right to data approach. Direct access to the data you need, with one, transparent, straightforward technology.
It's the only way to make "big data" get out of the niche where it currently resides.