In this article we describe NoSQL database that has been developed and created after the mass adoption of relational database. Logically, NoSQL may include pre-relational database that have been developed regardless of SQL system usage and isn’t aimed at contributing to modern NoSQL solutions.
As the amount of data on the Internet has been growing, the world IT-community has started to search for a better data storage and access strategies. In this connection, the concept of Big Data was created to denote the strategy for dealing with constantly growing data arrays. This concept revealed the need of a completely new database model that would be aimed at access speed and scalability. The solution that would be simplier than relational database and at the same time not less effective in completing such tasks as constructing a cloud storage, where user primarily values access speed and big data volume.
The most eager discussion about NoSQL database broke out in 2009 and brought about many myths and debatable theories about it. Let’s overview some of them to reconstruct the development and peoples’ view on NoSQL.
NoSQL is Revolution
And “a completely new system, a fresh idea that will change everything” – one could usually hear something like that from NoSQL promoters. However, in reality no global breakthrough happened. NoSQL were developing as an extension of relational database due to new data storage and access requirements.
As for some fundamentally new approaches, NoSQL solutions have little to offer. For instance, MangoDB conception, launched in 2008, is practically just a modernized version of Pick DB, launched in 1965.
No Future for SQL Databases
This is a widely circulating opinion that has been firstly declared around 20 years ago. Still, relational DBs occupy the top position on the market giving odds to NoSQL solutions. This may be explained by the fact that NoSQL aren’t able to perform SQL tasks better than SQL itself. The best NoSQL solutions cope with specific tasks and are usually created by leading IT companies, such as Google, Amazon, Microsoft and Apache to deal with their needs. For example, Google AppEngine Data Store can be used only with Google web-services, SQL Data Services is a part of Microsoft Azure platform and SimpleDB is a part of Amazon WebServices.
Non-cloud storages, which can be simply installed on PC, are usually young open source projects developed for particular purposes. They aren’t designed to cope with the whole variety of tasks that SQL systems are able to complete. Moreover, the key problem isn’t the narrow specialization of NoSQL, but the range of flaws. They might be minor for IT corporations, but rather crucial for the majority of ordinary companies.
The Hype around NoSQL and Big Data – Just a Fake
Despite the drawbacks of NoSQL, one shouldn’t simply take all the criticism on trust. The wave of non-relational DBs has brought about many commercially successful solutions. There are numerous projects that can benefit from NoSQL systems application, although one should usually estimate the long-term prospects of non-relational database introduction. In the majority NoSQL solutions are young; due to this many companies have failed, thoughtlessly following the trends. Still, such bad examples actually do not prove that the product itself is of low quality.
We assume that in future as a result of data processing development, there will be more and more combined solutions, where NoSQL systems will cover weak spots of SQL.
NoSQL Database Types
Overall, there are four main types of NoSQL stores. They differ in data model, distributivity and replication approaches, as a result they can successfully tackle different kinds of tasks.
Key-value storage is the simpliest database type, being in fact an associative array – each value has the corresponding unique key. Due to its simplicity, this storage type has impressive scalability prospects. For instance, it doesn’t ask for any database construction schemes and there is no connection between values. In fact, the number of values is limited only by computing power of a machine. That is why this very storage type is so attractive for companies that provide cloud hosting services.
On the other hand, key-value storage simplicity makes the majority of ordinary operations with storage values complicated or even impossible. Keys provide wide variety of actions, but if you start searching by values, the process might last for several orders longer than in case of the relational database. In combination with the limited range of operations with cell values it results in very slow database analysis and makes statistics compilation impossible.
Consequently, this store type is used in case cells’ content is not important for analysis – this means that there is no connection between the cells within the database. Key-value storage didn’t manage to replace relational database, still it is widely used as object’s cache, because cached objects of different users aren’t connected as well; the more important aspects are cache access speed and opportunity to change system’s scale.
Document-oriented DB is the system of hierarchical data structures (documents) storage that has tree or forest structure. Tree structure has a root node and may have several internal and leaf nodes. Leaf nodes contain final data that is recorded into the database indexes after adding. Due to this, even if the overall store structure is complicated, fast search is possible. In fact, document-oriented database is a more complicated version of key-value storage – they still don’t fit the systems where elements have numerous connections, but allow compiling sample on request without full download of particular documents into RAM. Searching algorithms work with full documents as well as document parts, tree structure, in turn, allows creating document collections by type or topic.
For instance, while creating a music storage you may create a collection of 80s music, then make “subcollections” by year within, where documents contain albums released that year. If user needs a top popular songs of the decade, the query will process quite long as the system will examine each document in the database. Thus, document-oriented database is useful for orderly storing data, but elements aren’t connected and there is no need to compile statistics. Documents don’t have a particular scheme, which means that each document may contain random number of unique fields – unlike relational database where storing heterogeneous data may cause empty fields to appear.
Graph database is a generalization of a network data model and has distinctively strong connections between nods.
Graph database is the best choice for projects that imply graph data structure – namely, for social networks and for creating a semantic web. In this field it outdoes relational DB in capacity, amending simplicity and visibility. Some databases have special optimization algorithms for operating SSD storage devices. There are also algorithms for the bigger graphs that partially transfer a graph into RAM.
Bigtable database or column family storage contains data in form of a sparse matrix, where lines and columns are used as keys. This kind of storage has pretty much in common with document-oriented database – content management system, event registration and blogs. At the same time, one shouldn’t confuse bigtable database for column storage, which is, in fact, a relational BD with separate column storage.
By rule, this kind of storage is used for web indexing and doing other tasks that require processing large amounts of data.
Strong and Weak Points of NoSQL
To reveal the reason of NoSQL popularity and to designate the field of their implementation, we would like to overview all the options from different points: outline the differences between NoSQL and relational DB, and discuss the projects where NoSQL database has stronger position due to its design features.
The first feature is pretty obvious – lacking SQL (Structured Query Language) – the universal query language that all relational systems apply. All the NoSQL systems have their own API for interaction or an embedded query language, which is usually just a stripped version of SQL. The positive sides of this solution are:
- Easy to use. Many NoSQL solutions, basically key-value storages, have limited functionality in comparison with relational databases, which is enough, though, for dealing with particular tasks. Also, user doesn’t need special qualification - powerful and flexible operational mechanism for SQL queries is enough. This is reflected in lower input threshold for getting started with NoSQL storage.
- The simplier request syntax is, the less errors occur. Some of the developers use ORM (Object-Relational Mapping) to make work with the database easier. This technology allows translating operations with objects into queries to database automatically. Often it produces numerous bad or needless requests. It’s not the ORM developers who do their job poorly – it’s the task that is too complicated. SQL is universal and very capacious, which means that user should have specific knowledge to apply it. As for modern NoSQL storage languages, they are designed to perform simple operations with database.
There are several drawbacks as well that might outweight the benefits of NoSQL:
- Application is bound to the particular DBMS. SQL is universal for all relational storages, so that user doesn’t need to rewrite the whole code, in case DBMS is changed. Even if two NoSQL systems are conceptually alike, they still very much differ in API standards and query specification.
- Limited capacity of embedded query language. SQL has a rich story and many standards. This is a very powerful and complicated tool for operating data and compelling reports. Almost every query language and NoSQL API storage methods were developed on the basis of different SQL functions – as a result, they have rather limited functionality.
- Low knowledge value and narrow specialization. It’s much easier to find an SQL specialist, because specialists in particular NoSQL API solutions are rare. This means that usually database operator has to learn some specific points on the spot.
NoSQL - Simple and New
While benefits of NoSQL are obvious, drawbacks are usually revealed only by the bitter experience. First of all, relational DB limited structure guarantees data integrity to some extend – information that doesn’t fit the type will never be added to the database. When you use, for instance, a key-value storage, data integrity is controlled fully by the app. Secondly, relational storage creation includes data model development. At this step, one can take into account weak points of strategy and develop a truly stable and convenient system. NoSQL solutions don’t require determining database scheme before getting started. However, without the primary testing and planning one may face unexpected difficulties at the development stage that might cut off particular way of operating a NoSQL solution. Taking into account the abovementioned difficulties of transition from one non-relational database to another, every error may result in a big loss.
Another important feature of NoSQL that worth mentioning is solutions’ “youth”. Many of them spread via BSD-like license and are financially supported by community efforts. Each company has particular database security requirements, so the majority of new NoSQL solutions go unnoticed. Some non-relational storages are just at beta-version stage and even those launched earlier don’t have enough experience of successful implementation in comparison with the relational DBMSs. Apart from high probability of bags and other weak points in code of non-relational DBMS, there might be other sort of error – mistake in choosing the correct program for company’s needs.
Strong Points of NoSQL
After review of some arguable features of NoSQL solutions we would like to discuss the directions where NoSQL does more success – distributed systems. All non-relational storage types, apart from graph DB, have priority as they imply many links between data nods.
Avalanche-like increase of data on the Internet has exacerbated the main problem of vertical scalability – computing power can’t grow forever, moreover, price for several autonomous servers is lower than for a high-performance one. In this situation, the best option is horizontal scalability, when several separate machines are united by one task, each processing a part of it. Such architecture makes it faster to increase cluster power by adding a server. NoSQL storages, designed to work with distribution systems, are initially developed so that all replication operations, data distribution and maintaining fault tolerance are operated by NoSQL database itself.
The key advantages of NoSQL database in distribution systems are sharing and replication functions.
Replication is the process of creating data copies on other servers as data updates. This allows making fault tolerance and system scalability higher. Traditionally, two main types of replication are distinguished: master-slave and peer-to-peer.
The first type implies one “master” server and several child servers. All the data is recorded on the master server, which subsequently transfers changes to the child servers. This replication type provides good reading scalability (it is possible to read from any network nod), however, recording operation scaling is impossible, because all recordings are made on master server only. Also, this kind of replication implies that there might be difficulties, if master server is faulty. In this case, a new master server should be chosen out of the rest.
The second type (peer-to-peer) implies that all nods operate reading and recording queries equally. Information about updates is transferred from one server to another in a circle.
Sharing is the process of dividing information array between network nods, when each nod operates a particular piece of information and reading or recording queries, addressed to it. This technology was used with relational databases when it was rather raw: at application level there were separate databases that operated users’ queries.
The fact that social data isn’t relational and big social networks realization via SQL might cause troubles is a strong argument for NoSQL solutions. Indeed, forming a news ticker by means of relational database is the process of connecting several tables. Posts in a news ticker, likes and comments, avatars and other data, necessary for forming a news ticker, is usually kept in different places, so it takes time to bring everything together. Therefore, it seems quite reasonable to keep the whole news ticker as a unified non-normalized structure. On the other hand, modern graph NoSQL bases have problems with scalability, which makes them useless for big social networks. What is more, relational storages have other advantages like stability, guaranteed information integrity and personal data security. Overall, what is important for complex multiuser project is stability, but not the high speed or limitless scalability.
Peak of NoSQL popularity was mainly predetermined by ambitious claims of Twitter. The social network saw some flaws in work with relational storage MySQL, where twits were stored, and decided to change for NoSQL DBMS Cassandra. However, this idea was never realized, because – as Twitter employees comment it – the company set priorities and decided the idea was too risky. Again, NoSQL storage was successfully used by Instagram and Facebook as the main database – which is a big success for NoSQL family.
Cloud storages and NoSQL solutions created for them are usually based on multiple rentals principle. It implies that many separate users apply the same system simultaneously. To prevent overload in high scalability tasks, the query limit policy is implemented. For instance, SimpleDB limits query time up to 5 seconds and Google AppEngine Datastore limits query results up to 1000 results.
These limits don’t affect application work but narrows down opportunities for data analytics. Companies usually make profit from the amount of personal data they get – such information allows revealing preferences and easily making recommendation lists for particular user groups. Many NoSQL solutions don’t support such functions or it’s hard to implement them.
This problem has one interesting solution – a separate storage where data is duplicated for analysis. Undoubtedly, the process of transferring millions of posts in batches of 1000 posts per query might take long time, and this detail should be taken into account.
Sharp increase of NoSQL popularity and usage of non-relational DBMSs showed how important realistic estimation of company’s priorities is. Some vendors have successfully implemented NoSQL storages which resulted in decreased loss and increased service quality. Some failed, because it was too late, when they understood the choice of solution was wrong. Some just didn’t try. The choice between relational and non-relational databases is not the only one that company has to make. Not less important is the choice between systems and working strategies.
Anyway, NoSQL revolution didn’t happen – relational database still occupies leading positions. They combine stability, functionality and universality. At the same time, some NoSQL solutions tackle just single flaws of SQL storages – primarily, increase horizontal scalability. Many non-relational databases perfectly perform tasks they were designed for, but they are not as universal as SQL. Companies rarely have such amount of data and other preconditions that exclude using any solutions except for NoSQL. NoSQL storage shows good results in combination with relational database. For instance, in systems where SQL keeps the major data and NoSQL is responsible for cache. Overall, non-relational systems still lack many basic features, like universality, stability, integrity and predictability, to occupy a more important position on the market.