NoSQL

Is NoSQL a fad or here to stay? What does it even mean?

History

Starting in about the 1970s, the relational model became so dominant that the world seemed to forget there was any other way to store data. But as data stores became larger and larger, this one-size-fits-all idea began to be challenged.

Why this challenge? When data stores become huge you start seeing:

More and more connections between objects, which means queries have too many joins, queries are hard to write and understand, and performance degrades in an RDBMS; and
Nested, hierarchical data, for which queries are nearly impossible to express well in a SQL.

So alternatives started popping up.

What Does NoSQL Mean?

The term NoSQL was coined as an abbreviation for “Not Only SQL” and basically means “you can use other kinds of data stores in your enterprise in addition to traditional relational databases.”

Informally, people usually use the term to mean “non-relational big data.”

Scalability

The usual promise of (non-relational) NoSQL is better scaling, achieved through better distribution (often over commodity hardware) of data. But to do this, many NoSQL systems give up certain things. Some systems:

Do not allow joins
Promise only eventual consistency, not immediate consistency

Exercise: The claim that meassive scalability cannot be achieved via relational systems is controversial; many people have defended relational databases for large-scale operations. Research and discuss the arguments for relational systems over non-relational systems. How are the claims of NoSQL's superiority refuted?

Kinds of Non-Relational Systems

Key Value Databases
Document Databases
Column Databases
Graph Databases

Good Reads

The CAP Theorem

Most NoSQL databases claim, as a selling point, that they are distributed. So let’s say you have a distributed database. That means when you write to a node, the writes have to propagate to other nodes, so you can read from anywhere (or, that reads somehow go to the “correct” nodes that have the most recent data):

Now suppose a couple links go down so the network gets partitioned:

Now writes to the top partition can’t be seen by clients hitting the bottom partition. Let’s define:

Consistency (C): Everyone always reads the same data (the most recent write)
Availability (A): Everyone is always up
Partition Tolerance (P): After a partition, all the partitions still operate

Now notice:

If you have P and A (all partitons and all hosts running), then writes don’t propagate across partitions so you can’t have C.
If you have P and want C, then you must shut down the hosts in all partitions except one, so you can't have A.
If you want both A and C, the only way you can achive this is to never run anyone when there is a partition, so you can’t have P.

The CAP theorem says you can have at most two of three of C, A, and P.

Exercise: This is a trilemma. If you are philosophically inclined, read about the most famous trilemma of them all, the Problem of Evil.