pic_cooming_soonDatabase Systems

Databases are designed to manipulate large amounts of information by inputting, storing, retrieving, and managing that information. Databases use a table format, with Microsoft Access being one of the most widely used.

Databases consist of rows and columns. Each piece of information is entered into a row, which creates a “record.” Databases are commonly used when saving addresses or other types of long lists of information. Once the records are created in the database, they can be sorted and manipulated in a variety of ways that are limited primarily by the software being used.

The word data is normally defined as facts from which information can be derived. For example, “Fred Crouse lives at 2209 Maple Avenue” is a fact. A database may contain millions of such facts. From these facts the database management system (DBMS) can derive information in the form of answers to questions such as “How many people live on Maple Avenue?” The popularity of databases in business is a direct result of the power of DBMSs in deriving valuable business information from large collections of data.

Databases are somewhat similar to spreadsheets, but databases are more powerful than spreadsheets because of their ability to manipulate the data. It is possible to do a number of functions with a database that would be more difficult to do with a spreadsheet. Consider these actions that are possible to do with a database:

  • Perform a variety of cross-referencing activities
  • Complete complicated calculations
  • Bring current records up to date
  • Retrieve large amounts of information that match certain criteria

 

Database Design

 

Database-Design

 

List of Database Management Systems

List of Database Adminstration & Schema Tools

List of NoSQL Database Systems

A NoSQL (originally referring to “non SQL”, “non relational” or “not only SQL”) database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the “NoSQL” moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.com. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages.

Motivations for this approach include: simplicity of design, simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as “more flexible” than relational database tables.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages (instead of SQL, for instance the lack of ability to perform ad-hoc joins across tables), lack of standardized interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true ACID transactions, although a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made them central to their designs.

Instead, most NoSQL databases offer a concept of “eventual consistency” in which database changes are propagated to all nodes “eventually” (typically within milliseconds) so queries for data might not return updated data immediately or might result in reading data that is not accurate, a problem known as stale reads. Additionally, some NoSQL systems may exhibit lost writes and other forms of data loss. Fortunately, some NoSQL systems provide concepts such as write-ahead logging to avoid data loss. For distributed transaction processing across multiple databases, data consistency is an even bigger challenge that is difficult for both NoSQL and relational databases. Even current relational databases “do not allow referential integrity constraints to span databases.” There are few systems that maintain both ACID transactions and X/Open XA standards for distributed transaction processing.

Wide Column Stores/Column Family Databases

Document Store Databases

Key Value/Tuple Store Databases

Graph Databases

Multimodel Databases

Object Databases

Grid & Cloud Databases

XML Databases

Multidimensional Databases

Network Model Databases

Hadoop

In the Big data world the sheer volume, velocity and variety of data renders most ordinary technologies ineffective. Thus in order to overcome their helplessness companies like Google and Yahoo! needed to find solutions to manage all the data that their servers were gathering in an efficient, cost effective way.

Hadoop was originally created by a Yahoo! Engineer, Doug Cutting, as a counter-weight to Google’s BigTable. Hadoop was Yahoo!’s attempt to break down the big data problem into small pieces that could be processed in parallel. Hadoop is now an open source project available under Apache License 2.0 and is now widely used to manage large chunks of data successfully by many companies. What follows is a short introduction to how it works.

At its core, Hadoop has two main systems:

Hadoop Distributed File System (HDFS):  the storage system for Hadoop spread out over multiple machines as a means to reduce cost and increase reliability.

MapReduce engine: the algorithm that filters, sorts and then uses the database input in some way.

How does HDFS work?

With the Hadoop Distributed File system the data is written once on the server and subsequently read and re-used many times thereafter. When contrasted with the repeated read/write actions of most other file systems it explains part of the speed with which Hadoop operates. As we will see, this is why HDFS is an excellent choice to deal with the high volumes and velocity of data required today.

serversThe way HDFS works is by having a main « NameNode » and multiple « data nodes » on a commodity hardware cluster. All the nodes are usually organized within the same physical rack in the data center. Data is then broken down into separate « blocks » that are distributed among the various data nodes for storage. Blocks are also replicated across nodes to reduce the likelihood of failure.

The NameNode is the «smart» node in the cluster. It knows exactly which data node contains which blocks and where the data nodes are located within the machine cluster. The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes.

The NameNode operates in a “loosely coupled” way with the data nodes. This means the elements of the cluster can dynamically adapt to the real-time demand of server capacity by adding or subtracting nodes as the system sees fit.

The data nodes constantly communicate with the NameNode to see if they need complete a certain task. The constant communication ensures that the NameNode is aware of each data node’s status at all times. Since the NameNode assigns tasks to the individual datanodes, should it realize that a datanode is not functioning properly it is able to immediately re-assign that node’s task to a different node containing that same data block. Data nodes also communicate with each other so they can cooperate during normal file operations. Clearly the NameNode is critical to the whole system and should be replicated to prevent system failure.

Again, data blocks are replicated across multiple data nodes and access is managed by the NameNode. This means when a data node no longer sends a “life signal” to the NameNode, the NameNode unmaps the data note from the cluster and keeps operating with the other data nodes as if nothing had happened. When this data node comes back to life or a different (new) data node is detected, that new data node is (re-)added to the system. That is what makes HDFS resilient and self-healing. Since data blocks are replicated across several data nodes, the failure of one server will not corrupt a file. The degree of replication and the number of data nodes are adjusted when the cluster is implemented and they can be dynamically adjusted while the cluster is operating.

Data integrity is also carefully monitored by HDFS’s many capabilities. HDFS uses transaction logs and validations to ensure integrity across the cluster. Usually there is one NameNode and possibly a data node running on a physical server in the rack, while all other servers run data nodes only.

Hadoop MapReduce in action

Hadoop MapReduce is an implementation of the MapReduce algorithm developed and maintained by the Apache Hadoop project. The general idea of the MapReduce algorithm is to break down the data into smaller manageable pieces, process the data in parallel on your distributed cluster, and subsequently combine it into the desired result or output.

Hadoop MapReduce includes several stages, each with an important set of operations designed to handle big data. The first step is for the program to locate and read the « input file » containing the raw data. Since the file format is arbitrary, the data must be converted to something the program can process. This is the function of « InputFormat » and « RecordReader » (RR). InputFormat decides how to split the file into smaller pieces (using a function called InputSplit). Then the RecordReader transforms the raw data for processing by the map. The result is a sequence of « key » and « value » pairs.

Once the data is in a form acceptable to map, each key-value pair of data is processed by the mapping function. To keep track of and collect the output data, the program uses an « OutputCollector ». Another function called « Reporter » provides information that lets you know when the individual mapping tasks are complete.

Once all the mapping is done, the Reduce function performs its task on each output key-value pair. Finally an OutputFormat feature takes those key-value pairs and organizes the output for writing to HDFS, which is the last step of the program.

Hadoop MapReduce is the heart of the Hadoop system. It is able to process the data in a highly resilient, fault-tolerant manner. Obviously this is just an overview of a larger and growing ecosystem with tools and technologies adapted to manage modern big data problems.

List of Hadoop Systems

Open Lectures:

  1. Reading in Database Systems
  2. Database Design
Further Reading:
  1. Connoly and Begg: Database Systems: A Practical Approach to Design, Implementation, and Management: Global Edition Paperback – 26 Sep 2014

 database

This book is ideal for a one- or two-term course in database management or database design in an undergraduate or graduate level course. With its comprehensive coverage, this book can also be used as a reference for IT professionals.

This best-selling text introduces the theory behind databases in a concise yet comprehensive manner, providing database design methodology that can be used by both technical and non-technical readers. The methodology for relational Database Management Systems is presented in simple, step-by-step instructions in conjunction with a realistic worked example using three explicit phases—conceptual, logical, and physical database design.

Teaching and Learning Experience

This program presents a better teaching and learning experience–for you and your students. It provides:

  • Database Design Methodology that can be Used by Both Technical and Non-technical Readers
  • A Comprehensive Introduction to the Theory behind Databases
  • A Clear Presentation that Supports Learning

 

  • Paperback: 1440 pages
  • Publisher: Pearson; 6 edition (26 Sept. 2014)
  • Language: English
  • ISBN-10: 1292061189
  • ISBN-13: 978-1292061184
  • Product Dimensions: 19.1 x 4.3 x 23.2 cm

Buy at Amazon.co.uk

2. Silberchatz, Korth and Sudershan: Database System Concepts (Int’l Ed) Paperback – 1 Jun 2010

database_II

Database System Concepts by Silberschatz is now in its 6th edition and is one of the cornerstone texts of database education. It presents the fundamental concepts of database management in an intuitive manner geared toward allowing students to begin working with databases as quickly as possible. Silberschatz is designed for a first course in databases at the junior/senior undergraduate level or the first year graduate level. It also contains additional material that can be used as supplements or as introductory material for an advanced course. Because the authors present concepts as intuitive descriptions, a familiarity with basic data structures, computer organization, and a high-level programming language are the only prerequisites. Important theoretical results are covered, but formal proofs are omitted. In place of proofs, figures and examples are used to suggest why a result is true.

  • Paperback: 1152 pages
  • Publisher: McGraw-Hill Education / Asia; 6 edition (1 Jun. 2010)
  • Language: English
  • ISBN-10: 0071289593
  • ISBN-13: 978-0071289597
  • Product Dimensions: 18.8 x 4.8 x 23.3 cm

Buy at Amazon.co.uk

 

 

3. Lindstedt and Olschimke: Building a Scalable Data Warehouse with Data Vault 2.0 Paperback – 15 Oct 2015

data_ware

The Data Vault was invented by Dan Linstedt at the U.S. Department of Defense, and the standard has been successfully applied to data warehousing projects at organizations of different sizes, from small to large-size corporations. Due to its simplified design, which is adapted from nature, the Data Vault 2.0 standard helps prevent typical data warehousing failures. “Building a Scalable Data Warehouse” covers everything one needs to know to create a scalable data warehouse end to end, including a presentation of the Data Vault modeling technique, which provides the foundations to create a technical data warehouse layer. The book discusses how to build the data warehouse incrementally using the agile Data Vault 2.0 methodology. In addition, readers will learn how to create the input layer (the stage layer) and the presentation layer (data mart) of the Data Vault 2.0 architecture including implementation best practices. Drawing upon years of practical experience and using numerous examples and an easy to understand framework, Dan Linstedt and Michael Olschimke discuss: * How to load each layer using SQL Server Integration Services (SSIS), including automation of the Data Vault loading processes. * Important data warehouse technologies and practices. * Data Quality Services (DQS) and Master Data Services (MDS) in the context of the Data Vault architecture. * Provides a complete introduction to data warehousing, applications, and the business context so readers can get-up and running fast * Explains theoretical concepts and provides hands-on instruction on how to build and implement a data warehouse* Demystifies data vault modeling with beginning, intermediate, and advanced techniques* Discusses the advantages of the data vault approach over other techniques, also including the latest updates to Data Vault 2.0 and multiple improvements to Data Vault 1.0

  • Paperback: 684 pages
  • Publisher: Morgan Kaufmann Publishers In (15 Oct. 2015)
  • Language: English
  • ISBN-10: 0128025107
  • ISBN-13: 978-0128025109
  • Product Dimensions: 19 x 3.3 x 23.1 cm

Buy at Amazon.co.uk

4. Inmon and Lindstedt: Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault Paperback– 26 Nov 2014

data_arc

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can’t be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist. Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You’ll be able to: Turn textual information into a form that can be analyzed by standard tools. Make the connection between analytics and Big DataUnderstand how Big Data fits within an existing systems environment Conduct analytics on repetitive and non-repetitive data.

  • Paperback: 378 pages
  • Publisher: Morgan Kaufmann (26 Nov. 2014)
  • Language: English
  • ISBN-10: 012802044X
  • ISBN-13: 978-0128020449
  • Product Dimensions: 19 x 2.2 x 23.4 cm

Buy at Amazon.co.uk