« All Blog Posts

Making Small Work of Big Data

A deeper look into Big Data Databases

In the past month a couple projects came across my desk that required some further investigation into Big Data. Now, Big Data means exactly what it is called. Data that is BIG. As in 500 million rows of complex data, that needs to be searched in milliseconds. Definitely not a small feat for your typical single server Relational Database schema’s you may be used to. In this article we will be discussing a little about what Big Data is, and furthermore, how to use tools in the marketplace to handle Big Data.

First we must define specifically what constitutes Big Data. Here we look at the three V’s of what can signify a Big Data project. Volume, Velocity and Variety are the commonly used metrics to classify big data.

Volume presents the most immediate challenge to conventional IT structures and mindsets. If you could run a production forecast across 300 factors instead of 6, do you think you could predict demand better? Or how about social data with full record of comments, likes, dislikes, previously viewed products and actions? And then add on top of that, historical logging that goes back till the beginning of that data stream? With the current amount of input that is possible from our always connected world, the ability to capture, store and recall that vast amount of data is paramount to making your data work for you and not against you.

Velocity goes hand in hand with volume. It is the increasing rate at which data flows into an organization. Tracking more data points means by default you will have more data flowing in both directions. Tracking a user on-site, off-site, on multiple platforms, from their home and also on the go in the mobile universe means that data has the potential to never be turned off. A constant stream of metrics always making a bee-line into your database. If you do not yet have volume then velocity is sure to get you volume quickly with continued use of the application.

Variety is creating order in the un-ordered chaos. Since data can come from so many different sources such as geo-location, user history, image data, text strings and raw sensor data as examples, this means the reality is that big data is messy and hard to form to a pre-defined schema. A common use of big data processing is extracting ordered meaning from name-value pair sets of collective information.

Now this is not an end all list of how to identify big data, but it can definitely get you close. Once you take the leap into working with big data there are further dimensions to consider over and above just mere tool selection. The majority of big data solutions are now provided in three forms: software-only, as an appliance or cloud based. Decisions between which route to take will depend on issues of data locality, privacy, regulations, human resources and project requirements.

Apache Hadoop has been the driving force behind the growth of the big data industry. You’ll hear it referenced often in a multitude of articles on Stack Overflow. Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. At the core of Hadoop is Google’s response to the problem of creating web search indexes, the MapReduce framework. The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. This distributed computation solves the issue of data too large to fit onto a single machine and process within a timely (almost instantaneous) manner.

Outside the scope of this article are associated frameworks that make working with a Hadoop architecture much easier and application specific.

  • Ambari – Deployment, configuration and monitoring
  • Flume – Collection and import of log and event data
  • HBase – Column-oriented database scaling to billions of rows
  • HCatalog – Schema and data type sharing over Pig, Hive and MapReduce
  • HDFS – Distributed redundant filesystem for Hadoop
  • Hive – Data warehouse with SQL-like access
  • Mahout – Library of machine learning and data mining algorithms
  • MapReduce – Parallel computation on server clusters
  • Pig – High-level programming language for Hadoop computations
  • Oozie – Orchestration and workflow management
  • Sqoop – Imports data from relational databases
  • Whirr – Cloud-agnostic deployment of clusters
  • Zookeeper – Configuration management and coordination

Choosing the proper layers or abstractions will largely depend upon what specifically you are trying to achieve. In my current Big Data project, I am using a hybrid approach, non-clustered relational databases for the small data and a Hadoop based solution for the massive data library of approximately 2.35 billion separate points of data. Following this approach leverages the best of both worlds for my particular application needs. Your mileage may vary.

The data landscape is changing and an increasing number of clients want to track every possible metric they can without thinking of the overhead cost of the mass amount of data in regard to processing power and speed. They are getting used to a world where they can go to Facebook and search billions of records instantaneously having results pop up immediately after every letter press.

Remember, the end of the line is not with Hadoop, it is just the beginning of the architecture. Microsoft, Oracle, Google and others all have their own products based off of the Hadoop architecture. But keep in mind if you are savvy enough you can just as easily setup and integrate your own Hadoop system with available plugs and hooks that are available in the Hadoop ecosystem, such as GlusterFS for distributed file systems for instance.

Here is a video from the 2014 DevNation Conference with the topic of, Hadoop and Beyond:

The next time you run a search on your data whether its your in-house data structure, mobile app or website and it takes more than 30 seconds to load the results, challenge your IT/DBA, tell them that they can indeed do a better job! Change your programmers old-school methods and get hip to Big Data and speed, if they say that a 40ms query across half a billion records is impossible, then fire him and give my team and I at A3S Media a call :) We’ve been there done that.

Published: October 9, 2015

Categories: Database, NoSQL, MySQL, Big Data

Tags: Web Development

Author: Nick Bagnoli