Big data is a term used when the data exceeds the processing capacity of typical database. We need a big data analytics when the data grows quickly and we need to uncover hidden patterns, unknown correlations, and other useful information.
There are three main features in big data:
Volume: Large amounts of data
Variety: Different types of structured, unstructured, and multi-structured data
Velocity: Needs to be analyzed quickly
As shown in the following figure, we can see the interaction between the three Vs:
Big data is the opportunity for any company to gain advantages from data aggregation, data exhaust, and metadata. This makes big data a useful business analytic tool, but there is a common misunderstanding about what big data is.
The most common architecture for big data processing is through MapReduce
, which is a programming model for processing large datasets in parallel using a distributed cluster.
Apache Hadoop
is the most popular implementation of MapReduce to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of the three classes of technologies for storing and managing big data. The other two classes are NoSQL
and massively parallel processing (MPP) data stores. In this book, we implement MapReduce functions and NoSQL storage through MongoDB
, see Chapter 12, Data Processing and Aggregation with MongoDB and Chapter 13, Working with MapReduce.
MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.
A paper published by the IEEE in 2009, The Unreasonable Effectiveness of Data states:
But invariably, simple models and a lot of data trump over more elaborate models based on less data.
This is a fundamental idea in big data (you can find the full paper at http://bit.ly/1dvHCom). The trouble with real world data is that the probability of finding false correlations is high and gets higher as the datasets grow. That's why, in this book, we focus on meaningful data instead of big data.
One of the main challenges for big data is how to store, protect, backup, organize, and catalog the data in a petabyte scale. Another main challenge of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras the amount of data available for each person increases every minute. Big data must process all this data in real time.
Interaction with the outside world is highly important in data analysis. Using sensors such as
RFID (Radio-frequency identification) or a smartphone to scan a
QR code (Quick Response Code) is an easy way to interact directly with the customer, make recommendations, and analyze consumer trends.
On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-based Image Retrieval, we will use these digital images to perform search by image. This can be used, for example, in face recognition or to find reviews of a restaurant just by taking a picture of the front door.
The interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.
Formally, the
SNA (social network analysis) performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals, as we can see in the following figure. The social network creates groups of related individuals (friendship) based on different aspects of their interaction. We can find important information such as hobbies (for product recommendation) or who has the most influential opinion in the group (centrality). We will present in Chapter 10, Working with Social Graphs, a project; who is your closest friend and we'll show a solution for Twitter clustering.
Social networks are strongly connected and these connections are often not symmetric. This makes the SNA computationally expensive, and needs to be addressed with high-performance solutions that are less statistical and more algorithmic.
The visualization of a social network can help us to get a good insight into how people are connected. The exploration of the graph is done through displaying nodes and ties in various colors, sizes, and distributions. The D3.js
library has animation capabilities that enable us to visualize the social graph with an interactive animation. These help us to simulate behaviors such as information diffusion or distance between nodes.
Facebook processes more than 500 TB data daily (images, text, video, likes, and relationships), this amount of data needs non-conventional treatment such as NoSQL databases and MapReduce frameworks, in this book, we work with MongoDB—a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.
Tools and toys for this book
The main goal of this book is to provide the reader with self-contained projects ready to deploy, in order to do this, as you go through the book you will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository https://github.com/hmcuesta.
You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.
Python is a scripting language—an interpreted language with its own built-in memory management and good facilities for calling and cooperating with other programs. There are two popular Versions, 2.7 or 3.x, in this book, we will focused on the 3.x Version because it is under active development and has already seen over two years of stable releases.
Python is multi-platform, which runs on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and .NET virtual machines. Python has powerful standard libraries and a wealth of third-party packages for numerical computation and machine learning such as NumPy
, SciPy
, pandas
, SciKit
, mlpy
, and so on.
Python is excellent for beginners, yet great for experts and is highly scalable—suitable for large projects as well as small ones. Also it is easily extensible and object-oriented.
Python is widely used by organizations such as Google, Yahoo Maps, NASA, RedHat, Raspberry Pi, IBM, and so on.
A list of organizations using Python is available at http://wiki.python.org/moin/OrganizationsUsingPython.
Python has excellent documentation and examples at http://docs.python.org/3/.
Python is free to use, even for commercial products, download is available for free from http://python.org/.
mlpy (Machine Learning Python) is a Python module built on top of NumPy
, SciPy
, and the GNU
Scientific Libraries. It is open source and supports Python 3.x. The mlpy
module has a large amount of machine learning algorithms for supervised and unsupervised problems.
Some of the features of mlpy
that will be used in this book are as follows:
We will perform a numeric regression with kernel ridge regression (KRR)
We will explore the dimensionality reduction through principal component analysis (PCA)
We will work with support vector machines (SVM) for classification
We will perform text classification with Naive Bayes
We will see how different two time series are with dynamic time warping (DTW) distance metric
We can download the latest Version of mlpy
from http://mlpy.sourceforge.net/.
For reference you can refer to the paper mply: Machine Learning Python (http://arxiv.org/abs/1202.6548) submitted in 2012 by D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, and C. Furlanello.
D3.js (Data-Driven Documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js
you can manipulate all the elements of the DOM (Document Object Model); it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).
D3.js
supports large datasets and includes animation capabilities that make it a really good choice for web visualization.
D3 has an excellent documentation, examples, and community at https://github.com/mbostock/d3/wiki/Gallery and https://github.com/mbostock/d3/wiki.
You can download the latest Version of D3.js
from http://d3js.org/d3.v3.zip.
NoSQL (Not only SQL) is a term that covers different types of data storage technologies, used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in Web 2.0 and in social media applications.
MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents that gives you the possibility to store the view models almost exactly like you model them in the application. Also, you can perform complex searches for data and elementary data mining with MapReduce.
MongoDB is highly scalable, robust, and perfect to work with JavaScript-based web applications because you can store your data in a JSON (JavaScript Object Notation
) document and implement a flexible schema which makes it perfect for no structured data.
MongoDB is used by highly recognized corporations such as Foursquare, Craigslist, Firebase, SAP, and Forbes. We can see a detailed list at http://www.mongodb.org/about/production-deployments/.
MongoDB has a big and active community and well-written documentation at http://docs.mongodb.org/manual/.
MongoDB is easy to learn and it's free, we can download MongoDB from http://www.mongodb.org/downloads.