Introduction to Big Data and MySQL 8

Today we are in the age of digitalization. We are producing enormous amounts of data in many ways--social networking, purchasing at grocery stores, bank/credit card transactions, emails, storing data on clouds, and so on. One of the first questions that comes to mind is: are you getting the utmost out of the collected data? For this data tsunami, we need to have appropriate tools to fetch data in an organized way that can be used in various fields such as scientific research, real-time traffic, fighting crime, fraud detection, digital personalization, and so on. All this data needs to be captured, stored, searched, shared, transferred, analyzed, and visualized.

Analysis of structured, unstructured, or semi-structured ubiquitous data helps us discover hidden patterns, market trends, correlations, personal preferences, and so on. With the help of the right tools to process and analyze, data organization can result in much better marketing plans, additional revenue opportunities, improved customer service, healthier operational efficiency, competitive benefits, and much more.

Every company collects data and uses it; however, to potentially flourish, a company needs to use data more effectively. Every company must carve out direct links to produced data, which can improve business either directly or indirectly.

Okay, now you have Big Data, which is generally being referred to as a large quantity of data, and you are doing analysis--is this what you need? Hold on! The other most critical factor is to successfully monetize the data. So, get ready and fasten your seatbelts to fly in understanding the importance of Big Data!

In this chapter we will learn about below points to find out Big Data's role in today's life and basic installation steps for MySQL 8:

Importance of Big Data
Life cycle of Big Data
What is structured database
MySQL's basics
New feature introduced in MySQL 8
Benefits of using MySQL 8
How to install MySQL 8
Evolution of MySQL for Big Data

The importance of Big Data

The importance of Big Data doesn't depend only on how much data you have, it's rather what you are going to do with the data. Data can be sourced and analyzed from unpredictable sources and can be used to address many things. Let's see use cases with real-life importance made on renowned scenarios with the help of Big Data.

The following image helps us understand a Big Data solution serving various industries. Though it's not an extensive list of industries where Big Data has been playing a prominent role in business decisions, let's discuss a few of the industries:

Social media

Social media content is information, and so are engagements such as views, likes, demographics, shares, follows, unique visitors, comments, and downloads. So, in regards to social media and Big Data, they are interrelated. At the end of the day, what matters is how your social media-related efforts contribute to business.

I came across one wonderful title: There's No Such Thing as Social Media ROI - It's Called Business ROI.

One notable example of Big Data possibilities on Facebook is providing insights about consumers lifestyles, search patterns, likes, demographics, purchasing habits, and so on. Facebook stores around 100PBs of data and piles up 500TB of data almost daily. Considering the number of subscribers and data collected, it is expected to be more than 60 zettabytes in the next three years. The more data you have, the more analysis you can have with sophisticated precision approaches for better Return on Investment (ROI). Information fetched from social media is also leveraged when targeting audiences for attractive and profitable ads.

Facebook has a service called Graph Search, which can help you do advanced searches with multiple criteria. For instance, you can search for people of male gender living in Ahmedabad who work with KNOWARTH Technologies. Google also helps you refine the search. Such searches and filters are not limited to these; it might also contain school, political views, age, and name. In the same way, you can also try for hotels, photos, songs, and more. So here, you have the business ROI of the Facebook company, which provides Facebook ad services which can be based on specific criteria such as regions, interests, or other specific features of user data. Google also provides a similar platform called Google AdWords.

Politics

The era of Big Data has been playing a significant role in politics too; political parties have been using various sources of data to target voters and better their election campaigns. Big Data analytics also made a significant contribution to the 2012 re-election of Barack Obama by enhancing engagement and speaking about the precise things that were significant for voters.

Narendra Modi is considered one of the most technology and social media-savvy politicians in the world! He has almost 500 million views on Google+, 30 million followers on Twitter, and 35 million likes on Facebook! Narendra Modi belongs to the Bhartiya Janta Party (BJP); Big Data analysis carried major responsibility for the BJP party and its associates for their successful Indian General Election in 2014, using open source tools that helped them get in direct touch with their voters. BJP reached their fluctuating voters and negative voters too, as they kept monitoring social media conversations and accordingly sent messages and used tactics to improve their vision for the election campaign.

Narendra Modi made a statement about prioritizing toilets before temples seven months earlier, after which the digital team closely monitored social media conversations around this. It was noticed that at least 50% of users were in line with the statement. This was when the opportunity to win the hearts of voters was converted to the mission of Swacch Bharat, which means hygienic India. The results were astonishing; BJP party support rose to around 30% in merely 50 hours.

Science and research

Did you know that with the help of Big Data, human genome decoding, which actually took 10 years to process, is now decoded in hardly a day, and there is almost a 100 times reduction in cost predicted by Moore's Law? Back in the year 2000, when the Sloan Digital Sky Survey (SDSS) started gathering astronomical data, it was with a rate of around 200 GB per night, which, at that time, was much higher than the data collected in astronomy history.

National Aeronautics and Space Administration (NASA) uses Big Data extensively considering the huge amount of science and research done. NASA gathers data from across the solar system to reveal unknown information about the universe; its massive collection of data is a prominent asset for science and research, and has been a benefit to humankind in diverse ways. The way NASA fetches data, stores it, and uses it in effective ways is enormous. There are so many use cases of NASA that it would be difficult to elaborate here!

Power and energy

One of the leading energy management companies that helps improve energy consumption with the help of Big Data predictive analysis, which helps build stronger relationships and retaining of customers. This company connects with more than 150 utilities and serves more than 35 million household customers to improve energy usage and reduce costs and carbon emissions. It also provides analytical reports to utility providers, from more than 10 million data points each day, for a holistic overview of usage for analysis. Household customers get these reports in invoices, which provide areas where energy usage can be reduced and directly helps consumers optimize energy costs.

Fraud detection

When it comes to security, fraud detection, or compliance, then Big Data is your soulmate, and precisely if your soulmate helps you in identifying and preventing issues before they strike, then it becomes a sweet spot for business. Most of the time, fraud detection happens a long time after the fraud has happened, when you might have already been damaged. The next steps would be obviously to minimize the impact and improve areas that could help you prevent this from being repeated.

Many companies who are into any type of transaction processing or claims are using fraud detection techniques extensively. Big Data platforms help them analyze transactions, claims, and so on in real-time, along with trends or anomalous behavior to prevent fraudulent activities.

The National Security Agency (NSA) also does Big Data analytics to foil terrorist plans. With the help of advanced Big Data fraudulent techniques, many security agencies use Big Data tools to predict criminal activity, credit card fraud, catch criminals, and prevent cyber attacks, among others. Day by day, as security, compliance, and fraud change their patterns, accordingly security agencies and fraud transaction techniques are becoming richer to keep a step ahead for such unwanted scenarios.

Healthcare

Nowadays, a wrist-based health tracker is a very common thing; however, with the help of Big Data, it not only shows your personal dashboard or changes over time, but also gives you relevant suggestions based on the medical data it collects to improve your diet, and analytic facts about people like you. So, from simple wrist-based health trackers, there are a lot of signs that can improve the healthcare of a patient. Companies providing these kinds of services also analyze how health is impacted by analyzing trends. Gradually, such wearables are also being used in Critical Care Units to quickly analyze the trend of doctors' immediate remediations.

By leveraging data accumulated from government agencies, social services files, accident reports, and clinical data, hospitals can help evaluate healthcare needs. Geographical statistics based on numerous factors, from population growth and disease rate to enhancing the quality of human life, are compared to determine the availability of medical services, ambulances, emergency services, pandemic plans, and other relevant health services. This can unbox probable environmental hazards, health risks, and trends that are being done by few agencies on a regular basis to forecast flu epidemics.

Business mapping

Netflix has millions of subscribers; it uses Big Data and analytics about a subscriber's habits based on age, gender, and geographical location to customize, which has proven to generate more business as per its expectations.

Amazon, back in 2011, started awarding $5 to its customers who use the Amazon Price Check Mobile App--scanning products in the store, grab a picture, and searching to find the lowest prices. It also had a feature to submit the in-store price for the products. It was then Big Data's role to have all the information on products could can be compared with Amazon products for price comparison and customer trends, and accordingly plan marketing campaigns and offers based on valuable data that was collected to dominate a rapidly developing e-commerce competitive market.

McDonalds has more than 35,000 local restaurants that cater to around 75 million customers in more than 120 countries. It uses Big Data to gain insights to improve customer experience and offers McDonalds key factors such as menu, queue timings, order size, and the pattern of orders by customers, which helps them optimize the effectiveness of their operations and customization based on geographical locations for lucrative business.

There are many real-world Big Data use cases that have changed humanity, technology, predictions, health, science and research, law and order, sports, customer experience, power and energy, financial trading, robotics, and many more fields. Big Data is an integral part of our daily routine, which is not evident all the time, but yes, it plays a significant role in the back to what we do in many ways. It's time to start looking in detail at how the life cycle of Big Data is structured, which would give an inside story of many areas that play a significant role in getting data to a place that might be used for processing.

The life cycle of Big Data

Many organizations are considering Big Data as not only just a buzzword, but a smart system to improve business and get relevant marked information and insights. Big Data is a term that refers to managing huge amounts of complex unprocessed data from diverse sources like databases, social media, images, sensor-driven equipment, log files, human sentiments, and so on. This data can be in a structured, semi-structured, or unstructured form. Thus, to process this data, Big Data tools are used to analyze, which is a difficult and time-intensive process using traditional processing procedures.

The life cycle of Big Data can be segmented into Volume, Variety, Velocity, and Veracity--commonly known as the FOUR V's OF BIG DATA. Let's look at them quickly and then move on to the four phases of the Big Data life cycle, that is, collecting data, storing data, analyzing data, and governing data.

The following illustrates a few real-world scenarios, which gives us a much better understanding of the four Vs defining Big Data:

Volume

Volume refers to the vast amount of data generated and stored every second. The size of data in enterprises is not in terabytes--it does an accrual of zettabytes or brontobytes. New Big Data tools are now generally using distributed systems that might be sometimes diversified across the world.

The amount of data generated across the globe by year 2008 is expected to be generated in just a minute by year 2020.

Variety

Variety refers to several types and natures of data such as click streams, text, sensors, images, voice, video, log files, social media conversations, and more. This helps people who scrutinize it to effectively use it for insights.

70% of the data in the world is unstructured such as text, images, voice, and so on. However, earlier structured data was popular for being analyzed, as it fits in files, databases, or such traditional data storing procedures.

Velocity

Velocity refers to the speed of the data generated, ingested, and processed to meet the demands and challenges that exist in the pathway towards evolution and expansion.

New age communication channels such as social media, emails, and mobiles have added velocity to the data in Big Data. To scrutinize around 1TB of trading event information every day for fraud detection is a time sensitive process, where sometimes every minute matters to prevent fraud. Just think of social media conversations going viral in a matter of seconds; analysis helps us get trends on such platforms.

Veracity

Veracity refers to the inconsistency of data that can be found; it can affect the way data is being managed and handled effectively. Managing such data and making it valuable is where Big Data can help.

Quality and accuracy has been a major challenge when we talk about Big Data, as that's what it's all about. The amount of Twitter feeds is an appropriate use case where hashtags, typos, informal text, and abbreviations abound; however, we daily come across scenarios where Big Data does its work in the backend and lets us work with this type of data.

Phases of the Big Data life cycle

The effective use of Big Data with exponential growth in data types and data volumes has the potential to transform economies useful business and marketing information and customer surplus. Big Data has become a key success mantra for current competitive markets for existing companies, and a game changer for new companies in the competition. This all can be proven true if VALUE FROM DATA is leveraged. Let's look at the following figure:

As this figure explains, the Big Data life cycle can be divided into four stages. Let's study them in detail.

Collect

This section is key in a Big Data life cycle; it defines which type of data is captured at the source. Some examples are gathering logs from the server, fetching user profiles, crawling reviews of organizations for sentiment analysis, and order information. Examples that we have mentioned might involve dealing with local language, text, unstructured data, and images, which will be taken care of as we move forward in the Big Data life cycle.

With an increased level of automating data collection streams, organizations that have been classically spending a lot of effort on gathering structured data to analyze and estimate key success data points for business are changing. Mature organizations now use data that was generally ignored because of either its size or format, which, in Big Data terminology, is often referred to as unstructured data. These organizations always try to use the maximum amount of information whether it is structured or unstructured, as for them, data is value.

You can use data to be transferred and consolidated into Big Data platform like HDFS (Hadoop Distributed File System). Once data is processed with the help of tools like Apache Spark, you can load it back to the MySQL database, which can help you populate relevant data to show which MySQL consists.

With the amount of data volume and velocity increasing, Oracle now has a NoSQL interface for the InnoDB storage engine and MySQL cluster. A MySQL cluster additionally bypasses the SQL layer entirely. Without SQL parsing and optimization, Key-value data can be directly inserted nine times faster into MySQL tables.

Store

In this section, we will discuss storing data that has been collected from various sources. Let's consider an example of crawling reviews of organizations for sentiment analysis, wherein each gathers data from different sites with each of them having data uniquely displayed.

Traditionally, data was processed using the ETL (Extract, Transform, and Load) procedure, which used to gather data from various sources, modify it according to the requirements, and upload it to the store for further processing or display. Tools that were every so often used for such scenarios were spreadsheets, relational databases, business intelligence tools, and so on, and sometimes manual effort was also a part of it.

The most common storage used in Big Data platform is HDFS. HDFS also provides HQL (Hive Query Language), which helps us do many analytical tasks that are traditionally done in business intelligence tools. A few other storage options that can be considered are Apache Spark, Redis, and MongoDB. Each storage option has their own way of working in the backend; however, most storage providers exposes SQL APIs which can be used to do further data analysis.

There might be a case where we need to gather real-time data and showcase in real time, which practically doesn't need the data to be stored for future purposes and can run real-time analytics to produce results based on the requests.

Analyze

In this section, we will discuss how these various data types are being analyzed with a common question starting with what if...? The way organizations have evolved with data also has impacted new metadata standards, organizing it for initial detection and reprocessing for structural approaches to be matured on the value of data being created.

Most mature organizations reliably provide accessibility, superiority, and value across business units with a constant automated process of structuring metadata and outcomes to be processed for analysis. A mature data-driven organization's analyzing engine generally works on multiple sources of data and data types, which also includes real-time data.

During the analysis phase, raw data is processed, for which MySQL has Map/Reduce jobs in Hadoop, to analyze and give the output. With MySQL data lying in HDFS, it can be accessed by the rest of the ecosystem of Big Data platform-related tools for further analysis.

Governance

Value for data cannot be expected for a business without an established governance policy in practice. In the absence of a mature data governance policy, businesses can experience misinterpreted information, which could ultimately cause unpredictable damages to the business. With the help of Big Data governance, an organization can achieve consistent, precise, and actionable awareness of data.

Data governance is all about managing data to meet compliance, privacy, regulatory, legal, and anything that is specifically obligatory as per business requirements. For data governance, continuous monitoring, studying, revising, and optimizing the quality of the process should also respect data security needs. So far, data governance has been taken with ease where Big Data is concerned; however, with data growing rapidly and being used in various places, this has drawn attention to data governance. It is gradually becoming a must-considerable factor for any Big Data project.

As we have now got a good understanding of the life cycle of Big Data, let's take a closer look at MySQL basics, benefits, and some of the excellent features introduced.

Structured databases

Many organizations use a structured database to store their data in an organized way with the formatted repository. Basically, data in a structured database has a fixed field, predefined data length, and defines what kind of data is to be stored such as numbers, date, time, address, currency, and so on. In short, the structure is already defined before data gets inserted, which gives a cleaner idea of what data can reside there. The key advantage of using a structured database is data being easily stored, queried, and analyzed.

An unstructured database is the opposite of this; it has no identifiable internal structure. It can have a massive unorganized agglomerate or various objects. Mainly, the source of structured data is machine-generated, which means information generated from the machine and without human intervention, whereas unstructured data is human-generated data. Organizations use structured databases for data such as ATM transactions, airline reservations, inventory systems, and so on. In the same way, some organizations use unstructured data such as emails, multimedia content, word processing documents, webpages, business documents, and so on.

Structured databases are traditional databases that used by many enterprises for more than 40 years. However, in the modern world, data volume is becoming bigger and bigger and a common need has taken its place--data analytics. Analytics is becoming difficult with structured databases as the volume and velocity of digital data grows faster by the day; we need to find a way to achieve such needs in an effective and efficient way. The most common database that is used as a structured database in the open source world is MySQL. You will learn how to achieve this structured database as Big Data that makes complex analysis easy. First, let's look into some insights of MySQL in the next section.

Basics of MySQL

MySQL is a well-known open source structured database because of its performance, easiness to use, and reliability. This is the most common choice of web applications for a relational database. In the current market, thousands of web-based applications rely on MySQL including giant industries like Facebook, Twitter, and Wikipedia. It has also proven the database choice for SaaS (Software as a Service) based applications like SugarCRM, Supply Dynamics, Workday, RightNow, Omniture, and Zimbra. MySQL was developed by MySQL AB, a Swedish company, and now it is distributed and supported by Oracle Corporation.

MySQL as a relational database management system

Data in a relational database is stored in an organized format so that information can be retrieved easily. Data will be stored in different tables made up of rows and columns. However, the relationship also can be built between different tables that efficiently store huge data and effectively retrieve the selected data. This provides database operations with tremendous speed and flexibility.

As a relational database, MySQL has capabilities to establish relationships with different tables such as one to many, many to one, and one to one by providing primary keys, foreign keys, and indexes. We can also perform joins between tables to retrieve exact information like inner joins and outer joins.

SQL (Structured Query Language) is used as an interface to interact with the relational data in MySQL. SQL is an ANSI (American National Standard Institute) standard language which we can operate with data like creation, deletion, updating, and retrieval.

Licensing

Many industries prefer open source technology for their flexibility and cost-saving features, while MySQL has put its footprint in the market by becoming the most popular relational database for web applications. Open source means that you can view the source of MySQL and customize it based on your needs without any cost. You can download the source or binary files from its site and use them accordingly.

The MySQL server is covered under the GNU (General Public License), which means that we can freely use it for web applications, study its source code, and modify accordingly. It also has the enterprise edition as well with advanced features included. Many enterprises purchase enterprise support from MySQL to get assistance on various issues.

Reliability and scalability

MySQL has great reliability to perform well without requiring extensive troubleshooting due to bottlenecks or other slowdowns. It also incorporates a number of performance-enhanced mechanisms such as index support, load utilities, and memory caches. MySQL uses InnoDB as a storage engine, which provides highly efficient ACID-compliant transactional capabilities that assure high performance and scalability. To handle the rapidly growing database, MySQL Replication and cluster helps scale out the database.

Platform compatibility

MySQL has great cross-platform availability that makes it more popular. It is flexible to run on major platforms such as RedHat, Fedora, Ubuntu, Debian, Solaris, Microsoft Windows, and Apple macOS. It also provides Application Programming Interface (APIs) to interconnect with various programming languages such as C, C++, C#, PHP, Java, Ruby, Python, and Perl.

Releases

Here is a list of major releases of MySQL so far:

Version 5.0 GA was released on 19th October, 2005
Version 5.1 GA was released on 14th November, 2008
Version 5.5 GA was released on 3rd December, 2010
Version 5.6 GA was released on 5th February, 2013
Version 5.7 GA was released on 21st October, 2015

Now it's time for the major version release--MySQL 8--which was announced on 12th September, 2016 and is still in the development milestone mode. Let's see what's new in this latest release.

New features in MySQL 8

The MySQL database development team has recently announced its major release as MySQL 8 Development Milestone Release (DMR) with significant updates and fixes for problems that were much needed in this change of Big Data.

You might be wondering why it's 8 after 5.7! Did the intermediate versions, that is, 6 and 7, missed out? Of course this was not the case; actually 6.0 was preserved as part of the changeover to a more frequent and timely release while 7.0 for the clustering version of MySQL.

Let's see some exciting features introduced in this latest version, as depicted in the following diagram:

It's time to look at MySQL 8 features in detail, which makes us excited and convinced about the reasons for a major version upgrade of MySQL.

Transactional data dictionary

Up until the previous version, the MySQL data dictionary was stored in different metadata files and non-transactional tables, but from this version, it will have a transactional data dictionary to store the information about the database. No more .frm, .trg, or .par files. All information will be stored in the database, which removes the cost of performing heavy file operations. There were numerous issues with filesystem metadata storage like the vulnerability of the filesystem, exorbitant file operations, difficult to handle crash recovery failures, or replication; it was also difficult to add new feature-related metadata. Now this upgrade has made it simple by storing information in a centralized manner, and will have improved performance as this data dictionary object can be cached in memory, similar to other database objects.

This data dictionary will have data that is needed for SQL query execution such as catalog information, character sets, collations, column types, indexes, database information, tables, store procedures, functions and triggers, and so on.

Roles

In MySQL 8, the privileges module has been improved by introducing roles, which means a collection of permissions. Now we can create roles with a number of privileges and assign to multiple users.

The problem with the previous version was that we were not able to define generic permissions for a group of users and each user has individual privileges. Suppose if there are 1,000 users already existing that have common privileges, and you want to remove the write permissions for all of these 1,000 users, what would you have done in the previous version? You would take the lengthy approach of updating each and every user, right? Arrgh... that's a time-consuming task.

Now with MySQL 8, it is easy to update any change in privileges. Roles will define all the required privileges and this role will be assigned to those 1,000 users. We just need to make any privilege changes in the role and all users will automatically inherit the respective privileges.

Roles can be created, deleted, grant or revoked permission, granted or revoked from the user account, and can specify the default role within the current session.

InnoDB auto increment

MySQL 8 has changed the auto-increment counter value store mechanism. Previously, it was stored in the memory, which was quite difficult to manage during server restarts or server crashes. However, now the auto-increment counter value is written into the redo log whenever the value gets changed and, on each checkpoint, it will be saved in the system table, which makes it persistent across the server restart.

With the previous version, updating the auto-increment value may have caused duplicate entry errors. Suppose if you updated the value of auto-increment in the middle of the sequence with a larger than the current maximum value, then but subsequent insert operations could not identify the unused values, which could cause a duplicate entry issue. This has been prevented by persisting the auto-increment value, hence subsequent insert operations can get the new value and allocate it properly.

If server restart happened, the auto-increment value was lost with the previous version as it was stored in memory and InnoDB needed to execute a query to find out the maximum used value. This has been changed, as the newer version has the capability to persist its value across the server restart. During the server restart, InnoDB initializes the counter value in memory using the maximum value stored in the data dictionary table. In case of server crashes, InnoDB initializes the auto-increment counter value that is bigger than the data dictionary table and the redo log.

Supporting invisible indexes

MySQL 8 provides you with a feature to make indexes invisible. These kind of indexes cannot be used by the optimizer. In case you want to test the query performance without indexes, using this feature you can do so by making them invisible rather than dropping and re-adding an index. This is a pretty handy feature when indexing is supposed to be dropped and recreated on huge datasets.

All indexes are visible by default. To make them invisible or visible, INVISIBLE and VISIBLE keywords are used respectively, as described in the following code snippet:

ALTER TABLE table1 ALTER INDEX ix_table1_col1 INVISIBLE;
ALTER TABLE table1 ALTER INDEX ix_table1_col1 VISIBLE;

Improving descending indexes

Descending indexes existed in version 5.7 too, but they were scanned in reverse order, which caused performance barriers. To improve performance, MySQL 8 has optimized this and scanned descending indexes in forward order, which has drastically improved performance. It also brings multiple column indexes for the optimizer when the most efficient scan order has ascending order for some columns, and descending order for other columns.

SET PERSIST

Server variables can be configured globally and dynamically while the server is running. There are numerous system variables that we can set using SET GLOBAL:

SET GLOBAL max_connections = 1000;

However, such settings will be lost after server restart. To avoid this, MySQL 8 has introduced the SET PERSIST variant, which preserves variables across a server restart:

SET PERSIST max_connections = 1000;

Expanded GIS support

Until the previous version, it supported only one coordinate system, a unitless 2D place that was not referenced to a position on earth. Now MySQL 8 has added support for a Spatial Reference System (SRS) with geo-referenced ellipsoids and 2D projections. SRS helps assign coordinates to a location and establishes relationships between sets of such coordinates. This spatial data can be managed in data dictionary storage as the ST_SPATIAL_REFERENCE_SYSTEMS table.

The default character set

The default character set has been changed from latin1 to UTF8. UTF8 is the dominating character set, though it hadn't been a default one in prior versions of MySQL. Along with the character set default, collation has been changed from latin1_swedish_ci to utf8mb4_800_ci_ai. With these changes globally accepted, character sets and collations are now based on UTF8; one of the common reasons is because there are around 21 different languages supported by UTF8, which makes systems provide multilingual support.

Extended bit-wise operations

In MySQL 5.7, bit-wise operations and functions were working for BigInt (64-bit integer) data types only. We needed to pass BIGINT as an argument and it would return the result as BIGINT. In short, it had maximum range up to 64 bits to perform operations. A user needs to do conversion to the BIGINT data type in case they want to perform it on other data types. This typecasting was not feasible for data types larger than 64 bits as it would truncate the actual value, which resulted in inaccuracy of data.

MySQL 8 has improved bit-wise operations by enabling support for other binary data types such as Binary, VarBinary, and BLOB. This makes it possible to perform bit-wise operations on larger than 64-bit data. No more typecasting needed! This allows the taking of arguments and returning results larger than 64 bits.

InnoDB Memcached

Multiple get operations are now possible with the InnoDB Memcached plugin, which will really help in improving the read performance. Now, multiple key value pairs can be fetched in a single Memcached query. Frequent communication traffic has also been minimized as we can get multiple data in a single shot. We will be referring the Memcached plugin in detail in Chapter 4, Using Memcached with MySQL 8.

Range queries are also supported by the InnoDB Memcached plugin. It simplifies range searches by specifying a particular range and retrieves values within this range.

NOWAIT and SKIP LOCKED

When rows are locked by other transactions that you are trying to access, then you need to wait for that transaction to release the lock on the same row so that you can access it accordingly. To avoid waiting for the other transaction, InnoDB has added support of the NOWAIT and SKIP LOCKED options. NOWAIT will return immediately with an error in case the requested row is locked rather than going into the waiting mode, and SKIP LOCKED will skip the locked row and never wait to acquire the row lock. Hence, SKIP LOCKED will not consider the locked row in the resulting set:

SELECT * FROM table1 WHERE id = 5 FOR UPDATE NOWAIT;
SELECT * FROM table1 FOR UPDATE SKIP LOCKED;

Benefits of using MySQL

Whether you are a developer or an enterprise, you would obviously choose the technology that provides good benefits and results when compared to other similar products. MySQL provides numerous advantages as the first choice in this competitive market. It has various powerful features available that make it a more comprehensive database. Let's now go through some benefits of using MySQL.

Security

The first thing that comes to mind is securing data, because nowadays data has become precious and can impact business continuity if legal obligations are not met; in fact, it can be so bad that it can close down your business in no time. MySQL is the most secure and reliable database management system used by many well-known enterprises such as Facebook, Twitter, and Wikipedia. It really provides a good security layer that protects sensitive information from intruders. MySQL gives access control management so that granting and revoking required access from the user is easy. Roles can also be defined with a list of permissions that can be granted or revoked for the user. All user passwords are stored in an encrypted format using plugin-specific algorithms.

Scalability

Day by day, the mountain of data is growing because of extensive use of technology in numerous ways. Due to this, load average is going through the roof. For some cases, it is unpredictable that data cannot exceed up to some limit or number of users will not go out of bounds. Scalable databases would be a preferable solution so that, at any point, we can meet unexpected demands to scale. MySQL is a rewarding database system for its scalability, which can scale horizontally and vertically; in terms of data and load of application queries across multiple MySQL servers is quite feasible. It is pretty easy to add horsepower to the MySQL cluster to handle the load.

An open source relational database management system

MySQL is an open source database management system that makes debugging, upgrading, and enhancing the functionality fast and easy. You can view the source and make the changes accordingly and use it in your own way. You can also distribute an extended version of MySQL but it requires a license for this.

High performance

MySQL gives high speed transaction processing with optimal speed. It can cache the results, which boosts read performance. Replication and clustering enables for better concurrency and manages the workload. Database indexes also accelerate the performance of SELECT query statements for large amounts of data. To enhance performance, MySQL 8 has included indexes in performance schema to speed up data retrieval.

High availability

Today, in the world of competitive marketing, an organization's key point is to have their system up and running. Any failure or downtime directly impacts business and revenue; hence, high availability is a factor that cannot be overlooked. MySQL is quite reliable and has constant availability using cluster and replication configurations. Cluster servers instantly handle failures and manage the failover part to keep your system available almost all the time. If one server gets down, it will redirect the user's request to another node and perform the requested operation.

Cross-platform capabilities

MySQL provides cross-platform flexibility that can run on various platforms such as Windows, Linux, Solaris, OS 2, and so on. It has great API support for the all major languages, which makes it very easy to integrate with languages like PHP, C++, Perl, Python, Java, and so on. It is also part of the LAMP (Linux Apache MySQL PHP) server that is used worldwide for web applications.

It's now time to get our hands dirty and take a look at MySQL 8; let's start with the installation of MySQL 8 on a Linux platform in our case. We prefer MySQL 8 on a Linux operating system as that has been a common use case across many organizations. You are able to use it on other platforms that MySQL supports, such as Windows, Solaris, HP-UNIX, and so on. Linux provides various ways to install the MySQL server, as follows:

RPM package
YUM repository
APT repository
SLES repository
Debian package
TAR package
Compiling and installing from the source code

We will install MySQL 8 with an RPM-based Linux distribution provided by Oracle; however, you can choose either of the approaches mentioned here. Let's see how to obtain and install it using the RPM package.

Installing MySQL 8

Let's first see from where to download MySQL 8 and a basic understanding of package structures to choose which one is appropriate. We start by obtaining MySQL 8 and will then quickly glance through installation and verification.

Obtaining MySQL 8

Download the RPM package of the MySQL Community Server from its download page (https://dev.mysql.com/downloads/mysql/8.0.html). There are various variants available based on the operating system and its architecture version. It comes with different packages that can be described by the package name. The following syntax is followed by the package name:

packageName-version-distribution-archType.rpm

Package Name: Name of the package, like myql-community-server, mysql-community-client, mysql-community-libs

Version: Describes the version-particular package

Distribution: This says the package is intended for which Linux distribution based on its abbreviation

Abbreviation	Linux distribution
el6, el7	Red Hat Enterprise Linux/Oracle Linux/CentOS 6, 7
fc24, fc25	Fedora 24 or 25
sles12	SUSE Linux Enterprise Server 12
solaris11	Oracle Solaris 11

Arch Type: Describes the processor type for which the package was built, like x86_64, i686, and so on

MySQL 8 installation

Once you have the RPM packages, just install it using the following command. This will place the required files and folder under the system directories:

rpm -vih <package-name>.rpm

For the standard installation, it requires only mysql-community-common, mysql-community-libs, mysql-community-client, and mysql-community-server packages. Look at the following screenshot for the installation process:

After the successful installation, just check its version to validate if it's installed properly and can be accessible:

The next step is to reset the temporary password during post-installation. MySQL will restrict you to use the database before changing this temporary password. A temporary password will be generated by MySQL and available in its log file. To retrieve a temporary password, you need to open the /var/log/mysql/mysqld.log file and search for the temporary password keyword.

Copy this and try to connect to MySQL with the following command:

As shown in the preceding screenshot, you will land on a MySQL command prompt, where you can execute the following query to reset the password:

ALTER USER 'root'@'localhost' IDENTIFIED BY '<NEW_PASSWORD>';

MySQL service commands

Let's look at a few MySQL basic commands:

To start the MySQL service, perform the following command:

service mysqld start

To stop the MySQL service, perform the following command:

service mysqld stop

To check the status of MySQL service whether it is running or not, perform the following command:

service mysqld status

Evolution of MySQL for Big Data

Most enterprises have used MySQL as a relational database for many decades. There is a large amount of data stored, which is used either for transactions or analysis on the data that is collected and generated, and this is where Big Data analytic tools need to be implemented. This is now possible with MySQL integration with Hadoop. Using Hadoop, data can be stored in a distributed storage engine and you can also implement the Hadoop cluster for the distributed analytical engine for Big Data analytics. Hadoop is most preferred for its massive parallel processing and powerful computation. With the combination of MySQL and Hadoop, it is now possible to have real-time analytics where Hadoop can store the data and work in parallel with MySQL to show the end results in real time; this helps address many use cases like GIS information, which has been explained in the Introducing MySQL 8 section of this chapter. We have seen the Big Data life cycle previously where data can be transformed to generate analytic results. Let's see how MySQL fits in to the life cycle.

The following diagram illustrates how MySQL 8 is mapped to each of the four stages of the Big Data life cycle:

Acquiring data in MySQL

With the volume and velocity of data, it becomes difficult to transfer data in MySQL with optimal performance. To avoid this, Oracle has developed the NoSQL API to store data in the InnoDB storage engine. This will not do any kind of SQL parsing and optimization, hence, key/value data can be directly written to the MySQL tables with high speed transaction responses without sacrificing ACID guarantees. The MySQL cluster also supports different NoSQL APIs for Node.js, Java, JPA, HTTP/REST, and C++. We will explore this in detail later in the book, however, we need to keep in mind that using the NoSQL API, we can enable the faster processing of data and transactions in MySQL.

Organizing data in Hadoop

The next step is to organize data in the Hadoop filesystem once the data has been acquired and loaded to MySQL. Big Data requires some processing to produce analysis results where Hadoop is used to perform highly parallel processing. Hadoop is also a highly scalable distributed framework and is powerful in terms of computation. Here, the data is consolidated from different sources to process the analysis. To transfer the data between MySQL tables to HDFS, Apache Sqoop will be leveraged.

Analyzing data

Now it's time for analyzing data! This is the phase where MySQL data will be processed using the map reduce algorithm of Hadoop. We can use other analysis tools such as Apache Hive or Apache Pig to do similar analytical results. We can also perform custom analysis that can be executed on Hadoop, which returns the results set with the data analyzed and processed.

Results of analysis

The results that were analyzed from our previous phases are loaded back into MySQL, which can be done with the help of Apache Sqoop. Now MySQL has the analysis result that can be consumed by business intelligence tools such as Oracle BI Solution, Jasper Soft, Talend, and so on or other traditional ways using web applications that can generate various analytical reports and, if required, do real-time processing.

This is how MySQL fits easily into a Big Data solution. This architecture makes structured databases handle the Big Data analysis. To understand how to achieve this, refer to Chapter 9, Case study: Part I - Apache Sqoop for Exchanging Data between MySQL and Hadoop, and Chapter 10, Case study: Part II - Realtime event processing using MySQL applier, which cover a couple of real-world use cases where we discuss using MySQL 8 extensively and solving business problems to generate value from data.

Summary

In this chapter, we discussed Big Data and its importance in various industries. We also looked into different Big Data segments and their life cycle phases including collecting, storing, analyzing, and governance. We discussed structured databases and why we require a structured database for Big Data. We looked at major features of the MySQL database and explored the newly added features in the MySQL 8 version. Then, we discussed the major benefits of using MySQL and learned how to install MySQL 8 in the Linux system using the RPM package. Finally, we understood how MySQL can be used in Big Data analysis and how it can fit within the Big Data life cycle phases with the help of Hadoop and Apache Sqoop.

In the next chapter, you will learn different query techniques with MySQL 8 including joins and aggregating datasets.