loading
 

Big Data - From Zero to Hero

by @Manu


Posted on Feb, 2021

How to enter the fascinating world of Big Data and not to die trying 

Do you want to be part of what is coming? In this post I am going to tell you the basic needs to stop being a mere mortal and transform yourself into a Big Data Hero, focusing on what guidelines and tools to follow to get into this world.

Join the future !!!


I divided the content into three universes, what is it?, why? and how? And this is the journey we are going to take!



But don't worry, to become a Big Data Hero, you have to follow some steps.

First, I am going to tell you a little about what Big Data is and what it is used for, if you think this part may bore you, you can go directly to “Shall we continue?”


You can also watch this video that summarizes all the theory in 5 minutes!

https://www.youtube.com/watch?v=bAyrObl7TYE

What is this Big Data that everyone talks about?

According to Gartner, Big Data is a “High-volume, high-speed, and/or wide-variety data set of information assets that require innovative and cost-effective information processing formulas and that enable improved vision, decision-making and process automation."


And what does it mean?

In other words, Big data lets us look to the future, allowing us to store and process large volumes of structured, semi-structured, and unstructured data at high speed, with historical or real-time information, leading companies to improve their decision-making processes and strategies based on data (Data Driven). Good heavens!

How do I know when something is Big Data?

The data that make up Big Data must meet certain characteristics, which are called the 7 V's: volume, velocity, variety, veracity, value, viability and visualization.


Where does this data come from?

The amount of digital data in the world reached one zettabyte (1,000,000,000,000,000,000,000 bytes) in 2012, marking a new era. Currently per year we generate almost 2 zettabytes and it is said that in a few years it will reach 10, crazy, right?


Today, data is everywhere, we generate it by surfing the internet, using our cell phones, cars, social networks, shopping, sensors, machines, and much more.




According to an analysis by Jeff Schultz in 2019, Facebook registered an average of 510,000 comments per minute worldwide, 136,000 photo uploads and 4 million likes. But today, who is still using Facebook?


Studies say that 71% of Instagram users are under 35 years old, so what is happening in the world with the information of the social network of millennials and generation Z?


Christina Newberry did an analysis of Instagram so far this year and ensures that more than 500 million people use it daily, with an average of 30 minutes per day, and that 130 million users watch the shopping advertisements per month.



What is the data used for?

This is the moment when we ask ourselves, what do companies do with data? After years of storing information, how do they use it?


The use of data is very wide, there are many sectors such as IoT, Security, Sales, Medicine, Financial, Automotive, Smart city, Marketing, Sports and more.


Some of the most common uses are:

Customer trend analysis

Customer segmentation

Creating smart strategies based on data

Customizing products and creating new ones based on customer needs

Predictions for improved health

Improved security

Improved transportation performance

City optimization


Why Big Data? 

You are wondering yourself the same thing as me, what benefits does Big Data have today?


The terms Big Data, Artificial Intelligence and Data Science have been very fashionable for some time, and they are said to be the jobs of the future, but why?


The answer is simple, currently all the information about what happens and what we do on any device is stored and this information is getting bigger and bigger every day. What is needed now is to analyze and exploit this data with engineering, mathematics and statistics, to evolve in medicine, so that companies can predict movements, improve decision-making and thus achieve their digital transformation.


This is where data analytics, Machine Learning algorithms and more appear, which allow us to transform that information into value. But we are not going to get into that field, it is another world and another post. Better to leave it for later, as we say in theater: LESS IS MORE!

Some interesting cases...

Ok, we have a lot of info, now we want to know what they do with it, and how they exploit it.


The closest example that comes to mind now is the fight against COVID-19, where with Big Data and Artificial Intelligence, several systems have been developed that help communities predict different guidelines regarding the pandemic, which have been used for decision-making, and it is believed that with these technologies, future pandemics can be predicted so we can get better prepared for them.

 

classic example is the use of Big Data in Japan, where the purpose is to improve the quality of life of citizens, using information through security cameras (Smart City) to fight against crime.



Netflix predicted the success of the series “House of cards” based on the analysis of its data.


Facial recognition is also contributing to the fight against disappearance and sex trafficking of children. Thanks to Amazon Rekognition, in California they ran the algorithm with the photo of a missing girl and found out that she was being sexually exploited.


Something at a slightly more extreme level happens in China, where of the 770 million video surveillance cameras that exist in the world, 52% are based on China and of the 20 most monitored cities in the world, 18 are in China. And although the government assures that it is done it to fight crime and monitor the traffic of their cities, this type of data use generates fear, confusion, and makes us question to what extent the use of these technologies does not violate our privacy.


If you want to see more use cases in different areas and companies, read an article in this blog Big Data & Analytics Case Studies.

Shall we continue? 

If you got this far, it's time to see how to define a Big Data architecture, and what technologies we can learn.


Big Data Architecture - Overview


When designing a Big Data architecture and choosing the tools that we are going to use for each stage, we must ask ourselves several questions, so that it meets our needs. This is where the 7 Vs that we mentioned at the beginning of the post appear, and it is time to prioritize them based on the use cases that we are going to develop. Then, we ask ourselves what data sources are we going to use, and finally the need for our architecture to support streaming.


What technologies exist within each stage?


This answer is very broad, although there are many technologies, which can be found in hundreds of blogs, I am going to focus on those open source, since they are what we specialize in at netlabs.


Mat Tturck took the trouble to put together this table with all the technologies that make up the Ecosystem (enter this link to see the version that can be zoomed):


Now, how to start? My process

I am going to tell you which tools I was trained in when deciding to work with Big Data, I repeat that there are many more, but I am going to focus on those necessary for you to start introducing yourself to this world.



Are you ready? First, let's start with Data platform

To get started I read the Apache Hadoop, Hive and Spark docs, to get a feel before I started testing:


Yes, a lot of reading, now it’s time to install...


The first thing you need is a 
Sandbox with HDP (Hortonworks Data Platform), I used an Amazon VM and I downloaded the Docker container (you can also use VMWare or Virtualbox) in this link from Cloudera where they explain how to do it.


Everything ready? Now we have to make something work !! 


For that, the best thing is to get into the Cloudera tutorial Getting Started with HDP where it starts with the basic concepts of HDP and Apache Hadoop and offers a brief introduction of all its components.


Since Cloudera migrated HDP to CDP (Cloudera Data Platform), you can find and try the new stuff here.


Then we’ll move on to HDFS, MapReduce & YARN, Hive, and more.

The whole tutorial is extensive and takes time, but it is worth stopping at each concept and not skipping any step, in general it is quite simple and everything is very detailed.



Control anxiety! Reading everything in full and doing the tutorial can take a few days, especially if you are curious and willing to investigate a bit more about each technology.


Ok, we got here, we did some tests, it's good, but how to continue?


It is where the magician Milind Jagre appears and his incredible blog, with step by step instructions of everything that is needed for the certification in Hortonworks (now Cloudera) Data Platform Certified Developer Certification (HDPCD).


The 52 posts have a very good level of detail, with definitions and images with examples, it is worth doing them ALL.

They are divided into 3 categories: Data ingestion, Data transformation and Data analysis.


If you prefer to watch videos, this Udemy course is ideal for getting started with Hadoop, with 3.5 hours of videos and has access to clusters for testing (it's also free).


Yes, I know, a lot up to here ...



But it's time to move on with Data Flow

For a start, as we did with Data Platform, it is advisable to read the documentation of:


Then, as with CDP, we have tutorials in Cloudera for Cloudera Data Flow (CDF). On the one hand the configuration of the Sandbox and Ambari and on the other, entering the world of Apache NiFi and Kafka.


This gives us an overview of a bunch of technologies, so you're ready to start playing a bit on your own!


Last, what I recommend you do is get into the Python and Scala programming languages and the Spark Framework.

If you are interested in more:

  1. If you've never heard the word Python, this is a good time to start Learning Python: From Zero to Hero
  2. If you know a little about programming, in Dataquest you will find a lot of free courses to advance with Python.
  3. And if you're already a functional programming kingpin, it's time to do your own research!


and let it flow !!!


NOW YEEEAH, YOU ARE ALREADY A BIG DATA HERO !!!






Do you want more?

Some courses

Udemy has many, short and very specific, for example if you are interested in knowing a little more about Hadoop with MapReduce, HDFS, Spark, Hive and more, this Ultimate Hands On Hadoop - Big Data Training Course contains 14 hours of videos and some articles .


If you prefer to get more involved with Spark and Python, this very short course of 6 hours of videos can be useful: Taming Big Data with Apache Spark 3 and Python


Do you want something more advanced?


Coursera has several interesting certifications, the most recommended is Big Data Certification Course, if you are interested in learning a little more about Hadoop with MapReduce, Spark and Hive, this certification is perfect, it contains 6 courses with a duration of 5 months and dedication of 7 hours /week.


If you are interested in mathematics and statistics and you want something more advanced and specific, this course at Harvard University –although I did not do it– is in the ranking of the best online courses in Data Science, and its duration is 1 year and a half with a dedication of 2 to 3 hours per week, Data Science Certification from Harvard University.


With all these courses you already have enough to entertain yourself for a long time !!


If you find out that you are more interested in creating or understanding algorithms, Andrew Ng's course, Machine Learning by Coursera, is highly recommended, not only because I am a fan of Andrew but because he is in the number 1 position of the courses in this branch, focused on Linear Regression, Neural Networks, and Machine Learning algorithms, it takes about 54 hours, and it's worth doing. It's free, you only pay for the certification.

If instead of a course, you are interested in knowing a little more about ML, don't hesitate to read this “From Zero to Hero” by ML, it's tremendous !!!


Do you already have great knowledge and experience? If you want to get a certification:


Are you interested in reading? I recommend these books that I read and found them interesting:


You can find more recommended books according to the ranking of the most read in this link.  


Namasté

Thanks for reading this post !! So far we come with my recommendation to start with Big Data. If you have any questions or suggestions, or if you find a more accurate tutorial to start with, do not hesitate to write!


I want to thank all the people that I was mentioning when sharing their content, I am still new in this world, and I would love to be able to help others so that they can be part of it as well.


Leave a Comment: