loading
 

BigData Performance

by @NMartinez


Posted on May, 2021

Introduction

When we talk about Big Data, of processing those large volumes of data, many times we focus on the tools and platforms with which we are going to process and store the data, but we forget very important aspects about the data itself.


It never hurts to remember that we can define Big Data with "the 3 Vs": velocity, variety and volume. According to the author, one can find that the "Vs" are 4, 5 and many more, but the important thing is that any of these "Vs" are describing the data. Data is the center of Big Data and how we handle it will impact our objectives. For example, from the performance point of view. That is why in this article we are going to focus on 3 aspects of data to improve the performance of our processes, regardless of the tools we use.


If you want to learn a little more about how to become a Big Data Hero before reading this article, I recommend the following article.  LINK

Results storage

One of the first things we do when we enter the world of Big Data is to build our data lakes and with that objective we create the data ingests. We invite you to check this AWS blog explaining what a data lake is.


These ingestion processes obtain data from our systems external to the platform, for example, log files, information related to users, databases, APIs, etc.


Probably following some governance practice, an area has been defined in our storage for "raw" data that has just entered the platform. When we talk about raw data, we refer to data that is as it was in its origin, there is no information or added value.


Obviously, if we still do not have plans for this data, at some point we are going to have it, we are going to want to do something with it. Whatever the use case, transformations will be performed on this data and will likely be mixed with other data. The result of this process is enriched data and the recommendation is always to store it.


Let's take an example.


We have an ingest of product billing information that is used to build reports at the end of each day, which requires certain transformations on the data. It is advisable to store the result of the transformations, regardless of the feeling that we have duplicate data by having it raw and enriched. This is because it is likely that this information will be needed again and if we do not store it, we have to apply the transformations again, which takes more time than reading the results ready in storage.


That would be the case if a new use case appears later and an audit control needs to be added, which verifies the reports with other data and again another output is generated. So now we have a process that consumes enriched data and generates more enriched data.


On the other hand, we can have another process that uses the billing data, in order to make predictions of the next month's sales using machine learning techniques. This process will require its own enrichment process and we will store it again.


At this point, we have our billing information in four places:

  1. Raw, from the original ingestion.
  2. Enriched for daily reports.
  3. Enriched by audit processes.
  4. Enriched by the sales prediction process.


These methodologies make us question some aspects:

  • I have the information repeated, am I not taking up more space?
    Indeed.
  • So why is it better?
    This is one of the typical cases where it is preferable to use more space than to waste time with computations that have already been carried out. Imagine the extra time it can take if we have to redo all the daily reports in order to apply the audit report.
  • Having the information repeated can complicate organizing the data in my data lake and knowing what was done with each dataset.
    It is also true and for that good governance practices should be applied, for which some tools such as Apache Atlas can help maintain the lineage of data and its structures.


Although the conclusion of this section is that we prefer to use more space than to spend more time doing computations, we do not forget that the used storage space has a cost. It comes from either the number of disks we have on our on-premise platform, or the cost per use in the cloud. This is why it is important to define time-to-live rules for our data to have a better storage plan, that is, how long (years) the data will exist in the data lake.

Data formats

In the previous section we talked about the ingestion and enrichment processes on the data in the data lake, but we did not talk about how this data is stored and it is one of the most important aspects.


When you talk about storage systems like Hadoop HDFS or AWS S3 in the context of Big Data, you end up with the idea that you can store anything you want, which is true, but it is important to consider how.


Usually we classify the data into:

  • Structured: data with a fixed schema where each attribute has a data type. This is the case of data from relational databases and the like.
  • Semi-structured: there are schemes and types, but they are variable. For example, JSON or XML.
  • Unstructured: there are no defined schemes or structures. They are usually logs, images, audio, emails, among many more.


The data formats apply to the structured and semi-structured, since they apply certain transformations and groupings on the attributes of the schema, generating more complex files and not so simple to read for a human being, but gaining performance in their treatment. Among the most used formats in Big Data we can find:

  • Avro
  • Parquet
  • ORC


These data formats apply complex techniques and structures that make data more compact and quicker to query. For example, Parquet is a column-oriented data type that allows reading only the necessary attributes and discarding those that are not, thus improving reading and processing times (although writing is slowed down).


Although our data sources can be simple to read files such as JSON or CSV, ideally they are not stored in plain text and are converted to a suitable data type. This will improve the space used and the processing times when this data is consulted.


In this article we are not going to elaborate on explaining each of these data types, but we will address it in the future in a more specific one.

Data compression formats

Related to the previous section, not only the format of the data is important, but also its compression. There are several data compression mechanisms, some of which are more appropriate to others depending on the case: if we are going to use them in mapreduce processes and we need to divide it, if we need to make information appends, among other options.


Some examples are: Snappy, BZip2, GZip, LZO, ZStd, LZ4.


Same as for data formats, which one to use and the comparison of these and others, surely requires a separate article and we will not discuss it at this time. In Big Data platforms based on Apache Hadoop, we have found that good combinations can be ORC + Snappy or Parquet + Snappy when we are working with data that we are going to read and write from Apache Spark processes or to which we are going to give a datawarehouse layer using Apache Hive.


Surely more than one must already be wondering: "When compressing and decompressing, am I not adding overhead to the processing?"

The reality is yes, but by reducing the size of our data we are saving processing in reads and writes to disk and in the amount of information that travels through our network. Except in very particular cases, compressing ends up being the most efficient, reducing the times of writing and reading, therefore, of the processes that we execute and finally reducing the space used in our storage.


These compression techniques can not only be used in the "data at rest" of our data lake, but also in streaming processing systems such as Apache Kafka, where good results are achieved when compressing.

Conclusions

For any company that is already in this world of Big Data, it was fantastic for sure to get to the point of having the platform ready to be flooded with those terabytes or petabytes of information, regardless of whether it is an EMR cluster on AWS or a platform based on Cloudera Hadoop. However, it is not so great that after a couple of months when we already have a good amount of data, we run into performance problems because the data is in plain text and uncompressed, or because processes are performed multiple times. For many it was not trivial to get to the root of the problem, coming to question whether their platform and the investment made in hardware or cloud had been worth it.


These problems are solvable, the data and processes can be modified, obviously with the costs that those hours of work and computing times imply. To prevent this, both for newcomers and those that continue to expand their data lakes with new data sources and use cases, it is essential to consider these aspects from the beginning.


Leave a Comment: