I had a coffee with an imaginary friend. He didn't have a clear idea about Big data, that's why I'll call him Igni (I Got No Idea).
He is very curious and asked lots of smart questions, which is great.
Here is our chat:
Igni: Hi Alessandro, can you tell me a bit about Big data? I get the feeling it's not just a buzzword.
Me: Hi Igni, that's right! Big data is not just a catchy word, it is a way to refer to a big amount of data, in the range of Terabytes or Petabytes. It is about working efficiently with such an amount of data and exploiting it to gain insights.
Igni: Even Petabytes? That's a lot, I would need expensive hard drivers and servers to store and process such an amount of data.
Me: Not necessary, you can store the data in the Cloud. For example, with Google you can use either Cloud Storage or Cloud BigQuery to store even petabytes of data. Amazon AWS and Microsoft Azure provide similar services too.
Igni: That is for the storage, but what about querying, processing and transforming the data?
Me: You can do it in the Cloud as well, there are many services providing a managed version of open source platforms. For example Cloud Dataproc is a managed version of Apache Hadoop/Spark, which allows you to process any data stored in Cloud Storage, Bigtable or BigQuery
Igni: That is going to be expensive, isn't it?
Me: Well, not really. You pay to use the server cluster only when you use it. If you shut down the cluster whenever you do not need any processing, you'll save lots of money. You should keep in mind that in the Cloud, contrary to on-premise servers, powering on and off a cluster that is not in use is good practice.
Igni: I see, but if you shut down the servers, you will not be able to access the data!
Me: That is true, but only if you store the data in the persistent disks. In the Cloud it is much cheaper to separate the storage from the processing. For example, if you use Dataproc you should store the data in Cloud Storage and not in HDFS (Hadoop Distributed File System).
Igni: I get the point but I still have a concern about the performance. I think that processing data stored in Cloud Storage is much slower than data stored in local disk! am I right?
Me: Not really, Cloud Storage is very fast and the Regional Storage class performances similar to persistent disks. The speed is not a real concern.
Igni: So you should store the data in Cloud Datastore, process it with Cloud Dataproc, and shut down the server when your processing is done, is that correct?
Me: That's right! Just remember that Dataproc is just one of the possible choices to process your data.
Igni: Can you tell me more about that? When should I use Dataproc and what about the other choices?
Me: Sure. Dataproc is based on Hadoop and Spark, so it is a good choice if you are familiar with them. If you are not, a good alternative is Cloud Dataflow, which is a service based on Cloud Beam. It allows either stream processing or batch processing, and contrary to Dataproc, Dataflow does not require to setup a cluster (no-ops) and it is auto-scale.
Igni: It looks Dataflow has some advantages. How does the stream processing work?
Me: Well, it is probably easier to explain it with an example. Imagine you have installed thousands of IoT devices into thousands of cars. Those devices send the speed and position every few seconds in the Cloud. How are you going to store that data?
Igni: I am not sure, I would just store them on a DB.
Me: Well, storing the data from the IoT device directly on a Database has many drawbacks.
To understand why we need to talk a bit about stream processing.
Igni: Why don't we directly store the data?
Me: There are many reasons for that. For example:
- The messages coming from the devices could have an incorrect order. Often IoT devices use the UDP protocol, and UDP does not guarantee the order or the messages.
- There could be duplication, in case IoT devices send multiple times the same message.
- There could be outliers that you you want to filter out.
- You may also want some control over the data. For example, you don't need to store all the messages but just one every 10 second, containing the average speed of the cars in the last 30 sec.
Igni: I think I got it. Dataflow collects a bunch of events, process them, and then store them, and then collect another bunch of event, process, store, and so on...
Me: Yes, that's exactly what happens! since the streaming is an unbounded collection of events, Apache Bean divides it into windows of finite collections, so they can be processed. This operation is called windowing. To know more you should read the Apache Beam documentation, it clarifies many of the concepts relevant to stream processing.
Igni: That is so interesting! but there are so many new concepts.
Me: Yes, also consider that Dataproc and Dataflow only serve for data processing. There are many others services to ingest, to store, and to visualise it.
Igni: So much to learn! Where I could start to know more on the topic?
Me: A good way to start is taking online courses. I found some pretty good ones on Coursera. I took the specialisation Data Engineering, Big Data, and Machine Learning on GCP Specialisation, which is made of 5 different sub-courses. But there are many other valid courses on different platforms, like Linux Academy, Cloud Guru and Udemy. Even on YouTube you can find good educational videos, made by Google itself.
Igni: Thanks Alessandro, this chat was really helpful!
Me: No problem, you can call me anytime. Ciao.