Some time ago I had an imaginary coffee with a friend called IGNI (that stand for I Got No Idea what I am doing). He has not a clear idea about what is Big data, but he is very curious, which is great, and he made lots of good questions.
Here is our chat:
Igni: Hi Alessandro, can you tell me a bit about Big data? I get the feeling it's not just a buzzword.
Me: Hi Igni, that's right! Big data is a catchy way to refer to big amount of data, on the order of Terabytes or Petabytes. It is about working efficiently with such amount of data, and exploiting it to can gain insights.
Igni: Even Petabytes? That's a lot, I would need expensive servers and hard drivers to store and process such amount of data.
Me: Not necessary, you can store the data in the Cloud. For example, with Google you could use either Cloud Storage or Cloud BigQuery or other SQL and NoSQL solutions.
Igni: And what about querying, processing and transforming the data?
Me: you can do it in the Cloud as well, there are many services providing a managed version of open source platforms. For example Cloud Dataproc provides a managed version of Apache Hadoop/Spark, and you can process any data stored in Cloud Storage, Bigtable or BigQuery
Igni: That's gonna be expensive, isn't it?
Me: Well, not really. You pay to use the server cluster only when you make use of it. If you shut down the server cluster whenever you do not need any processing, you'll save lots of money. In the Cloud, contrary to on the premises, power up/down clusters that are not in use, is common and advisable.
Igni: I see, but if you shut down the servers, you will not be able to access the data!
Me: That is true only if you store the data in the persistent disk of the servers, but that is not a good practice. In the Cloud is much better to separate the storage from the processing. In you use Dataproc you should store the data in Cloud Storage, and not in HDFS (Hadoop Distributed File System).
Igni: I got the point, just I have some concern about the performance. I think that running data processing of data that are stored in Cloud Storage, instead in local disk, might be quite slow!
Me: Cloud Storage is highly performant and the Regional Storage class has performance that is comparable to persistent disks. The speed is not a real concern, again, storing data outside the cluster is a best practice.
Igni: So you store the data in Cloud Datastore, process it with Cloud Dataproc, and shut down the server when your processing is done, is that correct?
Me: That's right! just remember that Dataproc is just one of the possible choices to process your data.
Igni: can you tell me more about that? when should I use Dataproc and what about the other choice?
Me: Sure. Dataproc is based on Hadoop and Spark, so it is a good choice if you are familiar with Hadoop ecosystem. Alternatively you may want to use Cloud Dataflow, which is a managed service based on Cloud Beam. It allows either stream processing and batch processing and contrary to Dataproc, it does not require to setup a cluster (no-ops) and it is auto-scale.
Igni: It looks Dataflow gets some advantages. How does the stream processing work?
Me: well, it is probably easier to explain it with an example. Image you have installed thousands of IoT devices into thousands of cars. Those devices send speed and position every few seconds in the Cloud. In which way do you store those data?
Igni: I don't know, maybe I would use a no SQL database?
Me: Where do you store those date is not so important, at least to understanding the concept of stream processing. What I want to point out is that you should not directly store the data coming from the IoT device to a Database.
Igni: I am not sure why not.
Me: There are many reasons you want not to do it. The data coming from the devices could be in incorrect order or duplicated, or there could be outliers, or you might want to store just one record every 10 second, containing the average speed of the last 30 sec of events.
Igni: I think I got it. Maybe Dataflow collects a block of events and then process it, and then collect another block and process it, and so on...
Me: yes, that's exactly what's happen! since the streaming is an unbounded collection of events, Apache Bean need to divide it into windows of finite collections, so they can be processed. This operation is called windowing. There are other concepts apart from windowing, you should have a look to the Apache Beam documentation, it explains well all the concepts relating to stream processing.
Igni: That is so interesting! and I can see that things start to become complex.
Me: And also consider that Dataproc and Dataflow only serve to process your data, there are many others services to ingest, to store, and to visualise it.
Igni: So much to learn! Where I could start to know more on the topic?
Me: A good way to start is taking online courses. I found some pretty good one on Coursera. These days I am taking the specialisation Data Engineering, Big Data, and Machine Learning on GCP Specialisation, which is made of 5 different sub-courses. But there are many other valid courses on different platforms, like Linux Academy, Cloud Guru and Udemy. Even on YouTube you can find good educational videos, made by Google itself.
Igni: Thanks Alessandro, this chat was really helpful!
Me: No problem, you can call me anytime. Ciao.