How to Use Microservices Architecture to Build a Data Lake on AWS

May 30th, 2018

“Data is the new oil.” Just as we have to drill to get oil, we also need to mine data to get information out of it. Google, Facebook, Netflix and other titans of the digital era use data to build great products that touch every part of human life.

Regardless of scale, building a managed data lake on AWS requires a robust and scalable technical architecture. They often use microservices during the build process. A microservice architecture is centered around building a suite of small services focused on business capabilities and are independently deployable. It uses lightweight protocols and run on its own processes, which makes a microservice architecture ideal for building decoupled, agile, and automatable data lake applications on AWS.

Join this session with Sabyasachi Gupta, Software Architect at Zaloni, to learn more about:
– The what and why of a microservices architecture
– The different layers of a data lake stack
– Why is metadata important and how to capture it in AWS
– The relationship between Serverless and Microservices and available options on AWS
– How to build a data lake using microservice architecture on AWS


See how Zaloni leverages microservices architecture in Zaloni Arena DataOps platform by requesting your custom demo!

Webinar transcription:

[Brett Carpenter] Hello everyone, and thank you for joining today’s webinar, how to use microservices to build a data lake on AWS. My name is Brett carpenter. I’m the marketing strategist here at Zaloni, and I’ll be your emcee for this webcast. Our speaker today will be Sabby Gupta, a software architect here at Zaloni.

Before we begin, I’d like to introduce you to Zaloni. We help customers modernize their data architecture and operationalize their data lakes, to incorporate data into everyday business practices. We supply the zaloni data platform, which provides comprehensive data management governance and self service capabilities. We also provide professional services and solutions that help you get your big data projects up and running fast. Now with that I’ll turn it over to Sabby to discuss how micro services play a role in building a data lake on AWS. 

[Sabby Gupta] today what we’re going to talk about is how we can leverage microservices to build a data lake on AWS. A lot of other cloud providers, but we also talk on AWS today. So what we included in the agenda is as follows. Definitely the microservices talk so we’ll try to understand what microservices are, what the principles, go through a reference architecture microservices can be a big topic on itself to discuss, so we will try to just go over the principles that are high level, and see like how that can be leveraged on accomplishing data lake on AWS. Then we will talk about the logical components of the data lake and at the end of this presentation, you should be able to understand to build a data lake, what are the logical components you need so that you can build on your own if you want to. Eventually, we want to build our application, our data lake application on a microservices architecture and on AWS. So I’ll go through a couple of approaches, where I showcase. What if you use AWS, infrastructure, but don’t use all of the components of AWS, what if you want to build your own stack to build on AWS like you want to just leverage the compute power and the storage use of AWS, but you want to have your own tech stack to build it, that would be kind of a generalized design, which also gives, and then I’ll go into a bit more specialized design where I can leverage other epic other components of AWS, to build the data lake solution. Eventually we’ll go through, we’ll tie all the things up, we’ll go through a reference system architecture and how it would look when things get working on the platform. Energy let them in the cover some questions and answers and answer some questions. So, what are microservices. 

Actually the so called above so microservices architecture is actually encompassing all of these different principles of development and design. So, with some episodes agradecer will use containers. It can be used in the cloud, obviously. As you can see that the talk about API is definitely one of the biggest tenets of of using microservices design. So, That would make sense. And domain driven design is also one of the big components and we will see like in the principles, like, what does that mean. So thanks thanks for. Thanks for the responses. Now, let’s let’s get deeper. So heres a definition of what microservices architecture really is the fun part is like if you bring in five people into a room and ask what microservices architecture is, you will get 10 different answers. So, so it is that everybody has an opinion as to what microservices architecture is all about. The one I chose was from Martin Flower. He was a pioneer of proposing and evangelizing about Microservices architecture. So what he what he talks about in this paragraph is pretty apt and it covers actually, what if you have to, if you have to practice microservices architecture was in the micro Microservices is definitely a suit of small services. That’s what my group talks about, or there’s always a comparison to monolithic application applications. And so microservices is kind of the opposite effect. So these are small services, which are contained by itself. They are running on their own processes. Because if you have multiple processes, multiple components running in the same process. You’re getting dependencies in terms of, they’re impacting each other in terms of, like, in terms of compute. And if you have to change one then, then you’re impacting the other because they have to bring down the container for example, or the node. For example, if you’re using. Just node they’re not containers. So the paramedic prescribed to make it run in their own processes. And because they’re running in their processes the distributed so they need to communicate with life at scanning themselves and HTTP endpoint because STP is important leveraging that makes things simpler If you have to do some remote procedure calls between processes. Definitely one of the key things is, it has to be built on business capabilities, because we are solving building a service is kind of solving a business problem. So we need to where it needs to be built across business capabilities. This is one area where it becomes a bit of an art than science, where you need to know like what what constitutes a business capability that can be put in a microservice. Does it get to a couple of business capabilities or it needs to be this one. And there is no right or wrong answer and once we get to the principles you will see like, What are some of the thumb rules we can, we can apply to see the granularity of the microservice component is correct. Then once you’ve built. Once you’ve deployed needed to make sure that independently deployable Microservices are, as I eluded to before, on the, on the point of monolith applications and architectures. It is like a fever in a monolith application, then he was deploying the whole unit. Together, which sometimes creates problem because my application is made up of multiple multiple smaller units. And if  even a tiny part of it after deploy the whole thing, which increases risk for us. So Microservices architecture helps us with that automated deployment, because we want to automate things would take out the manual part environment of the whole process. To make it more agile. And that was not the least is like leveraging different programming languages and different storage technologies, because microservice architecture is the surface area is small, so people can use their own languages and use different storage technologies to better interpret microservices. Now, it’s not like one size fits all, depending on the use case and what the team strengths are, what the team wants to do. They can pick and choose their own destiny. Kind of. 

So, we understand what micro service definition is so let’s do a bit. As I said, this can be a discussion as well but I’ll just try to let go or gloss over as to what it means, what are the some of the principles you need to follow to make micro services in your, in your design. Before I get the points I mean I would recommend reading this book, which is called Building microservices but he has been preaching microservices for a long time. It’s a brilliant book. So feel free to take a read. 

So the principles are obviously building on top of business domain and modeling it. So what that means is like for example let’s let’s take Amazon. As an example, like if you’re building an e commerce application. Then you have customers, so customer can be part of like billing, where you need a billing address, and part of delivery where in name delivery address. In the past, you might have one customer object and have a big object each has everything in it, but with Microservices, it is like it is okay to have multiple words and multiple types of customers. In terms of properties they don’t overlap but it’s okay to have the reason beingbeing in monolith era, or paradigm. So if you change one you’re impacting another part of the application or other part of the service. So it couples. The teams if there are two different teams most rapidly. So that’s why use your judgment to, to build that. But a good way of doing and seeing like how I can do that is through understanding domain driven design. There are quite a few books around it. So, you wouldn’t even resign would would help you, and at the end of the day, as I said, like it’s that this is an area which is a bit of an art and science, but. But yeah, following that would help you. Automation is definitely important, because he would automate things in terms of automation when it’s an automation it’s not just about development, it’s about testing, testing you’re automating your testing to things like soap UI and all that stuff. Some way of automating your pipeline, your build pipeline you can automate deployment AWS you can be, you know, things like tools to automate that if you’re doing on your own, you know Jenkins, for example, to, to automate your pipeline development pipeline. You want to encapsulate and aggregate your microservices. What that means is taking that example comparing it to monolith architecture, you will most likely have like one database and immunity or multiple services, they’re talking to a single database to get data. And, I mean, to put operations on top of it, the problem with that is like if again, if you change the schema because now the schema becomes coupled with different services so if you change the schema it impacts others. And as you can see the mantra of this is like no loose coupling, so that the dependencies are kind of broken. So, even if you change one part of it, it doesn’t impact the other. Because, and he will be like, why doesn’t impact the other because you following something like API design. So encapsulated microservices to, to reduce your service area of change. He wants self service and autonomy and as a team. I want to make choices as to what I need to do with my service. What tools technology, I can use. So it gives you autonomy. It’s not like he said it’s not one size fits all, which is which is good. You want to deploy independently. Like, you want to control your own destiny. So, if you it’s a big application is built on multiple micro services, and I control one part of it. If I change it I should be able to deploy it on my own. So micro services gives you that functionality. But API design, definitely you want to have API’s API should be the entry point for other services and external clients if you, if you’re exposing an API to public, say, on a public domain. So API’s should be that and that any of me are privy to like what are the best practices of delegate like versioning it and making your API is backward compatible wherever possible. So, for that best practices, would help you build a good microservice application. Failure isolation, your applications need to be fault tolerant. There are various approaches, some of the basic techniques used to apply kind of circuit breakers is one part of the node is not working because you’re running your application on multiple nodes. You might have this concept called bulk erring, where like if the concept is taken from ships next ships have different compartments, so that even if, if there is a there is a hole in one part of it, it can be contained and the ship doesn’t sink. So there are different applications and tools which will help you take the same paradigm and implement it. And devops at the end of the day.

so that should give you a flavor as to what the Microservices principles are, and the tenets you need to follow to build a good microservice application. 

(16:07: Microservice Architecture)

This is one of the, one of the frameworks which I found pretty useful it’s called j hipster, so feel free to take a look at it. What it does is it follows it’s an opinionated way of microservices of integrating applying microservices architecture. It is built on complete open source like Netflix open source and Spring Cloud and optionally Docker, I mean it has Docker Docker containers, is, is, is there for that citizen, but you don’t have to use containers like Docker for example, but it is suggestor helps you get get kick started, get on like onboard into medicines architecture, very quickly. Within minutes. Once you install the framework within minutes, you will be up and running. So the major components are at this kind of gateways design patterns are at the top of it. As you can see there is the browser is the gateway gateway. Has your UI code. And it is interacting as kind of a reverse proxy to microservices. You can have like a number of microservices. And gateway actually dynamically, with the help of the GH the registry, which is kind of Spring Cloud, it has a Eureka server and the Config Server, which helps us discover what our microservices are up and running, because you can have a cluster of micro services. So, using that the components can discover where I need to talk to. So as you can see that makes it completely decoupled, you can scale elastically. So, the framework gives you that ease of use. Gateway what it also does is also access a load balancer. And because there are multiple microservices running so it will try to pick up. Which one is available. So that is the role of the gateway, the micro services are the ones which is kind of having the business logic implemented for the application, and eventually there is, there is an option to also do things like there’s something called gates to console, which actually kind of captures the logs, and all that stuff. It’s an optional component, if you have your own log in mechanism it basically uses the el stack Elasticsearch LogStash and cabana to to do its job. So, so this is kind of putting them like whatever I just said it was the principles, this is kind of realizing those principles. So, using this framework I’m pretty sure there are others, but this is one which is pretty common and very useful. So, using this technology stack. We can get on boarded, What are the microservices paradigm, very quickly. 

(19:00: Data Lake Services)

Now that we know about microservices, what their principles are and how we can start building an application on micro services let’s get into the data lake part of it. To see like what data lake means, if I have to build a lake application, which is kind of a buzzword and everybody is trying to do it because of the data proliferation across whether it’s from IoT or other other places. We need to, kind of, like, have a data lake to to do all that. Do all the stuff when I say development like you need to enable like files and data coming into the entire system. And then you want to have governance because otherwise it will be kind of a wild wild west, but you need governance around what people can do what people can see privacy and all this stuff, you know, nowadays data privacy is key. So, how do you govern. And how do you engage so now you have the data, your governance around it. Now how do you engage people to use it. So, so covering these three major pillars would make us like good data lake application. So what do each of these mean, right. So when you say enable so I can start building some how they started making some micro services around it. So going back to, like, the new capabilities, I think that one’s one of the principles. You need to apply over here so like, try to see what are the business capabilities I’m trying to get out of it. So some might be, say the top tools kind of using data ingestion like batch and streaming so that can be a candidate for a good one micro service, and to capture metadata from somewhere, whether a file is coming in and discovering, all the metadata. So, these two are there is a two different microservices they can be one, but it’s a start with two of them. Then once I capture all of that, then I might have to do some data quality making sure the data is coming in make sense and we do some giants here and there to enrich the data that maintain the lifecycle of the data as to how long it lives on. On the servers on the storage is and all the stuff. obviously applies security around it. So authentication mechanisms, the backlog so that people can get into your application and see what, what are the what are the different entities and other stuff available so that people can operate on it, or people can pull some data out of the system into somewhere else, like spreadsheets or CSV file doesn’t know that, and eventually make Excel serve like if people want to explore the catalog and try to do some things on top of it, like exporting some data for, say they want to run the run making machine learning models on us on a snapshot of data, so that you should be able to fit into the catalog to get out what data sets you need, and pull that snapshot of a data set from there into your machine learning model. So that is kind of a self-serve so all of these can be good candidates of microservices. 

(22: 30: Data lake platform components)

So, peeling the onion getting a bit deeper. What it means is when we are, we would need some sources, because we need data coming into a data lake. The sources can be databases, and it can be files batch files streams Kafka streams for example, it can also even be an API, like a Salesforce API and trying to get the campaign data, or update for example so I can call the API and get whatever I need. So data coming in. Then once the data is there we need to ingest it out so different ways of ingestion into the streams, using something like kinases or kafka for example, if it is batchwise, some of the storage. They’re using like s3 or glacier, or some other file storage mechanisms. So once the data is ingested you need to store it somewhere. There is a storage layer, you need a storage layer for that. Then once the data is stored. Most people want to process on top of it. Like, people want to transform it to some giants. Make some data inference do some predictive model, and you need, like execution engines are a process into that so you can use apache spark for example, or MapReduce leverage EMR. Our lambda, for example, and we’ll see how they fit in each of the micro services. And in future slides coming up, Obviously all of these need to be exported on a web layer for people to get into. If there is a browser experience then obviously Tomcat, for example, use springboard to get you started pretty quickly, integrated way of doing things, which is good. You can use if you’re using a pure API, you might use like an Amazon API gateway to get stuff. And if you’re if you’re doing API development. One of the good tools is like using Swagger, because it gives you a design like top down approach to it. So, it’s more of an API first design programming type, it helps you. There are other tools but Swagger is kind of the market leader. Then the consumption layer like who or what applications we’re trying to consume. If it is kind of data says application might be it’s a Python script or as if you’re just trying to visualize Tableau or Kibana, you’re running some machine learning algorithms might be some some other application, you want to make a notebook or something like that. And eventually, you want some governance and data management. For that you can use like glue for example using a glue and kind of making sure whatever data is coming in, because everything might not go through your data lake application, you’re a senior data analyst data lake should be able to discover what is getting into the data lakes so Glue helps you catalog to waste management andand to monitor and making sure the things what is happening. People who actually uses them and all of us, they are doing it in the right shape and form is getting Synced by AWS cloud trail to mine the logs and make sure that those patterns make sense. People are not trying to misuse the data analysis. So that goes to data management the governance part of building a data lake. 

(26:09: Cloud Agnostic Architecture – Data Lake)

So, let the next couple of slides, what we’ll say is like we seem to build this whole, whole design of a data lake in cloud. What does it mean, so I’ll give you the first approach for designers. We are building an application and cloud, but we will be using j hipster for example, or any other stack text that whatever you want to use to build it without using any s3 or AWS components. So I’ll deploy a building’s application on, obviously AWS so they have an easy to instance or multiple instances to build a data lake where I need, like I need a gateway service to reverse reverse, that is the entry point. So you need a kind of an API, so that you have a ticket people can call into an API or, or get into a UI. So, you will have the tensor framework. There, and it will be built on Angular. So, you will have your UI will be built on Angular, and your API s se will be consumed. Like a REST API, for example. So you need the gateway unit and of the integration servers, because the data is coming in like to save their files or streams. So, your ingestion microservice, which is running on its own, in some server. It is able to ingest data which is coming in from outside so there might be agents, different agents to push data into your cloud infrastructure. In this case, your ingestion micro service. So as you can see this micro service is mostly dependent on the others. This is like decoupled micro service from the other micro services it’s running in your, in your cloud space, and it is looking for just for files and streams, and it will get the size and streams. Whatever the way say through Kafka or say any store and it for AWS s3 as your store. As your use as your blob storage as your storage layer. So, that microservice is just looking for data coming in, so that he can push it into the database. And once it does it. It will also try to put some inventory around it some, some trading some some messages in Elasticsearch saying that okay I can’t log, or I got the data file. And this is all these are the details of it, so that you can later go through that Elasticsearch log and see what came into the data lakes okay different audit trails, you have it, then you will have an inventory service. One of the scenarios is the kind of data coming in from outside so you know what is getting in inventories, what if somebody created a. Some, some entities and objects in the data lake outside of this application. How do you make sure your data lake is aware of it. So, he will that that we’re importing inventory, so it’s kind of making sure whatever is there, whatever inventory data inventory is out there in your data lake. I am making sure I have a handle of it. So your inventory microservice will be responsible for that. For that you can use things like apache spark is pretty good at doing it. The challenge over here is like the data set, sometimes can be unstructured. It can just be files, how do you know what is the structure of that file in CSV, it is easy, because I know CSV, for example, I know how to parse the structure. But still, the challenge from CSV, for example comes, I might know that these are the, these are the different fields, but I don’t know what the types of the fields are, and other stuff so there is a challenges. If it isn’t unstruck, like, if it is file, not databases. Now we can try to inventory. Because databases give you a lot of metadata, but files out of the box don’t to to get all of that, like, CSV was one say for example is a JSON file, how do you get a structure of it. He will describe that pretty well. So, it is good, good features like that so he will he will slowly implement that in your inventory microservice, so they can inventory. Your, your data sets. Then catalog is the one to invent rate. I want users to use my catalogs and discover and like collaborate on it might be. Do some ratings on that so that the higher the popular ones trickle up, and people can start discovering which data sets makes sense if I’m doing a data science model modeling, then with datasets and it looks, because they might be different variants of the same data set. So catalogs makes it collaborative from, from a data catalog point of view, then obviously security microservice you need to, to ensure the access to the systems are right Global’s getting in. Whatever the authorization levels are. That is maintained, so you can use LDAP for that. For example, there’s so much information considered so much of data getting into that system. So, whether it’s like operational data, metadata, or just pure data you want to search through it. So you might need. You want to dedicate a search service, which is just aggregating across all of this might be there’s a crawler, which is crawling across all of these things and like indexing getting killed so for example Elastic Search. When you are when you have data you want people to run transformations complex transformations, like big data transformations. So, you would need a micro service for that, so that people can submit their, their jobs. For example, and a good way of executing a bigger job is through apache spark you can use other two eyes. If you’re interested in MapReduce you can even use other technologies but the key to know is, you need some way of like, making sure your microservice which receives the request takes the job details, and then grants through it. Working with the cluster. And then they didn’t want to hear an admin microservice so so that it is our job is to make sure, new end users get on boarded. There is an easier management around it. When other responsibilities whatever when Microsoft says, and last wasn’t the need the logging which is pretty important because you want to log things into your application for troubleshooting and making sure if something goes wrong. It’s not a lot of advanced tools like Splunk and also they, they can make a lot of sense from from logs. So you want a microservice kind of a dedicated to capturing logs from across all of these. These can be kind of logging. Now the next service can capture logs real time, like through streams. If using LogStash then you might be streaming your log data. It can also be aggregating logs from clients. So it depends how you implement it. So I hope this gives a good color and good place to kind of make services you need to build a data lake. A good data lake application. 

So moving on, similar similar view, but you will see like now it’s more focus on AWS and what components of AWS they can use to build all of these micro services began when I would have kind of make a router, so router router of Amazon so that I can because I have so many micro services I want to personally, make sure the load is routed to the right one. So because it’s load balancing the budget and using elastic load balancing, depending on which part of the microservice I need to scale. I will add more nodes. So as you can see I’m very cynical about, about the load. For example, an admin microservice might not be loaded that I was but transformation micro service might be. So I’m putting my money where it really matters and that’s the power of using microservices architecture. So I have building again the building gateway I build a gateway for API’s to be exposed and in this there is a new if people want to leverage. So he’s an angler. And I told my staff on s3 and CloudFront for static pages I can put they’re part of the integration microservice, and we’ll be using say s3 and kinases streams for capturing files and streams, respectively, again, to inventory MIT’s Elasticsearch Amazon has Elastic Search. You can either use elastic cache if you want to. Or you can Amazon has elastic cloud. So there are different choices on doing it but AWS is all of those options out there. You can invent with a blue so blue is one of the latest products. So what it does is it does a heavy lifting for you so if you point it to say s3 bucket. It will go through it, it will try to based on the files out there in try to infer the structure, which will give you a kind of okay this file looks like this, or this file is very similar to this, so it will give you all that metadata out there. You can have you can be very. If you want to be a bit more creative. You can have like clear classifiers or patterns. So it will use those patterns to identify your objects. So below kind of the, like, it’s authorized doing your cataloguing. If you don’t, if you don’t want to write their application on your own page when I showed like using apache spark if you don’t want to do that, that’s fine. Glue will do the job for you. Then I’ve kept the catalog service, as is, as you can see the component i just used in Redis as a back background store for faster retrieval of the catalog entries. And I wanted to show that you can use whatever was passed you won’t be used. That is the power of microservices architecture in complexity using AWS components because you think it makes sense. But for cataloguing you might be still using a j hipster component to build your catalog experience for the user. So microservices gives you that that benefit if this was one big application. Then you most likely would have been stuck with a particular technology which was chosen at the beginning, but it’s microservices. That’s not the case, then obviously security microservice, we use some of the services of Amazon. And the next one is if you have a search microservice you can leverage things like our AWS lambda. The reason I have lambda, in some of the places and not all because there are some limitations on the amount of data. It can request payload it can take and it can send back. So, based on those best practices of using AWS lambda i think for such microservice it might be a good use case for transformation and raising lambda and writing my own using apache Spark, you can use EMR, or something like that, if you, if you want to, but I’m just showing that I have my microservice. I am building it like that, you might not be building it personally, but it is a choice you can have, and the choice is there, because you’re applying microservices on building the despair like application. The admin UI use lambda, for example, and for logging again is for lambda, because for logging the number is a very good component to use as you might be doing some log analytics and all of that. So, if you’re trying to aggregate logs you’re trying to get some counts on some of some of the metrics AWS lambda is pretty good at doing it so you can do all those logs and take some metrics, using that. And so with that let’s get into like a working kind of a working application, not an actual working application but from, from a demo point of view. How does it How would it look like so. First, security, so I would most likely be signed up with Active Directory, the directory service in X ray. Next, I will need some data to be flowing in. So streams or file security streams, and get into two appendices and kinases firehose, and put it into my s3 buckets. You can see that this is a best practice of using zones in in a data lake. We should should be able to break down into different zones because when the data comes in, it goes into most Raleigh and anexeon, which is kind of in a raw format. So maybe there is a very limited view of people who can see that. And once you have mastered the data and all the stuff you might be moving into clustered zones where more people are able to see people cannot stop taking those entities and do transformation and processing on top of it. So that is the best practice of zones if you’re if you’re interested to know more, you can find our material on on the Learning Portal. Under best practices of using zones and other stuff. Once the data is into a zone. Then you can build whatever microservices we have built as you can see different worlds like interesting completed metadata capture cataloging it ruins and dq data quality security. So, all of my services do the job of populating your database. Once pedelec is populated the new order needed to process the data on top of it. So for example, you can use EMR to process the data which is stored in s3 buckets. You can query it to wepner. You can provision to consume the data needs to be moved out to some other place like redshift or some Elastic Search For example, it can also do that because, because that’s that that’s driven by the kind of user experiences you’re you’re dealing with. If you have a good side, he might need. Kind of a BI tool. So that’s why you need to provision the data out from, from the data lake. into into a BI tool. For example, and if but if you’re just using Tableau and just want to do some slice and dice of data, you can directly look up pecan onto the data lake. And through Tableau connectors and other stuff you can do that. So this gives a complete picture starting from like multiplication. In this in the data, dropping the data into the right zones in a microservices paradigm to do different things like ingest and catalog and running transformation. And eventually, all those transform data whatever you have done your process over individualize through Tableau qivana excite. That will come to an end of this presentation, I hope it was useful.