I am trying to build a "Data Lake" from scratch. I understand how a data lake works and the purpose of it; it's all over the internet. But when the question arises how to build one from scratch there is no source. I want to understand if:

Data warehouse + Hadoop = Data Lake

I know how to run Hadoop and bring in data into Hadoop. I want to build a sample on premise data lake to demo my manager. Any help is appreciated.

2

Best Answer


You'd have to have structured and unstructured data to make a Hadoop cluster into a data lake.

So, you'd have to have some ETL pipeline taking the unstructured data and converting it to structured data. Product reviews or something similar would provide your unstructured data. Converting this to something usable by Hive (as an example) would give you your structured data.

I would look at https://opendata.stackexchange.com/ for getting your data and google Hadoop ETL for ideas on how to cleanse the data. It's up to you how you want to write your pipeline (Spark or MapReduce).

You can build datalake using AWS services. A simple way to do so is to use an AWS CloudFormation template to configure the solution, including AWS services such as Amazon S3 for unlimited data storage, Amazon Cognito for authentication, Amazon Elasticsearch for strong searching capabilities, AWS Lambda for microservices, AWS Glue for data transmission, and Amazon Athena for data analytics. The following figure represents the complete architecture of building a data lake on AWS using AWS services.

Refer this article for reference: https://medium.com/@pmahmoudzadeh/building-a-data-lake-on-aws-3f02f66a079e