According to research by Aberdeen, business data is growing exponentially, with the average company seeing the volume of their data grow at a rate that exceeds 50% per year. A customer of ours who provides specialized healthcare services approached us to help with their data growth challenges. With rapid expansion and complexity in the volume of its data, the firm sought to implement a data lake solution in the AWS cloud space.
What is a Data Lake?
Already an AWS customer, the firm sought a data lake in the cloud that would serve as a centralized repository to both its structured and unstructured data, at scale. The lure of the data lake is that enterprises can store data as-is, without having to first structure the data. For our healthcare company, this was especially helpful as it has diverse data inputs to its data lake with data flowing in from partners and newly acquired business units. Once in the data lake, multiple types of analytics can be run on the data, including big data processing, real-time analytics, and even machine learning.
Building an AWS Data Lake Solution
As Gartner analyst, Debra Logan, recently noted in a webinar, “Data & Analytics Leadership and Vision for 2019”, the infrastructure journey should progress from data collection to collection and connection.* To facilitate this crucial data connection, the AWS consulting team at Flux7 was brought in. The goal: To collect and connect data from four different data channels so that the organization could quickly and efficiently run analytics with meaningful business impact.
To do so, we partnered with the customer’s technology teams to create the infrastructure for the data store. Specifically, we:
- Developed and deployed Infrastructure as Code (IaC) for a data pipeline pattern, inclusive of
- An S3 Data Lake, AWS Glue and AWS RedShift clusters through
- A pipeline in Jenkins.
We created a pattern -- a common method for extracting, transforming and loading data -- for the customer. As seen in the diagram below, data was extracted from each of the company’s four data channels, moved into a file drop which triggered a Lambda function to move the files to the AWS data lake solution.
The Lambda function also triggered an AWS Glue crawler. AWS Glue is a fully managed extract, transform, and load (ETL) service, which we used to prepare and load the data for analytics. We pointed AWS Glue at the data stored in the AWS data lake where Glue discovers the data and stores the associated metadata in an AWS Glue Data Catalog. Once cataloged, the company’s data is immediately searchable, queryable, and available for ETL.
The created catalogs were in turn forwarded to Amazon Redshift, a data warehouse that makes it simple to analyze data across both the data warehouse and data lake. Analytics are then run through the Amazon Redshift cluster which was made available to users via Tableau’s data visualization solution.
AWS Data Lake Security
Flux7’s AWS security best practices were also a strong factor in this customer engagement as data security and HIPAA compliance are of the utmost importance. Specifically, we built security into the data lake solution in several ways:
- By creating a single pattern, we are able to limit human error, such as the team’s ability to introduce misconfiguration vulnerabilities in future data lakes.
- IAM roles were established for each data source, as were specific policies across CloudWatch Logs, AWS Glue and S3 buckets.
- We bootstrapped and hardened all accounts.
- We integrated the solution with the company’s LDAP to ensure access privileges were appropriately assigned and controlled.
Deploying a data lake solution can be a time-consuming process, especially for organizations without prior experience. By bringing in the AWS experts at Flux7, our customer found itself quickly enabled with a scalable data lake that was able to collect and connect data from four disparate data sources. Data is the lifeblood of this healthcare organization, helping it equip its customers with critical health information to prepare for their lives. Now, with data available at the team’s fingertips, they are able to quickly analyze information and make important data connections that will impact time to market and its customer experiences.
To learn more about effective data management and analysis in the cloud for business advantage, check out the following AWS case studies:
- AWS Case Study: QSR Adopts Amazon RedShift for Cost-Effective Analysis
- Data Lake to Support In-Field IoT Applications fro Autonomous Tractor Innovation
- Data Analytics Startup Matures Approach with Microservices Architecture
- AWS Case Study: Pharmaceutical Migrates R&D Analytics to AWS
* Gartner Webinar, Data & Analytics Leadership and Vision for 2019, Debra Logan, slide 4, Jan 24 2019