Designed and developed scalable AWS infrastructure to store and process large data sets using Pyspark, S3, EMR, and Redshift
Implemented ETL pipelines using Amazon Web Services and Python to extract data from various sources, perform data transformations, and load data into the data warehouse
Utilized AWS EC2 instances to create and manage EMR clusters for data ingestion, ensuring optimal performance and resource utilization
Implemented Spark jobs on AWS EMR to process large datasets, improving data ingestion efficiency and reducing processing time
Developed custom scripts to automate data ingestion, data quality checks, and data reconciliation processes
Utilized AWS Glue for scheduling and running ETL jobs, reducing manual intervention and increasing reliability
Collaborated with cross-functional teams to gather requirements, design solutions, and implement end-to-end data pipelines
Monitored performance and optimized AWS resources usage to reduce costs and increase efficiency
Automated deployment of Lambda functions using AWS CloudFormation
Designed, developed, and deployed custom software applications using languages such as Python, and Java