Just giving a small update/opinion:
The more I use AWS based tools for data, the more annoying it gets. Our datalake is Athena and we use AWS Lambda to run our python scripts. Both have hard limitations and we're starting to hit them in various spots. AWS Lambda has a 15 minute time limit and a certain upper limit for memory and drive space. This isn't so bad, as it does force you to keep your code a bit cleaner and modularize the code, but it gets a little silly at times. But its cheap as hell. I created a powerbi report to get an idea of what the cost associated with our lambdas, its like $20 month total, with all the jobs we run daily and the ones we run hourly, lol.
Athena, omg, athena is goofy as hell. Query timeout limit of 30 minutes...I can run a select statement to get 600M records, but I can't insert it into a table because of the damn time limit. What happened to the parallel processing and shyt? We might be bringing on databricks, but converting some of the jobs that Im having isues with to databricks is so far down the line.
AWS Glue, I admit i don't really understand it, but I dunno if this is my favorite spark environment. We'll see
Edit: Met with my manager, and realized that the query I'm converting over to Athena/Glue takes like an hour to run, lol. So now that I'm a little more patient, Glue is the right way to go in our environment. Just been doing this by hand (not by our standard structured job methodology) in order to just get an idea of what path works. So now I gotta standardize it and deploy it.