Modern data stack case study

2022. 8. 19. 16:23데이터 분석가로 살기


https://outerbounds.com/blog/modern-data-stack-mlops

Data is stored and transformed in Snowflake, which provides the underlying compute for SQL, including data transformations managed by a tool like dbt;
- Training happens on AWS Batch, leveraging the abstractions provided by Metaflow;
- Serving is on SageMaker, leveraging the PaaS offering by AWS;
- Scheduling is on AWS Step Functions, leveraging once again Metaflow (not shown in the repo, but straightforward to achieve).


A crucial point in our design is the abstraction level we chose to operate at: the entire pipeline does not need any special DevOps person, infrastructure work, or yaml files. SQL and Python are the only languages in the repository: infrastructure is either invisible (as in, Snowflake runs our dbt queries transparently) or declarative (for example, you specify the type of computation you need, such as GPUs, and Metaflow makes it happen). 이게 진짜 가능한 일인가요!?

The exciting part is that a simple SQL query, easy to read and maintain, is all that is needed to connect feature preparation and deep learning training on a GPU in the cloud. Training a model produces an artifact (that is, a ready-to-use model!), which can now generate predictions: as it’s good practice to test the model on unseen data before deploying it on real traffic, we showcase in test_model a draft of an advanced technique, behavioral testing.

https://github.com/jacopotagliabue/reclist

GitHub - jacopotagliabue/reclist: Behavioral "black-box" testing for recommender systems

Behavioral "black-box" testing for recommender systems - GitHub - jacopotagliabue/reclist: Behavioral "black-box" testing for recommender systems

github.com


Finally, our pipeline ends with deployment, that is, the process of shipping the artifact produced by training and validated by testing to a public endpoint that can be reached like any other API; by supplying a shopping session, the endpoint will respond with the most likely continuation, according to the model we trained.

This stack can also be run in increasingly complex configurations, depending on how many tools / functionalities you want to include: even at “full complexity” it is a remarkably simple and “hands-off” stack for terabytes-scale processing; also, everything is fairly decoupled, so if you wish to swap SageMaker with Seldon, or Comet with Weights & Biases you will be able to do it in a breeze.