Meesho Data Engineer Interview Questions
Round 1
Design a data platform.
multiple sources and sinks.
connectors for reading/writing data.
purpose -
1. read data -> perform ETL (transform on sequence of rules - string commands in spark sql -> load into sink).
-
configurable.
-
define pipeline :
a pipeline is bunch of similar events. event can be supplier data etc.
events of similar data is from same source for example supplier data from kafka.
-
Transformation are tied to events not pipeline.
event 1 - T1-> T2
event 2 - T1 -> T4 -> T5transformation are event specific.
source and sink are pipeline specific.
Also, what will be its Base classes and functions.
Purpose of functions.
Round 2
Consider an Ecommerce Website, with ClickStream events data being generated at the rate of 500k/second.
These events can be add_to_cart, view, order, wishlisted, ... etc.
The storage is Cloud Storage.
Requirements - Build a data platform with following characteristics -->
- Self serve platform to provide ETL (hourly, daily, weekly etc)
e.g. - Product view count per hour per product,
- Product view count per day per product etc,
-
User has ordered product after clicking on ad in last 3 days
-
Adhoc queries/Notebook interface - Analysts (300 DAU)
-
ML use cases (feature engineering, training etc - timetravel queries, historical data ) } hudi | deltalake
Existing - 1500 - 2000 jobs (sql)
1. Might contain duplicate
2. Non optimized (No predicates/filters)
What are things you will take care of for new jobs/sql?