Real-time Stream Analytics and Scoring Using Apache Flink, Druid & Cassandra - Ciesielczyk & Zontek

Flink Forward منتشر شده در تاریخ 1398/07/23

3.5 هزار بار بازدید - 5 سال پیش - One of the hardest challenges

One of the hardest challenges we are trying to solve is how to deliver customizable insights based on billions of data points in real-time, that fully scale from a single perspective of an individual up to millions of users.

At Deep.BI we track user habits, engagement, product and content performance, processing terabytes or billions of events of data daily. Our goal is to provide real-time insights based on custom metrics from a variety of self-created dimensions. The platform allows to perform tasks from various domains such as adjusting websites using real-time analytics, running AI optimized marketing campaigns, providing a dynamic paywall based on user engagement and AI scoring, or detecting frauds based on data anomalies and adaptive patterns.

To accomplish this, our system collects every user interaction. We use Apache Flink for event enrichment, custom transformations, aggregations and serving machine learning models. The processed data is then indexed by Apache Druid for real-time analytics and Apache Cassandra for delivery of the scores. Historical data is also stored on Apache Hadoop for machine learning model building. Using the low-level DataStream API, custom Process Functions, and Broadcasted State, we have built an abstract feature engineering framework that provides re-usable templates for data transformations. This allowed us to easily define domain specific features for analytics and machine learning, and migrate our batch data preprocessing pipeline from Python jobs deployed on Apache Spark to Flink, resulting in a significant performance boost.

This talk covers our challenges with building and maintaining our platform and lessons learned along the way, namely how to:

- evolve a continuous application processing an unbounded data stream,

- provide an API for defining, updating and reusing features for machine learning,

- handle late events and state TTL,

- serve machine learning models with the lowest latency possible,

- dynamically update the business logic at runtime without a need of redeploy, and

- automate the data pipeline deployment.

5 سال پیش در تاریخ 1398/07/23 منتشر شده است.

3,507 بـار بازدید شده

... بیشتر