What is a data streaming?
Data streams are continuous flows of data, typically in real-time or near real-time. They are generated from various sources, including IoT devices, sensors, logs, social media feeds, financial transactions, and user web activities on the internet and applications.
Key characteristics of data streaming:
Continuous and Real-time: Data streams are ongoing and unbounded, meaning they keep generating data as long as the source is active. This is different from finite static datasets that do not change once created.
High Velocity: Data streams often come at high speeds, requiring quick processing and analysis to derive meaningful insights promptly.
Real-time processing: below 50ms latency in response time. High-reliability, applications like fraud detection, monitoring, and anomaly detection.
Varied Data & Signal Sources: Data streams can originate from a variety of sources, including but not limited to:
IoT sensors generate environmental or machine status data
Social media platforms provide real-time updates on posts and user interactions
Financial systems producing transaction data
Web logs capturing real user activities on websites
Dynamic and Transient: data points can be temporary, meaning they might only be relevant for a short period. Systems processing data streams often need to handle data in a way that can deal with transient states efficiently.
Processing Models: There are specific processing models and frameworks designed to handle data streams, such as:
Stream Processing: Tools like Apache Kafka, Apache Flink, and Apache Storm are designed to process data streams by analyzing and acting on data as it arrives.
Complex Event Processing (CEP): Frameworks like Apache Esper enable the detection of patterns and relationships within streams, often used in scenarios requiring the detection of specific events or conditions from multiple data points.
Applications: Data streams are used in a variety of real-world apps, such as:
Real-time analytics: Providing insights and dashboards that reflect current conditions.
Monitoring and Alerting: Detecting anomalies or specific conditions that require immediate attention.
Real-time recommendations: Offering personalized recommendations based on current user interactions.
Internet of Things (IoT): Monitoring and controlling devices in real-time.
How does data streaming work?
Data stream works by continuously ingesting data from various sources (sensors, logs, transactions), transporting it via platforms like PaaS, Apache Kafka, or AWS Kinesis. Stream processing frameworks like Apache Flink or Google Dataflow analyze and process the data in real-time. Processed data can be stored in databases or data lakes for further analysis and visualization. Real-time dashboards like PubNub Illuminate trigger custom automated actions based on the insights derived from the data stream.
Data Stream Process Workflow
Plugging Data Sources: Data is generated from various sources such as IoT devices, applications, databases, social media, and more.
Data Ingestion/Collection on clouds like:
AWS Kinesis Data Streams: Allows real-time data streaming.
Google Cloud Pub/Sub: Messaging service for event-driven systems.
Azure Event Hubs: Big data streaming platform and event ingestion service.
Data Transportation:
Ingested data is transported through the streaming platform using a publish-subscribe model like PubNub.
Producers publish data to specific topics or streams.
Consumers subscribe to these topics to receive the data.
Data Processing: Cloud providers offer stream processing services to handle real-time data processing. Examples include:
AWS Kinesis Data Analytics: Processes streaming data using SQL queries.
Google Cloud Dataflow: Unified stream and batch data processing.
Azure Stream Analytics: Real-time stream processing service.
These services allow users to write processing logic using SQL-like queries, code, or graphical interfaces.
Complex Event Processing (CEP): Managed engines detect patterns and complex events in the data stream. These can often be integrated into stream processing services.
Data Storage: Processed data can be stored in various cloud storage solutions:
AWS S3, Google Cloud Storage, Azure Blob Storage: Object storage for large-scale data storage.
AWS Redshift, Google BigQuery, Azure Synapse: Data warehouses for structured data analysis.
Time-series databases: For time-dependent data, e.g., Amazon Timestream.
Real-time Analytics and Dashboards
Output and Actions: Processed data and insights are used to trigger actions and notifications with automated responses and workflows.
Smart City Data Stream Example Workflow on a PaaS
Data Generation: Sensors in a smart city infrastructure generate real-time data on traffic, weather, and energy usage.
Data Ingestion: Data is sent to Pub/Sub Cloud.
Data Transportation: Pub/Sub manages the distribution of data to various consumers.
Data Processing: Google Cloud Dataflow processes the data, filtering, and aggregating it to compute metrics like average traffic speed and energy consumption.
Complex Event Processing: Patterns indicating traffic congestion or power outages are detected.
Data Storage: Aggregated data is stored in database for further analysis.
Real-time Analytics and Dashboards: visualize real-time traffic and energy usage, and traffic statistics.
Output and Actions: Alerts are sent via Google Cloud Functions to city management systems if anomalies are detected, triggering automated responses such as adjusting traffic light timings or notifying maintenance teams.
Data streaming example - Real-Time Trading App
Context:In high-frequency trading, financial firms execute thousands of trades per second to capitalize on minor price fluctuations in the stock market. The data stream involved is vast and requires robust processing capabilities.
Data Sources:
Market Data Feeds ie. NYSE, NASDAQ, and other exchanges provide real-time quote and trade data.
News Feeds Real-time news services like Bloomberg or Reuters that provide financial news which might impact stock prices.
Data Characteristics:
Volume Thousands of trades per second, resulting in gigabytes of data per day.
Velocity Data needs to be processed with latencies often in the microsecond range.
Variety Structured data (trade prices, volumes) and unstructured data (news articles).
Data Stream Processing Pipeline:
Data Ingestion:
Kafka A distributed event streaming platform to handle real-time data ingestion.
Communication Method Buffers (Protobuf) Used for efficient data serialization.
Real-Time Processing:
Apache Flink A stream processing framework that processes the data with sub-millisecond latency.
Sliding Windows To compute moving averages and other statistics on the data stream. For instance, a 1-second sliding window may compute the average trade price and volume.
Analytics and Algorithms:
Machine Learning Models Deployed using libraries like TensorFlow or PyTorch to predict stock price movements based on incoming data.
Statistical Analysis: Real-time computation of statistics such as:
Moving Averages Simple Moving Average (SMA), Exponential Moving Average (EMA).
Volatility Measures Standard deviation of price changes over specified intervals.
Storage and Retrieval:
In-Memory Databases Redis or Memcached for storing interim computational results.
Time-Series Databases InfluxDB or TimescaleDB for storing historical data and enabling back-testing of trading strategies.
Example Statistics:
Tick Rate (Virtual Events Rate) 5,000 ticks/events per second.
Latency Requirements Processing latency under 500 microseconds.
Moving Average Calculation 10-second moving average of stock prices, updated every second.
Example of Real-Time Calculation:
Assume a simple moving average (SMA) calculation for the last 10 seconds on a data stream of stock prices:
Prices over the last 10 seconds: [101.5, 102.0, 101.8, 101.7, 102.2, 101.9, 102.3, 102.1, 101.6, 102.4]
SMA = (101.5 + 102.0 + 101.8 + 101.7 + 102.2 + 101.9 + 102.3 + 102.1 + 101.6 + 102.4) / 10
SMA = 101.95
This calculation needs to be updated every second as new prices stream in, maintaining real-time analytics.
Tech Stack Summary:
Data Ingestion Kafka
Serialization Protocol Buffers
Stream Processing Apache Flink
Machine Learning TensorFlow/PyTorch
Databases Redis, InfluxDB/TimescaleDB
This data streming setup ensures that the stock trading platform can handle high-speed data streams, process them with minimal latency, and make informed trading decisions in real-time.