KSQL — Getting Started (Part 2)

Rasiksuhail
6 min readFeb 25, 2024
Photo by Henry & Co. on Unsplash

In the ever-evolving landscape of data processing, stream processing has emerged as a critical component for real-time analytics and insights. Among the myriad of tools available, KSQL stands out as a powerful and intuitive stream processing engine that simplifies stream processing tasks for developers and data engineers.

In this blog post, we’ll delve little more into what KSQL is, its key features, and how you can leverage it to streamline your stream processing workflows.

Lets start from where we left before.

Streams

A stream is a partitioned, immutable, append-only collection that represents a series of historical facts.

Example

  • “Alice sent $100 to Bob”
  • “Charlie sent $50 to Bob”

Immutable & Append only

  • Once a row is inserted into a stream, it can never change.
  • New rows can be appended at the end of the stream, but existing rows can never be updated or deleted.

Partition

  • Each row is stored in a particular partition.
  • Every row, implicitly or explicitly, has a key that represents its identity.
  • All rows with the same key reside in the same partition.

Tables

A table is a mutable, partitioned collection that models change over time.

A table represents what is true as of “now”.

Key is crucial.

  • Tables work by leveraging the keys of each row
  • If a sequence of rows shares a key, the last row for a given key represents the most up-to-date information for that key’s identity
  • A background process periodically runs and deletes all but the newest rows for each key.

Queries

Persistent — Server side

  • Issued when deriving new streams and tables from existing streams or tables
  • Runs indefinitely processing rows of events

Push Query — Client side

  • Subscribes to a result as it changes in real-time
  • Long lived connection
  • Good fit for asynchronous control flow
  • Can be used to query either streams or tables for a particular key
  • Execute a push query by sending an HTTP request to the ksqlDB REST API, and the API sends back a chunked response of indefinite length.
  • Persistence of result

— The result of a push query isn’t persisted to a backing Kafka topic

— For persistence, use a CREATE TABLE AS SELECT or CREATE STREAM AS SELECT statement

  • Example : Subscribing to a particular user’s geographic location

Pull Query — Client side

  • Form of query that retrieves a result as of “now”, like a query against a traditional RDBMS
  • Returns finite result immediately and closes its connection
  • Good fit for synchronous control flow (request/response)
  • Execute a pull query by sending an HTTP request to the ksqlDB REST API, and the API responds with a single response.
  • Enable you to fetch the current state of a materialized view.
  • Example : Ask for the current map coordinates of a particular user

Joins

A ksqlDB join and a relational database join are similar in that they both combine data from two or more sources based on common values

  • Stream + Stream = Stream
  • Table + Table = Table
  • Stream + Table = Stream

Currently, the join expression must be a single column equal comparison. Non-equi joins, or joins on multiple columns are not supported.

CREATE STREAM shipped_orders AS
SELECT
o.id as orderId,
o.itemid as itemId,
s.id as shipmentId,
p.id as paymentId
FROM orders o
INNER JOIN payments p WITHIN 1 HOURS ON p.id = o.id
INNER JOIN shipments s WITHIN 2 HOURS ON s.id = o.id;SQL

Stream + Stream -> must specify WITHIN clause

Join capabilities

Stream — Stream Joins

Stream-stream joins involve merging two or more continuous data streams based on common attributes or temporal windows. In this type of join, events from different streams are matched in real-time as they arrive, allowing organizations to correlate related events and detect patterns across multiple data sources simultaneously. Stream-stream joins are particularly useful in scenarios such as complex event processing, fraud detection, and IoT sensor data analytics, where real-time correlation of streaming events is critical for actionable insights.

  • ksqlDB supports INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER joins between streams.
  • All of these operations support out-of-order records.
  • To join two streams, you must specify a windowing scheme by using the WITHIN clause.
  • A new input record on one side produces a join output for each matching record on the other side, and there can be multiple such matching records within a join window.

JOIN Results

  • LEFT OUTER joins will contain leftRecord-NULL records in the result stream, which means that the join contains NULL values for fields selected from the right-hand stream where no match is made.
  • RIGHT OUTER joins will contain rightRecord-NULL records in the result stream, which means that the join contains NULL values for fields selected from the left-hand stream where no match is made.
  • FULL OUTER joins will contain leftRecord-NULL or NULL-rightRecord records in the result stream, which means that the join contains NULL values for fields coming from a stream where no match is made.

Stream — Table Joins

Stream-table joins combine an unbounded stream of data with a static dataset (table) to enrich the streaming data with additional attributes or context. In this type of join, each incoming event from the stream is matched against the corresponding entries in the table based on specified key attributes. Stream-table joins enable real-time enrichment of streaming data with reference data, allowing organizations to augment streaming events with relevant metadata or lookup information. Common use cases for stream-table joins include real-time personalization, dynamic pricing, and anomaly detection, where enriched streaming data enhances decision-making capabilities.

  • ksqlDB supports INNER and LEFT joins between a stream and a table.
  • An OUTER join is not available because it can’t be defined with sound semantics.
  • Stream-table joins are always non-windowed joins.
  • Only events arriving on the stream side trigger downstream updates and produce join output.
  • Only input records for the left-side stream trigger the join
  • Updates on the table side don’t produce updated join output.

Table — Table Joins

Table-table joins involve joining two or more static datasets (tables) together based on common attributes to generate aggregated or enriched datasets. Unlike stream-stream and stream-table joins, table-table joins operate on bounded datasets, making them suitable for batch processing or offline analytics tasks. In this type of join, rows from different tables are matched based on specified join keys, allowing organizations to perform complex data transformations, aggregations, and analysis across multiple datasets. Table-table joins are commonly used in data warehousing, business intelligence, and reporting applications, where historical data analysis and cross-dataset correlation are essential for generating actionable insights.

  • ksqlDB supports primary-key (1:1) as well as foreign-key (1:N) joins between tables.
  • Many-to-many (N:M) joins are not supported currently.
  • For a foreign-key join, you can use any left table column in the join condition to join it with the primary-key of the right table.
  • For primary-key joins INNER, LEFT OUTER, and FULL OUTER joins are supported. For foreign-key joins INNER and LEFT OUTER joins are supported.
  • Table-table joins are always non-windowed joins.
  • Table-table joins are eventually consistent.

Great, you are starting to know more on KSQL.

Stream-stream, stream-table, and table-table joins are fundamental operations in stream processing and data analytics, enabling organizations to correlate, enrich, and analyze streaming and static datasets in real-time and batch processing scenarios. By understanding the characteristics and use cases of each type of join, organizations can leverage the power of stream processing to unlock valuable insights from their data streams and drive informed decision-making in dynamic and fast-paced environments.

Stay tuned for the next blog to learn more on KSQL

Happy Learning !

--

--