Posted on May 22, 2020
I've recently been working on an Apache Kafka/Confluent data pipeline to analyse event streams. I decided to use Google Cloud BigQuery for the data analysis as it seemed to be easy to get set up with and extremely powerful. But to get up and running I'd need to backfill all my existing data. I also decided to add it to a time-partitioned table to increase performance and reduce costs.
Posted on Mar 14, 2019
Odd one this, and one that took me a little while to debug. I recently set up a Confluent/Kafka data pipeline with transformations being handled by KSQL and data being produced by an application written in Go. As part of the test process I persisted data using a MongoDB Sink connector. The command line producers had no problems and producing a large file would persist the expected data to MongoDB. However, I ran into issues when producing from Golang, I would notice that somewhere between 7% and 12% of the messages were being persisted to MongoDB, the others were lost somewhere in the processing.