Ylem documentation
  • 🗒️General information
    • Introduction to Ylem
    • Quick start guide
    • Release notes
  • 🔬Open-source edition
    • Installation
    • Usage of Apache Kafka
    • Task processing architecture
    • Configuring integrations with .env variables
  • 💡Integrations
    • Connecting an integration
    • Library of integrations
      • Amazon Redshift
      • Apache Kafka
      • APIs
      • Atlassian Jira
      • AWS Lambda
      • AWS RDS
      • AWS S3
      • ClickHouse
      • ElasticSearch
      • E-mail
      • Google Big Query
      • Google Cloud SQL
      • Google Pub/Sub
      • Google Sheets
      • Immuta
      • Incident.io
      • Jenkins
      • Hubspot
      • Microsoft Azure SQL
      • MySQL
      • OpenAI ChatGPT
      • Opsgenie
      • PostgreSQL
      • PlanetScale
      • RabbitMQ
      • Salesforce
      • Slack
      • Snowflake
      • Tableau
      • Twilio. SMS
      • WhatsApp (through Twilio)
    • Initial demo data source
  • 🚡Pipelines
    • Pipeline management
    • Tasks
      • Aggregator
      • API Call
      • Code
      • Condition
      • External trigger
      • Filter
      • For each
      • GPT
      • Merge
      • Notification
      • Query
      • Pipeline runner
      • Processor
      • Transformer
    • Running and scheduling pipelines
    • Library of templates
    • Environment variables
    • Mathematical functions and operations
    • Formatting of messages
  • 📈Statistics and profiling
    • Statistics of runs
    • Slow tasks
  • 📊Metrics
    • Metric management
    • Using previous values of a metric
  • 💼Use cases, patterns, templates, examples
    • Use cases
    • Messaging patterns
      • Datatype Channel
      • Message Dispatcher
      • Messaging Bridge
      • Message Bus
      • Message Filter
      • Message Router
      • Point-to-Point Channel
      • Publish-Subscribe Channel
      • Pull-Push
    • Functional use cases
      • Streaming from Apache Kafka and messaging queues
      • Streaming from APIs
      • Streaming from databases
      • Data orchestration, transformation and processing
      • Usage of Python and Pandas
      • KPI Monitoring
      • OKRs and custom metrics
      • Data Issues & Incidents
      • Reporting
      • Other functional use cases
    • Industry-specific use cases
      • Finance and Payments
      • E-commerce & Logistics
      • Customer Success
      • Security, Risk, and Anti-Fraud
      • Anti-Money Laundering (AML)
  • 🔌API
    • OAuth clients
    • API Reference
  • 👁️‍🗨️Other resources
    • FAQ
    • Our blog on Medium
Powered by GitBook
On this page
  • Monitoring of missing data
  • Monitoring of data consistency
  • Detecting NULLs instead of data
  • Detecting bugs in code with monitoring of missing data
  • Detection of broken data pipelines

Was this helpful?

Edit on GitHub
  1. Use cases, patterns, templates, examples
  2. Functional use cases

Data Issues & Incidents

PreviousOKRs and custom metricsNextReporting

Last updated 8 months ago

Was this helpful?

Monitoring of missing data

One of the use cases of data streaming pipelines is to monitor and share information about missing data in data sets as fast as possible.

The potential pipeline is pretty straightforward: get expected data, check if the condition is true or false, and take action. In this particular example, the Condition block checks if the data set is empty or not.

Monitoring of data consistency

We all know that if an application has a multistep user onboarding process, some of the steps might be skipped or shortcut manually by supporting sales teams.

However, it is important to keep all data consistent to avoid some of it being missed in the future.

As an example, the following pipeline represents monitoring that for all users in the state "complete", questionnaires are in the state "filled".

Detecting NULLs instead of data

Yes, indeed while creating a table, you can specify that it should not contain NULLs. But it is quite often necessary to lower such strict requirements, for example, when until a certain moment data in a column can be NULL, but should not be NULL after a certain action.

For example, let's take a look at user onboarding in a classical FinTech company. When a user is created in the database, fields like scoring or risk class can be NULL, but after some time when a scoring report is pulled from the 3rd-party scoring provider, these data should be filled. Unless the scoring provider doesn't provide it. Such situations can be monitored with Ylem.

Detecting bugs in code with monitoring of missing data

Missing or broken data can be caused by various reasons:

  • Broken data pipelines

  • Broken 3rd-party or internal APIs

  • Bugs in the code that produces such data

With Ylem it is easily possible to detect all three root causes. To do that, run the following simple pipeline, for example, once per minute:

  • Retrieve expected data

  • Compare it with what you expect to get. For example, the number of items is higher than 0

  • If it doesn't match, send a notification alert or take any other necessary action

Detection of broken data pipelines

In multiple cases, the root cause of missing data is not bugs in the code but broken data pipelines. To control it you can write expectations of how much time can pass between the creation of items in a certain table of your data storage. And if this threshold is passed Ylem will notify your data engineers.

💼
Possible pipeline
Possible SQL query
Query example
Example of a pipeline
SQL query to calculate difference between the last item's creation and now in seconds
An expression to check if that value if higher than 24 hours
Notification message for data engineers