Data Issues & Incidents
Last updated
Last updated
One of the use cases of data streaming pipelines is to monitor and share information about missing data in data sets as fast as possible.
The potential pipeline is pretty straightforward: get expected data, check if the condition is true or false, and take action. In this particular example, the Condition block checks if the data set is empty or not.
We all know that if an application has a multistep user onboarding process, some of the steps might be skipped or shortcut manually by supporting sales teams.
However, it is important to keep all data consistent to avoid some of it being missed in the future.
As an example, the following pipeline represents monitoring that for all users in the state "complete", questionnaires are in the state "filled".
Yes, indeed while creating a table, you can specify that it should not contain NULLs. But it is quite often necessary to lower such strict requirements, for example, when until a certain moment data in a column can be NULL, but should not be NULL after a certain action.
For example, let's take a look at user onboarding in a classical FinTech company. When a user is created in the database, fields like scoring or risk class can be NULL, but after some time when a scoring report is pulled from the 3rd-party scoring provider, these data should be filled. Unless the scoring provider doesn't provide it. Such situations can be monitored with Ylem.
Missing or broken data can be caused by various reasons:
Broken data pipelines
Broken 3rd-party or internal APIs
Bugs in the code that produces such data
With Ylem it is easily possible to detect all three root causes. To do that, run the following simple pipeline, for example, once per minute:
Retrieve expected data
Compare it with what you expect to get. For example, the number of items is higher than 0
If it doesn't match, send a notification alert or take any other necessary action
In multiple cases, the root cause of missing data is not bugs in the code but broken data pipelines. To control it you can write expectations of how much time can pass between the creation of items in a certain table of your data storage. And if this threshold is passed Ylem will notify your data engineers.