Ylem documentation
  • 🗒️General information
    • Introduction to Ylem
    • Quick start guide
    • Release notes
  • 🔬Open-source edition
    • Installation
    • Usage of Apache Kafka
    • Task processing architecture
    • Configuring integrations with .env variables
  • 💡Integrations
    • Connecting an integration
    • Library of integrations
      • Amazon Redshift
      • Apache Kafka
      • APIs
      • Atlassian Jira
      • AWS Lambda
      • AWS RDS
      • AWS S3
      • ClickHouse
      • ElasticSearch
      • E-mail
      • Google Big Query
      • Google Cloud SQL
      • Google Pub/Sub
      • Google Sheets
      • Immuta
      • Incident.io
      • Jenkins
      • Hubspot
      • Microsoft Azure SQL
      • MySQL
      • OpenAI ChatGPT
      • Opsgenie
      • PostgreSQL
      • PlanetScale
      • RabbitMQ
      • Salesforce
      • Slack
      • Snowflake
      • Tableau
      • Twilio. SMS
      • WhatsApp (through Twilio)
    • Initial demo data source
  • 🚡Pipelines
    • Pipeline management
    • Tasks
      • Aggregator
      • API Call
      • Code
      • Condition
      • External trigger
      • Filter
      • For each
      • GPT
      • Merge
      • Notification
      • Query
      • Pipeline runner
      • Processor
      • Transformer
    • Running and scheduling pipelines
    • Library of templates
    • Environment variables
    • Mathematical functions and operations
    • Formatting of messages
  • 📈Statistics and profiling
    • Statistics of runs
    • Slow tasks
  • 📊Metrics
    • Metric management
    • Using previous values of a metric
  • 💼Use cases, patterns, templates, examples
    • Use cases
    • Messaging patterns
      • Datatype Channel
      • Message Dispatcher
      • Messaging Bridge
      • Message Bus
      • Message Filter
      • Message Router
      • Point-to-Point Channel
      • Publish-Subscribe Channel
      • Pull-Push
    • Functional use cases
      • Streaming from Apache Kafka and messaging queues
      • Streaming from APIs
      • Streaming from databases
      • Data orchestration, transformation and processing
      • Usage of Python and Pandas
      • KPI Monitoring
      • OKRs and custom metrics
      • Data Issues & Incidents
      • Reporting
      • Other functional use cases
    • Industry-specific use cases
      • Finance and Payments
      • E-commerce & Logistics
      • Customer Success
      • Security, Risk, and Anti-Fraud
      • Anti-Money Laundering (AML)
  • 🔌API
    • OAuth clients
    • API Reference
  • 👁️‍🗨️Other resources
    • FAQ
    • Our blog on Medium
Powered by GitBook
On this page
  • Step 1
  • Steps 2-3
  • Steps 4-5
  • Steps 6-7
  • Steps 8-9
  • Step 10

Was this helpful?

Edit on GitHub
  1. Open-source edition

Task processing architecture

PreviousUsage of Apache KafkaNextConfiguring integrations with .env variables

Last updated 8 months ago

Was this helpful?

Ylem uses Apache Kafka topics for exchanging pipeline task messages while processing them. Several services are connected to it to produce messages on the topics or consume messages from it.

The high-level architecture looks like this:

The processing cycle for every task consists of 10 steps:

Step 1

The service ylem_pipelines_schedule_generator (./backend/pipelines) reads pipeline schedules for the next 48 hours, prepares scheduled runs, and adds the ones that are not yet planned to run to the database table pipelines.scheduled_runs.

This service is unaware of Apache Kafka and only works with the database.

Steps 2-3

When the time comes to run pipelines the next service ylem_pipelines_schedule_publisher (./backend/pipelines) reads schedules from the table pipelines.scheduled_runs, defines initial task(s) for each pipeline, and sends these tasks to the Kafka-topic task_runs.

These 3 steps are only valid for the pipelines run by schedule. In all the other cases such as manual run, API-call run, external trigger, etc, the initial task(s) get published directly to the Kafka-topic task_runs.

Steps 4-5

ylem_loadbalancer is a service that reads tasks from the topic task_runs and defines the partition it needs to be sent to for the most effective processing. The most effective in this case is less loaded at the moment.

This service is connected to the statistics of the previous runs of each pipeline and therefore it knows which pipeline is expected to be "slow" and which is most likely not. 60% of partitions are typically reserved for the "slow" pipelines and 40% for the fast ones. The decision on which partition will run the task is taken byylem_loadbalancer at these two steps.

When the partition is defined, ylem_loadbalancer forwards the task further to the next topic task_runs_load_balanced for the actual processing.

Steps 6-7

ylem_taskrunner is the service that does the actual processing of the task. In step 6, it reads them from the task_runs_load_balanced topic, processes it, and sends the result to 3 other topics:

  • task_run_results

  • query_task_run_results

  • notification_task_run_results

Steps 8-9

The next service ylem_pipelines_trigger_listener listens for the task run results in the task_run_results topic.

For each new result, it checks whether the task ran successfully and if there's any task to be triggered next it sends it to the task_runs topic again with the output of the previous task as an input for the next one.

Step 10

ylem_pipelines_trigger_listener is not the only service that listens to the task_run_results topic.

Read more about various ways to trigger pipelines .

The last 2 topics contain additional information about the results of the tasks and . Currently, it can be only used for logging and debugging purposes.

At the same time, it is being listened to by ylem_statistics_refult_listener which saves the result to its own Clickhouse database from where it is being served to the UI dashboard, , or .

🔬
here
Query
Notification
statistics
profiling logs
Pipeline task processing architecture