Data Pipelines & API’s

Data Pipelines & APIs — Konrad Chmieliński
Engineering Blog · Konrad Chmieliński

How I design, build, and maintain end-to-end data flows — from raw source to actionable output.

RPA / AI Engineering Python 5+ yrs Azure · REST · ETL Katowice, PL · Remote OK

Data pipelines are the connective tissue of every automation project I’ve built. Whether extracting chemical research data from PDFs at ArcelorMittal or routing five years of invoice transactions through a duplicate-detection engine — the engineering challenge is always the same: move data reliably, transform it cleanly, and deliver it where decisions happen.

This post walks through the exact stack I use at each layer of a pipeline — with real-world context from production deployments.

Pipeline architecture overview

pattern I use
Source
Extract
Transform
Validate
Load
Orchestrate
Monitor

Extraction layer

where data is born

Every pipeline starts with pulling data from a heterogeneous mix of sources. In my day-to-day work these include SAP GUI interfaces, REST APIs, Excel files, PDF reports, and web portals — often within the same pipeline.

Python requests REST APIs Microsoft Graph API BeautifulSoup Selenium / XPath SAP GUI scripting PyAutoGUI UIAutomation JSON / XML parsing openpyxl / xlrd
# Typical multi-source extract pattern (ArcelorMittal invoice dedup pipeline)
import requests, pandas as pd
from pydantic import BaseModel

response = requests.get( „https://api.internal/invoices” , headers=auth_headers)
df = pd.DataFrame(response.json()[ „data” ])
validated = InvoiceSchema.parse_obj(df.to_dict( „records” ))
At ING (GKYC) I built two RPA bots handling secure data exchange and portal report retrieval — these required coordinating REST API calls, session management, and filesystem writes in a single orchestrated flow using Automation Anywhere A360.

OCR & document intelligence

unstructured → structured

A large share of enterprise data lives in unstructured documents — scanned PDFs, identity documents, research reports, contracts. I’ve deployed production-grade document intelligence pipelines using both on-premise OCR and cloud-native services.

Cloud · Azure
Document Intelligence
Form Recognizer + custom models for structured field extraction
On-premise
ABBYY FlexiCapture
Batch OCR with trained document classifiers
Cloud · Azure
ABBYY Vantage
Intelligent document processing with ML skill training
Embedded
Azure OCR
Computer Vision API for raw text extraction from scanned pages
At ArcelorMittal I maintain an Azure-based pipeline extracting and consolidating data from diverse chemical research reports — overseeing Document Intelligence workflows and ensuring accurate integration with centralized databases.
At ING I led development of an automated pipeline for power of attorney and identity documents across NL/FR/BE/DE markets — multi-format OCR + signature presence detection — achieving 70% processing accuracy and substantially reducing manual verification effort.

Transform & validate

clean before load

Raw extracted data is never load-ready. I use Pandas for tabular wrangling, NumPy/SciPy for numerical transformations, and Pydantic for schema validation — catching bad records before they corrupt downstream databases.

Pandas NumPy Pydantic SciPy Pillow / OpenCV Scikit-Learn regex / string normalization deduplication logic
Global invoice dedup (ArcelorMittal India): developed an automated solution analysing five years of transactional data to identify potential duplicates — custom hashing logic + Pandas merge operations across millions of rows, with Pydantic models enforcing schema contracts at every stage.

Load & storage layer

persistence strategy

Output targets vary by use case: SQL databases for structured reporting, MongoDB for flexible document storage, Azure Blob/cloud storage for file-based pipelines, Excel or SharePoint for business-user-facing deliverables, and web applications via REST.

Relational
SQL
Primary store for structured reporting and analytics
Document store
MongoDB
Flexible schemas for semi-structured pipeline outputs
Search
Solr · Vector DBs
Full-text search; Chroma / Pinecone for RAG pipelines
File / cloud
Azure Blob · SharePoint
Report delivery and file-based integration targets

Orchestration & CI/CD

keeping it running

Pipelines that aren’t monitored aren’t pipelines — they’re one-time scripts. I use Jenkins for CI/CD and scheduled automation, Automation Anywhere (A11 & A360) as an enterprise orchestration layer, and Python-native schedulers for lightweight flows. Version control through Git keeps all pipeline code reviewable and deployable.

Jenkins CI/CD Automation Anywhere A360 SAIO (ING proprietary) Git / branching strategy LangGraph (agentic flows) LangChain Azure Functions (conceptual) Cron / APScheduler

AI-enhanced pipeline patterns

LLM-in-the-loop

The most interesting pipelines I’ve worked on recently embed LLMs as processing steps — not as chatbots, but as transformation engines within a larger orchestrated flow. This includes prompt engineering for structured output, RAG over internal knowledge bases, and tool-calling agents that decide which pipeline branches to execute.

Models
GPT-4 / GPT-4o
Azure OpenAI — structured output, entity extraction, classification
Open-source
Meta Llama
Self-hosted inference for on-premise data scenarios
Frameworks
LangChain · LangGraph
Agentic pipeline composition with memory and tool calling
Retrieval
RAG Architecture
Chunking, embeddings, semantic search over enterprise docs
AI Client Offer Generation (ArcelorMittal Luxembourg, 2024): built an LLM-driven pipeline using Azure OpenAI to generate personalised client proposals from historical order data + new inquiries. The pipeline fetched data via REST, ran GPT-4o inference with structured prompts, formatted the output, and automated email dispatch — achieving ~60% accuracy and significantly reducing sales team workload.

Reporting & visualisation output

end of pipeline

The final stage of any pipeline is making data consumable. I deliver results via Streamlit dashboards, Matplotlib/Seaborn charts embedded in reports, or direct integration with web application portals — depending on the audience.

Streamlit Matplotlib Seaborn Excel report generation PDF report output REST API write-back SharePoint integration

Where I’ve built this in production

employer context
ArcelorMittal BCoE Poland · Oct 2023 – present
Global market forecasting pipeline (Excel ETL → SQL → web app), Azure Document Intelligence for chemical reports, invoice dedup over 5yr transaction history, HR contract generation automation, compliance report pipeline from multimedia sources.
ArcelorMittal Luxembourg · Mar – Jul 2024 (secondment)
End-to-end LLM pipeline: REST data fetch → GPT-4o inference → structured output → automated email dispatch. Full agentic flow using LangChain with tool calling.
ING Hubs Poland (GKYC) · Nov 2021 – Oct 2023
Risk classifier pipeline (ML model training + inference), OCR document intelligence pipeline for 4 Western European markets, RPA bots for secure data exchange across portals. SAIO-based GUI automation for internal process flows.

About the Author

Dodaj komentarz

Twój adres e-mail nie zostanie opublikowany. Wymagane pola są oznaczone *

You may also like these