Data Pipelines & API’s

Data Pipelines & APIs — Konrad Chmieliński

Engineering Blog · Konrad Chmieliński

How I design, build, and maintain end-to-end data flows — from raw source to actionable output.

RPA / AI Engineering Python 5+ yrs Azure · REST · ETL Katowice, PL · Remote OK

Data pipelines are the connective tissue of every automation project I’ve built. Whether extracting chemical research data from PDFs at ArcelorMittal or routing five years of invoice transactions through a duplicate-detection engine — the engineering challenge is always the same: move data reliably, transform it cleanly, and deliver it where decisions happen.

This post walks through the exact stack I use at each layer of a pipeline — with real-world context from production deployments.

Pipeline architecture overview

pattern I use

Source

→

Extract

→

Transform

→

Validate

→

Load

→

Orchestrate

→

Monitor

Extraction layer

where data is born

Every pipeline starts with pulling data from a heterogeneous mix of sources. In my day-to-day work these include SAP GUI interfaces, REST APIs, Excel files, PDF reports, and web portals — often within the same pipeline.

            # Typical multi-source extract pattern (ArcelorMittal invoice dedup pipeline)
            
            import
             requests, pandas 
            as
             pd
            
            from
             pydantic 
            import
             BaseModel
            
                response = requests.get(
            „https://api.internal/invoices”
            , headers=auth_headers)
            
                df = pd.DataFrame(response.json()[
            „data”
            ])
            
                validated = InvoiceSchema.parse_obj(df.to_dict(
            „records”
            ))

At ING (GKYC) I built two RPA bots handling secure data exchange and portal report retrieval — these required coordinating REST API calls, session management, and filesystem writes in a single orchestrated flow using Automation Anywhere A360.

OCR & document intelligence

unstructured → structured

A large share of enterprise data lives in unstructured documents — scanned PDFs, identity documents, research reports, contracts. I’ve deployed production-grade document intelligence pipelines using both on-premise OCR and cloud-native services.

Cloud · Azure

Document Intelligence

Form Recognizer + custom models for structured field extraction

On-premise

ABBYY FlexiCapture

Batch OCR with trained document classifiers

Cloud · Azure

ABBYY Vantage

Intelligent document processing with ML skill training

Embedded

Azure OCR

Computer Vision API for raw text extraction from scanned pages

At ArcelorMittal I maintain an Azure-based pipeline extracting and consolidating data from diverse chemical research reports — overseeing Document Intelligence workflows and ensuring accurate integration with centralized databases.

At ING I led development of an automated pipeline for power of attorney and identity documents across NL/FR/BE/DE markets — multi-format OCR + signature presence detection — achieving 70% processing accuracy and substantially reducing manual verification effort.

Transform & validate

clean before load

Raw extracted data is never load-ready. I use Pandas for tabular wrangling, NumPy/SciPy for numerical transformations, and Pydantic for schema validation — catching bad records before they corrupt downstream databases.

Global invoice dedup (ArcelorMittal India): developed an automated solution analysing five years of transactional data to identify potential duplicates — custom hashing logic + Pandas merge operations across millions of rows, with Pydantic models enforcing schema contracts at every stage.

Load & storage layer

persistence strategy

Output targets vary by use case: SQL databases for structured reporting, MongoDB for flexible document storage, Azure Blob/cloud storage for file-based pipelines, Excel or SharePoint for business-user-facing deliverables, and web applications via REST.

Relational

SQL

Primary store for structured reporting and analytics

Document store

MongoDB

Flexible schemas for semi-structured pipeline outputs

Solr · Vector DBs

Full-text search; Chroma / Pinecone for RAG pipelines

File / cloud

Azure Blob · SharePoint

Report delivery and file-based integration targets

Orchestration & CI/CD

keeping it running

Pipelines that aren’t monitored aren’t pipelines — they’re one-time scripts. I use Jenkins for CI/CD and scheduled automation, Automation Anywhere (A11 & A360) as an enterprise orchestration layer, and Python-native schedulers for lightweight flows. Version control through Git keeps all pipeline code reviewable and deployable.

AI-enhanced pipeline patterns

LLM-in-the-loop

The most interesting pipelines I’ve worked on recently embed LLMs as processing steps — not as chatbots, but as transformation engines within a larger orchestrated flow. This includes prompt engineering for structured output, RAG over internal knowledge bases, and tool-calling agents that decide which pipeline branches to execute.

Models

GPT-4 / GPT-4o

Azure OpenAI — structured output, entity extraction, classification

Open-source

Meta Llama

Self-hosted inference for on-premise data scenarios

Frameworks

LangChain · LangGraph

Agentic pipeline composition with memory and tool calling

Retrieval

RAG Architecture

Chunking, embeddings, semantic search over enterprise docs

AI Client Offer Generation (ArcelorMittal Luxembourg, 2024): built an LLM-driven pipeline using Azure OpenAI to generate personalised client proposals from historical order data + new inquiries. The pipeline fetched data via REST, ran GPT-4o inference with structured prompts, formatted the output, and automated email dispatch — achieving ~60% accuracy and significantly reducing sales team workload.

Reporting & visualisation output

end of pipeline

The final stage of any pipeline is making data consumable. I deliver results via Streamlit dashboards, Matplotlib/Seaborn charts embedded in reports, or direct integration with web application portals — depending on the audience.

Where I’ve built this in production

employer context

ArcelorMittal BCoE Poland · Oct 2023 – present

Global market forecasting pipeline (Excel ETL → SQL → web app), Azure Document Intelligence for chemical reports, invoice dedup over 5yr transaction history, HR contract generation automation, compliance report pipeline from multimedia sources.

AM
LUX

ArcelorMittal Luxembourg · Mar – Jul 2024 (secondment)

End-to-end LLM pipeline: REST data fetch → GPT-4o inference → structured output → automated email dispatch. Full agentic flow using LangChain with tool calling.

ING

ING Hubs Poland (GKYC) · Nov 2021 – Oct 2023

Risk classifier pipeline (ML model training + inference), OCR document intelligence pipeline for 4 Western European markets, RPA bots for secure data exchange across portals. SAIO-based GUI automation for internal process flows.

Categories

Pipeline architecture overview

Extraction layer

OCR & document intelligence

Transform & validate

Load & storage layer

Orchestration & CI/CD

AI-enhanced pipeline patterns

Reporting & visualisation output

Where I’ve built this in production

About the Author

Konrad Chmieliński

Dodaj komentarz Anuluj pisanie odpowiedzi

Ostatnie wpisy

Najnowsze komentarze

You may also like these

Business Process Automation Specialist

SAP, MS Office, PDF, Web Automation & more

OCR & Document Intelligence

AI / LLM Integration