NetShield — Unified Pipeline Architecture Plan

From three isolated local scripts to a single automated daily pipeline
Date: March 14, 2026
Project: NetShield Intelligence Pipeline
Budget constraint: $0 / month (free tiers only)
Team size: 3–5 people
Run cadence: Once per day
Target scale: 100,000 rows / load

Contents

  1. Glossary — Technical Players & Their Roles
  2. Current State — The Problem
  3. Target State — What We Are Building
  4. Architecture Diagram (Before → After)
  5. Component Breakdown & Free-Tier Resources
  6. GitHub Actions as the Orchestrator (Replacing Airflow)
  7. Cloud PostgreSQL — Neon (Free)
  8. Data Handoff Strategy Between Teams
  9. Migration Roadmap — Step by Step
  10. Cost Summary
  11. Known Limitations & Future Upgrade Path

Glossary — Technical Players & Their Roles

Before diving into the architecture, here is a plain-language explanation of every tool and service used in this plan — what it is, what problem it solves, and what role it plays specifically in the NetShield pipeline.

GitHub
Code hosting & collaboration platform
What is itA website where code is stored in version-controlled repositories. Think of it as Google Drive but for code, with a full history of every change ever made.
Role hereCentral hub — holds the ELT pipeline code, the incoming NDJSON data files, the HTML report, and triggers the automation.
Free tierUnlimited public & private repositories, up to 2,000 automation minutes/month.
FREE
GitHub Actions
Automation & CI/CD engine (Airflow replacement)
What is itA built-in GitHub feature that runs scripts automatically when something happens (a push, a schedule, a manual click). It spins up a temporary Linux computer in the cloud, runs your steps, then shuts down.
Role hereThe brain of the automation. Replaces Airflow: triggers the pipeline every morning at 06:00 UTC and whenever the data team pushes new files. Runs load → dbt → report in sequence.
vs AirflowAirflow needs a 24/7 server ($6+/mo). Actions is serverless — you only pay for the minutes your code actually runs. For one daily run, it is completely free.
FREE
Neon
Serverless cloud PostgreSQL database
What is itA fully managed PostgreSQL database hosted in the cloud. It is identical to the local PostgreSQL already running on your machine — same SQL, same drivers — but accessible from anywhere on the internet.
Role hereReplaces your local PostgreSQL. The pipeline writes raw_data and social_data_alfa tables here. The DS team reads from here. No more "the DB is on my laptop."
Serverless"Serverless" means it automatically pauses when nobody is using it (saving compute cost) and wakes in ~1 second when a connection arrives.
Free tier0.5 GB storage, 1 project, unlimited connections, never expires. No credit card required.
FREE
dbt (data build tool)
SQL transformation layer
What is itAn open-source tool that takes raw data loaded into a database and transforms it into clean, analysis-ready tables using SQL. It handles dependency ordering, testing, and documentation automatically.
Role hereAfter NDJSON files are loaded into raw_data, dbt runs SQL models that produce the final social_data_alfa tables used by the DS team.
Free tierdbt Core is fully open source and runs inside GitHub Actions at no cost.
FREE
GitHub Pages
Static website hosting
What is itA GitHub feature that takes an HTML file from your repository and publishes it as a public website with an HTTPS URL — no server needed.
Role hereAfter each pipeline run, the generated HTML report is automatically published to a URL like trendact.github.io/-elt/. Anyone with the link can view it — no Flask server or Cloudflare tunnel needed.
Free tier1 GB storage, 100 GB bandwidth/month, always free on public repos.
FREE
PostgreSQL
Relational database engine
What is itThe world's most widely used open-source relational database. Data is stored in tables with rows and columns, and queried with standard SQL.
Role hereThe central data store. Raw NDJSON data is parsed and inserted as rows. dbt then reads and transforms those rows. The DS team runs SELECT queries to pull data for their models.
Local vs cloudCurrently running on your laptop (local). Migration moves it to Neon (cloud) — all code stays the same, only the connection string changes.
FREE (open source)
NDJSON
Data file format (Newline-Delimited JSON)
What is itA text file where each line is a valid JSON object. It is the standard format for streaming or exporting large datasets because it can be read one line at a time without loading the whole file into memory.
Role hereThe output of the data collection scraper. Each NDJSON file contains TikTok posts, comments, users, or WhatsApp links. The ELT pipeline reads these files and inserts them into PostgreSQL.
Format only — no cost
Apache Airflow
Workflow orchestration platform (future option)
What is itAn open-source platform designed specifically for scheduling and monitoring complex data pipelines. Pipelines are defined as DAGs (Directed Acyclic Graphs) — a visual diagram of tasks and their dependencies.
Role hereNot used yet — GitHub Actions covers all current needs at $0. Airflow becomes relevant if the pipeline grows to dozens of tasks requiring visual monitoring, automatic retries, and SLA alerts.
CostRequires a 24/7 server ($6+/month minimum on DigitalOcean or Render).

1. Current State — The Problem

The pipeline is split across three machines with no shared infrastructure. Every daily run requires manual coordination between three people.

Pipeline Owner What it does Output Problem
Data Collection Machine A Scrapes TikTok (by keyword + WhatsApp links), produces NDJSON files NDJSON files on local disk Manual Files must be physically sent / committed
ELT Pipeline (this repo) Machine B (yours) Parses NDJSON → loads to local PostgreSQL → dbt → HTML report Tables in social_data_alfa schema, HTML report Manual DB is local, nobody else can read it
DS Pipeline Machine C Reads processed tables, trains / runs models Model outputs, analysis Blocked Cannot access local DB on Machine B. Repo: Trendact/-ai-pipeline
Root cause: There is no shared storage layer. Each team works on a silo. Solving this requires (1) a cloud database every team can reach, and (2) an automated trigger so no human has to start anything.

2. Target State — What We Are Building

The goal is a single automated daily pipeline where:

End state: Data Collection team does a git push → everything else runs automatically → DS team queries cloud DB with fresh data by morning.

3. Architecture Diagram

BEFORE (current)

┌─────────────────┐ manual file transfer ┌─────────────────┐ cannot reach ┌─────────────────┐ │ MACHINE A │ ────────────────────────────▶ │ MACHINE B │ ──────────────────▶ │ MACHINE C │ │ Data Collection│ │ ELT Pipeline │ │ DS Pipeline │ │ - TikTok scrape│ │ - Load NDJSON │ │ - ML models │ │ - WhatsApp link│ │ - dbt │ │ - Analysis │ │ → NDJSON files │ │ - Local PG DB │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ⚠ manual ⚠ local DB, no sharing ⚠ blocked

AFTER (target)

MACHINE A GITHUB (free) NEON (free cloud PG) ───────────────────────────────────────────────────────────────────────────────── [Data Collection] ┌──────────────────────────┐ scrapes TikTok & WhatsApp │ shared repo │ saves NDJSON files │ Trendact/-elt │ git push NDJSON ─────────▶│ (or dedicated data repo) │ │ │GitHub Actions (cron)┌────────────────────────┐ │ runs daily at 06:00 UTC │ │ Neon PostgreSQL │ │ ─────────────────────── │ │ (cloud, always on) │ │ Step 1: checkout data │ │ │ │ Step 2: load NDJSON │──────▶│ raw_data schema │ │ Step 3: dbt run │──────▶│ social_data_alfa │ │ Step 4: gen report │ │ │ │ Step 5: publish pages │ └────────────────────────┘ └──────────────────────────┘ │ │ │ ✓ HTML report on GitHub Pages✓ Logs stored in Actions │ │ [DS Pipeline — Trendact/-ai-pipeline] connects via psycopg2 / SQLAlchemy reads social_data_alfa ✓ always fresh, no manual steps

4. Component Breakdown & Free-Tier Resources

Component Service Free Limit Cost Status
Orchestration (Airflow replacement) GitHub Actions 2,000 min/month (private repo) — our daily run ≈ 5–10 min → ~210 min/month FREE Well within free tier
Cloud PostgreSQL Neon Pro 10 GB storage, unlimited connections, always-on option Free tier is not viable at 100k rows/load (~60 MB/day → free tier exhausted in 8 days). Neon Pro (10 GB) required from day one. 10 GB lasts ~5.5 months before needing archival. See Section 6 for full breakdown.
Report hosting GitHub Pages 1 GB storage, 100 GB bandwidth/month FREE HTML report committed and served as a static page
Source code GitHub Repos Unlimited repositories FREE ELT pipeline code. NDJSON data files must NOT be committed to git at this scale — see Section 7.
NDJSON data file transfer GitHub Releases (artifacts) 2 GB per file, unlimited releases on public repos FREE 100k rows ≈ 50–100 MB per NDJSON file. Github Releases handle this cleanly. Data team uploads a release asset, Actions workflow downloads it. See Section 7.
Secrets management GitHub Actions Secrets Unlimited encrypted secrets per repo/org FREE DB connection string, passwords stored as secrets
dbt dbt-core (self-hosted) Open source, runs inside Actions runner FREE Already in this repo
Python environment GitHub Actions runner (ubuntu-latest) Python 3.x pre-installed on every runner FREE pip install from requirements.txt
Actual Apache Airflow Astro (managed Airflow) Free tier: 1 deployment, limited tasks Optional upgrade if DAG complexity grows
Why not real Airflow? Apache Airflow requires a persistent server (web server + scheduler + worker running 24/7). The cheapest option is a $6/month DigitalOcean droplet. At $0 budget, GitHub Actions is the correct replacement — it provides the same trigger + DAG-like step execution for free. If budget becomes available later, migrating the Actions workflow to an Airflow DAG requires minimal rewrites.

5. GitHub Actions as the Orchestrator

GitHub Actions runs a YAML workflow file on a schedule. Each job = one Airflow task. Jobs can depend on each other, pass data between steps, and send notifications on failure — everything Airflow does for this use case.

Workflow Design

# .github/workflows/daily_pipeline.yml
name: NetShield Daily Pipeline

on:
  schedule:
    - cron: '0 6 * * *'          # runs every day at 06:00 UTC
  workflow_dispatch:              # also allows manual trigger from GitHub UI
  repository_dispatch:            # triggered by data collection team's push
    types: [data-ready]

jobs:

  load-and-transform:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout ELT repo
        uses: actions/checkout@v4

      - name: Checkout data collection repo
        uses: actions/checkout@v4
        with:
          repository: Trendact/data-collection   # data team's repo
          path: data_intake

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.13'
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run pipeline (load + dbt + report)
        env:
          DB_HOST:     ${{ secrets.NEON_HOST }}
          DB_USER:     ${{ secrets.NEON_USER }}
          DB_PASSWORD: ${{ secrets.NEON_PASSWORD }}
          DB_NAME:     ${{ secrets.NEON_DB }}
        run: python run_pipeline.py

      - name: Publish report to GitHub Pages
        uses: peaceiris/actions-gh-pages@v4
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./reports
This gives you: Scheduled runs, manual trigger from GitHub UI, triggered by data team push, environment-variable secrets, published HTML report on GitHub Pages — all for $0.

Action Minutes Usage Estimate (at 100k rows/load)

StepEstimated Time
Checkout + install deps~2 min
Download NDJSON from GitHub Release (~100 MB)~1 min
Load 100k rows to PostgreSQL~10–15 min (bulk COPY: ~5 min)
dbt run (more rows = more model compute)~2–3 min
Generate report~1 min
Publish to Pages<1 min
Total per run~17–23 min (optimised: ~10 min with bulk COPY)
Total per month (30 runs)~510–690 min (of 2,000 free) ✓
Actions minutes are still free even at 100k rows/load — worst case ~690 min/month, well within the 2,000 free minutes. The key optimisation is using PostgreSQL COPY (bulk insert) instead of row-by-row INSERT for the loading step. This can cut load time from 15 min to ~2–3 min for 100k rows.

6. Cloud PostgreSQL — Neon (Free)

Neon is a serverless PostgreSQL provider with a generous free tier — fully compatible with psycopg2, dbt-postgres, and SQLAlchemy. No credit card required to sign up.

Actual Current Data Size (measured March 14, 2026)

The local PostgreSQL database was measured before recommending a storage tier. Here are the real numbers:

SchemaTableSize (total incl. indexes)Notes
raw_datatiktok_posts_meta2.3 MBLargest table — 2,967 rows
raw_datatiktok_script_out968 kB
social_data_alfahashtags1.1 MB
social_data_alfaposts1.0 MB2,898 rows
social_data_alfapost_metadata960 kB
All other tables combined~1.5 MB
Total user data (raw_data + social_data_alfa)~8 MBFull DB including system tables: 18 MB

Storage Growth Projection

Average storage per row across all tables: ~0.6 KB. Target load: 100,000 rows per load = ~60 MB per daily run.

Rows per loadDaily DB growthFree tier (0.5 GB) lastsNeon Pro (10 GB) lastsVerdict
500 rows~0.3 MB/day~4.5 yearsN/A✓ Free tier OK
10,000 rows~6 MB/day~3 months~4.5 years
100,000 rows ← TARGET~60 MB/day✗ ~8 days~5.5 months
100,000 rows + archiving old dataNet ~10–20 MB/dayN/A1–2 years✓ With archival strategy
Free tier is not viable at target scale. 100k rows/load = ~60 MB/day. Neon's 0.5 GB free tier would be exhausted in approximately 8 days.

Required: Neon Pro at $19/month (10 GB storage). This lasts ~5.5 months before an archival strategy is needed — old raw data should be moved to cold storage (e.g. compressed GitHub Release archives) while keeping only the transformed social_data_alfa tables live. This extends Neon Pro to 1–2 years of operation.

Archival Strategy (keeps DB lean)

Rather than storing every raw NDJSON row forever, the pipeline should periodically archive old raw_data tables to compressed files in GitHub Releases and truncate the source tables. The social_data_alfa transformed tables are much smaller and can be kept live indefinitely.

Why Neon vs alternatives

ProviderFree storageFree connectionsExpiryCredit card?
Neon Pro ✓ recommended10 GBUnlimitedNeverNo
Neon Free0.5 GBUnlimitedNeverNo (but exhausted in 8 days at 100k rows)
Supabase500 MBUnlimitedNever (with activity)No
Railway1 GBUnlimited500 hrs/month (sleeps)No
Render1 GB97 connections90 days then deletedNo

Required code change — .env update

After creating the Neon project, replace your local connection string:

# .env  (local machine — never committed)
DB_HOST=ep-xxxxx.us-east-2.aws.neon.tech
DB_PORT=5432
DB_NAME=neondb
DB_USER=netshield_user
DB_PASSWORD=xxxxxxxxxxxxxxxx
DB_SSLMODE=require                          # required for Neon

# GitHub Actions secrets (set in repo Settings → Secrets)
# NEON_HOST, NEON_USER, NEON_PASSWORD, NEON_DB

The db_config.py and profiles.yml files already read from environment variables — only the values change, not the code logic.

7. Data Handoff Strategy Between Teams

The weakest link in the current flow is how NDJSON files get from the data collection machine to the ELT pipeline. There are two options:

Critical: NDJSON files cannot go in a git commit at this scale. 100k rows ≈ 50–100 MB per file. GitHub's hard limit per file is 100 MB; anything above 50 MB triggers warnings and slows clone/push significantly. Storing data files in git is an anti-pattern at this volume.

Option A — GitHub Releases as Data Drop Zone (Recommended)

The data collection team uploads their NDJSON file as a GitHub Release asset to the Trendact/data-intake repo (up to 2 GB per file, free on public repos). Creating the release triggers a release event, which fires the ELT workflow automatically — no polling, no waiting for cron.

# In data-intake repo — .github/workflows/notify.yml
on:
  release:
    types: [published]   # fires when data team publishes a new release
jobs:
  notify-elt:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger ELT pipeline with release URL
        run: |
          curl -X POST \
            -H "Authorization: token ${{ secrets.PAT_TOKEN }}" \
            -H "Accept: application/vnd.github.v3+json" \
            https://api.github.com/repos/Trendact/-elt/dispatches \
            -d '{"event_type":"data-ready","client_payload":{"asset_url":"${{ github.event.release.assets[0].browser_download_url }}"}}'

The ELT workflow receives the direct download URL of the NDJSON file as a payload, downloads it with curl inside the Actions runner, and processes it. No large files ever enter a git commit.

Option B — Single Monorepo

All teams commit to the same repo under separate folders (data_intake/, elt/). Simpler for a small team, but mixes concerns and makes the repo large over time as NDJSON files accumulate.

Recommendation: Use Option A (two repos). This mirrors real-world Data Engineering patterns, keeps the ELT repo clean, and allows the data collection team to push without needing write access to the ELT code.

DS Team Access (Trendact/-ai-pipeline)

After migration, the DS team updates their connection string in the Trendact/-ai-pipeline repo to point to Neon. No other changes are required on their side:

# DS pipeline — Python
import psycopg2
conn = psycopg2.connect(
    host="ep-xxxxx.us-east-2.aws.neon.tech",
    dbname="neondb", user="netshield_user",
    password="xxxxxx", sslmode="require"
)

No VPN, no IP whitelisting, no file transfers needed. Neon supports connections from any IP by default.

8. Migration Roadmap — Step by Step

Estimated total setup time: 4–6 hours (one person, one sitting).
1

Create Neon PostgreSQL project

Go to neon.tech → sign up free → create project → copy connection string. Create two roles: one for the pipeline (read + write), one for the DS team (read-only).
CREATE ROLE ds_reader LOGIN PASSWORD 'xxx'; GRANT SELECT ON ALL TABLES IN SCHEMA social_data_alfa TO ds_reader;

2

Run initial migration to Neon

Dump the local PostgreSQL schema + data and restore to Neon:
pg_dump -h localhost -U postgres postgres | psql "postgres://netshield_user:xxx@ep-xxx.neon.tech/neondb?sslmode=require"

3

Update .env and profiles.yml to point to Neon

Update your local .env and ~/.dbt/profiles.yml with the Neon connection details. Run run_pipeline.py locally once to verify dbt + load all work against the cloud DB.

4

Add GitHub Actions secrets

In the GitHub repo → Settings → Secrets and variables → Actions → add: NEON_HOST, NEON_USER, NEON_PASSWORD, NEON_DB. These replace the .env file inside the Actions runner.

5

Create requirements.txt

The Actions runner has no .venv. Export your environment:
.venv\Scripts\pip freeze > requirements.txt then commit it. The workflow will install from this file.

6

Create the GitHub Actions workflow file

Create .github/workflows/daily_pipeline.yml using the template in Section 5. Commit and push. The Actions tab in GitHub will show the workflow — trigger it manually once to verify end-to-end in the cloud.

7

Set up GitHub Pages for the report

In repo Settings → Pages → Source: gh-pages branch. After the first successful workflow run, the HTML report is published at https://trendact.github.io/-elt/ — accessible to anyone with the URL.

8

Set up data intake repo + GitHub Releases workflow

Create Trendact/data-intake repo. Data collection team clones it and adds the notify workflow (Section 7). When they finish a scrape, they create a GitHub Release and upload the NDJSON file as a release asset (via gh release create CLI or GitHub UI). This fires the ELT pipeline automatically — no files ever committed to git.

9

Share Neon connection string with DS team

Give the DS team (Trendact/-ai-pipeline) the read-only role credentials. They update their connection string in their own repo. Verify they can SELECT from social_data_alfa.

10

Monitor first automated run + fix edge cases

Let the cron run automatically. Check the Actions log for errors. Common issues: missing Python package in requirements.txt, dbt profile path differences between Windows local and Linux runner. Patch and commit.

9. Cost Summary

ItemServiceMonthly CostNotes
Orchestration (pipeline scheduling + running) GitHub Actions $0 ~180 of 2,000 free minutes used
Cloud PostgreSQL Neon $0 0.5 GB free, auto-suspends when idle = no compute charges
Report hosting GitHub Pages $0 Static HTML, no Flask server needed
Code + data file storage GitHub Repos $0 NDJSON files are small; well within free limit
Secrets management GitHub Actions Secrets $0 Encrypted at rest, injected at runtime
Domain / SSL for report GitHub Pages (included) $0 HTTPS enforced by default on github.io URLs
TOTAL Only Neon Pro requires payment. Everything else remains free.
The only paid component is Neon Pro ($19/mo). This is non-negotiable at 100k rows/load — the free tier is exhausted within one week. All other services (GitHub Actions, Pages, Releases, Secrets) remain completely free within the usage limits for this workload.

Future upgrade triggers

BottleneckTriggerUpgradeCost
DB storage>10 GB on Neon Pro (after ~5.5 months with no archiving)Add archival job to pipeline, or Neon Business plan$0 (archival) or $69/mo (50 GB)
Pipeline complexity growsNeed real DAG retries, branching, visual monitoringAstro (managed Airflow)$0–$30/mo
Actions minutes>2,000 min/month (loading 100k rows may take 15–20 min/run)Self-hosted runner on a cheap VM$4–$6/mo
Private report URLNeed password protectionRe-enable Flask + Cloudflare (already built)$0

10. Known Limitations & Future Upgrade Path

LimitationImpactSeveritySolution when ready
GitHub Actions is not real Airflow No visual DAG UI, no automatic retry with backoff, no SLA alerting Migrate to Astro or self-hosted Airflow when DAG complexity grows
Neon auto-suspends after 5 min idle Only on free tier — not an issue on Neon Pro (always-on) N/A on Pro Neon Pro is always-on. No cold start delay.
Report on GitHub Pages is public Anyone with the URL can read the report Keep Flask + Cloudflare for sensitive sharing; or use a private GitHub Pages (requires GitHub Team plan)
NDJSON files in git repo 100k rows ≈ 50–100 MB per file — hits GitHub's 100 MB hard limit, breaks git push BLOCKER — do not commit data files Use GitHub Releases as the data drop zone (Section 7). ELT workflow downloads the asset URL directly. No data files ever enter a git commit.
dbt profile uses Linux path in Actions profiles.yml path differs between Windows and Linux runner Pass profiles dir via --profiles-dir . flag in workflow, keep a profiles.yml in the ELT folder for CI
Summary: At $19/month (Neon Pro only), this architecture eliminates all manual coordination between teams at full 100k rows/load scale. The data collection team publishes a GitHub Release with their NDJSON file → the entire pipeline triggers automatically → the DS team has fresh data in the cloud DB by morning. All other infrastructure (orchestration, report hosting, file transfer, secrets) is free. This is the minimum viable spend to run this pipeline at target scale.