NetShield — Unified Pipeline Architecture Plan

From three isolated local scripts to a single automated daily pipeline

Date: March 14, 2026

Project: NetShield Intelligence Pipeline

Budget constraint: $0 / month (free tiers only)

Team size: 3–5 people

Run cadence: Once per day

Target scale: 100,000 rows / load

Glossary — Technical Players & Their Roles
Current State — The Problem
Target State — What We Are Building
Architecture Diagram (Before → After)
Component Breakdown & Free-Tier Resources
GitHub Actions as the Orchestrator (Replacing Airflow)
Cloud PostgreSQL — Neon (Free)
Data Handoff Strategy Between Teams
Migration Roadmap — Step by Step
Cost Summary
Known Limitations & Future Upgrade Path

Glossary — Technical Players & Their Roles

Before diving into the architecture, here is a plain-language explanation of every tool and service used in this plan — what it is, what problem it solves, and what role it plays specifically in the NetShield pipeline.

GitHub

Code hosting & collaboration platform

What is itA website where code is stored in version-controlled repositories. Think of it as Google Drive but for code, with a full history of every change ever made.

Role hereCentral hub — holds the ELT pipeline code, the incoming NDJSON data files, the HTML report, and triggers the automation.

Free tierUnlimited public & private repositories, up to 2,000 automation minutes/month.

FREE

GitHub Actions

Automation & CI/CD engine (Airflow replacement)

What is itA built-in GitHub feature that runs scripts automatically when something happens (a push, a schedule, a manual click). It spins up a temporary Linux computer in the cloud, runs your steps, then shuts down.

Role hereThe brain of the automation. Replaces Airflow: triggers the pipeline every morning at 06:00 UTC and whenever the data team pushes new files. Runs load → dbt → report in sequence.

vs AirflowAirflow needs a 24/7 server ($6+/mo). Actions is serverless — you only pay for the minutes your code actually runs. For one daily run, it is completely free.

FREE

Neon

Serverless cloud PostgreSQL database

What is itA fully managed PostgreSQL database hosted in the cloud. It is identical to the local PostgreSQL already running on your machine — same SQL, same drivers — but accessible from anywhere on the internet.

Role hereReplaces your local PostgreSQL. The pipeline writes raw_data and social_data_alfa tables here. The DS team reads from here. No more "the DB is on my laptop."

Serverless"Serverless" means it automatically pauses when nobody is using it (saving compute cost) and wakes in ~1 second when a connection arrives.

Free tier0.5 GB storage, 1 project, unlimited connections, never expires. No credit card required.

FREE

dbt (data build tool)

SQL transformation layer

What is itAn open-source tool that takes raw data loaded into a database and transforms it into clean, analysis-ready tables using SQL. It handles dependency ordering, testing, and documentation automatically.

Role hereAfter NDJSON files are loaded into raw_data, dbt runs SQL models that produce the final social_data_alfa tables used by the DS team.

Free tierdbt Core is fully open source and runs inside GitHub Actions at no cost.

FREE

GitHub Pages

Static website hosting

What is itA GitHub feature that takes an HTML file from your repository and publishes it as a public website with an HTTPS URL — no server needed.

Role hereAfter each pipeline run, the generated HTML report is automatically published to a URL like trendact.github.io/-elt/. Anyone with the link can view it — no Flask server or Cloudflare tunnel needed.

Free tier1 GB storage, 100 GB bandwidth/month, always free on public repos.

FREE

PostgreSQL

Relational database engine

What is itThe world's most widely used open-source relational database. Data is stored in tables with rows and columns, and queried with standard SQL.

Role hereThe central data store. Raw NDJSON data is parsed and inserted as rows. dbt then reads and transforms those rows. The DS team runs SELECT queries to pull data for their models.

Local vs cloudCurrently running on your laptop (local). Migration moves it to Neon (cloud) — all code stays the same, only the connection string changes.

FREE (open source)

NDJSON

Data file format (Newline-Delimited JSON)

What is itA text file where each line is a valid JSON object. It is the standard format for streaming or exporting large datasets because it can be read one line at a time without loading the whole file into memory.

Role hereThe output of the data collection scraper. Each NDJSON file contains TikTok posts, comments, users, or WhatsApp links. The ELT pipeline reads these files and inserts them into PostgreSQL.

Format only — no cost

Apache Airflow

Workflow orchestration platform (future option)

What is itAn open-source platform designed specifically for scheduling and monitoring complex data pipelines. Pipelines are defined as DAGs (Directed Acyclic Graphs) — a visual diagram of tasks and their dependencies.

Role hereNot used yet — GitHub Actions covers all current needs at $0. Airflow becomes relevant if the pipeline grows to dozens of tasks requiring visual monitoring, automatic retries, and SLA alerts.

CostRequires a 24/7 server ($6+/month minimum on DigitalOcean or Render).

Costs money — deferred

1. Current State — The Problem

The pipeline is split across three machines with no shared infrastructure. Every daily run requires manual coordination between three people.

Pipeline	Owner	What it does	Output	Problem
Data Collection	Machine A	Scrapes TikTok (by keyword + WhatsApp links), produces NDJSON files	NDJSON files on local disk	Manual Files must be physically sent / committed
ELT Pipeline (this repo)	Machine B (yours)	Parses NDJSON → loads to local PostgreSQL → dbt → HTML report	Tables in `social_data_alfa` schema, HTML report	Manual DB is local, nobody else can read it
DS Pipeline	Machine C	Reads processed tables, trains / runs models	Model outputs, analysis	Blocked Cannot access local DB on Machine B. Repo: `Trendact/-ai-pipeline`

Root cause: There is no shared storage layer. Each team works on a silo. Solving this requires (1) a cloud database every team can reach, and (2) an automated trigger so no human has to start anything.

2. Target State — What We Are Building

The goal is a single automated daily pipeline where:

Data Collection team pushes their NDJSON files to a shared GitHub repo and that push automatically starts the full pipeline.
The ELT pipeline (load + dbt + report) runs in the cloud with zero human involvement.
The DS team reads directly from a cloud PostgreSQL database — always up to date.
All logs, run results, and the HTML report are accessible from GitHub — no local server needed.

End state: Data Collection team does a git push → everything else runs automatically → DS team queries cloud DB with fresh data by morning.

3. Architecture Diagram

BEFORE (current)

┌─────────────────┐ manual file transfer ┌─────────────────┐ cannot reach ┌─────────────────┐ │ MACHINE A │ ────────────────────────────▶ │ MACHINE B │ ──────────────────▶ │ MACHINE C │ │ Data Collection│ │ ELT Pipeline │ │ DS Pipeline │ │ - TikTok scrape│ │ - Load NDJSON │ │ - ML models │ │ - WhatsApp link│ │ - dbt │ │ - Analysis │ │ → NDJSON files │ │ - Local PG DB │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ⚠ manual ⚠ local DB, no sharing ⚠ blocked

AFTER (target)

MACHINE A GITHUB (free) NEON (free cloud PG) ───────────────────────────────────────────────────────────────────────────────── [Data Collection] ┌──────────────────────────┐ scrapes TikTok & WhatsApp │ shared repo │ saves NDJSON files │ Trendact/-elt │ git push NDJSON ─────────▶│ (or dedicated data repo) │ │ │ │ GitHub Actions (cron) │ ┌────────────────────────┐ │ runs daily at 06:00 UTC │ │ Neon PostgreSQL │ │ ─────────────────────── │ │ (cloud, always on) │ │ Step 1: checkout data │ │ │ │ Step 2: load NDJSON │──────▶│ raw_data schema │ │ Step 3: dbt run │──────▶│ social_data_alfa │ │ Step 4: gen report │ │ │ │ Step 5: publish pages │ └────────────────────────┘ └──────────────────────────┘ │ │ │ ✓ HTML report on GitHub Pages │ ✓ Logs stored in Actions │ │ [DS Pipeline — Trendact/-ai-pipeline] connects via psycopg2 / SQLAlchemy reads social_data_alfa ✓ always fresh, no manual steps

4. Component Breakdown & Free-Tier Resources

Component	Service	Free Limit	Cost	Status
Orchestration (Airflow replacement)	GitHub Actions	2,000 min/month (private repo) — our daily run ≈ 5–10 min → ~210 min/month	FREE	Well within free tier
Cloud PostgreSQL	Neon Pro	10 GB storage, unlimited connections, always-on option	$19/mo	Free tier is not viable at 100k rows/load (~60 MB/day → free tier exhausted in 8 days). Neon Pro (10 GB) required from day one. 10 GB lasts ~5.5 months before needing archival. See Section 6 for full breakdown.
Report hosting	GitHub Pages	1 GB storage, 100 GB bandwidth/month	FREE	HTML report committed and served as a static page
Source code	GitHub Repos	Unlimited repositories	FREE	ELT pipeline code. NDJSON data files must NOT be committed to git at this scale — see Section 7.
NDJSON data file transfer	GitHub Releases (artifacts)	2 GB per file, unlimited releases on public repos	FREE	100k rows ≈ 50–100 MB per NDJSON file. Github Releases handle this cleanly. Data team uploads a release asset, Actions workflow downloads it. See Section 7.
Secrets management	GitHub Actions Secrets	Unlimited encrypted secrets per repo/org	FREE	DB connection string, passwords stored as secrets
dbt	dbt-core (self-hosted)	Open source, runs inside Actions runner	FREE	Already in this repo
Python environment	GitHub Actions runner (ubuntu-latest)	Python 3.x pre-installed on every runner	FREE	pip install from requirements.txt
Actual Apache Airflow	Astro (managed Airflow)	Free tier: 1 deployment, limited tasks	~$0 but restricted	Optional upgrade if DAG complexity grows

Why not real Airflow? Apache Airflow requires a persistent server (web server + scheduler + worker running 24/7). The cheapest option is a $6/month DigitalOcean droplet. At $0 budget, GitHub Actions is the correct replacement — it provides the same trigger + DAG-like step execution for free. If budget becomes available later, migrating the Actions workflow to an Airflow DAG requires minimal rewrites.

5. GitHub Actions as the Orchestrator

GitHub Actions runs a YAML workflow file on a schedule. Each job = one Airflow task. Jobs can depend on each other, pass data between steps, and send notifications on failure — everything Airflow does for this use case.

Workflow Design

# .github/workflows/daily_pipeline.yml
name: NetShield Daily Pipeline

on:
  schedule:
    - cron: '0 6 * * *'          # runs every day at 06:00 UTC
  workflow_dispatch:              # also allows manual trigger from GitHub UI
  repository_dispatch:            # triggered by data collection team's push
    types: [data-ready]

jobs:

  load-and-transform:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout ELT repo
        uses: actions/checkout@v4

      - name: Checkout data collection repo
        uses: actions/checkout@v4
        with:
          repository: Trendact/data-collection   # data team's repo
          path: data_intake

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.13'
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run pipeline (load + dbt + report)
        env:
          DB_HOST:     ${{ secrets.NEON_HOST }}
          DB_USER:     ${{ secrets.NEON_USER }}
          DB_PASSWORD: ${{ secrets.NEON_PASSWORD }}
          DB_NAME:     ${{ secrets.NEON_DB }}
        run: python run_pipeline.py

      - name: Publish report to GitHub Pages
        uses: peaceiris/actions-gh-pages@v4
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./reports

This gives you: Scheduled runs, manual trigger from GitHub UI, triggered by data team push, environment-variable secrets, published HTML report on GitHub Pages — all for $0.

Action Minutes Usage Estimate (at 100k rows/load)

Step	Estimated Time
Checkout + install deps	~2 min
Download NDJSON from GitHub Release (~100 MB)	~1 min
Load 100k rows to PostgreSQL	~10–15 min (bulk COPY: ~5 min)
dbt run (more rows = more model compute)	~2–3 min
Generate report	~1 min
Publish to Pages	<1 min
Total per run	~17–23 min (optimised: ~10 min with bulk COPY)
Total per month (30 runs)	~510–690 min (of 2,000 free) ✓

Actions minutes are still free even at 100k rows/load — worst case ~690 min/month, well within the 2,000 free minutes. The key optimisation is using PostgreSQL COPY (bulk insert) instead of row-by-row INSERT for the loading step. This can cut load time from 15 min to ~2–3 min for 100k rows.

6. Cloud PostgreSQL — Neon (Free)

Neon is a serverless PostgreSQL provider with a generous free tier — fully compatible with psycopg2, dbt-postgres, and SQLAlchemy. No credit card required to sign up.

Actual Current Data Size (measured March 14, 2026)

The local PostgreSQL database was measured before recommending a storage tier. Here are the real numbers:

Schema	Table	Size (total incl. indexes)	Notes
raw_data	tiktok_posts_meta	2.3 MB	Largest table — 2,967 rows
raw_data	tiktok_script_out	968 kB
social_data_alfa	hashtags	1.1 MB
social_data_alfa	posts	1.0 MB	2,898 rows
social_data_alfa	post_metadata	960 kB
All other tables combined		~1.5 MB
Total user data (raw_data + social_data_alfa)		~8 MB	Full DB including system tables: 18 MB

Storage Growth Projection

Average storage per row across all tables: ~0.6 KB. Target load: 100,000 rows per load = ~60 MB per daily run.

Rows per load	Daily DB growth	Free tier (0.5 GB) lasts	Neon Pro (10 GB) lasts	Verdict
500 rows	~0.3 MB/day	~4.5 years	N/A	✓ Free tier OK
10,000 rows	~6 MB/day	~3 months	~4.5 years	⚠ Upgrade soon
100,000 rows ← TARGET	~60 MB/day	✗ ~8 days	~5.5 months	Neon Pro required from day 1
100,000 rows + archiving old data	Net ~10–20 MB/day	N/A	1–2 years	✓ With archival strategy

Free tier is not viable at target scale. 100k rows/load = ~60 MB/day. Neon's 0.5 GB free tier would be exhausted in approximately 8 days.

Required: Neon Pro at $19/month (10 GB storage). This lasts ~5.5 months before an archival strategy is needed — old raw data should be moved to cold storage (e.g. compressed GitHub Release archives) while keeping only the transformed social_data_alfa tables live. This extends Neon Pro to 1–2 years of operation.

Archival Strategy (keeps DB lean)

Rather than storing every raw NDJSON row forever, the pipeline should periodically archive old raw_data tables to compressed files in GitHub Releases and truncate the source tables. The social_data_alfa transformed tables are much smaller and can be kept live indefinitely.

Why Neon vs alternatives

Provider	Free storage	Free connections	Expiry	Credit card?
Neon Pro ✓ recommended	10 GB	Unlimited	Never	No
Neon Free	0.5 GB	Unlimited	Never	No (but exhausted in 8 days at 100k rows)
Supabase	500 MB	Unlimited	Never (with activity)	No
Railway	1 GB	Unlimited	500 hrs/month (sleeps)	No
Render	1 GB	97 connections	90 days then deleted	No

Required code change — .env update

After creating the Neon project, replace your local connection string:

# .env  (local machine — never committed)
DB_HOST=ep-xxxxx.us-east-2.aws.neon.tech
DB_PORT=5432
DB_NAME=neondb
DB_USER=netshield_user
DB_PASSWORD=xxxxxxxxxxxxxxxx
DB_SSLMODE=require                          # required for Neon

# GitHub Actions secrets (set in repo Settings → Secrets)
# NEON_HOST, NEON_USER, NEON_PASSWORD, NEON_DB

The db_config.py and profiles.yml files already read from environment variables — only the values change, not the code logic.

7. Data Handoff Strategy Between Teams

The weakest link in the current flow is how NDJSON files get from the data collection machine to the ELT pipeline. There are two options:

Critical: NDJSON files cannot go in a git commit at this scale. 100k rows ≈ 50–100 MB per file. GitHub's hard limit per file is 100 MB; anything above 50 MB triggers warnings and slows clone/push significantly. Storing data files in git is an anti-pattern at this volume.

Option A — GitHub Releases as Data Drop Zone (Recommended)

The data collection team uploads their NDJSON file as a GitHub Release asset to the Trendact/data-intake repo (up to 2 GB per file, free on public repos). Creating the release triggers a release event, which fires the ELT workflow automatically — no polling, no waiting for cron.

# In data-intake repo — .github/workflows/notify.yml
on:
  release:
    types: [published]   # fires when data team publishes a new release
jobs:
  notify-elt:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger ELT pipeline with release URL
        run: |
          curl -X POST \
            -H "Authorization: token ${{ secrets.PAT_TOKEN }}" \
            -H "Accept: application/vnd.github.v3+json" \
            https://api.github.com/repos/Trendact/-elt/dispatches \
            -d '{"event_type":"data-ready","client_payload":{"asset_url":"${{ github.event.release.assets[0].browser_download_url }}"}}'

The ELT workflow receives the direct download URL of the NDJSON file as a payload, downloads it with curl inside the Actions runner, and processes it. No large files ever enter a git commit.

Option B — Single Monorepo

All teams commit to the same repo under separate folders (data_intake/, elt/). Simpler for a small team, but mixes concerns and makes the repo large over time as NDJSON files accumulate.

Recommendation: Use Option A (two repos). This mirrors real-world Data Engineering patterns, keeps the ELT repo clean, and allows the data collection team to push without needing write access to the ELT code.

DS Team Access (`Trendact/-ai-pipeline`)

After migration, the DS team updates their connection string in the Trendact/-ai-pipeline repo to point to Neon. No other changes are required on their side:

# DS pipeline — Python
import psycopg2
conn = psycopg2.connect(
    host="ep-xxxxx.us-east-2.aws.neon.tech",
    dbname="neondb", user="netshield_user",
    password="xxxxxx", sslmode="require"
)

No VPN, no IP whitelisting, no file transfers needed. Neon supports connections from any IP by default.

8. Migration Roadmap — Step by Step

Estimated total setup time: 4–6 hours (one person, one sitting).

Create Neon PostgreSQL project

Go to neon.tech → sign up free → create project → copy connection string. Create two roles: one for the pipeline (read + write), one for the DS team (read-only).
CREATE ROLE ds_reader LOGIN PASSWORD 'xxx'; GRANT SELECT ON ALL TABLES IN SCHEMA social_data_alfa TO ds_reader;

Run initial migration to Neon

Dump the local PostgreSQL schema + data and restore to Neon:
pg_dump -h localhost -U postgres postgres | psql "postgres://netshield_user:xxx@ep-xxx.neon.tech/neondb?sslmode=require"

Update .env and profiles.yml to point to Neon

Update your local .env and ~/.dbt/profiles.yml with the Neon connection details. Run run_pipeline.py locally once to verify dbt + load all work against the cloud DB.

Add GitHub Actions secrets

In the GitHub repo → Settings → Secrets and variables → Actions → add: NEON_HOST, NEON_USER, NEON_PASSWORD, NEON_DB. These replace the .env file inside the Actions runner.

Create requirements.txt

The Actions runner has no .venv. Export your environment:
.venv\Scripts\pip freeze > requirements.txt then commit it. The workflow will install from this file.

Create the GitHub Actions workflow file

Create .github/workflows/daily_pipeline.yml using the template in Section 5. Commit and push. The Actions tab in GitHub will show the workflow — trigger it manually once to verify end-to-end in the cloud.

Set up GitHub Pages for the report

In repo Settings → Pages → Source: gh-pages branch. After the first successful workflow run, the HTML report is published at https://trendact.github.io/-elt/ — accessible to anyone with the URL.

Set up data intake repo + GitHub Releases workflow

Create Trendact/data-intake repo. Data collection team clones it and adds the notify workflow (Section 7). When they finish a scrape, they create a GitHub Release and upload the NDJSON file as a release asset (via gh release create CLI or GitHub UI). This fires the ELT pipeline automatically — no files ever committed to git.

Share Neon connection string with DS team

Give the DS team (Trendact/-ai-pipeline) the read-only role credentials. They update their connection string in their own repo. Verify they can SELECT from social_data_alfa.

Monitor first automated run + fix edge cases

Let the cron run automatically. Check the Actions log for errors. Common issues: missing Python package in requirements.txt, dbt profile path differences between Windows local and Linux runner. Patch and commit.

9. Cost Summary

Item	Service	Monthly Cost	Notes
Orchestration (pipeline scheduling + running)	GitHub Actions	$0	~180 of 2,000 free minutes used
Cloud PostgreSQL	Neon	$0	0.5 GB free, auto-suspends when idle = no compute charges
Report hosting	GitHub Pages	$0	Static HTML, no Flask server needed
Code + data file storage	GitHub Repos	$0	NDJSON files are small; well within free limit
Secrets management	GitHub Actions Secrets	$0	Encrypted at rest, injected at runtime
Domain / SSL for report	GitHub Pages (included)	$0	HTTPS enforced by default on github.io URLs
TOTAL		$19 / month	Only Neon Pro requires payment. Everything else remains free.

The only paid component is Neon Pro ($19/mo). This is non-negotiable at 100k rows/load — the free tier is exhausted within one week. All other services (GitHub Actions, Pages, Releases, Secrets) remain completely free within the usage limits for this workload.

Future upgrade triggers

Bottleneck	Trigger	Upgrade	Cost
DB storage	>10 GB on Neon Pro (after ~5.5 months with no archiving)	Add archival job to pipeline, or Neon Business plan	$0 (archival) or $69/mo (50 GB)
Pipeline complexity grows	Need real DAG retries, branching, visual monitoring	Astro (managed Airflow)	$0–$30/mo
Actions minutes	>2,000 min/month (loading 100k rows may take 15–20 min/run)	Self-hosted runner on a cheap VM	$4–$6/mo
Private report URL	Need password protection	Re-enable Flask + Cloudflare (already built)	$0

10. Known Limitations & Future Upgrade Path

Limitation	Impact	Severity	Solution when ready
GitHub Actions is not real Airflow	No visual DAG UI, no automatic retry with backoff, no SLA alerting	LOW for now	Migrate to Astro or self-hosted Airflow when DAG complexity grows
Neon auto-suspends after 5 min idle	Only on free tier — not an issue on Neon Pro (always-on)	N/A on Pro	Neon Pro is always-on. No cold start delay.
Report on GitHub Pages is public	Anyone with the URL can read the report	MEDIUM	Keep Flask + Cloudflare for sensitive sharing; or use a private GitHub Pages (requires GitHub Team plan)
NDJSON files in git repo	100k rows ≈ 50–100 MB per file — hits GitHub's 100 MB hard limit, breaks git push	BLOCKER — do not commit data files	Use GitHub Releases as the data drop zone (Section 7). ELT workflow downloads the asset URL directly. No data files ever enter a git commit.
dbt profile uses Linux path in Actions	profiles.yml path differs between Windows and Linux runner	ACTION NEEDED	Pass profiles dir via `--profiles-dir .` flag in workflow, keep a `profiles.yml` in the ELT folder for CI

Summary: At $19/month (Neon Pro only), this architecture eliminates all manual coordination between teams at full 100k rows/load scale. The data collection team publishes a GitHub Release with their NDJSON file → the entire pipeline triggers automatically → the DS team has fresh data in the cloud DB by morning. All other infrastructure (orchestration, report hosting, file transfer, secrets) is free. This is the minimum viable spend to run this pipeline at target scale.

NetShield — Unified Pipeline Architecture Plan

Contents

Glossary — Technical Players & Their Roles

1. Current State — The Problem

2. Target State — What We Are Building

3. Architecture Diagram

BEFORE (current)

AFTER (target)

4. Component Breakdown & Free-Tier Resources

5. GitHub Actions as the Orchestrator

Workflow Design

Action Minutes Usage Estimate (at 100k rows/load)

6. Cloud PostgreSQL — Neon (Free)

Actual Current Data Size (measured March 14, 2026)

Storage Growth Projection

Archival Strategy (keeps DB lean)

Why Neon vs alternatives

Required code change — .env update

7. Data Handoff Strategy Between Teams

Option A — GitHub Releases as Data Drop Zone (Recommended)

Option B — Single Monorepo

DS Team Access (Trendact/-ai-pipeline)

8. Migration Roadmap — Step by Step

Create Neon PostgreSQL project

Run initial migration to Neon

Update .env and profiles.yml to point to Neon

Add GitHub Actions secrets

Create requirements.txt

Create the GitHub Actions workflow file

Set up GitHub Pages for the report

Set up data intake repo + GitHub Releases workflow

Share Neon connection string with DS team

Monitor first automated run + fix edge cases

9. Cost Summary

Future upgrade triggers

10. Known Limitations & Future Upgrade Path

DS Team Access (`Trendact/-ai-pipeline`)