twitterapi.io is an independent third-party service. Not affiliated with X Corp.

Blogtwitter scraping

Twitter (X) Scraping — A Developer's Guide

By Michael Park6 min read

Twitter (X) scraping covers a wide span of workflows — pulling user timelines for OSINT research, fetching hashtag conversations for brand monitoring, archiving historical posts for academic study, building real-time monitoring dashboards. The implementation choice matters more than most dev teams realize: the wrong path means weekly maintenance whenever X tweaks their HTML, the right path means stable structured JSON with a few lines of code.

This guide walks the four paths with runnable Python, per-path cost from each provider's published pricing, and the practical decision rule for which to pick. Pricing references are URL-cited; cost ratios derived from those URLs.

01 — Section

The four paths — at a glance

Path 1 — twitterapi.io API: structured JSON returned directly, $0.00015 per tweet (twitterapi.io/pricing), no Developer Console required, no HTML parsing, no UI-change breakage.

Path 2 — X official API: structured JSON, $0.005 per post (docs.x.com), requires X Developer Console + bearer token. 7-day window on recent-search; full-archive search is enterprise tier.

Path 3 — Browser-automation scrapers (Playwright / Puppeteer): free to run, you write the HTML parsing yourself, breaks on every X UI change, ToS-risk if X detects automation patterns. Reasonable for one-off small jobs; painful as a production dependency.

Path 4 — Third-party scraper SaaS (Apify / scrapfly / similar): abstracts the browser path, hosts the maintenance, priced per actor run or per credit. Pay-as-you-go without the dev-overhead of building your own Playwright stack.

02 — Section

Auth is a single X-API-Key header — no OAuth, no X account required (the API authenticates by key, not by user-login). Sign up at twitterapi.io with email, receive the key, start calling.

Pricing per twitterapi.io/pricing: $0.00015 per returned tweet, $0.00018 per profile lookup, no monthly minimums.

python
import os, requests

HEADERS = {"X-API-Key": os.environ["TWITTERAPI_IO_KEY"]}
BASE = "https://api.twitterapi.io"

def scrape_user_timeline(handle: str, max_pages: int = 10):
    """Scrape a user's timeline — pure API, no HTML parsing."""
    tweets, cursor = [], None
    for _ in range(max_pages):
        params = {"userName": handle}
        if cursor: params["cursor"] = cursor
        r = requests.get(
            f"{BASE}/twitter/user/last_tweets",
            headers=HEADERS, params=params, timeout=15,
        )
        r.raise_for_status()
        resp = r.json()
        tweets.extend(resp.get("data", []))
        cursor = resp.get("next_cursor")
        if not cursor: break
    return tweets

def scrape_hashtag(tag: str, max_pages: int = 10):
    """Scrape hashtag conversations — full advanced-search operators."""
    tweets, cursor = [], None
    for _ in range(max_pages):
        params = {"query": f"#{tag} -is:retweet lang:en"}
        if cursor: params["cursor"] = cursor
        r = requests.get(
            f"{BASE}/twitter/tweet/advanced_search",
            headers=HEADERS, params=params, timeout=15,
        )
        r.raise_for_status()
        resp = r.json()
        tweets.extend(resp.get("tweets", []))
        cursor = resp.get("next_cursor")
        if not cursor: break
    return tweets

# Both functions return structured JSON — no HTML parsing
for t in scrape_user_timeline("nasa")[:5]:
    print(f"  {t['id']}: {t.get('text', '')[:80]}")
03 — Section

Path 2 — X official API

Requires X Developer Console onboarding (developer.x.com), which requires an X account in good standing. Auth via OAuth bearer token. Recent-search returns last 7 days; full-archive (/2/tweets/search/all) is academic/enterprise tier.

Pricing per docs.x.com/x-api/getting-started/pricing: $0.005 per post read.

python
# pip install tweepy
import tweepy

client = tweepy.Client(bearer_token="YOUR_X_BEARER")

def scrape_x_official(query: str, max_results: int = 100):
    tweets = []
    for page in tweepy.Paginator(
        client.search_recent_tweets,
        query=query,
        max_results=max_results,
        tweet_fields=["created_at", "public_metrics", "author_id"],
        limit=10,
    ):
        tweets.extend(page.data or [])
    return tweets

for t in scrape_x_official("#machinelearning -is:retweet lang:en")[:5]:
    print(f"  {t.id}: {t.text[:80]}")
04 — Section

Path 3 — Browser-automation (Playwright)

The classic 'scrape it ourselves' path. Spin up a headless browser, navigate to a tweet URL or search page, parse the rendered HTML. Works without any API key but has three structural problems:

Problem 1 — Maintenance: X changes their HTML structure regularly. Every change breaks your selectors. Dev teams running Playwright scrapers typically spend half a day per month patching parsers.

Problem 2 — Login required for most content: anonymous browsers see limited tweets; full content requires login. Login automation triggers X's detection (captcha, account lock, anti-bot flags). Maintaining a pool of working accounts is its own engineering project.

Problem 3 — ToS-risk: Playwright + automated login patterns violate X's terms. Accounts get suspended, IPs get rate-limited, your scraper stops working at the worst possible moment (right before a deadline).

Reasonable for: a one-off small scrape, learning the platform, or a workflow that genuinely needs DOM-level data not exposed via API. Not reasonable for: a production dependency.

python
# pip install playwright
# python -m playwright install chromium
from playwright.sync_api import sync_playwright

def scrape_via_browser(tweet_url: str):
    """Illustrative only — production use should prefer API path."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(tweet_url, wait_until="networkidle")
        # Selectors here break whenever X redesigns. Treat as fragile.
        text = page.locator("[data-testid='tweetText']").first.text_content()
        likes = page.locator("[data-testid='like']").first.text_content()
        browser.close()
        return {"text": text, "likes": likes}

# Real-world: this works today, breaks next month, requires constant maintenance
# Most production teams move off this path within 6 months
05 — Section

Path 4 — Third-party scraper SaaS

Tools like Apify, scrapfly, ScraperAPI host the browser-automation maintenance for you. You call their API, they handle the proxy rotation, the parser updates, the anti-bot evasion. Pricing varies by provider — Apify's Twitter scraper actor is ~$0.40-$2.00 per 1,000 tweets depending on plan; scrapfly charges per credit per scrape with cost dependent on render mode and proxy tier.

Reasonable for: workflows that need the abstraction over Playwright. Less reasonable for: most workflows, because if you're paying anyway, paying per structured tweet (paths 1 and 2) is operationally simpler than paying per actor run with parsing edge cases.

06 — Section

Side-by-side — 4-path matrix

Per-tweet cost derived from each provider's published pricing page. Apify and scrapfly pricing approximate based on their public plan tiers at apify.com/pricing and scrapfly.io/pricing.

Dimensiontwitterapi.io APIX official APIPlaywright DIYApify / scrapfly
Per-tweet cost$0.00015 (twitterapi.io/pricing)$0.005 (docs.x.com)$0 + dev time$0.0004-$0.002 per tweet (~typical actor pricing)
Setup frictionAPI key (email signup)X Developer Console (X account required)code + accounts + proxiesAPI key (provider signup)
Maintenancenone — provider handlesnone — provider handlesweekly to monthlynone — provider handles
HTML parsingnone — JSON returnednone — JSON returnedfull — your codenone — provider returns JSON
ToS risklow (read-only public data)low (official)medium-highmedium
Best formost workloads, default choicealready on X billone-off learningwhen you specifically want the Apify ecosystem

Two practical observations: (a) the dev-time cost of DIY scraping dominates the dollar cost of API paths for any sustained workload; (b) at 33× cheaper per call than X official, twitterapi.io's economics let you scrape at scale without re-deciding budget every month.

07 — Section

Common scraping workloads — which path fits

Brand monitoring (track mentions of your brand): twitterapi.io advanced_search with "your brand" query, hourly cron, ~500 tweets/run = $0.075/run × 24/day = $1.80/day. Stable, no maintenance.

OSINT / journalism (research a person or topic): same advanced_search pattern, ad-hoc queries. Per-query cost is single-digit cents.

Academic research (full-archive multi-year pull): twitterapi.io archive depth + cheap per-tweet cost. A 5-year-archive of a moderate-activity account is single-digit dollars.

Real-time monitoring (instant alerts): WebSocket streaming via twitterapi.io's stream endpoint, or polling the search endpoint every 30s. Stream is the cleaner pattern for high-volume.

Analytics product (build a dashboard for users): twitterapi.io for the read layer + your warehouse + your dashboard frontend. Per-call cost stays linear with usage.

08 — Section

Picking the path — decision rule

Default: twitterapi.io API. Lowest setup friction, lowest per-call cost, zero maintenance burden. No Developer Console gating.

Already on X official for other workflows: X official; marginal cost rides on the same auth.

Genuinely need DOM-level data (visible-rendering details, paid-tier-only fields): Playwright for the specific case, kept small and isolated from your main pipeline.

Want the Apify ecosystem (actor marketplace + workflow integration): Apify, knowing the per-tweet cost is higher than API paths.

Most teams ending up at twitterapi.io start by trying Playwright (fragility) → trying X official (cost) → landing at twitterapi.io. Save the iteration time by starting there.

python
# Practical example: stable brand-mention scraper running on cron.
import os, requests, json
from datetime import datetime, timezone

HEADERS = {"X-API-Key": os.environ["TWITTERAPI_IO_KEY"]}
BASE = "https://api.twitterapi.io"

def scrape_brand_mentions(brand: str, hours_back: int = 1, out_path: str = "mentions.jsonl"):
    query = f'"{brand}" -is:retweet within_time:{hours_back}h'
    tweets, cursor = [], None
    for _ in range(20):  # cap pages so a runaway query doesn't drain credit
        params = {"query": query}
        if cursor: params["cursor"] = cursor
        r = requests.get(
            f"{BASE}/twitter/tweet/advanced_search",
            headers=HEADERS, params=params, timeout=15,
        )
        r.raise_for_status()
        resp = r.json()
        tweets.extend(resp.get("tweets", []))
        cursor = resp.get("next_cursor")
        if not cursor: break
    snapshot = {
        "brand": brand,
        "captured_at": datetime.now(timezone.utc).isoformat(),
        "count": len(tweets),
        "tweets": tweets[:200],  # cap for storage
    }
    with open(out_path, "a") as f:
        f.write(json.dumps(snapshot) + "\n")
    return snapshot

snap = scrape_brand_mentions("twitterapi.io")
print(f"captured {snap['count']} mentions in last hour at {snap['captured_at']}")
for t in snap["tweets"][:3]:
    print(f"  @{t.get('author', {}).get('userName')}: {t.get('text', '')[:80]}")

# Cost framing (math from cited pricing pages):
#   ~500 tweets per hourly run × $0.00015 = $0.075 per run
#   Hourly × 24 × 30 = $54/month for continuous brand monitoring
#   Same workload via Playwright: 0 dollars + 4 hours/month patching selectors after X redesigns
# The dev-time cost makes the dollar cost negligible — API path wins on total cost
09 — Questions

Questions readers ask

Is scraping Twitter (X) against the terms of service?

It depends on the path. Using the official X API or third-party API providers like twitterapi.io is the standard developer workflow within terms. Browser-automation + automated-login at scale is the path where ToS-violation risk lives. Read-only data via documented APIs is generally fine; review docs.x.com developer terms for your specific commercial use case.

Can I scrape tweets without an X account?

Yes, via twitterapi.io — API-key auth doesn't require an X account. X official requires an account for the Developer Console. Playwright + anonymous browsing returns limited content (login-walled features hidden). See /blog/twitter-no-account-api-read-only for the no-X-account path detail.

How fast can I scrape — what are the rate limits?

twitterapi.io rate limits are per-account, with defaults in the thousands of requests/hour. X official's rate limits vary by tier (50/15-min on the basic search). Playwright is limited by your browser instance + X's anti-bot detection (slow + flaky). For high-volume, the API path always wins.

What about scraping deleted tweets?

Deleted tweets are typically filtered out by the search endpoints. For deleted-tweet research, see /blog/deleted-tweet-search and /blog/twitter-archive-tweet-finder-guide — different specialty workflows.

Can I scrape and store the data commercially?

Public tweet data + profile data storage is generally allowed by both X's developer terms and most third-party API providers. Restrictions vary by use case (resale of data, training AI models, advertising) — review the specific terms before commercial use. Personal data + DMs are not accessible via public scraping regardless.

What's the cost difference at scale (1M tweets / month)?

Math from cited pricing pages: 1M tweets at twitterapi.io = $150/month; at X official = $5,000/month; via Apify actor varies by plan tier per apify.com/pricing. Playwright is $0 in dollars but the ongoing dev-maintenance time when X changes their HTML rapidly compounds. The API paths' total cost (dollars + zero dev-time) usually wins for sustained workloads.

10 — Further reading

Continue

Sources & further reading
More from this series
Build it

Stop reading. Start building.

Starter credits cover real testing on real data. Google sign-in, no card, no application queue.

Get an API key
    Twitter (X) Scraping — A Developer's Guide | TwitterAPI.io