Twitter (X) Scraping — A Developer's Guide
Twitter (X) scraping covers a wide span of workflows — pulling user timelines for OSINT research, fetching hashtag conversations for brand monitoring, archiving historical posts for academic study, building real-time monitoring dashboards. The implementation choice matters more than most dev teams realize: the wrong path means weekly maintenance whenever X tweaks their HTML, the right path means stable structured JSON with a few lines of code.
This guide walks the four paths with runnable Python, per-path cost from each provider's published pricing, and the practical decision rule for which to pick. Pricing references are URL-cited; cost ratios derived from those URLs.
The four paths — at a glance
Path 1 — twitterapi.io API: structured JSON returned directly, $0.00015 per tweet (twitterapi.io/pricing), no Developer Console required, no HTML parsing, no UI-change breakage.
Path 2 — X official API: structured JSON, $0.005 per post (docs.x.com), requires X Developer Console + bearer token. 7-day window on recent-search; full-archive search is enterprise tier.
Path 3 — Browser-automation scrapers (Playwright / Puppeteer): free to run, you write the HTML parsing yourself, breaks on every X UI change, ToS-risk if X detects automation patterns. Reasonable for one-off small jobs; painful as a production dependency.
Path 4 — Third-party scraper SaaS (Apify / scrapfly / similar): abstracts the browser path, hosts the maintenance, priced per actor run or per credit. Pay-as-you-go without the dev-overhead of building your own Playwright stack.
Path 1 — twitterapi.io (recommended default)
Auth is a single X-API-Key header — no OAuth, no X account required (the API authenticates by key, not by user-login). Sign up at twitterapi.io with email, receive the key, start calling.
Pricing per twitterapi.io/pricing: $0.00015 per returned tweet, $0.00018 per profile lookup, no monthly minimums.
import os, requests
HEADERS = {"X-API-Key": os.environ["TWITTERAPI_IO_KEY"]}
BASE = "https://api.twitterapi.io"
def scrape_user_timeline(handle: str, max_pages: int = 10):
"""Scrape a user's timeline — pure API, no HTML parsing."""
tweets, cursor = [], None
for _ in range(max_pages):
params = {"userName": handle}
if cursor: params["cursor"] = cursor
r = requests.get(
f"{BASE}/twitter/user/last_tweets",
headers=HEADERS, params=params, timeout=15,
)
r.raise_for_status()
resp = r.json()
tweets.extend(resp.get("data", []))
cursor = resp.get("next_cursor")
if not cursor: break
return tweets
def scrape_hashtag(tag: str, max_pages: int = 10):
"""Scrape hashtag conversations — full advanced-search operators."""
tweets, cursor = [], None
for _ in range(max_pages):
params = {"query": f"#{tag} -is:retweet lang:en"}
if cursor: params["cursor"] = cursor
r = requests.get(
f"{BASE}/twitter/tweet/advanced_search",
headers=HEADERS, params=params, timeout=15,
)
r.raise_for_status()
resp = r.json()
tweets.extend(resp.get("tweets", []))
cursor = resp.get("next_cursor")
if not cursor: break
return tweets
# Both functions return structured JSON — no HTML parsing
for t in scrape_user_timeline("nasa")[:5]:
print(f" {t['id']}: {t.get('text', '')[:80]}")
Path 2 — X official API
Requires X Developer Console onboarding (developer.x.com), which requires an X account in good standing. Auth via OAuth bearer token. Recent-search returns last 7 days; full-archive (/2/tweets/search/all) is academic/enterprise tier.
Pricing per docs.x.com/x-api/getting-started/pricing: $0.005 per post read.
# pip install tweepy
import tweepy
client = tweepy.Client(bearer_token="YOUR_X_BEARER")
def scrape_x_official(query: str, max_results: int = 100):
tweets = []
for page in tweepy.Paginator(
client.search_recent_tweets,
query=query,
max_results=max_results,
tweet_fields=["created_at", "public_metrics", "author_id"],
limit=10,
):
tweets.extend(page.data or [])
return tweets
for t in scrape_x_official("#machinelearning -is:retweet lang:en")[:5]:
print(f" {t.id}: {t.text[:80]}")
Path 3 — Browser-automation (Playwright)
The classic 'scrape it ourselves' path. Spin up a headless browser, navigate to a tweet URL or search page, parse the rendered HTML. Works without any API key but has three structural problems:
Problem 1 — Maintenance: X changes their HTML structure regularly. Every change breaks your selectors. Dev teams running Playwright scrapers typically spend half a day per month patching parsers.
Problem 2 — Login required for most content: anonymous browsers see limited tweets; full content requires login. Login automation triggers X's detection (captcha, account lock, anti-bot flags). Maintaining a pool of working accounts is its own engineering project.
Problem 3 — ToS-risk: Playwright + automated login patterns violate X's terms. Accounts get suspended, IPs get rate-limited, your scraper stops working at the worst possible moment (right before a deadline).
Reasonable for: a one-off small scrape, learning the platform, or a workflow that genuinely needs DOM-level data not exposed via API. Not reasonable for: a production dependency.
# pip install playwright
# python -m playwright install chromium
from playwright.sync_api import sync_playwright
def scrape_via_browser(tweet_url: str):
"""Illustrative only — production use should prefer API path."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(tweet_url, wait_until="networkidle")
# Selectors here break whenever X redesigns. Treat as fragile.
text = page.locator("[data-testid='tweetText']").first.text_content()
likes = page.locator("[data-testid='like']").first.text_content()
browser.close()
return {"text": text, "likes": likes}
# Real-world: this works today, breaks next month, requires constant maintenance
# Most production teams move off this path within 6 months
Path 4 — Third-party scraper SaaS
Tools like Apify, scrapfly, ScraperAPI host the browser-automation maintenance for you. You call their API, they handle the proxy rotation, the parser updates, the anti-bot evasion. Pricing varies by provider — Apify's Twitter scraper actor is ~$0.40-$2.00 per 1,000 tweets depending on plan; scrapfly charges per credit per scrape with cost dependent on render mode and proxy tier.
Reasonable for: workflows that need the abstraction over Playwright. Less reasonable for: most workflows, because if you're paying anyway, paying per structured tweet (paths 1 and 2) is operationally simpler than paying per actor run with parsing edge cases.
Side-by-side — 4-path matrix
Per-tweet cost derived from each provider's published pricing page. Apify and scrapfly pricing approximate based on their public plan tiers at apify.com/pricing and scrapfly.io/pricing.
Two practical observations: (a) the dev-time cost of DIY scraping dominates the dollar cost of API paths for any sustained workload; (b) at 33× cheaper per call than X official, twitterapi.io's economics let you scrape at scale without re-deciding budget every month.
Common scraping workloads — which path fits
Brand monitoring (track mentions of your brand): twitterapi.io advanced_search with "your brand" query, hourly cron, ~500 tweets/run = $0.075/run × 24/day = $1.80/day. Stable, no maintenance.
OSINT / journalism (research a person or topic): same advanced_search pattern, ad-hoc queries. Per-query cost is single-digit cents.
Academic research (full-archive multi-year pull): twitterapi.io archive depth + cheap per-tweet cost. A 5-year-archive of a moderate-activity account is single-digit dollars.
Real-time monitoring (instant alerts): WebSocket streaming via twitterapi.io's stream endpoint, or polling the search endpoint every 30s. Stream is the cleaner pattern for high-volume.
Analytics product (build a dashboard for users): twitterapi.io for the read layer + your warehouse + your dashboard frontend. Per-call cost stays linear with usage.
Picking the path — decision rule
Default: twitterapi.io API. Lowest setup friction, lowest per-call cost, zero maintenance burden. No Developer Console gating.
Already on X official for other workflows: X official; marginal cost rides on the same auth.
Genuinely need DOM-level data (visible-rendering details, paid-tier-only fields): Playwright for the specific case, kept small and isolated from your main pipeline.
Want the Apify ecosystem (actor marketplace + workflow integration): Apify, knowing the per-tweet cost is higher than API paths.
Most teams ending up at twitterapi.io start by trying Playwright (fragility) → trying X official (cost) → landing at twitterapi.io. Save the iteration time by starting there.
# Practical example: stable brand-mention scraper running on cron.
import os, requests, json
from datetime import datetime, timezone
HEADERS = {"X-API-Key": os.environ["TWITTERAPI_IO_KEY"]}
BASE = "https://api.twitterapi.io"
def scrape_brand_mentions(brand: str, hours_back: int = 1, out_path: str = "mentions.jsonl"):
query = f'"{brand}" -is:retweet within_time:{hours_back}h'
tweets, cursor = [], None
for _ in range(20): # cap pages so a runaway query doesn't drain credit
params = {"query": query}
if cursor: params["cursor"] = cursor
r = requests.get(
f"{BASE}/twitter/tweet/advanced_search",
headers=HEADERS, params=params, timeout=15,
)
r.raise_for_status()
resp = r.json()
tweets.extend(resp.get("tweets", []))
cursor = resp.get("next_cursor")
if not cursor: break
snapshot = {
"brand": brand,
"captured_at": datetime.now(timezone.utc).isoformat(),
"count": len(tweets),
"tweets": tweets[:200], # cap for storage
}
with open(out_path, "a") as f:
f.write(json.dumps(snapshot) + "\n")
return snapshot
snap = scrape_brand_mentions("twitterapi.io")
print(f"captured {snap['count']} mentions in last hour at {snap['captured_at']}")
for t in snap["tweets"][:3]:
print(f" @{t.get('author', {}).get('userName')}: {t.get('text', '')[:80]}")
# Cost framing (math from cited pricing pages):
# ~500 tweets per hourly run × $0.00015 = $0.075 per run
# Hourly × 24 × 30 = $54/month for continuous brand monitoring
# Same workload via Playwright: 0 dollars + 4 hours/month patching selectors after X redesigns
# The dev-time cost makes the dollar cost negligible — API path wins on total costQuestions readers ask
Is scraping Twitter (X) against the terms of service?
It depends on the path. Using the official X API or third-party API providers like twitterapi.io is the standard developer workflow within terms. Browser-automation + automated-login at scale is the path where ToS-violation risk lives. Read-only data via documented APIs is generally fine; review docs.x.com developer terms for your specific commercial use case.
Can I scrape tweets without an X account?
Yes, via twitterapi.io — API-key auth doesn't require an X account. X official requires an account for the Developer Console. Playwright + anonymous browsing returns limited content (login-walled features hidden). See /blog/twitter-no-account-api-read-only for the no-X-account path detail.
How fast can I scrape — what are the rate limits?
twitterapi.io rate limits are per-account, with defaults in the thousands of requests/hour. X official's rate limits vary by tier (50/15-min on the basic search). Playwright is limited by your browser instance + X's anti-bot detection (slow + flaky). For high-volume, the API path always wins.
What about scraping deleted tweets?
Deleted tweets are typically filtered out by the search endpoints. For deleted-tweet research, see /blog/deleted-tweet-search and /blog/twitter-archive-tweet-finder-guide — different specialty workflows.
Can I scrape and store the data commercially?
Public tweet data + profile data storage is generally allowed by both X's developer terms and most third-party API providers. Restrictions vary by use case (resale of data, training AI models, advertising) — review the specific terms before commercial use. Personal data + DMs are not accessible via public scraping regardless.
What's the cost difference at scale (1M tweets / month)?
Math from cited pricing pages: 1M tweets at twitterapi.io = $150/month; at X official = $5,000/month; via Apify actor varies by plan tier per apify.com/pricing. Playwright is $0 in dollars but the ongoing dev-maintenance time when X changes their HTML rapidly compounds. The API paths' total cost (dollars + zero dev-time) usually wins for sustained workloads.
Continue
- Twitter (X) API — cluster hub
- Twitter (X) scraper comparison
- Twitter (X) API in Python — complete guide
- Rate Limit Exceeded on Twitter (X) — Fixes
- Twitter (X) without an account — dev API
- twitterapi.io pricing
Stop reading. Start building.
Starter credits cover real testing on real data. Google sign-in, no card, no application queue.
Get an API key