twitterapi.io is an independent third-party service. Not affiliated with X Corp.

Blogtwitter data collection

Twitter (X) Data Collection — A Developer's API Guide

By Alex Chen4 min read

Twitter (X) data collection is a compound problem: pick the right endpoint(s), paginate at scale, persist for later query. Different downstream use cases (research corpus, brand monitoring, longitudinal study, competitor tracking) share the same collection primitives; only the query shape and volume vary.

This guide walks the four endpoint families in the collection surface, runnable Python for each, and the honest per-record cost across providers. Pricing references URL-cited.

01 — Section

The four collection endpoint families

Tweet-by-query: advanced_search — retrieve tweets matching a rule (keyword, hashtag, date range, engagement filters). The workhorse for any content-based collection.

Profile lookup: user/info — retrieve profile metadata (bio, follower/following counts, verified status). One call per profile.

Timeline pull: user/last_tweets — retrieve a specific user's recent post history. Combine with pagination for full-history archival.

Network extraction: user/followers and user/followings — retrieve social-graph edges. Cursor-paginated; large accounts require many calls.

Any real collection project combines 2-4 of these depending on the target signal. A discourse-research corpus uses advanced_search only. A brand-monitoring dashboard uses advanced_search + user/info to enrich author context. A network study uses followers + user/info to build the graph. An influencer-tracking product uses last_tweets + user/info + advanced_search for full context.

02 — Section

Path 1 — twitterapi.io endpoint set

Auth via X-API-Key. All four endpoints share the same auth surface + cursor pagination pattern.

Pricing per twitterapi.io/pricing: $0.00015 per returned tweet, $0.00018 per profile lookup, $0.00001-$0.00003 per follower entry (tiered by page size). No monthly minimum.

python
import os, requests

HEADERS = {"X-API-Key": os.environ["TWITTERAPI_IO_KEY"]}
BASE = "https://api.twitterapi.io"

def collect_tweets(query: str, max_pages: int = 20):
    tweets, cursor = [], None
    for _ in range(max_pages):
        params = {"query": query}
        if cursor: params["cursor"] = cursor
        r = requests.get(f"{BASE}/twitter/tweet/advanced_search", headers=HEADERS, params=params, timeout=15)
        r.raise_for_status()
        resp = r.json()
        tweets.extend(resp.get("tweets", []))
        cursor = resp.get("next_cursor")
        if not cursor: break
    return tweets

def collect_profile(handle: str):
    r = requests.get(f"{BASE}/twitter/user/info", headers=HEADERS, params={"userName": handle}, timeout=10)
    r.raise_for_status()
    return r.json()

def collect_timeline(handle: str, max_pages: int = 20):
    tweets, cursor = [], None
    for _ in range(max_pages):
        params = {"userName": handle}
        if cursor: params["cursor"] = cursor
        r = requests.get(f"{BASE}/twitter/user/last_tweets", headers=HEADERS, params=params, timeout=15)
        r.raise_for_status()
        resp = r.json()
        tweets.extend(resp.get("data", []))
        cursor = resp.get("next_cursor")
        if not cursor: break
    return tweets

def collect_followers(handle: str, max_pages: int = 5):
    followers, cursor = [], None
    for _ in range(max_pages):
        params = {"userName": handle, "count": 5000}
        if cursor: params["cursor"] = cursor
        r = requests.get(f"{BASE}/twitter/user/followers", headers=HEADERS, params=params, timeout=20)
        r.raise_for_status()
        resp = r.json()
        followers.extend(resp.get("data", []))
        cursor = resp.get("next_cursor")
        if not cursor: break
    return followers

# Compose a collection workflow
query_tweets = collect_tweets('"machine learning" min_faves:100 lang:en')
profile = collect_profile("twitterapi_io")
timeline = collect_timeline("twitterapi_io")
followers = collect_followers("twitterapi_io", max_pages=2)
print(f"collected: {len(query_tweets)} tweets, 1 profile, {len(timeline)} timeline, {len(followers)} followers")
03 — Section

Path 2 — X official equivalents

Same endpoint families under different names: /2/tweets/search/recent (7d) or /2/tweets/search/all (enterprise) for tweets-by-query; /2/users/by/username/{name} for profile; /2/users/{id}/tweets for timeline; /2/users/{id}/followers for network.

Pricing per docs.x.com/x-api/getting-started/pricing: $0.005 per post read, $0.010 per profile. Full-archive tweets require enterprise contract.

04 — Section

Pagination + persistence patterns

Cursor pagination: both providers use cursor tokens. Loop until next_cursor is null OR you hit a page cap. Always cap pages for budget safety (a runaway query can drain credit).

JSONL for archives: write each returned record as one line of JSON. Streamable, appendable, minimal parsing. Standard format for research corpora and audit trails.

Postgres / relational for query: model as tweets(id, author_id, text, created_at, public_metrics jsonb) + authors(id, handle, follower_count, ...). Index on author_id, created_at. Rich querying via SQL.

Columnar warehouse (BigQuery / Snowflake / Redshift) for analytics: batch-load JSONL from your archive into columnar tables. Best for large-scale aggregation queries (top-authors-by-month, engagement-distribution-over-year).

Elasticsearch / OpenSearch for full-text: index tweet text for keyword-in-content search. Best when your users do free-text search over the collected corpus.

05 — Section

Side-by-side — 2 collection paths at scale

Costs derived from each provider's published pricing pages. Sample workload: collect 1M tweets + 10K profiles + 100K follower edges.

Dimensiontwitterapi.ioX official
Per-tweet cost$0.00015 (twitterapi.io/pricing)$0.005 (docs.x.com)
Per-profile cost$0.00018$0.010
Full-archive depthincludedEnterprise tier only
Auth setupemail + API keyX Developer Console + bearer
1M-tweet corpus~$150~$5,000
10K profiles~$1.80~$100
100K follower edges~$1-3 (tiered)~$500 (extrapolated)
Sample-total workload~$155~$5,600

Two practical observations: (a) at collection scale, the per-record ratio compounds — the difference is measured in thousands of dollars per project; (b) full-archive depth on twitterapi.io means historical collection doesn't need enterprise contract negotiation.

06 — Section

Common data-collection projects — patterns to reuse

Research corpus (100K-1M tweets): advanced_search with topic + date range → JSONL archive → columnar warehouse for analysis. Cost: $15-150.

Brand monitoring dashboard (continuous): advanced_search every N minutes for brand mention + user/info to enrich top mentioners → Postgres for query → dashboard reads. Cost: $50-200/month.

Longitudinal user study (per-user timeline archive): last_tweets paginated through history + user/info snapshot → JSONL per user. Sub-$1 per user for years of history.

Network / graph study (medium-account network): followers + user/info per node → graph database (Neo4j / networkx pickle) → analysis. Cost: $50-500 depending on network size.

Influencer discovery + tracking: advanced_search by topic + engagement floor → user/info for authors → last_tweets for context. Rolling monthly refresh. Cost: $100-500/month for a working watch-list.

07 — Section

Picking the path

Building a data product or product-adjacent research pipeline → twitterapi.io. Cost economics at collection scale are the deciding factor.

Occasional / one-off small collection → either works; X official if already authenticated.

Large institutional research + existing X Enterprise contract → X official — marginal cost rides on existing contract.

Historical archive spanning years → twitterapi.io. Enterprise-tier X full-archive works but the setup + cost delta is significant.

python
# Practical example: multi-endpoint collection pipeline for a monitoring project.
import os, requests, json
from pathlib import Path
from datetime import datetime, timezone

HEADERS = {"X-API-Key": os.environ["TWITTERAPI_IO_KEY"]}
BASE = "https://api.twitterapi.io"
OUT_DIR = Path("collection")
OUT_DIR.mkdir(exist_ok=True)

TARGETS = {
    "brand_mentions": '"twitterapi.io" -is:retweet lang:en',
    "topic_signal": '"machine learning API" min_faves:50 lang:en',
}
HANDLES_TO_ENRICH = ["twitterapi_io", "nasa", "github"]

def collect_query_to_jsonl(query: str, out_path: Path, max_pages: int = 20):
    tweets, cursor = 0, None
    with open(out_path, "a") as f:
        for _ in range(max_pages):
            params = {"query": query}
            if cursor: params["cursor"] = cursor
            r = requests.get(f"{BASE}/twitter/tweet/advanced_search", headers=HEADERS, params=params, timeout=15)
            r.raise_for_status()
            resp = r.json()
            for t in resp.get("tweets", []):
                f.write(json.dumps({"captured_at": datetime.now(timezone.utc).isoformat(), **t}) + "\n")
                tweets += 1
            cursor = resp.get("next_cursor")
            if not cursor: break
    return tweets

def collect_profiles(handles: list, out_path: Path):
    profiles = []
    with open(out_path, "w") as f:
        for h in handles:
            r = requests.get(f"{BASE}/twitter/user/info", headers=HEADERS, params={"userName": h}, timeout=10)
            r.raise_for_status()
            profile = r.json()
            f.write(json.dumps({"captured_at": datetime.now(timezone.utc).isoformat(), **profile}) + "\n")
            profiles.append(profile)
    return profiles

for label, query in TARGETS.items():
    n = collect_query_to_jsonl(query, OUT_DIR / f"{label}.jsonl")
    print(f"  {label}: {n} tweets")

profiles = collect_profiles(HANDLES_TO_ENRICH, OUT_DIR / "profiles.jsonl")
print(f"  profiles: {len(profiles)}")

# Cost framing per twitterapi.io/pricing:
#   ~1000 tweets across queries × $0.00015 = $0.15
#   3 profiles × $0.00018 = $0.00054
#   Total: $0.15 for this collection cycle
# Same via X official: $5.03 (~33x)
08 — Questions

Questions readers ask

How do I handle rate limits during large collection?

Both providers publish rate limits. For twitterapi.io: per-API-key throughput lands in the thousands of requests/hour for most tiers. Pace batch pulls across multiple hours rather than bursting. For X official: rate limits are tier-based and generally tighter — plan accordingly. If you hit a 429, back off exponentially per your client library's convention.

What's the best format to store collected data?

JSONL for archives (append-friendly, streamable, minimal parsing overhead). Load into Postgres or columnar warehouse for query-heavy access. Elasticsearch for full-text search. Choice depends on downstream access pattern, not on the collection API.

Can I resume a partial collection if my script crashes?

Yes — persist the last-known next_cursor per query. On restart, load the cursor and continue from there. For time-bounded pulls, also persist the last successfully-processed date to avoid re-fetching.

How do I avoid collecting the same tweet twice?

Deduplicate on tweet.id at your storage layer. If using Postgres, unique constraint on id. If using JSONL archives, a separate 'seen ids' set kept in memory (or on disk) during collection.

What about tweet deletions after collection?

A tweet that was public at collection time and later deleted stays in your JSONL / warehouse (you have a copy). Whether to purge from your storage on notification of deletion depends on your data policy + jurisdiction. Public-data research typically retains; commercial products may need deletion cascades.

Can I redistribute the collected data?

Provider terms + jurisdiction terms both apply. Public tweet data is generally usable for research + analytics + product-internal use. Redistribution as a data product (selling tweets) usually requires explicit provider agreement + terms compliance. Review provider terms before any commercial redistribution.

09 — Further reading

Continue

Sources & further reading
More from this series
Build it

Stop reading. Start building.

Starter credits cover real testing on real data. Google sign-in, no card, no application queue.

Get an API key
    Twitter (X) Data Collection — API Guide | TwitterAPI.io