# Google Search Console ## Description Connect to Google Search Console via OAuth2, inspect URLs, fetch search analytics, manage sitemaps, diagnose indexing/crawl errors, and bulk-export performance data. Uses credentials from `.env` and caches the OAuth token locally. ## When to Activate - User wants to check Google Search Console data - User asks about indexing errors, crawl issues, or coverage problems - User wants to see search analytics (clicks, impressions, queries, pages) - User asks about sitemap status or wants to submit a sitemap - User wants to inspect specific URLs for indexing status - User mentions "GSC", "Search Console", or "search performance" - User wants to export or paginate through all search data - User wants to batch multiple GSC API calls ## Prerequisites ### 1. Google Cloud Project Setup Before first use, the user needs: 1. A Google Cloud project with **Search Console API** enabled 2. An **OAuth 2.0 Client ID** (Desktop app type) created in APIs & Services > Credentials 3. The user's Google account added as a **test user** in OAuth consent screen (if app is in testing mode) ### 2. Environment Variables Add to `.env` in the project root: ``` GOOGLE_CLIENT_ID=.apps.googleusercontent.com GOOGLE_CLIENT_SECRET=GOCSPX- ``` ### 3. Python Dependencies ```bash pip3 install google-api-python-client google-auth-oauthlib google-auth-httplib2 python-dotenv ``` ## OAuth2 Scopes | Scope | Access | |-------|--------| | `https://www.googleapis.com/auth/webmasters` | Read/write (sites, sitemaps, analytics) | | `https://www.googleapis.com/auth/webmasters.readonly` | Read-only | | `https://www.googleapis.com/auth/indexing` | URL inspection | Use `webmasters` (not `readonly`) if you need to submit sitemaps or add/remove sites. If a token was created with `readonly` scope and you need write access, delete `.gsc_token.json` and re-authenticate with the full scope. ## Authentication Flow First run opens a browser for OAuth consent. The token is cached to `.gsc_token.json` for subsequent runs. If the token expires, it auto-refreshes using the stored refresh token. ```python import os import json from pathlib import Path from dotenv import load_dotenv from google_auth_oauthlib.flow import InstalledAppFlow from google.auth.transport.requests import Request from google.oauth2.credentials import Credentials from googleapiclient.discovery import build load_dotenv() SCOPES = [ 'https://www.googleapis.com/auth/webmasters', 'https://www.googleapis.com/auth/indexing', ] TOKEN_FILE = Path('.gsc_token.json') def get_credentials(): creds = None if TOKEN_FILE.exists(): creds = Credentials.from_authorized_user_file(str(TOKEN_FILE), SCOPES) if not creds or not creds.valid: if creds and creds.expired and creds.refresh_token: creds.refresh(Request()) else: client_config = { "installed": { "client_id": os.getenv("GOOGLE_CLIENT_ID"), "client_secret": os.getenv("GOOGLE_CLIENT_SECRET"), "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "redirect_uris": ["http://localhost"] } } flow = InstalledAppFlow.from_client_config(client_config, SCOPES) creds = flow.run_local_server(port=8085) with open(TOKEN_FILE, 'w') as f: f.write(creds.to_json()) return creds def get_service(): return build('searchconsole', 'v1', credentials=get_credentials()) ``` ## REST API Base All endpoints follow: `https://www.googleapis.com/webmasters/v3/resourcePath?parameters` The Python client abstracts this, but useful for debugging raw requests. --- ## API Operations ### List Verified Sites ```python sites = service.sites().list().execute() for site in sites.get('siteEntry', []): print(f"{site['siteUrl']} ({site['permissionLevel']})") ``` Permission levels: `siteOwner`, `siteFullUser`, `siteRestrictedUser`, `siteUnverifiedUser`. ### Add / Remove Sites ```python # Add a site (requires verification separately) service.sites().add(siteUrl='https://example.com/').execute() # Remove a site service.sites().delete(siteUrl='https://example.com/').execute() ``` --- ### URL Inspection Inspect individual URLs for indexing status, crawl info, mobile usability, and rich results. ```python result = service.urlInspection().index().inspect( body={ 'inspectionUrl': 'https://example.com/page/', 'siteUrl': 'https://example.com/' } ).execute() ir = result['inspectionResult'] idx = ir['indexStatusResult'] # Key fields: verdict, coverageState, indexingState, lastCrawlTime, # crawledAs, robotsTxtState, pageFetchState, referringUrls mobile = ir.get('mobileUsabilityResult', {}) # Key fields: verdict, issues[].issueType, issues[].severity rich = ir.get('richResultsResult', {}) # Key fields: verdict, detectedItems[].richResultType, detectedItems[].issues[] ``` #### Common Coverage States (errors to watch for) - `"Page with redirect"` — URL redirects, won't be indexed at this URL - `"URL is unknown to Google"` — never discovered/crawled - `"Crawled - currently not indexed"` — crawled but Google chose not to index - `"Discovered - currently not indexed"` — known but not yet crawled - `"Excluded by 'noindex' tag"` — page has noindex directive - `"Blocked by robots.txt"` — robots.txt prevents crawling - `"Soft 404"` — page exists but Google treats as 404 - `"Server error (5xx)"` — server returned error during crawl - `"Submitted URL marked 'noindex'"` — sitemap URL has noindex - `"Page indexed without content"` — indexed but body appears empty --- ### Search Analytics Query search performance data. Data has a **2-3 day lag** — never query today's date. #### Request Parameters | Parameter | Required | Description | |-----------|----------|-------------| | `startDate` | Yes | Start of date range (YYYY-MM-DD) | | `endDate` | Yes | End of date range (YYYY-MM-DD) | | `dimensions` | No | Array: `query`, `page`, `country`, `device`, `searchAppearance`, `date` | | `dimensionFilterGroups` | No | Filter conditions (see below) | | `searchType` | No | `web` (default), `image`, `video`, `news`, `discover`, `googleNews` | | `rowLimit` | No | Max rows per request (default 1000, max 25,000) | | `startRow` | No | Zero-based offset for pagination | | `aggregationType` | No | `auto` (default), `byPage`, `byProperty` | | `dataState` | No | `final` (default) or `all` (includes fresh/unfinished data) | #### Response Metrics (automatically returned per row) - `clicks` — total clicks - `impressions` — total impressions - `ctr` — click-through rate (0.0–1.0) - `position` — average ranking position #### Basic Queries ```python # By page response = service.searchanalytics().query( siteUrl='https://example.com/', body={ 'startDate': '2026-02-19', 'endDate': '2026-03-17', 'dimensions': ['page'], 'rowLimit': 25000 } ).execute() # By query response = service.searchanalytics().query( siteUrl='https://example.com/', body={ 'startDate': '2026-02-19', 'endDate': '2026-03-17', 'dimensions': ['query'], 'rowLimit': 25000 } ).execute() # Cross-tabulation (query × page) response = service.searchanalytics().query( siteUrl='https://example.com/', body={ 'startDate': '2026-02-19', 'endDate': '2026-03-17', 'dimensions': ['query', 'page'], 'rowLimit': 25000 } ).execute() # Each row: keys[] (one per dimension), clicks, impressions, ctr, position ``` #### Filtering ```python response = service.searchanalytics().query( siteUrl='https://example.com/', body={ 'startDate': '2026-02-19', 'endDate': '2026-03-17', 'dimensions': ['query'], 'dimensionFilterGroups': [{ 'filters': [ { 'dimension': 'country', 'expression': 'ind' # ISO 3166-1 alpha-3 }, { 'dimension': 'device', 'expression': 'MOBILE' # MOBILE, DESKTOP, TABLET } ] }], 'rowLimit': 25000 } ).execute() ``` Filter operators (set via `operator` field): `equals` (default), `notEquals`, `contains`, `notContains`, `includingRegex`, `excludingRegex`. #### Pagination for Large Datasets (>25K rows) The API returns max 25,000 rows per request. Max available data is **50,000 rows per day per search type**. ```python def fetch_all_rows(service, site_url, body): """Paginate through all available search analytics rows.""" all_rows = [] max_rows = 25000 start_row = 0 while True: body['rowLimit'] = max_rows body['startRow'] = start_row response = service.searchanalytics().query( siteUrl=site_url, body=body ).execute() rows = response.get('rows', []) if not rows: break all_rows.extend(rows) start_row += max_rows if len(rows) < max_rows: break # Last page return all_rows ``` #### Bulk Data Export Strategy For complete data export, query **one day at a time** to stay within quotas: ```python from datetime import date, timedelta def export_all_data(service, site_url, start, end): """Export all search analytics data, one day at a time.""" current = start all_data = [] while current <= end: day_str = current.strftime('%Y-%m-%d') rows = fetch_all_rows(service, site_url, { 'startDate': day_str, 'endDate': day_str, 'dimensions': ['query', 'page', 'country', 'device'], }) for row in rows: row['date'] = day_str all_data.extend(rows) current += timedelta(days=1) return all_data ``` #### Verifying Data Freshness Before querying, check which dates have data: ```python response = service.searchanalytics().query( siteUrl=site_url, body={ 'startDate': (date.today() - timedelta(days=10)).isoformat(), 'endDate': date.today().isoformat(), 'dimensions': ['date'], } ).execute() # The most recent date with data tells you the current lag ``` #### Search Appearance Data Two-step process: 1. Query with `dimensions: ['searchAppearance']` to discover available types 2. Filter by specific type for detailed analysis ```python # Step 1: Discover appearance types response = service.searchanalytics().query( siteUrl=site_url, body={ 'startDate': start, 'endDate': end, 'dimensions': ['searchAppearance'], } ).execute() # Returns types like: AMP_ARTICLE, RICH_RESULT, VIDEO, etc. # Step 2: Filter for a specific type response = service.searchanalytics().query( siteUrl=site_url, body={ 'startDate': start, 'endDate': end, 'dimensions': ['query'], 'dimensionFilterGroups': [{ 'filters': [{ 'dimension': 'searchAppearance', 'expression': 'RICH_RESULT' }] }], } ).execute() ``` #### Query Cost Awareness Queries have different computational costs: - **Cheapest**: Group by date only, no filters - **Medium**: Group by country or device - **Expensive**: Group by page or query - **Most expensive**: Group by page AND query combined - Longer date ranges cost more than shorter ones - Repeated identical queries within a short window cost more --- ### Sitemaps ```python # List sitemaps sitemaps = service.sitemaps().list(siteUrl='https://example.com/').execute() # Each sitemap: path, lastSubmitted, lastDownloaded, warnings, errors, # contents[].type, contents[].submitted, contents[].indexed # Get a specific sitemap sitemap = service.sitemaps().get( siteUrl='https://example.com/', feedpath='https://example.com/sitemap.xml' ).execute() # Submit a sitemap (requires webmasters scope, not readonly) service.sitemaps().submit( siteUrl='https://example.com/', feedpath='https://example.com/sitemap.xml' ).execute() # Delete a sitemap service.sitemaps().delete( siteUrl='https://example.com/', feedpath='https://example.com/sitemap.xml' ).execute() ``` --- ## Batch Requests Combine multiple API calls into a single HTTP request. Max **1,000 requests per batch**. Each sub-request counts individually toward quotas. ```python from googleapiclient.http import BatchHttpRequest def handle_response(request_id, response, exception): if exception: print(f"Request {request_id} failed: {exception}") else: print(f"Request {request_id}: {response}") batch = service.new_batch_http_request(callback=handle_response) # Add multiple URL inspections to the batch urls = ['https://example.com/', 'https://example.com/about/'] for i, url in enumerate(urls): batch.add( service.urlInspection().index().inspect( body={'inspectionUrl': url, 'siteUrl': 'https://example.com/'} ), request_id=str(i) ) batch.execute() ``` Batch execution order is **not guaranteed** — the server may process calls in any order. --- ## Performance Optimization ### Gzip Compression Reduce bandwidth by requesting gzip-encoded responses: ```python import httplib2 http = httplib2.Http() http.force_exception_to_status_code = True # The Python client library handles gzip automatically ``` ### Partial Responses Request only needed fields to reduce payload size: ```python # Use the fields parameter to request specific fields only response = service.searchanalytics().query( siteUrl=site_url, body={...}, fields='rows(keys,clicks,impressions)' # Only return these fields ).execute() # For sites list sites = service.sites().list(fields='siteEntry(siteUrl,permissionLevel)').execute() ``` Field syntax: - Comma-separated: `field1,field2` - Nested: `a/b` (field b inside a) - Sub-selection: `items(title,length)` (only title and length from each item) --- ## Site URL Format GSC uses two property types: - **URL-prefix**: `https://example.com/` (trailing slash required) - **Domain**: `sc-domain:example.com` (covers all protocols and subdomains) Always check which format the user's property uses via `sites().list()` before making API calls. --- ## Rate Limits & Quotas ### Search Analytics | Scope | Limit | |-------|-------| | Per-site | 1,200 queries/minute | | Per-user | 1,200 queries/minute | | Per-project | 40,000 queries/minute, 30,000,000 queries/day | ### URL Inspection | Scope | Limit | |-------|-------| | Per-site | 2,000 queries/day, 600 queries/minute | | Per-project | 15,000 queries/minute, 10,000,000 queries/day | ### All Other Resources (sites, sitemaps) | Scope | Limit | |-------|-------| | Per-user | 20 queries/second, 200 queries/minute | | Per-project | 100,000,000 queries/day | ### Load Quota Measured in 10-minute and 1-day chunks. If you hit the load quota, **wait 15 minutes** and retry. Monitor usage at: Google Cloud Console > APIs & Services > Quotas. --- ## Gotchas - **Port conflicts**: If `run_local_server(port=8085)` fails with "Address in use", kill the process on that port (`lsof -ti:8085 | xargs kill -9`) or use a different port. - **BlogVault / WAF blocking**: If the site has a WAF (BlogVault, Sucuri, Cloudflare), WordPress REST API POST requests may be blocked. Use FTP or WP-CLI for write operations instead. - **OAuth consent screen**: The app must be in "Testing" mode with the user added as a test user, or fully verified. Otherwise Google blocks the auth flow with "access_denied". - **Scope mismatch**: If you authenticated with `readonly` scope but need write access (submit sitemap, add site), delete `.gsc_token.json` and re-authenticate with `webmasters` scope. - **Token expiry**: Access tokens last ~1 hour. The refresh token is used automatically. If refresh fails, delete `.gsc_token.json` and re-authenticate. - **Search analytics lag**: Data is 2-3 days behind. Querying today or yesterday returns empty results. Verify freshness by querying the `date` dimension over the past 10 days. - **50K row cap**: The API exposes max 50,000 rows per day per search type, sorted by clicks. Low-traffic long-tail queries may be omitted. - **Expensive queries**: Grouping by page+query is the most expensive combination. Use simpler dimensions when possible, and avoid repeating identical queries in quick succession. - **Never commit `.gsc_token.json`** — add it to `.gitignore`.