# Google Search Console

## Description
Connect to Google Search Console via OAuth2, inspect URLs, fetch search analytics, manage sitemaps, diagnose indexing/crawl errors, and bulk-export performance data. Uses credentials from `.env` and caches the OAuth token locally.

## When to Activate
- User wants to check Google Search Console data
- User asks about indexing errors, crawl issues, or coverage problems
- User wants to see search analytics (clicks, impressions, queries, pages)
- User asks about sitemap status or wants to submit a sitemap
- User wants to inspect specific URLs for indexing status
- User mentions "GSC", "Search Console", or "search performance"
- User wants to export or paginate through all search data
- User wants to batch multiple GSC API calls

## Prerequisites

### 1. Google Cloud Project Setup
Before first use, the user needs:
1. A Google Cloud project with **Search Console API** enabled
2. An **OAuth 2.0 Client ID** (Desktop app type) created in APIs & Services > Credentials
3. The user's Google account added as a **test user** in OAuth consent screen (if app is in testing mode)

### 2. Environment Variables
Add to `.env` in the project root:
```
GOOGLE_CLIENT_ID=<client-id>.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=GOCSPX-<secret>
```

### 3. Python Dependencies
```bash
pip3 install google-api-python-client google-auth-oauthlib google-auth-httplib2 python-dotenv
```

## OAuth2 Scopes

| Scope | Access |
|-------|--------|
| `https://www.googleapis.com/auth/webmasters` | Read/write (sites, sitemaps, analytics) |
| `https://www.googleapis.com/auth/webmasters.readonly` | Read-only |
| `https://www.googleapis.com/auth/indexing` | URL inspection |

Use `webmasters` (not `readonly`) if you need to submit sitemaps or add/remove sites. If a token was created with `readonly` scope and you need write access, delete `.gsc_token.json` and re-authenticate with the full scope.

## Authentication Flow

First run opens a browser for OAuth consent. The token is cached to `.gsc_token.json` for subsequent runs. If the token expires, it auto-refreshes using the stored refresh token.

```python
import os
import json
from pathlib import Path
from dotenv import load_dotenv
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build

load_dotenv()

SCOPES = [
    'https://www.googleapis.com/auth/webmasters',
    'https://www.googleapis.com/auth/indexing',
]
TOKEN_FILE = Path('.gsc_token.json')

def get_credentials():
    creds = None
    if TOKEN_FILE.exists():
        creds = Credentials.from_authorized_user_file(str(TOKEN_FILE), SCOPES)

    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            client_config = {
                "installed": {
                    "client_id": os.getenv("GOOGLE_CLIENT_ID"),
                    "client_secret": os.getenv("GOOGLE_CLIENT_SECRET"),
                    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
                    "token_uri": "https://oauth2.googleapis.com/token",
                    "redirect_uris": ["http://localhost"]
                }
            }
            flow = InstalledAppFlow.from_client_config(client_config, SCOPES)
            creds = flow.run_local_server(port=8085)

        with open(TOKEN_FILE, 'w') as f:
            f.write(creds.to_json())

    return creds

def get_service():
    return build('searchconsole', 'v1', credentials=get_credentials())
```

## REST API Base

All endpoints follow: `https://www.googleapis.com/webmasters/v3/resourcePath?parameters`

The Python client abstracts this, but useful for debugging raw requests.

---

## API Operations

### List Verified Sites
```python
sites = service.sites().list().execute()
for site in sites.get('siteEntry', []):
    print(f"{site['siteUrl']} ({site['permissionLevel']})")
```

Permission levels: `siteOwner`, `siteFullUser`, `siteRestrictedUser`, `siteUnverifiedUser`.

### Add / Remove Sites
```python
# Add a site (requires verification separately)
service.sites().add(siteUrl='https://example.com/').execute()

# Remove a site
service.sites().delete(siteUrl='https://example.com/').execute()
```

---

### URL Inspection
Inspect individual URLs for indexing status, crawl info, mobile usability, and rich results.

```python
result = service.urlInspection().index().inspect(
    body={
        'inspectionUrl': 'https://example.com/page/',
        'siteUrl': 'https://example.com/'
    }
).execute()

ir = result['inspectionResult']
idx = ir['indexStatusResult']
# Key fields: verdict, coverageState, indexingState, lastCrawlTime,
#             crawledAs, robotsTxtState, pageFetchState, referringUrls

mobile = ir.get('mobileUsabilityResult', {})
# Key fields: verdict, issues[].issueType, issues[].severity

rich = ir.get('richResultsResult', {})
# Key fields: verdict, detectedItems[].richResultType, detectedItems[].issues[]
```

#### Common Coverage States (errors to watch for)
- `"Page with redirect"` — URL redirects, won't be indexed at this URL
- `"URL is unknown to Google"` — never discovered/crawled
- `"Crawled - currently not indexed"` — crawled but Google chose not to index
- `"Discovered - currently not indexed"` — known but not yet crawled
- `"Excluded by 'noindex' tag"` — page has noindex directive
- `"Blocked by robots.txt"` — robots.txt prevents crawling
- `"Soft 404"` — page exists but Google treats as 404
- `"Server error (5xx)"` — server returned error during crawl
- `"Submitted URL marked 'noindex'"` — sitemap URL has noindex
- `"Page indexed without content"` — indexed but body appears empty

---

### Search Analytics

Query search performance data. Data has a **2-3 day lag** — never query today's date.

#### Request Parameters

| Parameter | Required | Description |
|-----------|----------|-------------|
| `startDate` | Yes | Start of date range (YYYY-MM-DD) |
| `endDate` | Yes | End of date range (YYYY-MM-DD) |
| `dimensions` | No | Array: `query`, `page`, `country`, `device`, `searchAppearance`, `date` |
| `dimensionFilterGroups` | No | Filter conditions (see below) |
| `searchType` | No | `web` (default), `image`, `video`, `news`, `discover`, `googleNews` |
| `rowLimit` | No | Max rows per request (default 1000, max 25,000) |
| `startRow` | No | Zero-based offset for pagination |
| `aggregationType` | No | `auto` (default), `byPage`, `byProperty` |
| `dataState` | No | `final` (default) or `all` (includes fresh/unfinished data) |

#### Response Metrics (automatically returned per row)
- `clicks` — total clicks
- `impressions` — total impressions
- `ctr` — click-through rate (0.0–1.0)
- `position` — average ranking position

#### Basic Queries
```python
# By page
response = service.searchanalytics().query(
    siteUrl='https://example.com/',
    body={
        'startDate': '2026-02-19',
        'endDate': '2026-03-17',
        'dimensions': ['page'],
        'rowLimit': 25000
    }
).execute()

# By query
response = service.searchanalytics().query(
    siteUrl='https://example.com/',
    body={
        'startDate': '2026-02-19',
        'endDate': '2026-03-17',
        'dimensions': ['query'],
        'rowLimit': 25000
    }
).execute()

# Cross-tabulation (query × page)
response = service.searchanalytics().query(
    siteUrl='https://example.com/',
    body={
        'startDate': '2026-02-19',
        'endDate': '2026-03-17',
        'dimensions': ['query', 'page'],
        'rowLimit': 25000
    }
).execute()

# Each row: keys[] (one per dimension), clicks, impressions, ctr, position
```

#### Filtering
```python
response = service.searchanalytics().query(
    siteUrl='https://example.com/',
    body={
        'startDate': '2026-02-19',
        'endDate': '2026-03-17',
        'dimensions': ['query'],
        'dimensionFilterGroups': [{
            'filters': [
                {
                    'dimension': 'country',
                    'expression': 'ind'       # ISO 3166-1 alpha-3
                },
                {
                    'dimension': 'device',
                    'expression': 'MOBILE'     # MOBILE, DESKTOP, TABLET
                }
            ]
        }],
        'rowLimit': 25000
    }
).execute()
```

Filter operators (set via `operator` field): `equals` (default), `notEquals`, `contains`, `notContains`, `includingRegex`, `excludingRegex`.

#### Pagination for Large Datasets (>25K rows)

The API returns max 25,000 rows per request. Max available data is **50,000 rows per day per search type**.

```python
def fetch_all_rows(service, site_url, body):
    """Paginate through all available search analytics rows."""
    all_rows = []
    max_rows = 25000
    start_row = 0

    while True:
        body['rowLimit'] = max_rows
        body['startRow'] = start_row
        response = service.searchanalytics().query(
            siteUrl=site_url, body=body
        ).execute()

        rows = response.get('rows', [])
        if not rows:
            break

        all_rows.extend(rows)
        start_row += max_rows

        if len(rows) < max_rows:
            break  # Last page

    return all_rows
```

#### Bulk Data Export Strategy

For complete data export, query **one day at a time** to stay within quotas:

```python
from datetime import date, timedelta

def export_all_data(service, site_url, start, end):
    """Export all search analytics data, one day at a time."""
    current = start
    all_data = []

    while current <= end:
        day_str = current.strftime('%Y-%m-%d')
        rows = fetch_all_rows(service, site_url, {
            'startDate': day_str,
            'endDate': day_str,
            'dimensions': ['query', 'page', 'country', 'device'],
        })
        for row in rows:
            row['date'] = day_str
        all_data.extend(rows)
        current += timedelta(days=1)

    return all_data
```

#### Verifying Data Freshness

Before querying, check which dates have data:
```python
response = service.searchanalytics().query(
    siteUrl=site_url,
    body={
        'startDate': (date.today() - timedelta(days=10)).isoformat(),
        'endDate': date.today().isoformat(),
        'dimensions': ['date'],
    }
).execute()
# The most recent date with data tells you the current lag
```

#### Search Appearance Data

Two-step process:
1. Query with `dimensions: ['searchAppearance']` to discover available types
2. Filter by specific type for detailed analysis

```python
# Step 1: Discover appearance types
response = service.searchanalytics().query(
    siteUrl=site_url,
    body={
        'startDate': start, 'endDate': end,
        'dimensions': ['searchAppearance'],
    }
).execute()
# Returns types like: AMP_ARTICLE, RICH_RESULT, VIDEO, etc.

# Step 2: Filter for a specific type
response = service.searchanalytics().query(
    siteUrl=site_url,
    body={
        'startDate': start, 'endDate': end,
        'dimensions': ['query'],
        'dimensionFilterGroups': [{
            'filters': [{
                'dimension': 'searchAppearance',
                'expression': 'RICH_RESULT'
            }]
        }],
    }
).execute()
```

#### Query Cost Awareness

Queries have different computational costs:
- **Cheapest**: Group by date only, no filters
- **Medium**: Group by country or device
- **Expensive**: Group by page or query
- **Most expensive**: Group by page AND query combined
- Longer date ranges cost more than shorter ones
- Repeated identical queries within a short window cost more

---

### Sitemaps

```python
# List sitemaps
sitemaps = service.sitemaps().list(siteUrl='https://example.com/').execute()
# Each sitemap: path, lastSubmitted, lastDownloaded, warnings, errors,
#               contents[].type, contents[].submitted, contents[].indexed

# Get a specific sitemap
sitemap = service.sitemaps().get(
    siteUrl='https://example.com/',
    feedpath='https://example.com/sitemap.xml'
).execute()

# Submit a sitemap (requires webmasters scope, not readonly)
service.sitemaps().submit(
    siteUrl='https://example.com/',
    feedpath='https://example.com/sitemap.xml'
).execute()

# Delete a sitemap
service.sitemaps().delete(
    siteUrl='https://example.com/',
    feedpath='https://example.com/sitemap.xml'
).execute()
```

---

## Batch Requests

Combine multiple API calls into a single HTTP request. Max **1,000 requests per batch**. Each sub-request counts individually toward quotas.

```python
from googleapiclient.http import BatchHttpRequest

def handle_response(request_id, response, exception):
    if exception:
        print(f"Request {request_id} failed: {exception}")
    else:
        print(f"Request {request_id}: {response}")

batch = service.new_batch_http_request(callback=handle_response)

# Add multiple URL inspections to the batch
urls = ['https://example.com/', 'https://example.com/about/']
for i, url in enumerate(urls):
    batch.add(
        service.urlInspection().index().inspect(
            body={'inspectionUrl': url, 'siteUrl': 'https://example.com/'}
        ),
        request_id=str(i)
    )

batch.execute()
```

Batch execution order is **not guaranteed** — the server may process calls in any order.

---

## Performance Optimization

### Gzip Compression
Reduce bandwidth by requesting gzip-encoded responses:
```python
import httplib2
http = httplib2.Http()
http.force_exception_to_status_code = True
# The Python client library handles gzip automatically
```

### Partial Responses
Request only needed fields to reduce payload size:
```python
# Use the fields parameter to request specific fields only
response = service.searchanalytics().query(
    siteUrl=site_url,
    body={...},
    fields='rows(keys,clicks,impressions)'  # Only return these fields
).execute()

# For sites list
sites = service.sites().list(fields='siteEntry(siteUrl,permissionLevel)').execute()
```

Field syntax:
- Comma-separated: `field1,field2`
- Nested: `a/b` (field b inside a)
- Sub-selection: `items(title,length)` (only title and length from each item)

---

## Site URL Format

GSC uses two property types:
- **URL-prefix**: `https://example.com/` (trailing slash required)
- **Domain**: `sc-domain:example.com` (covers all protocols and subdomains)

Always check which format the user's property uses via `sites().list()` before making API calls.

---

## Rate Limits & Quotas

### Search Analytics
| Scope | Limit |
|-------|-------|
| Per-site | 1,200 queries/minute |
| Per-user | 1,200 queries/minute |
| Per-project | 40,000 queries/minute, 30,000,000 queries/day |

### URL Inspection
| Scope | Limit |
|-------|-------|
| Per-site | 2,000 queries/day, 600 queries/minute |
| Per-project | 15,000 queries/minute, 10,000,000 queries/day |

### All Other Resources (sites, sitemaps)
| Scope | Limit |
|-------|-------|
| Per-user | 20 queries/second, 200 queries/minute |
| Per-project | 100,000,000 queries/day |

### Load Quota
Measured in 10-minute and 1-day chunks. If you hit the load quota, **wait 15 minutes** and retry.

Monitor usage at: Google Cloud Console > APIs & Services > Quotas.

---

## Gotchas
- **Port conflicts**: If `run_local_server(port=8085)` fails with "Address in use", kill the process on that port (`lsof -ti:8085 | xargs kill -9`) or use a different port.
- **BlogVault / WAF blocking**: If the site has a WAF (BlogVault, Sucuri, Cloudflare), WordPress REST API POST requests may be blocked. Use FTP or WP-CLI for write operations instead.
- **OAuth consent screen**: The app must be in "Testing" mode with the user added as a test user, or fully verified. Otherwise Google blocks the auth flow with "access_denied".
- **Scope mismatch**: If you authenticated with `readonly` scope but need write access (submit sitemap, add site), delete `.gsc_token.json` and re-authenticate with `webmasters` scope.
- **Token expiry**: Access tokens last ~1 hour. The refresh token is used automatically. If refresh fails, delete `.gsc_token.json` and re-authenticate.
- **Search analytics lag**: Data is 2-3 days behind. Querying today or yesterday returns empty results. Verify freshness by querying the `date` dimension over the past 10 days.
- **50K row cap**: The API exposes max 50,000 rows per day per search type, sorted by clicks. Low-traffic long-tail queries may be omitted.
- **Expensive queries**: Grouping by page+query is the most expensive combination. Use simpler dimensions when possible, and avoid repeating identical queries in quick succession.
- **Never commit `.gsc_token.json`** — add it to `.gitignore`.
