Extractor

LLM-Powered data extractor is ideal for one-time data extraction and unstructured data.

`extract`

Paste your API key to the X-API-KEY header to send the request to the extract endpoint:

curl https://api.parsera.org/v1/extract \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
    "url": "https://news.ycombinator.com/",
    "prompt": "Extract news metadata",
    "attributes": [
        {
            "name": "title",
            "description": "News title"
        },
        {
            "name": "points",
            "description": "Number of points"
        }
    ],
    "proxy_country": "Germany"
}'

Parameters:

Minimal payload requires url and either prompt or attributes.

Parameter	Type	Default	Description
`url`	`string`	-	URL of the webpage to extract data from
`prompt`	`string`	`""`	Prompt for initial scraping
`attributes`	`array`	`[]`	A list of attribute objects with `name` and `description` fields to extract from the webpage. Also, you can specify Output Types
`mode`	`string`	`standard`	Mode of the extractor, `standard` or `precision`. For details, see Precision mode
`proxy_country`	`string`	`UnitedStates`	Proxy country, see Proxy Countries
`cookies`	`array`	Empty	Cookies to use during extraction, see Cookies

It's recommended to set the proxy_country parameter to a specific country since a page could be unavailable from some locations.

If some data is missing, you can retry with precision mode, which looks into data hidden in HTML tags. For details, see Precision mode.

`parse`

In addition to extract, there is a parse endpoint that can be used to parse data generated on your side instead of one from url.
There is a content attribute for passing data, which accepts both raw HTML and string:

curl https://api.parsera.org/v1/parse \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
    "content": <HTML_OR_TEXT_HERE>,
    "prompt": "Extract news metadata",
    "attributes": [
        {
            "name": "title",
            "description": "News title"
        },
        {
            "name": "points",
            "description": "Number of points"
        }
    ]
}'

Parameters:

Parameter	Type	Default	Description
`content`	`string`	-	Raw HTML or text content to extract data from
`prompt`	`string`	`""`	Prompt for initial scraping
`attributes`	`array`	-	A list of attribute objects with `name` and `description` fields to extract from the webpage. Also, you can specify Output Types
`mode`	`string`	`standard`	Mode of the extractor, `standard` or `precision`. For details, see Precision mode

`extract_markdown`

You can get a markdown from URL with the extract_markdown endpoint:

curl https://api.parsera.org/v1/extract_markdown \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
    "url": "https://news.ycombinator.com/",
    "proxy_country": "UnitedStates"
}'

Parameters:

Parameter	Type	Default	Description
`url`	`string`	-	URL of the webpage to extract data from
`proxy_country`	`string`	`UnitedStates`	Proxy country, see Proxy Countries
`cookies`	`array`	Empty	Cookies to use during extraction, see Cookies

More features

Check out further documentation to explore more features: