Extractor

Extractor

LLM-Powered data extractor is ideal for one-time data extraction and unstructured data.

extract

Paste your API key to the X-API-KEY header to send the request to the extract endpoint:

curl https://api.parsera.org/v1/extract \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
    "url": "https://news.ycombinator.com/",
    "prompt": "Extract news metadata",
    "attributes": [
        {
            "name": "title",
            "description": "News title"
        },
        {
            "name": "points",
            "description": "Number of points"
        }
    ],
    "proxy_country": "Germany"
}'

Parameters:

Minimal payload requires url and either prompt or attributes.

ParameterTypeDefaultDescription
urlstring-URL of the webpage to extract data from
promptstring""Prompt for initial scraping
attributesarray[]A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types
modestringstandardMode of the extractor, standard or precision. For details, see Precision mode
proxy_countrystringUnitedStatesProxy country, see Proxy Countries
cookiesarrayEmptyCookies to use during extraction, see Cookies

It's recommended to set the proxy_country parameter to a specific country since a page could be unavailable from some locations.

If some data is missing, you can retry with precision mode, which looks into data hidden in HTML tags. For details, see Precision mode.

parse

In addition to extract, there is a parse endpoint that can be used to parse data generated on your side instead of one from url.
There is a content attribute for passing data, which accepts both raw HTML and string:

curl https://api.parsera.org/v1/parse \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
    "content": <HTML_OR_TEXT_HERE>,
    "prompt": "Extract news metadata",
    "attributes": [
        {
            "name": "title",
            "description": "News title"
        },
        {
            "name": "points",
            "description": "Number of points"
        }
    ]
}'

Parameters:

ParameterTypeDefaultDescription
contentstring-Raw HTML or text content to extract data from
promptstring""Prompt for initial scraping
attributesarray-A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types
modestringstandardMode of the extractor, standard or precision. For details, see Precision mode

extract_markdown

You can get a markdown from URL with the extract_markdown endpoint:

curl https://api.parsera.org/v1/extract_markdown \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
    "url": "https://news.ycombinator.com/",
    "proxy_country": "UnitedStates"
}'

Parameters:

ParameterTypeDefaultDescription
urlstring-URL of the webpage to extract data from
proxy_countrystringUnitedStatesProxy country, see Proxy Countries
cookiesarrayEmptyCookies to use during extraction, see Cookies

More features

Check out further documentation to explore more features:

Parsera Parsera on