1. Overview
Before using the Octoparse API, you will need to hold a Standard or Professional account with at least one runnable task set up. (Haven’t got an account? Sign up here.) You can easily retrieve extracted data, task information and even control tasks (advanced API) by connecting to the Octoparse API, realizing efficient data extraction by coordinating with your own application.
1.2. Contact
Contact: Octoparse support team
Email: support@octoparse.com
1.3. URI Standard
All requests should be URL encoded with the base URL:
https://dataapi.octoparse.com/
For example: A request for 'Get Data by Offset' should be: GET http://advancedapi.octoparse.com/api/alldata/GetDataOfTaskByOffset?taskId={taskId}&offset={offset}&size={size}
Note: {xxxx} in this document represents placeholder and users need to replace it with real value. For example, if your task ID is abc, offset is 1 and size is 10, then the URL should be https://dataapi.octoparse.com/api/alldata/GetDataOfTaskByOffset?taskId=abc&offset=1&size=10.
1.4. Obtain OAuth2.0 Token
Before getting access to the Octoparse API, you need to obtain an Access Token based on OAuth2.0.
1.4.1. Obtain a New Token
You will need your username and password to get a new Access Token.
Request
POST https://dataapi.octoparse.com/token
Parameters
username={userName}&password={password}&grant_type=password
Request Content Type
application/x-www-form-urlencoded
Response
Access Token
Response Content Type
application/json, text/json
Example
{ "access_token": "ABCD1234", //Access permission "token_type": "bearer", //Token type "expires_in": 86399, //Access Token Expiration time (in seconds)(It is recommended to use the same token repeatedly within this time frame.) "refresh_token": "refresh_token" //To refresh Access Token }
‘Access_Token’ is required for any API method invoked. Please add it to HTTP Header following the format below.
HeaderName: Authorization Value: bearer {access_token}
Note: There is a space between ‘bearer’ and ‘Access Token’. For example, if the Access Token is AA11BB22...CC33, the Header should be ‘Authorization: bearer AA11BB22...CC33’. Access Token has an expiration time and it is recommended to be used repeatedly before it expires.
1.4.2. Refresh Token
Once an Access Token expires, users can refresh Access Token with 'Refresh_Token'. 'Refresh Token' is a more secure way of obtaining new token compared to making new requests with username and password.
Note: Each 'refresh_token' can only be used once. The new 'refresh token' returned from the current request should be used for the next request.
Request
POST https://dataapi.octoparse.com/token
Parameters
refresh_token={refresh_token}&grant_type=refresh_token
Request Content Type
application/x-www-form-urlencoded
Response
Access Token
Note: The response HTTP status code should be ‘200’. If not, please refer to HTTP Status Code to solve the problem.
2. Instruction
Octoparse limits API usage to 20 requests/second. Please reduce access frequency if you receive status code ‘429’.
Note: Octoparse uses a leaky bucket algorithm to limit API access frequency. The maximum number of requests is 100 within any five-second time interval; no more requests will be taken thereafter until the next 5-second time interval.
Unusual Status
The response HTTP status code should be ‘200’. If not, please refer to HTTP Status Code to solve the problem.
2.1. Get Task Group Information
2.1.1.List All Task Groups
Request
Response
Json-formatted text containing task group information and request status
Response Content Type
application/json, text/json
Example
{ "data": [ { "taskGroupId": 1, "taskGroupName": "Example Task Group 1" }, { "taskGroupId": 2, "taskGroupName": "Example Task Group 2" } ], "error": "success", "error_Description": "Operation successes." }
2.2. Manage Task
2.2.1.List All Tasks in a Group
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskGroupId |
Task Group ID |
Please define the parameter in request URL. |
Response
Json-formatted text including task ID (taskId), task name(taskName)and user ID and request status.
Response Content Type
application/json, text/json
Example
{ "data": [ { "taskId": "337fd7d7-aded-4081-9104-2b551161ccc8", "taskName": "Example Task 1", "creationUserId": "5d1e4b3c-645c-44ab-ac0e-bfa9ad600ece" }, { "taskId": "4adf489b-f883-43fa-b958-0cfde945ddb7", "taskName": "Example Task 2", "creationUserId": "5d1e4b3c-645c-44ab-ac0e-bfa9ad600ece" } ], "error": "success", "error_Description": "Operation successes." }
2.2.2.Clear Data
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
Response
Data has been cleared successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.3. Export Data
2.3.1.Export Non-exported Data
This returns non-exported data. Data will be tagged status = exporting (instead of status=exported) after the export. This way, the same set of data can be exported multiple times using this method. If the user has confirmed receipt of the data and wish to update data status to ‘exported’, please follow instruction 2.3.2 for status update.
Note: If the export gets interrupted (e.g. Due to network interruption), please re-export the data set once again using this method.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
size |
The amount of data rows(range from 1 to 1000) |
Please define the parameter in request URL. |
Response
Data and request status
Response Content Type
application/json, text/json
Example
{ "data": { "total": 100000, "currentTotal": 4, "dataList": [ { "State": "Texas", "City": "Plano" }, { "State": "Texas", "City": "Houston" }, { "State": "Texas", "City": "Austin" }, { "State": "Texas", "City": "Arlington" } ] }, "error": "success", "error_Description": "Operation successes." }
2.3.2.Update Data Status
This updates data status from ‘exporting’ to ‘exported’.
Note: Please confirm data exported via the API ‘Export Task Data’ (api/notexportdata/gettop) have been retrieved successfully before using this method.
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task Id |
Please define the parameter in request URL. |
Response
Task status has been updated successfully or not.
Response Content Type
application/json, text/json
Example
{ "error": "success", "error_Description": "Operation successes." }
2.4. Get Data
2.4.1.Get Data by Offset
To get data, parameters such as offset, size and task ID are all required in the request. Offset should default to 0 (offset=0), and size∈[1,1000] for making the initial request. The offset returned (could be any value greater than 0) should be used for making the next request. For example, if a task has 1000 data rows, using parameter: offset = 0, size = 100 will return the first 100 rows of data and the offset X (X can be any random number greater than or equal to 100). When making the second request, user should use the offset returned from the first request, offset = X, size = 100 to get the next 100 rows of data (row 101 to 200) as well as the new offset to use for the request follows.
Note: This method is only used to get data but will not affect the status of data. (Non-exported data will still remain as non-exported)
Request
Parameters
Parameter | Description | Remark |
---|---|---|
taskId |
Task ID |
Please define the parameter in request URL. |
offset |
If offset is less than or equal to 0, data will be returned starting from the first row. |
Please define the parameter in request URL. |
size |
The amount of data that will be returned(range from 1 to 1000) |
Please define the parameter in request URL. |
Response
Data and request status
Response Content Type
application/json, text/json
Example
{ "data": { "offset": 4, "total": 100000, "restTotal": 99996, "dataList": [ { "State": "Texas", "City": "Plano" }, { "State": "Texas", "City": "Houston" }, { "State": "Texas", "City": "Austin" }, { "State": "Texas", "City": "Arlington" } ] }, "error": "success", "error_Description": "Operation successes." }
3. Reference
3.1. HTTP Status Code
Whenever an error code is returned, please refer to the following status code to solve the problem.
HTTP Status Code | Inner Status Code | Description |
---|---|---|
200 |
ok |
Operation successful. |
400 |
invalid_grant |
Incorrect username or password. |
400 |
unsupported_grant_type |
Incorrect POST format. The correct format should be username={username}&password={password}&grant_type=password. |
401 |
unauthorized |
Access Token is invalid because it is expired or unauthorized. Please get a new token. |
403 |
user_not_allowed |
Permission denied. Please upgrade to Standard Plan to use Data API; upgrade to Professional Plan to use Advanced API. |
404 |
not_found |
The HTTP request is not recognized. Please request with the correct URL. |
405 |
method_not_allowed |
The HTTP method is not supported. Please use the method supported by the interface. |
429 |
quota_exceeded |
The request frequency has exceeded the limit. Please reduce access frequency to less than 20 times per second. |
503 |
service_unavailable |
The server is temporarily unavailable. Please try again later. |
3.2. Example Code
Example Code:
Python: ApiSamples/Code/Python/
Java: ApiSamples/Code/Java/
PHP: ApiSamples/Code/Php/
(Other languages will be coming out soon.)