Web Content Extractor Documentation

Settings

You can open the Settings window by clicking "Tools->Settings"or by pressing Alt+F7. This window allows you to change the following settings:

General Tab

"Reload last project at startup" - if this option is enabled, the program will automatically open the last project when you start the program.

"Restart the program when memory reached" - if this option is enabled and memory reached the limit, then it will be restarted automatically to release the memory.

"Hide the results view after" - if this option is enabled, then the program will hide the results view after x minutes.

"Enable logging" - enable this option if you want to log web scraper events. The logging directory is the directory where the Web Content Extarctor stores log files, the default logging directory is a current directory.

"Internet Connection Settings" - the program uses the Internet connection settings from Internet Explorer. You can change these settings by clicking the "Change"button.

"Enable to open JSON documents in Internet Explorer browser" - to change this option you have to run Web Content Extractor as an administrator.

Browser Tab

"Browser Name":

  • "Internet Explorer" - the program will use Internet Explorer to download webpages.
  • "HTTP downloader" - the program will use http requests to download webpages.
  • "Google Chrome" - the program will use Google Chrome to download webpages.

"User Agent String " - the string attached to the request header (if you use Google Chrome you need to restart the program to have this change take effect). This is a global setting and applies to all projects.

"Delay between download and parsing data" - the delay that is necessary to execute all scripts on a page.

"Time-out to receive a response to a request " - the maximum time the program will wait for a response from the server after requesting a page.

"Time-out to execute a javascript" - the maximum time the program will wait to execute a javascript.

"Enable Javascript" - enable this option if you want to allow scripts in the web browser.

"Enable images" - enable this option if you want to see images in the web browser.

"Convert json content to html" - if this option is enabled, the program will convert json data to html.

"Enable authentication dialog" - if this option is enabled, the program will display a Basic Authentication dialog to ask the user for a username and password.

"Enable a pop-up window" - if this option is enabled, the program will open popup windows in the main window.

"Split merged table cells" - if this option is enabled, the program will separate all merged cells in the webpage table into individual cells.

Scraper Tab

"Delay between requests" - the delay necessary to prevent the server from being overloaded by multiple requests from the program. We recommend that you set the delay to at least 1-2 seconds.

"Maximum number of download threads" - the number of simultaneous connections to a server.

"Maximum crawling time" - limits the maximum crawling time. Set the number of minutes a project is allowed to run. If this is reached, the program stops the project. If set to zero, no time limit is imposed.

"Crawl only unique URLs" - if this option is enabled, the program will add only new links to the project, i.e. links that are not in the task list yet.

"Extract only unique data" - if this option is enabled, the program will add only new data to the project, i.e. data that are not in the database yet.

"Reload a page before run Javascript task" - enable this option if you want to reload a page before run Javascript code in the web browser control.

"Resolve redirect URLs" - if this option is enabled, the program will update the URLs of redirected links.

"Remove hash from URLs" - if this option is enabled, the program will remove hash string from the URLs. A hash string is the part of the URL that appears after the '#' sign.

"Use separate thread for parsing" - if this option is enabled, the program will use a separate thread for the parsing process.

Proxy Servers Tab

"Use Proxy Server " - if this option is enabled, the program will use proxy server to internet connection. Use the following syntax for the proxy address: <ip_address>:<port> where <ip_address> is the Ip address of the proxy server, and <port> is the port number that is assigned to the proxy server. If your proxy server requires authentication, you have to use: <username>:<password>@<ip_address>:<port>

"Change browser proxy every x requests" - the program will change the browser proxy every x requests.

 

Captcha Page Detection Tab

"Detect Captcha Page" - if this option is enabled, the program will scan pages for captcha page patterns. You can specify four types of patterns (used "contain" function, not "equal"):

  • "Captcha page text contains" - the program will scan page text for a pattern.
  • "Captcha page URL contains" - the program will scan page URL for a pattern.
  • "Captcha image URL contains" - the program will scan all URLs of images for a pattern.
  • "Captcha image tag contains" - the program will scan all IMG tags for a pattern.
  • "Captcha input tag contains" - the program will scan all INPUT tags for a pattern.

 

If the program detects the captcha page, then it stops the extraction process and shows the browser window to enable you to enter the captcha text.