Before run this extension It is recommend to read carefully this steps, configure it (see Administration) otherwise it will be very difficult to run correctly the extension!
Define your first engine
Within your sysfolder assigned to the FEEDER create your first site. The following example concerns the configuration parameters for the engine:
As stated before this manual is reserved only to Administrators (see Users Manual). Thus the best way to put on work this extension is to follow the following instruction step-by-step. In the future will be published new documentation to explain how to do (configure a new site, learn and study html, etc.).
If you are admin you can load a new engine or define a new one. To start as soon as possible, run News Feeder and select the last option from the drop-down menu: 'Load sites definition'.This option allows you to create a new engine; the definitions are stored within a file you received with this extension.News feeder will check and create a new engine for you:
Google (test mode) news.google.it
This engine-setup works fine and was tested for a long time. Tag-definitions inside are related for the Google news engine in ITALIAN language (http://www.google.it); google.com news was tested on Jan 04, 2007 and works fine. Now I can connect and read the pages: contact me only if sites definition preloaded do not work correctly. However google.com recently changed html code output for the news and since Jan 04, 2007 all is OK.
Now open your FEEDER folder (from BE interface: List -> select your folder) and you will see what happened. Modify the Google (test mode) news.google.it record and you will see the page with the parameters needed to fetch the news.
Warning: This extension works using GET vars, the PHP file function to fetch the pages and PHP eregi function to accept or exclude sites/titles. Thus if you don't know how to, please refer to http://www.php.net. The +ext does not use navigators (could be in the future) and therefore is unable to send POST data.
Brief explanation of used fields:
Hideif engine is hidden it will not be processed by ttnews_feeder
Search engine namesite/engine name
Scheme default: http://, alternative: https:// - Trick: to do the test please save the remote page (using Mozilla, Explorer, etc.) on your hard disk and transfer it on your server. This way is useful to avoid to stress remote server for testing.
Urlurl for connection. Here you can use some markers:
###RECORDSTOVIEW### how many records retrieve (i.e. 10,20,50,100); content is defined under keywords table###SEARCHKW### this will be substituted with the search keywords; content is defined under keywords table###EXCLUDEKW### this will be substituted with the keywords to exclude; content is defined under keywords table
CharsetYou can select one of the listed items. All strings (title, subtitle, font) will be translated to this charset. If you don't know what to do try cp1252. If you see some undesired chars try to change this parameter until the problem disappears.Content unwrapthis is a tag or piece of a tag and a tag or piece of a tag that tells to the +ext what fetch from the page. Content means all the block of a page containing all the news.
Section unwrap
this is a tag or piece of a tag and a tag or piece of a tag that tells to the +ext what fetch from the Content (above) to extract each news (title, subtitle, font, etc.).
Title unwrapthis is a tag or piece of a tag and a tag or piece of a tag that tells to the +ext what fetch from the Section (above) to extract the title.
Subtitle, Font and Link unwrapLike above.
Subtitle extraction methodIf the title of the news and its subtitle is located in a page , select: 'from search page (url above)': will be used the URL field to fetch the subtitle – means from the same page. Otherwise you must select: 'from target page, news link'. This second option can slow the extracting process because News Feeder loads another page to examine and fetch the subtitle. The page depends on the link extracted (see below Link unwrap) If the text is long it will be truncated to the first 255 chars found, preserving the last word found (this is not a simple and bad crop!)
image unwrap, if any found in the sectionIf the section extracted c(captured with Section unwrap) ontains an image and you configured with the parameter fetchImages = 1 (bool) News Feeder will download the images recognized as TYPO3 configuration parameters defined during installation process. The images will be stored within the /uploads/pics/ folder of your site.Images greater maxImageBytesSize parameter will not be written and thus ignored.All tags for extraction are divided by the marker ###SEP###, you should use this markers and the url markers to project a new engine/site. If you need to define a new site, you must study carefully the page and define correctly these unwraps, then use the TEST MODE to test if the site is working correctly and at the end pass the site in production mode (MANUAL CHECK or CRON MODE).
Link unwrapThis is used to fetch the link that points to the site where the entire news is published (see also subtitle extraction method).
Url to add to the extracted link somewhat could happen that a site (expecially when static) point to internal news using only relative references (i.e.: /index.php?id=28). If this site is indexed by ttnews_feeder we cannot publish on our TYPO3 site the relative path, then the +ext adds this url to reconstruct the entire (absolute) path.note: if you are configuring a static/dynamic site and the image unwrap is set, this url will be used to fetch the images. When News Feeder analyze the url it looks if the URL starts with 'http://' or 'https://' (absolute paths); if not it will compose what fetched prepending this parameter.
Autoclean(interactive or CRON mode) – If enabled you can delete (not remove!) records expired and defined in the next box:
Autoclean backdaysAll news related to this site will be considered as deletion after the days here defined. News deleted will be still present in thte database, used for title/url exclusion, but will not available for visitors.
ModeRunning mode. At the first time please select Test mode.
Check every n daysCheck frequency under Cron/Manual check mode: '0' means each day, otherwise write the number of days between one check and the next.Note: if you leave this field empty News Feeder will use 0.
NotesInternal notes. When you proceed with an UPDATE this field will be preserved and News Feeder will add the UPDATE date and hour.
This tables are used for exclude or accredited sites and the use is intuitive and easy. A Title excluded field need to specify the url related to this title, you can use REGEXP. As stated before, please refer to PHP site for REGEXP syntax.
Define your keywords - Within your system folder assigned to the FEEDER create your keywords. The following example concerns the configuration parameters for the keywords. Here you can define several keywords and configure them individually to obtain different results. Each keyword can be related to one or more sites:
Hide if keyword is hidden it will not be processed by ttnews_feeder
keywordsearch keyword: you must to use the syntax connection to the search engine desired, i.e. For Google you can load this field with: antivirus+security (use '+' as separator)
but not... keyword (or list of keywords) to exclude, typically Google uses: +-microsoft+-HIV+-flu
search enginesselect from the right-box the search engine you want to explore using the keyword. Note that Google, Yahoo, Excite use the same syntax for keyword. For sites that use different syntax for keyword definition and exclusion you must to open a new keyword.
Categoryhere you can select one or more categories to relate the news extracted and approved. This is very useful if you need to aggregate news in your site using tt_news plugin. Refere to tt_news documentation to know how to create categories.
Notes internal notes. Put here what you want and remember.
I suggest you to define one or more search engine and then define the keywords. You can associate (relate) each keywords to one or more search engines, but each configured keyword must respect the syntax ot the search engine(s) selected: google, altavista, excite uses the same syntax. If the syntax is different, you must to define another keyword for the desired search engine.
How to define a keyword correctly – To avoid errors, please follow the steps below:
using your preferred browser connect to the desired engine (i.e. http://news.google.it)
fill the search box and run a search i.e. Using the following keywords:bush -powell (stays for search for bush news but avoid the 'powell' contents)
click on the search button
note that the URL box has changed, for the example above you will see:http://news.google.it/news?hl=it&ned=it&q=bush+-powell&btnG=Cerca+nelle+notizie
well, now you can see the way google uses to pass the GET vars.
Fill the field keyword (see previous paragraph Define your keywords) inserting: bush
Fill the field but not... (see previous paragraph Define your keywords) inserting: +-powell
finaly associate your keyword to the search engine and run a test.
When all is OK, change your search engine properties switching to production mode
Just configured the extension, defined a keyword and search engine, you can do a test.
Test mode doesn't write any record on your DB and it is a great method to check if your engine-configuration is working well.To run test-mode click on:
![]()
and then in the right-frame select the menu item:
Test news engine/sites
read the text, select the name of the site to test (or All) and click on the button:
Run site/engine test
Note: if you see nothing probably you have not defined yet. Test mode is very similar to production mode, only the modifications will not be written. The only difference is when from test mode there is a DB check for the records already stored. The records displayed have an icon on the left. Right side there is a brief explanation (this is called 'news status').Images will be not written on your server only displayed through a link to remote site.
Just configured the extension, and tested the site/engine as explained you can modify the site/engine status in production mode (refer to the engine configuration to do it).
When a site is under production mode records will be written in the DB. To run follow the instruction:
Click on
![]()
and then in the right-frame select the menu item:
Run Manual Check
read the text and click on the button:
Run Manual Check
please wait some seconds for conclusion and read what fetched.
Note: if you have deleted a record (manually or automatically refused) the record will be only hidden and it is stored in the DB. It will be deleted (removed definitely) only using the menu item: Clean DB. It is very important to keep on mind that if you remove the records definitively using ttnews_feeder or other utilities, the +ext cannot more check if a certain news is yet stored and if you run a new manual (or CRON) check the fresh news will be reloaded.Images, if any, will be written on your server within the folder upload/pics, according with parameters given – images upper than maxImageByteSize will be skipped.
Available since v.2.0.0.
First add to your site a new BE user with the name:
_cli _ttnewsfeeder
Set the parameter (see reference):
mod.web_txttnewsfeederM1.newsBEOwner = <uid>
if you want to edit/display the news fetched remember to set the uid above to '1' (usually this is the uid for Admin user); otherwise use another BE user uid or, if you want, write the uid of the user:
_cli _ttnewsfeeder
it's your own choice depending on security issues and privileges assigned to various BE users.
Now you can easily program CRON TASK (ask to your system administrator) addint the following lines to CRONTAB ('crontab -e' to edit, 'crontab -l' to list under Linux systems). The example below defines a daily check programmed for 8.30 AM :
30 8 * * * php -q /var/www/vhosts/< ...>/typo3conf/ext/ttnews_feeder/cli/ttnews_feeder_cli.phpsh
( All instructions must to be written in one line only!). However if you can use a shell you can run and test interactively the news feeder typing:
php -q /var/www/vhosts/< ...>/typo3conf/ext/ttnews_feeder/cli/ttnews_feeder_cli.phpsh
Under some circumstances you will need to change access for ttnews_feeder_cli.phpsh:
chmod 0755 ttnews_feeder_cli.phpsh
Warning! The News Feeder behaviour will be the same as in the BE. Then I suggest you to try before in the BE.Using CRON News Feeder will fetch news and, for the accredited sites, the news will be published immediately!!!This is a good way to automatize your site but can be some risks so that I suggest you to select carefully the site to define as 'accredited'. The other news, coming from not accredited sites will be stored in your data base and you must to approve the manually. Don't forget that you must define at least a keyword and/or an engine and select the MODE:CRON MODE or CRON MODE+MANUAL MODE
Suspend CRON modeYou can suspend CRON (i.e. When you are on vacation...) setting
suspendFlag = 1
Set this parameter:autoSuspendLimit = <value>
with a proper value and when CRON detects that news not approved are over the limit CRON will not fetch and store news.
How to receive a report via emailIf you are admin set CRON like above, at the end of the line add the chars here in bold:
(...) ttnews_feeder_cli.phpsh | admin@your-domain.com
If admin and there are more people that are responsible for the news approval each for a different section, you will receive the same report you see in interactive mode (BE) for all section activated.Otherwise, if you want that each of responsible for a certain section receives an email with a report, in the modTSConfig (see Reference) configure the parameter: newsResponsibleEmail
At each CRON running the responsible will receive an email with its own report.
Store only the accredited site records
If you set cronWriteOnlyAccredited to '1' and CRON TASK is active News Feeder will store in the db only the records coming from accredited sites. This could be very useful if you need to automatize completely the approval process avoiding manual approval.Valid records, usually get for manually approval, are stored in the db and marked as deleted so that News Feeder can recognize them and reject again on the next check.
Cron keeps your DB clean!If you set suspendFlag to 1 and CRON TASK is active News Feeder will be launched and will keep clean your db, checking for records to delete and erase.
Images download is available only if you set to true (1) the fetchImages parameter. However if you want that downloaded images are resized to a certain value (e.g. 100 px), you must to set the autoresizeImages parameter too.
If you set up autoresizeImages to true (1) the images will be first resized and only after resized the images will be measured and accepted according to maxImageByteSize, maxImagePxWidth, maxImagePxHeight parameters. Values.Check for extensions allowed – News Feeder accept first the images extensions allowed by TYPO3 general configuration. Note that autoresize option is allowed only for JPEG, JPG, GIF, PNG images format. If autoresize is on and an image has not any of these format, it will accepted and measured as described above and, if it is oversized it will be refused.
Autoresize images – I suggest to keep it on because you save disk-space in your server and you will have more and more images for your news because the images will be rarely refused.
Images quality – First release with image support (v 1.1.16) was not tested with PNG format and could be improved. Please contact me if images will be displayed as not expected so I can introduce news code for resizing.
Images and tt_news – If you order News Feeder to resize images please keep note that all images will be resized from tt_news extensions to create thumbnails in news listing and others. Please note that the best way to avoid low quality is to define some tt_news parameters (max images width and max images eight) greater/equals of resizedImagePxWidth.The height will be calculated automatically from News Feeder.
Why can't I see anything under test mode? Check if you configuration is ok (header unwrap etc.) then verify if your site. A common error for the engines is that they need to be related from a keyword definition. If you have not loaded a keyword related to your (new) site, your site will be not visited.
I've had just loaded a new definition, run a manual test and I can't see nothing. Why? You can define several sites/engine but to run them you must create at least one keyword and associate it (relate) to your engine. So, if you have just loaded a new engine (i.e. Google) please load a new keyword and from the menu select the engine.
Parsing 'news.google.it' sometimes a subtitle disappears. Why? The extension extract the text using the 'unwrap' parameters passed through the search engine definition. Some google records are different and the extension cannot extract them correctly. However the title is always available.
I'm Italian and I have loaded news.google.COM site definition. Nothing works, why? The extension connects to news.google.com but google redirects to italian service: news.google.it. The pages are formatted differently and the extension cannot fetch record if the site is redirected.
Most important configuration in order to guarantee the correct implementation:
Define the pid of the ttnews_feeder system folder
Define the uid of the (user): news owner
- Reference (TSconfig): ttnews_feeder – News Feeder
Property: | Data type: | Description: | Default: |
|---|---|---|---|
clearCachePages | int+/string | List of all page pid's you need to clear from cache. This will run at the end of the process so that the fresh news of accredited sites will be immediately available on BE (since v.2.1.1 you can use also: pages,all,temp_CACHED) | - |
useSubIfTitleIsEmpty | boolean | 1 (true), 0 (false) – If set to 1 when the news field Title is not extracted (for some reasons...) it will be substituted by the subtitle with limit to 60 chars | 1 |
useTitleIfSubIsEmpty | boolean | 1 (true), 0 (false) – If set to 1 when the news field Subtitle is not extracted (for some reasons...) it will be substituted by the Title with limit to 250 chars | 1 |
BackDays | int+ | Under evaluation; currently not used | 7 |
suspendFlag | boolean | Set to '1' if you are on vacation: this will suspend any fetching through CRON | 0 |
autosuspendLimit | int+ | Works only in CRON mode. If this limit is reached (e.g. There is not any operator to approve fresh news, cause vacation...) no more news are accepted and stored in the DB. The counter keep track only of approved news. This prevents from DB overload. | 100 |
maxRecordsPerSession | int+ | Works only in MANUAL CHECK mode. If this limit is reached no more news are accepted and stored in the DB. The counter keep track only of approved news. | 30 |
feederSysFolderPID | int+ | The PID of the page where store your configuration tables (keywords, sites/engine to visit, etc.). | required |
newsSysFolderPID | int+ | The PID of the page where store your EXTERNAL NEWS. I suggest to keep separated your internal and external news so that it will be more easy for you to inspect them. | ul |
newsBEOwner | int+ | Use this parameter only if you wish write into tt_news table the same user id, otherwise will be used the user UID of the BE user running News Feeder. | 1 |
removeExternalOldNews | int+ | Days back - When this limit is reached: CRON (if used) will remove expired news; if you work in MANUAL CHECK, the news will be removed manually | 50 |
removeMyOldNews | string | Days back - When this limit is reached: CRON (if used) will remove expired news; if you work in MANUAL CHECK, the news will be removed manually. | 920 |
charSet | String | Charset for Html conversion, same parameters of the PHP htmlentities function | cp1252 |
maxImageByteSize | int+ | Max dimension for images fetched | 15000 |
fetchImages | bool | Fetch or not the images from site/engine, default: disabled | 0 |
maxImagePxWidth | int+ | If the image captured width is over this limit, it will be refused | 300 |
maxImagePxHeight | int+ | If the image captured height is over this limit, it will be refused | 300 |
resizeImages | bool | Autoresize for the images downloaded, if set all Images will be resized according to the resizedImagePxWidth parameter | 0 |
resizedImagePxWidth | int+ | This works only if fetchImages and resizeImages are both set to 1 (true). If an image is less or more than the parameter; e.g. If the width of downloaded image is 120 pixels the width of resulting image will be 80 pixels width; if it is 60 pixels the new width will be 80 pixels. | 80 |
resizedJpgCompression | int+ | Compression for output image if extension is JPG or JPEG; use 100 for no compression. | 70 |
useRandomTime | bool | Date and hour set for the news fetched will be calculated randomly or not. You can disable this setting to '0'; this can be usefull to fetch news according to importace order of search engine visited | 1 |
newsResponsibleEmail | String | Type a valid email address. Each time CRON will be executed an email containing a report will be sent to this email address. | - |
cronWriteOnlyAccredited | Bool | If set to '1' and News Feeder is running under CRON, only the records of accredited site will be written in the db. | |
apacheOwner | String | CRON mode: images downloaded will be set with this owner.Default: owner of uploads/pics. | Same ofuploads/pics |
apacheGroup | String | CRON mode: images downloaded will be set with this group.Default: owner of uploads/pics. | Same ofuploads/pics |
[tsref:(cObject).web_txttnewsfeederM1 ]