Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Priority Two: Fix in current sprint
-
Resolution: Fixed
-
Affects Version/s: STABLE 3.5.2c
-
Fix Version/s: STABLE 3.5.4
-
Component/s: None
-
- Description:
-
HideThe scheduler appears to be running successfully, however doesn't actually try to index any files. And the dashlet does not indicate that an indexing task has been run recently.
The only time I can get the application to attempt indexing is when I manually run the script from the command line or enter in a modified URL (http://XXXXXXXXXXXX/search2/indexing/bin/cronIndex). Compare the modified URL to the one in the log file (includes var/www/knowledgeTree/)
E.g. of log file:
2008-07-14 06:01:01 () DEBUG: Scheduler: starting
2008-07-14 06:01:01 () DEBUG: call_page: calling http://XXXXXXXXXXXX/var/www/knowledgeTree/search2/indexing/bin/cronIndex
er.php
2008-07-14 06:01:01 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-14 06:01:02 () DEBUG: call_page: calling http://XXXXXXXXXXXX/var/www/knowledgeTree/search2/indexing/bin/cronMigra
tion.php
2008-07-14 06:01:02 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-14 06:01:02 () DEBUG: Scheduler - Task: Open Office test completed in 0.42s.
2008-07-14 06:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory completed in 0.42s.
2008-07-14 06:01:03 () DEBUG: Scheduler: stopping
ShowThe scheduler appears to be running successfully, however doesn't actually try to index any files. And the dashlet does not indicate that an indexing task has been run recently. The only time I can get the application to attempt indexing is when I manually run the script from the command line or enter in a modified URL (http://XXXXXXXXXXXX/search2/indexing/bin/cronIndex). Compare the modified URL to the one in the log file (includes var/www/knowledgeTree/) E.g. of log file: 2008-07-14 06:01:01 () DEBUG: Scheduler: starting 2008-07-14 06:01:01 () DEBUG: call_page: calling http://XXXXXXXXXXXX/var/www/knowledgeTree/search2/indexing/bin/cronIndex er.php 2008-07-14 06:01:01 () DEBUG: Scheduler - Task: Indexing completed in 0.43s. 2008-07-14 06:01:02 () DEBUG: call_page: calling http://XXXXXXXXXXXX/var/www/knowledgeTree/search2/indexing/bin/cronMigra tion.php 2008-07-14 06:01:02 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s. 2008-07-14 06:01:02 () DEBUG: Scheduler - Task: Open Office test completed in 0.42s. 2008-07-14 06:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory completed in 0.42s. 2008-07-14 06:01:03 () DEBUG: Scheduler: stopping
-
- Environment:
- Debian Linux. KT Source install.
Attachments
-
- 20080805-1.log
- (5 kB)
- Ziphozakhe Mashologu
- 05/Aug/08 01:51 PM
-
- 20080805-2.log
- (7 kB)
- Rakesh Mistry
- 05/Aug/08 02:38 PM
-
- dmsDefaults.php
- (30 kB)
- Megan Watson
- 14/Jul/08 03:43 PM
-
- ktutil.inc
- (36 kB)
- Megan Watson
- 17/Jul/08 09:12 AM
-
- ktutil.inc
- (36 kB)
- Megan Watson
- 16/Jul/08 01:55 PM
-
- log-2008-07-16.www-data.txt
- (7 kB)
- Rakesh Mistry
- 16/Jul/08 02:12 PM
-
- log-2008-07-16.www-data.txt-scheduler
- (10 kB)
- Rakesh Mistry
- 16/Jul/08 03:14 PM
-
- log-2008-07-17.www-data.txt
- (33 kB)
- Rakesh Mistry
- 17/Jul/08 12:54 PM
-
- log-2008-08-06.www-data.txt
- (11 kB)
- Rakesh Mistry
- 06/Aug/08 07:19 AM
-
- log.txt
- (5 kB)
- Rakesh Mistry
- 15/Jul/08 07:15 AM
-
- OpenOfficeTextExtractor.inc.php
- (4 kB)
- Megan Watson
- 05/Aug/08 01:57 PM
-
- php_error_log
- (62 kB)
- Rakesh Mistry
- 23/Jul/08 02:09 PM
-
- search2.tgz
- (74 kB)
- Megan Watson
- 25/Aug/08 02:00 PM
-
- search2.tgz
- (21 kB)
- Megan Watson
- 20/Aug/08 08:35 AM
-
- search2.tgz
- (18 kB)
- Megan Watson
- 08/Aug/08 03:20 PM
-
- Screenshot-Task Scheduler.png
- (48 kB)
Issue Links
Activity
Hide
I can confirm that rootUrl = default for the installation.
Show
Rakesh Mistry added a comment - 14/Jul/08 12:26 PM I can confirm that rootUrl = default for the installation.
Hide
This is the apache config for the KT application:
#<Directory /knowledgeTree>
# Options FollowSymLinks
# AllowOverride All
#</Directory>
<VirtualHost *>
ServerAdmin webmaster@xxxxxxxxxxxx
ServerName xxxxxxxxxxxxxx
DocumentRoot /var/www/knowledgeTree
<Directory /var/www/knowledgeTree>
Options FollowSymLinks
AllowOverride All
Order allow,deny
allow from all
</Directory>
ErrorLog /var/log/apache2/error.log
# Possible values include: debug, info, notice, warn, error, crit,
# alert, emerg.
LogLevel warn
CustomLog /var/log/apache2/access.log combined
ServerSignature On
</VirtualHost>
#<Directory /knowledgeTree>
# Options FollowSymLinks
# AllowOverride All
#</Directory>
<VirtualHost *>
ServerAdmin webmaster@xxxxxxxxxxxx
ServerName xxxxxxxxxxxxxx
DocumentRoot /var/www/knowledgeTree
<Directory /var/www/knowledgeTree>
Options FollowSymLinks
AllowOverride All
Order allow,deny
allow from all
</Directory>
ErrorLog /var/log/apache2/error.log
# Possible values include: debug, info, notice, warn, error, crit,
# alert, emerg.
LogLevel warn
CustomLog /var/log/apache2/access.log combined
ServerSignature On
</VirtualHost>
Show
Rakesh Mistry added a comment - 14/Jul/08 12:31 PM This is the apache config for the KT application:
#<Directory /knowledgeTree>
# Options FollowSymLinks
# AllowOverride All
#</Directory>
<VirtualHost *>
ServerAdmin webmaster@xxxxxxxxxxxx
ServerName xxxxxxxxxxxxxx
DocumentRoot /var/www/knowledgeTree
<Directory /var/www/knowledgeTree>
Options FollowSymLinks
AllowOverride All
Order allow,deny
allow from all
</Directory>
ErrorLog /var/log/apache2/error.log
# Possible values include: debug, info, notice, warn, error, crit,
# alert, emerg.
LogLevel warn
CustomLog /var/log/apache2/access.log combined
ServerSignature On
</VirtualHost>
Hide
The indexing task gets the server name from a file that is created on login, could you check if this file exists? The file, serverName.txt should be in your knowledgeTree/var/cache directory. If it does exist, what is the server name that it contains?
Show
Megan Watson added a comment - 14/Jul/08 01:47 PM The indexing task gets the server name from a file that is created on login, could you check if this file exists? The file, serverName.txt should be in your knowledgeTree/var/cache directory. If it does exist, what is the server name that it contains?
Hide
The file does exist it contains the same value as represented by xxxxxx above.
http://kt.xxxxxxx.org.za
Obviously masked for security reasons.
http://kt.xxxxxxx.org.za
Obviously masked for security reasons.
Show
Rakesh Mistry added a comment - 14/Jul/08 01:52 PM The file does exist it contains the same value as represented by xxxxxx above.
http://kt.xxxxxxx.org.za
Obviously masked for security reasons.
Hide
The only other place that I can think of where this might be going wrong is in the initialisation of the root url in the dmsDefaults. I'll attach a copy of the dmsDefaults with a few debugging statements. I notice your logLevel is already set to debug. Could you do the following:
- Open the knowledgeTree/config directory
- Rename your existing dmsDefaults.php file
- Copy the new one in
- Open the knowledgeTree/cache directory and delete its contents
- Login to knowledgeTree and navigate to the Browse Documents tab.
- Send me the tail end of todays log file
Thanks.
- Open the knowledgeTree/config directory
- Rename your existing dmsDefaults.php file
- Copy the new one in
- Open the knowledgeTree/cache directory and delete its contents
- Login to knowledgeTree and navigate to the Browse Documents tab.
- Send me the tail end of todays log file
Thanks.
Show
Megan Watson added a comment - 14/Jul/08 03:19 PM The only other place that I can think of where this might be going wrong is in the initialisation of the root url in the dmsDefaults. I'll attach a copy of the dmsDefaults with a few debugging statements. I notice your logLevel is already set to debug. Could you do the following:
- Open the knowledgeTree/config directory
- Rename your existing dmsDefaults.php file
- Copy the new one in
- Open the knowledgeTree/cache directory and delete its contents
- Login to knowledgeTree and navigate to the Browse Documents tab.
- Send me the tail end of todays log file
Thanks.
Hide
I don't see the attachment... where is the link located?
I will probably get to this tomorrow, as it's the end of the day for me. Thanks!
I will probably get to this tomorrow, as it's the end of the day for me. Thanks!
Show
Rakesh Mistry added a comment - 14/Jul/08 03:34 PM I don't see the attachment... where is the link located?
I will probably get to this tomorrow, as it's the end of the day for me. Thanks!
Hide
Sorry, I forgot to add the attachment. End of the day for me too.
Show
Megan Watson added a comment - 14/Jul/08 03:43 PM Sorry, I forgot to add the attachment. End of the day for me too.
Hide
Morning Megan. Thanks, I've uploaded the tail end of the log file.
Show
Rakesh Mistry added a comment - 15/Jul/08 07:16 AM Morning Megan. Thanks, I've uploaded the tail end of the log file.
Hide
Yes, the log was fine, thank you. Sorry for the delay, I needed some advice on this issue. I haven't seen it anywhere else and can't reproduce it myself so the best way of tracking it down and resolving it is on your machine. The problem is that the directory path (/var/www/knowledgeTree) is added to the url for the indexer. Now the parts of the url come from 2 places. The serverName.txt file, which you confirmed did not have the path name in, and the root url, which is set to default in the config file. The dmsDefaults file I attached on Monday contains a function which resolves the rootUrl when it is set to default. However, from the log it appears to be resolving it correctly. The next place to look is at the point where it puts these together and creates the url.
Could you repeat the process and replace your ktutil.inc file with the one I'm about to attach.
- The file is in: knowledgeTree / lib / util /
Thanks
Could you repeat the process and replace your ktutil.inc file with the one I'm about to attach.
- The file is in: knowledgeTree / lib / util /
Thanks
Show
Megan Watson added a comment - 16/Jul/08 01:54 PM Yes, the log was fine, thank you. Sorry for the delay, I needed some advice on this issue. I haven't seen it anywhere else and can't reproduce it myself so the best way of tracking it down and resolving it is on your machine. The problem is that the directory path (/var/www/knowledgeTree) is added to the url for the indexer. Now the parts of the url come from 2 places. The serverName.txt file, which you confirmed did not have the path name in, and the root url, which is set to default in the config file. The dmsDefaults file I attached on Monday contains a function which resolves the rootUrl when it is set to default. However, from the log it appears to be resolving it correctly. The next place to look is at the point where it puts these together and creates the url.
Could you repeat the process and replace your ktutil.inc file with the one I'm about to attach.
- The file is in: knowledgeTree / lib / util /
Thanks
Hide
The new log file looks a lot more promising! I need the debug logs from when the scheduler runs the indexing task. Could you let it run for a few minutes and upload the new log.
Thanks
Thanks
Show
Megan Watson added a comment - 16/Jul/08 02:35 PM The new log file looks a lot more promising! I need the debug logs from when the scheduler runs the indexing task. Could you let it run for a few minutes and upload the new log.
Thanks
Hide
Thank Megan! I've attached the new log with debugs of the scheduler running as well.
Show
Rakesh Mistry added a comment - 16/Jul/08 03:14 PM Thank Megan! I've attached the new log with debugs of the scheduler running as well.
Hide
Morning. I've finally figured out what has been happening with the url for the indexer. The rootUrl is resolved using the path to the current script, when the script is running in the browser then the path is relative to the server root, however, when running in the background via a cron job the path is the full / absolute path. I'm not sure if this is specific to Debian or because you're running a source install, but it hasn't appeared previously. We have had a similar problem with the server name which is why the serverName.txt file came into existence.
I've attached a new version of the ktutil.inc file (the old one should be greyed out), that should hopefully fix the problem. What it does now is save the full url to the serverName.txt file and doesn't rely on resolving the rootUrl.
- Replace the ktutil.inc file in "knowledgeTree/lib/util" with the new one.
- Delete the contents of the "knowledgeTree/var/cache" directory.
- Open / refresh the login screen in the browser, this should regenerate the serverName.txt file in the cache directory (you may need to clear your browser cache as well)
- Check that the serverName.txt file still contains: http://kt.xxxxxxx.org.za
- Let the scheduler run for a while
You might need to reschedule the indexing on some of your documents. You can do that through the Admin -> Search and Indexing interface.
I've attached a new version of the ktutil.inc file (the old one should be greyed out), that should hopefully fix the problem. What it does now is save the full url to the serverName.txt file and doesn't rely on resolving the rootUrl.
- Replace the ktutil.inc file in "knowledgeTree/lib/util" with the new one.
- Delete the contents of the "knowledgeTree/var/cache" directory.
- Open / refresh the login screen in the browser, this should regenerate the serverName.txt file in the cache directory (you may need to clear your browser cache as well)
- Check that the serverName.txt file still contains: http://kt.xxxxxxx.org.za
- Let the scheduler run for a while
You might need to reschedule the indexing on some of your documents. You can do that through the Admin -> Search and Indexing interface.
Show
Megan Watson added a comment - 17/Jul/08 09:34 AM Morning. I've finally figured out what has been happening with the url for the indexer. The rootUrl is resolved using the path to the current script, when the script is running in the browser then the path is relative to the server root, however, when running in the background via a cron job the path is the full / absolute path. I'm not sure if this is specific to Debian or because you're running a source install, but it hasn't appeared previously. We have had a similar problem with the server name which is why the serverName.txt file came into existence.
I've attached a new version of the ktutil.inc file (the old one should be greyed out), that should hopefully fix the problem. What it does now is save the full url to the serverName.txt file and doesn't rely on resolving the rootUrl.
- Replace the ktutil.inc file in "knowledgeTree/lib/util" with the new one.
- Delete the contents of the "knowledgeTree/var/cache" directory.
- Open / refresh the login screen in the browser, this should regenerate the serverName.txt file in the cache directory (you may need to clear your browser cache as well)
- Check that the serverName.txt file still contains: http://kt.xxxxxxx.org.za
- Let the scheduler run for a while
You might need to reschedule the indexing on some of your documents. You can do that through the Admin -> Search and Indexing interface.
Hide
I've attached the latest log file with the new ktutils.inc.
Sorry, it's a bit long, I had to relogin an extra time, because the first time it complained that the user had not logged in for the first time.
Sorry, it's a bit long, I had to relogin an extra time, because the first time it complained that the user had not logged in for the first time.
Show
Rakesh Mistry added a comment - 17/Jul/08 12:54 PM I've attached the latest log file with the new ktutils.inc.
Sorry, it's a bit long, I had to relogin an extra time, because the first time it complained that the user had not logged in for the first time.
Hide
2008-07-17 13:01:02 () DEBUG: kt_url: base url - http://kt.XXXXXX.org.za
2008-07-17 13:01:02 () DEBUG: call_page: calling http://kt.XXXXXX.org.za/search2/indexing/bin/cronIndexer.php
2008-07-17 13:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
Looks like its resolving the url correctly now. I'll put the fix into our next release. You can replace the dmsDefaults.php file with the original otherwise your log files will be quickly filled in debug mode.
Thanks for your patience and help with the debugging!
2008-07-17 13:01:02 () DEBUG: call_page: calling http://kt.XXXXXX.org.za/search2/indexing/bin/cronIndexer.php
2008-07-17 13:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
Looks like its resolving the url correctly now. I'll put the fix into our next release. You can replace the dmsDefaults.php file with the original otherwise your log files will be quickly filled in debug mode.
Thanks for your patience and help with the debugging!
Show
Megan Watson added a comment - 17/Jul/08 01:21 PM 2008-07-17 13:01:02 () DEBUG: kt_url: base url - http://kt.XXXXXX.org.za
2008-07-17 13:01:02 () DEBUG: call_page: calling http://kt.XXXXXX.org.za/search2/indexing/bin/cronIndexer.php
2008-07-17 13:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
Looks like its resolving the url correctly now. I'll put the fix into our next release. You can replace the dmsDefaults.php file with the original otherwise your log files will be quickly filled in debug mode.
Thanks for your patience and help with the debugging!
Hide
A pleasure..
Would I be able to get the fixed file, or will I have to install the new release?
I'm a bit limited with re-installation at the moment because our sysadmin is on holiday at the moment and I only have access to the knowledgetree directories.
Would I be able to get the fixed file, or will I have to install the new release?
I'm a bit limited with re-installation at the moment because our sysadmin is on holiday at the moment and I only have access to the knowledgetree directories.
Show
Rakesh Mistry added a comment - 17/Jul/08 01:34 PM A pleasure..
Would I be able to get the fixed file, or will I have to install the new release?
I'm a bit limited with re-installation at the moment because our sysadmin is on holiday at the moment and I only have access to the knowledgetree directories.
Hide
Oops.. can I move the default ktutils.inc back as well?
Show
Rakesh Mistry added a comment - 17/Jul/08 01:43 PM Oops.. can I move the default ktutils.inc back as well?
Hide
You already have the fix :) The ktutil.inc file contains the fix. It'll be the same as what I put in for the next release.
Show
Megan Watson added a comment - 17/Jul/08 01:52 PM You already have the fix :) The ktutil.inc file contains the fix. It'll be the same as what I put in for the next release.
Hide
Ok.. Thanks!
One thing is worrying me....
If it is working, surely when the scheduler runs I should be getting the error I see in my other bug post:
http://issues.knowledgetree.com/browse/KTS-3490 ?
And shouldn't the documents in the queue then move to the "problem queue"...
One thing is worrying me....
If it is working, surely when the scheduler runs I should be getting the error I see in my other bug post:
http://issues.knowledgetree.com/browse/KTS-3490 ?
And shouldn't the documents in the queue then move to the "problem queue"...
Show
Rakesh Mistry added a comment - 17/Jul/08 02:17 PM Ok.. Thanks!
One thing is worrying me....
If it is working, surely when the scheduler runs I should be getting the error I see in my other bug post:
http://issues.knowledgetree.com/browse/KTS-3490 ?
And shouldn't the documents in the queue then move to the "problem queue"...
Hide
Not neccessarily. It depends on how many documents you have and what their mime types are. The indexer batch indexes a set number of documents (50, I think) each time it runs and it should run every minute. If you've been running the indexer from the command line then its possible that all your documents have been indexed or at least attempted to be indexed. Also, Open Office is only used in a subset of the documents, so your pdf's, etc will have been indexed without a problem.
The DMS Administration -> Search and Indexing interface should provide you with some answers.
Under "Extractor Information" you can see which document mime types are indexed by Open Office.
Under "Pending Documents Indexing Queue" you can see if there are any documents left in the queue.
And "Document Indexing Diagnostics" should give you the errors in your other issue.
The DMS Administration -> Search and Indexing interface should provide you with some answers.
Under "Extractor Information" you can see which document mime types are indexed by Open Office.
Under "Pending Documents Indexing Queue" you can see if there are any documents left in the queue.
And "Document Indexing Diagnostics" should give you the errors in your other issue.
Show
Megan Watson added a comment - 17/Jul/08 02:55 PM Not neccessarily. It depends on how many documents you have and what their mime types are. The indexer batch indexes a set number of documents (50, I think) each time it runs and it should run every minute. If you've been running the indexer from the command line then its possible that all your documents have been indexed or at least attempted to be indexed. Also, Open Office is only used in a subset of the documents, so your pdf's, etc will have been indexed without a problem.
The DMS Administration -> Search and Indexing interface should provide you with some answers.
Under "Extractor Information" you can see which document mime types are indexed by Open Office.
Under "Pending Documents Indexing Queue" you can see if there are any documents left in the queue.
And "Document Indexing Diagnostics" should give you the errors in your other issue.
Hide
Sorry for the late reply.. I was not in on Friday.
I only have three documents in the queue. One of them is a PDF the other two ODTs. The extractor information (mime types) show the document types listed.
Document Indexing diagnostics show no issues.
However, dashlet on the document indexing statistics shows last indexing time as 3 days and somewhat ago. Which is probably the last time I kicked off the scheduler manually!
I only have three documents in the queue. One of them is a PDF the other two ODTs. The extractor information (mime types) show the document types listed.
Document Indexing diagnostics show no issues.
However, dashlet on the document indexing statistics shows last indexing time as 3 days and somewhat ago. Which is probably the last time I kicked off the scheduler manually!
Show
Rakesh Mistry added a comment - 21/Jul/08 08:12 AM Sorry for the late reply.. I was not in on Friday.
I only have three documents in the queue. One of them is a PDF the other two ODTs. The extractor information (mime types) show the document types listed.
Document Indexing diagnostics show no issues.
However, dashlet on the document indexing statistics shows last indexing time as 3 days and somewhat ago. Which is probably the last time I kicked off the scheduler manually!
Hide
And I'm pretty sure if I kick off the scheduler via a browser url it will index the PDF document.
Show
Rakesh Mistry added a comment - 21/Jul/08 08:14 AM And I'm pretty sure if I kick off the scheduler via a browser url it will index the PDF document.
Hide
The scheduler gets added as a service during the installation process, on a source install this won't have happened. You'll need to add it to a cron job for it to run regularly.
http://wiki.knowledgetree.com/Scheduler
http://wiki.knowledgetree.com/Scheduler
Show
Megan Watson added a comment - 21/Jul/08 08:37 AM The scheduler gets added as a service during the installation process, on a source install this won't have happened. You'll need to add it to a cron job for it to run regularly.
http://wiki.knowledgetree.com/Scheduler
Hide
The cronjob was already added by our sysadmin:
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
Show
Rakesh Mistry added a comment - 21/Jul/08 08:44 AM The cronjob was already added by our sysadmin:
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
Hide
Okay, the first thing to check then is whether the scheduler is running and when each task was last run. You can see this under DMS Administration -> Miscellaneous -> Manage Task Scheduler.
Show
Megan Watson added a comment - 21/Jul/08 08:59 AM Okay, the first thing to check then is whether the scheduler is running and when each task was last run. You can see this under DMS Administration -> Miscellaneous -> Manage Task Scheduler.
Hide
As you can see from the screenshot, the scheduler appears to be running.
Show
Rakesh Mistry added a comment - 21/Jul/08 09:08 AM As you can see from the screenshot, the scheduler appears to be running.
Hide
Right, correct me if I'm wrong here, the status at the moment is:
- the scheduler is running correctly - all tasks are being updated as run
- there are 3 documents sitting in the Pending Documents Queue
- there are no issues in the Document indexing diagnostics
The cronIndexer should be using the correct url to call the indexing task after the fix. Your issue with open office shouldn't affect the PDF being indexed.
What do your logs say?
- the scheduler is running correctly - all tasks are being updated as run
- there are 3 documents sitting in the Pending Documents Queue
- there are no issues in the Document indexing diagnostics
The cronIndexer should be using the correct url to call the indexing task after the fix. Your issue with open office shouldn't affect the PDF being indexed.
What do your logs say?
Show
Megan Watson added a comment - 23/Jul/08 10:02 AM Right, correct me if I'm wrong here, the status at the moment is:
- the scheduler is running correctly - all tasks are being updated as run
- there are 3 documents sitting in the Pending Documents Queue
- there are no issues in the Document indexing diagnostics
The cronIndexer should be using the correct url to call the indexing task after the fix. Your issue with open office shouldn't affect the PDF being indexed.
What do your logs say?
Hide
Hi Megan,
Correct! Only the document indexing dashlet shows that indexing hasn't run correctly in 4 days.
Strangely the log file only indicates the scheduler running every hour (not sure if this is normal).
And I'm pretty sure if I run http://kt.xxxxxxxxxxx.org.za/search2/indexing/bin/scheduler.php in my browser it will index the PDF at least.
<log>
2008-07-23 10:01:02 () DEBUG: Scheduler: starting
2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronIndexer.php
2008-07-23 10:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronMigration.php
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.42s
.
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet
ed in 0.42s.
2008-07-23 10:01:03 () DEBUG: Scheduler: stopping
2008-07-23 11:01:01 () DEBUG: Scheduler: starting
2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronIndexer.php
2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronMigration.php
2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.43s
.
2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet
ed in 0.42s.
2008-07-23 11:01:03 () DEBUG: Scheduler: stopping
</log>
Correct! Only the document indexing dashlet shows that indexing hasn't run correctly in 4 days.
Strangely the log file only indicates the scheduler running every hour (not sure if this is normal).
And I'm pretty sure if I run http://kt.xxxxxxxxxxx.org.za/search2/indexing/bin/scheduler.php in my browser it will index the PDF at least.
<log>
2008-07-23 10:01:02 () DEBUG: Scheduler: starting
2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronIndexer.php
2008-07-23 10:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronMigration.php
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.42s
.
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet
ed in 0.42s.
2008-07-23 10:01:03 () DEBUG: Scheduler: stopping
2008-07-23 11:01:01 () DEBUG: Scheduler: starting
2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronIndexer.php
2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronMigration.php
2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.43s
.
2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet
ed in 0.42s.
2008-07-23 11:01:03 () DEBUG: Scheduler: stopping
</log>
Show
Rakesh Mistry added a comment - 23/Jul/08 10:12 AM Hi Megan,
Correct! Only the document indexing dashlet shows that indexing hasn't run correctly in 4 days.
Strangely the log file only indicates the scheduler running every hour (not sure if this is normal).
And I'm pretty sure if I run http://kt.xxxxxxxxxxx.org.za/search2/indexing/bin/scheduler.php in my browser it will index the PDF at least.
<log>
2008-07-23 10:01:02 () DEBUG: Scheduler: starting
2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronIndexer.php
2008-07-23 10:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronMigration.php
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.42s
.
2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet
ed in 0.42s.
2008-07-23 10:01:03 () DEBUG: Scheduler: stopping
2008-07-23 11:01:01 () DEBUG: Scheduler: starting
2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronIndexer.php
2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in
dexing/bin/cronMigration.php
2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s.
2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.43s
.
2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet
ed in 0.42s.
2008-07-23 11:01:03 () DEBUG: Scheduler: stopping
</log>
Hide
The cronjob has it set to run every hour:
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
To run it every minute use:
*/1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
The snippet from your log file doesn't show the cronIndexer to be running. What often happens is the scheduler creates one log file as the root user and the indexer creates another log file as the apache user (www-data). Do you have multiple log files?
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
To run it every minute use:
*/1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
The snippet from your log file doesn't show the cronIndexer to be running. What often happens is the scheduler creates one log file as the root user and the indexer creates another log file as the apache user (www-data). Do you have multiple log files?
Show
Megan Watson added a comment - 23/Jul/08 10:43 AM The cronjob has it set to run every hour:
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
To run it every minute use:
*/1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php
The snippet from your log file doesn't show the cronIndexer to be running. What often happens is the scheduler creates one log file as the root user and the indexer creates another log file as the apache user (www-data). Do you have multiple log files?
Hide
Thanks.. I'll get our sysadmin to update the cron job.
Nope .. we only have log files for www-data user.
Nope .. we only have log files for www-data user.
Show
Rakesh Mistry added a comment - 23/Jul/08 10:49 AM Thanks.. I'll get our sysadmin to update the cron job.
Nope .. we only have log files for www-data user.
Hide
I'm not sure what the problem is, it looks like everything is in place but the file isn't being run. It's unlikely to be a permissions / access problem since its being run as the apache user. It's possible that there's a php error that we're not picking up. In your config.ini, set phpErrorLogFile to true, it should log additional information to the php_error_log file. Leave it to run for a bit then check your logs.
In addition what happens when you run the indexer directly:
php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
In addition what happens when you run the indexer directly:
php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Show
Megan Watson added a comment - 23/Jul/08 01:19 PM I'm not sure what the problem is, it looks like everything is in place but the file isn't being run. It's unlikely to be a permissions / access problem since its being run as the apache user. It's possible that there's a php error that we're not picking up. In your config.ini, set phpErrorLogFile to true, it should log additional information to the php_error_log file. Leave it to run for a bit then check your logs.
In addition what happens when you run the indexer directly:
php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Hide
PHP Error log attached.
If I execute the command from the command line I got the following:
rmistry@xxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593
And in the log file I get:
2008-07-23 15:03:08 () DEBUG: indexDocuments: start
2008-07-23 15:03:08 () DEBUG: Indexer::clearoutDeleted: removed documents from inde
xing queue that have been deleted
2008-07-23 15:03:08 () DEBUG: Indexing docid: 7 extension: 'pdf' mimetype: 'applica
tion/pdf' extractor: 'PDFExtractor'
2008-07-23 15:03:08 () INFO: Processing docid: 7.
2008-07-23 15:03:08 () DEBUG: Extra Info docid: 7 Source File: '/home/Documents/00/
10' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB'
2008-07-23 15:03:08 () DEBUG: PDFExtractor: '/usr/bin/pdftotext' -nopgbrk -enc UTF-
8 "/home/Documents/00/10" "/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB"
Which at least looks like it's attempting to do the indexing! Unlike the normal scheduling.
If I execute the command from the command line I got the following:
rmistry@xxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593
And in the log file I get:
2008-07-23 15:03:08 () DEBUG: indexDocuments: start
2008-07-23 15:03:08 () DEBUG: Indexer::clearoutDeleted: removed documents from inde
xing queue that have been deleted
2008-07-23 15:03:08 () DEBUG: Indexing docid: 7 extension: 'pdf' mimetype: 'applica
tion/pdf' extractor: 'PDFExtractor'
2008-07-23 15:03:08 () INFO: Processing docid: 7.
2008-07-23 15:03:08 () DEBUG: Extra Info docid: 7 Source File: '/home/Documents/00/
10' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB'
2008-07-23 15:03:08 () DEBUG: PDFExtractor: '/usr/bin/pdftotext' -nopgbrk -enc UTF-
8 "/home/Documents/00/10" "/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB"
Which at least looks like it's attempting to do the indexing! Unlike the normal scheduling.
Show
Rakesh Mistry added a comment - 23/Jul/08 02:09 PM PHP Error log attached.
If I execute the command from the command line I got the following:
rmistry@xxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593
And in the log file I get:
2008-07-23 15:03:08 () DEBUG: indexDocuments: start
2008-07-23 15:03:08 () DEBUG: Indexer::clearoutDeleted: removed documents from inde
xing queue that have been deleted
2008-07-23 15:03:08 () DEBUG: Indexing docid: 7 extension: 'pdf' mimetype: 'applica
tion/pdf' extractor: 'PDFExtractor'
2008-07-23 15:03:08 () INFO: Processing docid: 7.
2008-07-23 15:03:08 () DEBUG: Extra Info docid: 7 Source File: '/home/Documents/00/
10' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB'
2008-07-23 15:03:08 () DEBUG: PDFExtractor: '/usr/bin/pdftotext' -nopgbrk -enc UTF-
8 "/home/Documents/00/10" "/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB"
Which at least looks like it's attempting to do the indexing! Unlike the normal scheduling.
Hide
Many thanks for fixing this issue.
Anyway IMHO scripts run by cron (i.e. locally and without user interaction) shouldn't rely on the serverName.txt file (which in turn relies on browser-server interaction) to initialise their variables.
This leads to problems in the following scenario: an instance of KT running in a private LAN, accessible both from the internal LAN and from the external internet, though with different URLs (server names and possibly ports).
If the URL (server name/port) used for external access is not DNS-resolvable from the host running KT (and this is not necessarily required!), and the last connection to KT has come from the outside (so that serverName.txt contains the external URL), the indexing scripts run by cron won't work.
Such scripts should rely on "static" configuration files for their functioning.
(incidentally, this may not be the only issue that prevents using KT in the kind of scenario I described above, it's just the first I run into)
Anyway IMHO scripts run by cron (i.e. locally and without user interaction) shouldn't rely on the serverName.txt file (which in turn relies on browser-server interaction) to initialise their variables.
This leads to problems in the following scenario: an instance of KT running in a private LAN, accessible both from the internal LAN and from the external internet, though with different URLs (server names and possibly ports).
If the URL (server name/port) used for external access is not DNS-resolvable from the host running KT (and this is not necessarily required!), and the last connection to KT has come from the outside (so that serverName.txt contains the external URL), the indexing scripts run by cron won't work.
Such scripts should rely on "static" configuration files for their functioning.
(incidentally, this may not be the only issue that prevents using KT in the kind of scenario I described above, it's just the first I run into)
Show
Crikt added a comment - 25/Jul/08 11:04 AM Many thanks for fixing this issue.
Anyway IMHO scripts run by cron (i.e. locally and without user interaction) shouldn't rely on the serverName.txt file (which in turn relies on browser-server interaction) to initialise their variables.
This leads to problems in the following scenario: an instance of KT running in a private LAN, accessible both from the internal LAN and from the external internet, though with different URLs (server names and possibly ports).
If the URL (server name/port) used for external access is not DNS-resolvable from the host running KT (and this is not necessarily required!), and the last connection to KT has come from the outside (so that serverName.txt contains the external URL), the indexing scripts run by cron won't work.
Such scripts should rely on "static" configuration files for their functioning.
(incidentally, this may not be the only issue that prevents using KT in the kind of scenario I described above, it's just the first I run into)
Hide
Any more ideas Megan? I really need to get this DMS implemented soon..
Thanks.
Thanks.
Show
Rakesh Mistry added a comment - 28/Jul/08 07:09 AM Any more ideas Megan? I really need to get this DMS implemented soon..
Thanks.
Hide
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593
The line in the error is basically running through the content of the file and replacing all the tabs and new lines with spaces. What is the memory_limit set to in your php.ini file?
Mine looks as follows:
max_execution_time = 200 ; Maximum execution time of each script, in seconds
max_input_time = 200 ; Maximum amount of time each script may spend parsing request data
memory_limit = 500000000 ; Maximum amount of memory a script may consume (500MB)
I set the memory_limit in Bytes because it wasn't recognising it in MB or GB.
The line in the error is basically running through the content of the file and replacing all the tabs and new lines with spaces. What is the memory_limit set to in your php.ini file?
Mine looks as follows:
max_execution_time = 200 ; Maximum execution time of each script, in seconds
max_input_time = 200 ; Maximum amount of time each script may spend parsing request data
memory_limit = 500000000 ; Maximum amount of memory a script may consume (500MB)
I set the memory_limit in Bytes because it wasn't recognising it in MB or GB.
Show
Megan Watson added a comment - 28/Jul/08 01:26 PM Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593
The line in the error is basically running through the content of the file and replacing all the tabs and new lines with spaces. What is the memory_limit set to in your php.ini file?
Mine looks as follows:
max_execution_time = 200 ; Maximum execution time of each script, in seconds
max_input_time = 200 ; Maximum amount of time each script may spend parsing request data
memory_limit = 500000000 ; Maximum amount of memory a script may consume (500MB)
I set the memory_limit in Bytes because it wasn't recognising it in MB or GB.
Hide
max_execution_time = 30 ; Maximum execution time of each script, in seconds
max_input_time = 60 ; Maximum amount of time each script may spend parsing request data
memory_limit = 16M ; Maximum amount of memory a script may consume (16MB)
I'll get our sysadmin to change these values.. but I'm not sure it's the root cause. Because, today when I ran it I got:
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ?
converter.convert(argv[1], argv[2])
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden=
True, ReadOnly=True))
__main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one.
</output>
2008-07-28 14:35:32 () DEBUG: Document docid: 3 was not removed from the queue as it look
s like there was a problem with the extraction process
2008-07-28 14:35:32 () DEBUG: Indexing docid: 6 extension: 'odt' mimetype: 'application/v
nd.oasis.opendocument.text' extractor: 'OOTextExtractor'
2008-07-28 14:35:32 () INFO: Processing docid: 6.
2008-07-28 14:35:32 () DEBUG: Extra Info docid: 6 Source File: '/var/www/knowledgeTree/va
r/tmp/6.odt' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerrgQMfk'
2008-07-28 14:35:32 () DEBUG: OOTextExtractor: "/usr/bin/python" "/var/www/knowledgeTree/
bin/openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/6.odt" "/var/www/kno
wledgeTree/var/tmp/ktindexerrgQMfk.html" 127.0.0.1 8100
2008-07-28 14:35:33 () ERROR: Could not extract contents from document 6
2008-07-28 14:35:33 () ERROR: <output>Traceback (most recent call last):
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ?
converter.convert(argv[1], argv[2])
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden=
True, ReadOnly=True))
__main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one.
</output>
2008-07-28 14:35:33 () DEBUG: Document docid: 6 was not removed from the queue as it look
s like there was a problem with the extraction process
2008-07-28 14:35:33 () DEBUG: indexDocuments: done
max_execution_time = 30 ; Maximum execution time of each script, in seconds
max_input_time = 60 ; Maximum amount of time each script may spend parsing request data
memory_limit = 16M ; Maximum amount of memory a script may consume (16MB)
I'll get our sysadmin to change these values.. but I'm not sure it's the root cause. Because, today when I ran it I got:
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ?
converter.convert(argv[1], argv[2])
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden=
True, ReadOnly=True))
__main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one.
</output>
2008-07-28 14:35:32 () DEBUG: Document docid: 3 was not removed from the queue as it look
s like there was a problem with the extraction process
2008-07-28 14:35:32 () DEBUG: Indexing docid: 6 extension: 'odt' mimetype: 'application/v
nd.oasis.opendocument.text' extractor: 'OOTextExtractor'
2008-07-28 14:35:32 () INFO: Processing docid: 6.
2008-07-28 14:35:32 () DEBUG: Extra Info docid: 6 Source File: '/var/www/knowledgeTree/va
r/tmp/6.odt' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerrgQMfk'
2008-07-28 14:35:32 () DEBUG: OOTextExtractor: "/usr/bin/python" "/var/www/knowledgeTree/
bin/openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/6.odt" "/var/www/kno
wledgeTree/var/tmp/ktindexerrgQMfk.html" 127.0.0.1 8100
2008-07-28 14:35:33 () ERROR: Could not extract contents from document 6
2008-07-28 14:35:33 () ERROR: <output>Traceback (most recent call last):
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ?
converter.convert(argv[1], argv[2])
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden=
True, ReadOnly=True))
__main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one.
</output>
2008-07-28 14:35:33 () DEBUG: Document docid: 6 was not removed from the queue as it look
s like there was a problem with the extraction process
2008-07-28 14:35:33 () DEBUG: indexDocuments: done
Show
Rakesh Mistry added a comment - 28/Jul/08 01:37 PM
max_execution_time = 30 ; Maximum execution time of each script, in seconds
max_input_time = 60 ; Maximum amount of time each script may spend parsing request data
memory_limit = 16M ; Maximum amount of memory a script may consume (16MB)
I'll get our sysadmin to change these values.. but I'm not sure it's the root cause. Because, today when I ran it I got:
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ?
converter.convert(argv[1], argv[2])
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden=
True, ReadOnly=True))
__main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one.
</output>
2008-07-28 14:35:32 () DEBUG: Document docid: 3 was not removed from the queue as it look
s like there was a problem with the extraction process
2008-07-28 14:35:32 () DEBUG: Indexing docid: 6 extension: 'odt' mimetype: 'application/v
nd.oasis.opendocument.text' extractor: 'OOTextExtractor'
2008-07-28 14:35:32 () INFO: Processing docid: 6.
2008-07-28 14:35:32 () DEBUG: Extra Info docid: 6 Source File: '/var/www/knowledgeTree/va
r/tmp/6.odt' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerrgQMfk'
2008-07-28 14:35:32 () DEBUG: OOTextExtractor: "/usr/bin/python" "/var/www/knowledgeTree/
bin/openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/6.odt" "/var/www/kno
wledgeTree/var/tmp/ktindexerrgQMfk.html" 127.0.0.1 8100
2008-07-28 14:35:33 () ERROR: Could not extract contents from document 6
2008-07-28 14:35:33 () ERROR: <output>Traceback (most recent call last):
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ?
converter.convert(argv[1], argv[2])
File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden=
True, ReadOnly=True))
__main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one.
</output>
2008-07-28 14:35:33 () DEBUG: Document docid: 6 was not removed from the queue as it look
s like there was a problem with the extraction process
2008-07-28 14:35:33 () DEBUG: indexDocuments: done
Hide
BTW Do you know exactly which Openoffice packages need to be installed for KT?
Like I mentioned in the other bug, these are what we have installed at the moment:
i A openoffice.org-base-core - OpenOffice.org office suite -- libdba
i A openoffice.org-common - OpenOffice.org office suite architecture i
i A openoffice.org-core - OpenOffice.org office suite architecture d
i openoffice.org-dtd-officedocume - OfficeDocument 1.0 DTD (OpenOffice.org 1.x
i A openoffice.org-filter-binfilter - Legacy filters (e.g. StarOffice 5.2) for O
i openoffice.org-headless - Headless VCL plugin for OpenOffice.org
i openoffice.org-java-common - OpenOffice.org office suite Java support a
i A openoffice.org-style-andromeda - Default symbol style for OpenOffice.org
i A openoffice.org-style-crystal - Crystal symbol style for OpenOffice.org
i A openoffice.org-style-tango - Tango symbol style for OpenOffice.org
i openoffice.org-writer - OpenOffice.org office suite - word process
Like I mentioned in the other bug, these are what we have installed at the moment:
i A openoffice.org-base-core - OpenOffice.org office suite -- libdba
i A openoffice.org-common - OpenOffice.org office suite architecture i
i A openoffice.org-core - OpenOffice.org office suite architecture d
i openoffice.org-dtd-officedocume - OfficeDocument 1.0 DTD (OpenOffice.org 1.x
i A openoffice.org-filter-binfilter - Legacy filters (e.g. StarOffice 5.2) for O
i openoffice.org-headless - Headless VCL plugin for OpenOffice.org
i openoffice.org-java-common - OpenOffice.org office suite Java support a
i A openoffice.org-style-andromeda - Default symbol style for OpenOffice.org
i A openoffice.org-style-crystal - Crystal symbol style for OpenOffice.org
i A openoffice.org-style-tango - Tango symbol style for OpenOffice.org
i openoffice.org-writer - OpenOffice.org office suite - word process
Show
Rakesh Mistry added a comment - 28/Jul/08 01:39 PM BTW Do you know exactly which Openoffice packages need to be installed for KT?
Like I mentioned in the other bug, these are what we have installed at the moment:
i A openoffice.org-base-core - OpenOffice.org office suite -- libdba
i A openoffice.org-common - OpenOffice.org office suite architecture i
i A openoffice.org-core - OpenOffice.org office suite architecture d
i openoffice.org-dtd-officedocume - OfficeDocument 1.0 DTD (OpenOffice.org 1.x
i A openoffice.org-filter-binfilter - Legacy filters (e.g. StarOffice 5.2) for O
i openoffice.org-headless - Headless VCL plugin for OpenOffice.org
i openoffice.org-java-common - OpenOffice.org office suite Java support a
i A openoffice.org-style-andromeda - Default symbol style for OpenOffice.org
i A openoffice.org-style-crystal - Crystal symbol style for OpenOffice.org
i A openoffice.org-style-tango - Tango symbol style for OpenOffice.org
i openoffice.org-writer - OpenOffice.org office suite - word process
Hide
Actually that does kind of make sense. The error is from open office being unable to extract the contents from the document, this happens before the point in the code where the memory limit is reached. I wasn't able to run my installation with a memory_limit of 16MB. Reschedule the pdf document once the values have been changed and see if that works.
Show
Megan Watson added a comment - 28/Jul/08 01:59 PM Actually that does kind of make sense. The error is from open office being unable to extract the contents from the document, this happens before the point in the code where the memory limit is reached. I wasn't able to run my installation with a memory_limit of 16MB. Reschedule the pdf document once the values have been changed and see if that works.
Hide
Hi Megan
I just noticed that there are two cronIndexer.php scripts:
/var/www/knowledgeTree/search2/bin/cronIndexer.php
/var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
It appears that the first script is supposed to call the second script.
When I execute the first script from the CLI, indexing does NOT occur. i.e. I just see the following in the log:
2008-07-30 08:25:37 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronIndexer.php
2008-07-30 08:25:54 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxx.o
rg.za
However if I execute the second script from the CLI then I get the OpenOffice error - "unsupported URL" above.
It is only when I execute the second script that the dashlet in KT says that indexing has occurred, and the two Oo documents go into error status.
If I'm not mistaken it is the first script that is called from scheduler.php.
I just noticed that there are two cronIndexer.php scripts:
/var/www/knowledgeTree/search2/bin/cronIndexer.php
/var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
It appears that the first script is supposed to call the second script.
When I execute the first script from the CLI, indexing does NOT occur. i.e. I just see the following in the log:
2008-07-30 08:25:37 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronIndexer.php
2008-07-30 08:25:54 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxx.o
rg.za
However if I execute the second script from the CLI then I get the OpenOffice error - "unsupported URL" above.
It is only when I execute the second script that the dashlet in KT says that indexing has occurred, and the two Oo documents go into error status.
If I'm not mistaken it is the first script that is called from scheduler.php.
Show
Rakesh Mistry added a comment - 30/Jul/08 07:33 AM Hi Megan
I just noticed that there are two cronIndexer.php scripts:
/var/www/knowledgeTree/search2/bin/cronIndexer.php
/var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
It appears that the first script is supposed to call the second script.
When I execute the first script from the CLI, indexing does NOT occur. i.e. I just see the following in the log:
2008-07-30 08:25:37 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronIndexer.php
2008-07-30 08:25:54 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxx.o
rg.za
However if I execute the second script from the CLI then I get the OpenOffice error - "unsupported URL" above.
It is only when I execute the second script that the dashlet in KT says that indexing has occurred, and the two Oo documents go into error status.
If I'm not mistaken it is the first script that is called from scheduler.php.
Hide
Actually, maybe I'm talking rubbish because here the correct URL is displayed (even though indexing doesn't actually occur) :
2008-07-30 07:01:04 () DEBUG: Scheduler: stopping
2008-07-30 08:01:02 () DEBUG: Scheduler: starting
2008-07-30 08:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-30 08:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronIndexer.php
2008-07-30 08:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-30 08:01:03 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-30 08:01:03 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronMigration.php
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.4
2s.
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.
42s.
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory comp
leted in 0.43s.
2008-07-30 08:01:04 () DEBUG: Scheduler: stopping
2008-07-30 07:01:04 () DEBUG: Scheduler: stopping
2008-07-30 08:01:02 () DEBUG: Scheduler: starting
2008-07-30 08:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-30 08:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronIndexer.php
2008-07-30 08:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-30 08:01:03 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-30 08:01:03 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronMigration.php
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.4
2s.
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.
42s.
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory comp
leted in 0.43s.
2008-07-30 08:01:04 () DEBUG: Scheduler: stopping
Show
Rakesh Mistry added a comment - 30/Jul/08 07:40 AM Actually, maybe I'm talking rubbish because here the correct URL is displayed (even though indexing doesn't actually occur) :
2008-07-30 07:01:04 () DEBUG: Scheduler: stopping
2008-07-30 08:01:02 () DEBUG: Scheduler: starting
2008-07-30 08:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-30 08:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronIndexer.php
2008-07-30 08:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s.
2008-07-30 08:01:03 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-07-30 08:01:03 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2
/indexing/bin/cronMigration.php
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.4
2s.
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.
42s.
2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory comp
leted in 0.43s.
2008-07-30 08:01:04 () DEBUG: Scheduler: stopping
Hide
The scheduler calls the first cronIndexer.php script which uses a curl function to call the second. I can't recall the reason for this right now. Do you have curl installed and working for your www-data user?
Has the memory_limit been changed on your installation yet?
Has the memory_limit been changed on your installation yet?
Show
Megan Watson added a comment - 30/Jul/08 07:54 AM The scheduler calls the first cronIndexer.php script which uses a curl function to call the second. I can't recall the reason for this right now. Do you have curl installed and working for your www-data user?
Has the memory_limit been changed on your installation yet?
Hide
Nope.. sysadmin has been a bit busy, but he should do it soon I hope.
I have the following installed for curl:
i php5-curl - CURL module for php5
As for the Openoffice indexing issue, I've asked the sysadmin to add the following to the startup command:
-nofirststartwizard
As ours is a server with no X or GUI, I have a feeling the startup wizard is trying to run. I got this hint from a post on the Oo forums.
I have the following installed for curl:
i php5-curl - CURL module for php5
As for the Openoffice indexing issue, I've asked the sysadmin to add the following to the startup command:
-nofirststartwizard
As ours is a server with no X or GUI, I have a feeling the startup wizard is trying to run. I got this hint from a post on the Oo forums.
Show
Rakesh Mistry added a comment - 30/Jul/08 08:46 AM Nope.. sysadmin has been a bit busy, but he should do it soon I hope.
I have the following installed for curl:
i php5-curl - CURL module for php5
As for the Openoffice indexing issue, I've asked the sysadmin to add the following to the startup command:
-nofirststartwizard
As ours is a server with no X or GUI, I have a feeling the startup wizard is trying to run. I got this hint from a post on the Oo forums.
Hide
Attached is the most recent log.
We changed the php memory settings and added the 'nostartwizard' option as well as installed the additional Oo packages.
Unfortunately none of this is making a difference!
As you can see in the first part of the log ( attachment: 20080805.log) the scheduler looks like it's running. However the last bit of the log is where I run:
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
from the command line.
I'm losing hope here! :(
We changed the php memory settings and added the 'nostartwizard' option as well as installed the additional Oo packages.
Unfortunately none of this is making a difference!
As you can see in the first part of the log ( attachment: 20080805.log) the scheduler looks like it's running. However the last bit of the log is where I run:
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
from the command line.
I'm losing hope here! :(
Show
Rakesh Mistry added a comment - 05/Aug/08 11:47 AM Attached is the most recent log.
We changed the php memory settings and added the 'nostartwizard' option as well as installed the additional Oo packages.
Unfortunately none of this is making a difference!
As you can see in the first part of the log ( attachment: 20080805.log) the scheduler looks like it's running. However the last bit of the log is where I run:
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
from the command line.
I'm losing hope here! :(
Hide
I've attached a new file OpenOfficeTextExtractor which will extract the text form the odt file without open office. The only thing it needs is unzip which should be standard on Debian.
- Copy the file into "knowledgeTree/search2/indexing/extractors/"
- Reschedule all documents
- Run the cronIndexer
Hopefully that works.
We're releasing 3.5.3 soon which will have support for catdoc and catppt thereby removing our reliance on OpenOffice.
- Copy the file into "knowledgeTree/search2/indexing/extractors/"
- Reschedule all documents
- Run the cronIndexer
Hopefully that works.
We're releasing 3.5.3 soon which will have support for catdoc and catppt thereby removing our reliance on OpenOffice.
Show
Megan Watson added a comment - 05/Aug/08 02:03 PM I've attached a new file OpenOfficeTextExtractor which will extract the text form the odt file without open office. The only thing it needs is unzip which should be standard on Debian.
- Copy the file into "knowledgeTree/search2/indexing/extractors/"
- Reschedule all documents
- Run the cronIndexer
Hopefully that works.
We're releasing 3.5.3 soon which will have support for catdoc and catppt thereby removing our reliance on OpenOffice.
Hide
Should that file be called OOTextExtractor.inc.php instead of OpenoffceTextExtractor.inc.php?
Show
Rakesh Mistry added a comment - 05/Aug/08 02:13 PM Should that file be called OOTextExtractor.inc.php instead of OpenoffceTextExtractor.inc.php?
Hide
I get this when trying to reshcedule all documents:
Fatal error: Call to a member function debug() on a non-object in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 523
Fatal error: Call to a member function debug() on a non-object in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 523
Show
Rakesh Mistry added a comment - 05/Aug/08 02:29 PM I get this when trying to reshcedule all documents:
Fatal error: Call to a member function debug() on a non-object in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 523
Hide
sorry.. ignore the last comment.. I had tried to add in a debug statement when I was trying to figure things out.
Show
Rakesh Mistry added a comment - 05/Aug/08 02:31 PM sorry.. ignore the last comment.. I had tried to add in a debug statement when I was trying to figure things out.
Hide
See 20080805-2.log
I still get the same "unsupported url" problem..
How does it know to call the new extractor?
I still get the same "unsupported url" problem..
How does it know to call the new extractor?
Show
Rakesh Mistry added a comment - 05/Aug/08 02:38 PM See 20080805-2.log
I still get the same "unsupported url" problem..
How does it know to call the new extractor?
Hide
Sorry, I missed that part out.
Run the following sql on your database:
update mime_types set extractor_id = null;
delete from mime_extractors;
delete from system_settings where name='mimeTypesRegistered';
It deletes the current associated extractors. The indexer will regenerate these based on the contents of the directory the next time it runs.
FYI: The issue that removes the dependence on OpenOffice is:KTS-3456
Run the following sql on your database:
update mime_types set extractor_id = null;
delete from mime_extractors;
delete from system_settings where name='mimeTypesRegistered';
It deletes the current associated extractors. The indexer will regenerate these based on the contents of the directory the next time it runs.
FYI: The issue that removes the dependence on OpenOffice is:
Show
Megan Watson added a comment - 05/Aug/08 03:10 PM Sorry, I missed that part out.
Run the following sql on your database:
update mime_types set extractor_id = null;
delete from mime_extractors;
delete from system_settings where name='mimeTypesRegistered';
It deletes the current associated extractors. The indexer will regenerate these based on the contents of the directory the next time it runs.
FYI: The issue that removes the dependence on OpenOffice is: KTS-3456
Hide
Morning Megan.
Unfortunately it's still giving the url issue. Please see the log-2008-08-06.www-data.txt attachment.
Initially it said that the extractors are disabled. So I then re-ran the script and it then gave the same URL issue.
Did I miss something?
Thanks
Unfortunately it's still giving the url issue. Please see the log-2008-08-06.www-data.txt attachment.
Initially it said that the extractors are disabled. So I then re-ran the script and it then gave the same URL issue.
Did I miss something?
Thanks
Show
Rakesh Mistry added a comment - 06/Aug/08 07:19 AM Morning Megan.
Unfortunately it's still giving the url issue. Please see the log-2008-08-06.www-data.txt attachment.
Initially it said that the extractors are disabled. So I then re-ran the script and it then gave the same URL issue.
Did I miss something?
Thanks
Hide
Should the original OOTextExtractor.inc.php be removed from the directory so KT doesn't get confused?
Show
Rakesh Mistry added a comment - 06/Aug/08 12:32 PM Should the original OOTextExtractor.inc.php be removed from the directory so KT doesn't get confused?
Hide
Yes, delete all three of the OO extractors. The run the sql again to clear out the mime_extractor mappings. Sorry about dragging this out, it isn't my area.
Show
Megan Watson added a comment - 06/Aug/08 01:33 PM Yes, delete all three of the OO extractors. The run the sql again to clear out the mime_extractor mappings. Sorry about dragging this out, it isn't my area.
Hide
No worries.. as long as we eventually sort it out I'll be happy.
I tried this but get :
Warning: require_once(OOTextExtractor.inc.php) [function.require-once]: failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39
Fatal error: require_once() [function.require]: Failed opening required 'OOTextExtractor.inc.php' (include_path='/var/www/knowledgeTree/search2:/var/www/knowledgeTree/ktapi:/var/www/knowledgeTree/thirdparty/xmlrpc-2.2/lib:/var/www/knowledgeTree/thirdparty/simpletest:/var/www/knowledgeTree/thirdparty/Smarty:/var/www/knowledgeTree/thirdparty/pear:/var/www/knowledgeTree/thirdparty/ZendFramework/library:.:/usr/share/php:/usr/share/pear') in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39
Do I need to edit RTFExtractor and change the require to OpenOfficeTextExtractor?
I tried this but get :
Warning: require_once(OOTextExtractor.inc.php) [function.require-once]: failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39
Fatal error: require_once() [function.require]: Failed opening required 'OOTextExtractor.inc.php' (include_path='/var/www/knowledgeTree/search2:/var/www/knowledgeTree/ktapi:/var/www/knowledgeTree/thirdparty/xmlrpc-2.2/lib:/var/www/knowledgeTree/thirdparty/simpletest:/var/www/knowledgeTree/thirdparty/Smarty:/var/www/knowledgeTree/thirdparty/pear:/var/www/knowledgeTree/thirdparty/ZendFramework/library:.:/usr/share/php:/usr/share/pear') in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39
Do I need to edit RTFExtractor and change the require to OpenOfficeTextExtractor?
Show
Rakesh Mistry added a comment - 06/Aug/08 02:49 PM No worries.. as long as we eventually sort it out I'll be happy.
I tried this but get :
Warning: require_once(OOTextExtractor.inc.php) [function.require-once]: failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39
Fatal error: require_once() [function.require]: Failed opening required 'OOTextExtractor.inc.php' (include_path='/var/www/knowledgeTree/search2:/var/www/knowledgeTree/ktapi:/var/www/knowledgeTree/thirdparty/xmlrpc-2.2/lib:/var/www/knowledgeTree/thirdparty/simpletest:/var/www/knowledgeTree/thirdparty/Smarty:/var/www/knowledgeTree/thirdparty/pear:/var/www/knowledgeTree/thirdparty/ZendFramework/library:.:/usr/share/php:/usr/share/pear') in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39
Do I need to edit RTFExtractor and change the require to OpenOfficeTextExtractor?
Hide
There appear to be a number of dependencies involved. It may be better to upgrade to 3.5.3 which includes this than to try and patch it. We're releasing the Community Edition next week.
Another option might be to manually change the mime_extractor mappings. Did you delete those extractor files or just move them out?
Another option might be to manually change the mime_extractor mappings. Did you delete those extractor files or just move them out?
Show
Megan Watson added a comment - 07/Aug/08 10:54 AM There appear to be a number of dependencies involved. It may be better to upgrade to 3.5.3 which includes this than to try and patch it. We're releasing the Community Edition next week.
Another option might be to manually change the mime_extractor mappings. Did you delete those extractor files or just move them out?
Hide
I just moved them out..
How would I manually change the mime_extractor mappings?
I might be able to give 3.5.3 a go.... but I have to get our sysadmin to do it.. and I don't know if he'll have time to do another install for me.
How would I manually change the mime_extractor mappings?
I might be able to give 3.5.3 a go.... but I have to get our sysadmin to do it.. and I don't know if he'll have time to do another install for me.
Show
Rakesh Mistry added a comment - 07/Aug/08 11:25 AM I just moved them out..
How would I manually change the mime_extractor mappings?
I might be able to give 3.5.3 a go.... but I have to get our sysadmin to do it.. and I don't know if he'll have time to do another install for me.
Hide
I've attached my extractor files, I'm not sure if they'll work, as there have been a number of changes in 3.5.3. The files go into the knowledgeTree/Search2/ directory.
My mime_extractors table is below:
INSERT INTO `mime_extractors` VALUES
(23, 'ExcelExtractor', 1),
(24, 'ExifExtractor', 1),
(25, 'OpenOfficeTextExtractor', 1),
(26, 'OpenXmlTextExtractor', 1),
(27, 'PDFExtractor', 1),
(28, 'PlainTextExtractor', 1),
(29, 'PowerpointExtractor', 1),
(30, 'ScriptExtractor', 1),
(31, 'StarOfficeExtractor', 1),
(32, 'WordExtractor', 1),
(33, 'XMLExtractor', 1);
My mime_extractors table is below:
INSERT INTO `mime_extractors` VALUES
(23, 'ExcelExtractor', 1),
(24, 'ExifExtractor', 1),
(25, 'OpenOfficeTextExtractor', 1),
(26, 'OpenXmlTextExtractor', 1),
(27, 'PDFExtractor', 1),
(28, 'PlainTextExtractor', 1),
(29, 'PowerpointExtractor', 1),
(30, 'ScriptExtractor', 1),
(31, 'StarOfficeExtractor', 1),
(32, 'WordExtractor', 1),
(33, 'XMLExtractor', 1);
Show
Megan Watson added a comment - 08/Aug/08 03:20 PM I've attached my extractor files, I'm not sure if they'll work, as there have been a number of changes in 3.5.3. The files go into the knowledgeTree/Search2/ directory.
My mime_extractors table is below:
INSERT INTO `mime_extractors` VALUES
(23, 'ExcelExtractor', 1),
(24, 'ExifExtractor', 1),
(25, 'OpenOfficeTextExtractor', 1),
(26, 'OpenXmlTextExtractor', 1),
(27, 'PDFExtractor', 1),
(28, 'PlainTextExtractor', 1),
(29, 'PowerpointExtractor', 1),
(30, 'ScriptExtractor', 1),
(31, 'StarOfficeExtractor', 1),
(32, 'WordExtractor', 1),
(33, 'XMLExtractor', 1);
Hide
Good news Megan!
That worked, in that all my documents are now indexed!!!!
However, when running from the command line I get the following:
me@xxxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Warning: filesize(): stat failed for /var/www/knowledgeTree/var/tmp/ktindexernyPIDx in /var/www/knowledgeTree/search2/indexing/extractors/PDFExtractor.inc.php on line 94
Warning: file_get_contents(/var/www/knowledgeTree/var/tmp/ktindexernyPIDx): failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 584
The actual scheduler still thinks there is nothing to do, even though I add a new document:
2008-08-11 08:36:03 () DEBUG: Scheduler: starting
2008-08-11 08:36:03 () DEBUG: Scheduler: stopping - nothing to do
Would it be a problem for me to add the cronIndexer.php script into the cron job? That way I know that indexing is occuring..
Rax
That worked, in that all my documents are now indexed!!!!
However, when running from the command line I get the following:
me@xxxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Warning: filesize(): stat failed for /var/www/knowledgeTree/var/tmp/ktindexernyPIDx in /var/www/knowledgeTree/search2/indexing/extractors/PDFExtractor.inc.php on line 94
Warning: file_get_contents(/var/www/knowledgeTree/var/tmp/ktindexernyPIDx): failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 584
The actual scheduler still thinks there is nothing to do, even though I add a new document:
2008-08-11 08:36:03 () DEBUG: Scheduler: starting
2008-08-11 08:36:03 () DEBUG: Scheduler: stopping - nothing to do
Would it be a problem for me to add the cronIndexer.php script into the cron job? That way I know that indexing is occuring..
Rax
Show
Rakesh Mistry added a comment - 11/Aug/08 07:37 AM Good news Megan!
That worked, in that all my documents are now indexed!!!!
However, when running from the command line I get the following:
me@xxxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Warning: filesize(): stat failed for /var/www/knowledgeTree/var/tmp/ktindexernyPIDx in /var/www/knowledgeTree/search2/indexing/extractors/PDFExtractor.inc.php on line 94
Warning: file_get_contents(/var/www/knowledgeTree/var/tmp/ktindexernyPIDx): failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 584
The actual scheduler still thinks there is nothing to do, even though I add a new document:
2008-08-11 08:36:03 () DEBUG: Scheduler: starting
2008-08-11 08:36:03 () DEBUG: Scheduler: stopping - nothing to do
Would it be a problem for me to add the cronIndexer.php script into the cron job? That way I know that indexing is occuring..
Rax
Hide
Oops.. I also get the following when trying to index an xls:
Fatal error: Call to undefined method JavaXMLRPCLuceneIndexer::restartBatch() in /var/www/knowledgeTree/search2/indexing/extractors/StarOfficeExtractor.inc.php on line 168
LOG:
2008-08-11 09:07:29 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e
xcel' extractor: 'ExcelExtractor'
2008-08-11 09:07:29 () INFO: Processing docid: 10.
2008-08-11 09:07:29 () DEBUG: Extra Info docid: 10 Source File: '/var/www/knowledgeTree/var/tmp/1
0.xls' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerOugVAX'
2008-08-11 09:07:29 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/var/www/k
nowledgeTree/var/tmp/10.xls" > "/var/www/knowledgeTree/var/tmp/ktindexerOugVAX"
2008-08-11 09:07:30 () DEBUG: StarOfficeExtractor: "/usr/bin/python" "/var/www/knowledgeTree/bin/
openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/10.xls" "/var/www/knowledgeTree/
var/tmp/ktindexerOugVAX.html" 127.0.0.1 8100
2008-08-11 09:07:31 () INFO: DocumentId: 10 - Suspect the file cannot be indexed by Open Office.
Fatal error: Call to undefined method JavaXMLRPCLuceneIndexer::restartBatch() in /var/www/knowledgeTree/search2/indexing/extractors/StarOfficeExtractor.inc.php on line 168
LOG:
2008-08-11 09:07:29 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e
xcel' extractor: 'ExcelExtractor'
2008-08-11 09:07:29 () INFO: Processing docid: 10.
2008-08-11 09:07:29 () DEBUG: Extra Info docid: 10 Source File: '/var/www/knowledgeTree/var/tmp/1
0.xls' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerOugVAX'
2008-08-11 09:07:29 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/var/www/k
nowledgeTree/var/tmp/10.xls" > "/var/www/knowledgeTree/var/tmp/ktindexerOugVAX"
2008-08-11 09:07:30 () DEBUG: StarOfficeExtractor: "/usr/bin/python" "/var/www/knowledgeTree/bin/
openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/10.xls" "/var/www/knowledgeTree/
var/tmp/ktindexerOugVAX.html" 127.0.0.1 8100
2008-08-11 09:07:31 () INFO: DocumentId: 10 - Suspect the file cannot be indexed by Open Office.
Show
Rakesh Mistry added a comment - 11/Aug/08 08:09 AM Oops.. I also get the following when trying to index an xls:
Fatal error: Call to undefined method JavaXMLRPCLuceneIndexer::restartBatch() in /var/www/knowledgeTree/search2/indexing/extractors/StarOfficeExtractor.inc.php on line 168
LOG:
2008-08-11 09:07:29 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e
xcel' extractor: 'ExcelExtractor'
2008-08-11 09:07:29 () INFO: Processing docid: 10.
2008-08-11 09:07:29 () DEBUG: Extra Info docid: 10 Source File: '/var/www/knowledgeTree/var/tmp/1
0.xls' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerOugVAX'
2008-08-11 09:07:29 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/var/www/k
nowledgeTree/var/tmp/10.xls" > "/var/www/knowledgeTree/var/tmp/ktindexerOugVAX"
2008-08-11 09:07:30 () DEBUG: StarOfficeExtractor: "/usr/bin/python" "/var/www/knowledgeTree/bin/
openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/10.xls" "/var/www/knowledgeTree/
var/tmp/ktindexerOugVAX.html" 127.0.0.1 8100
2008-08-11 09:07:31 () INFO: DocumentId: 10 - Suspect the file cannot be indexed by Open Office.
Hide
I'm so glad its working! It shouldn't be a problem if you add the cronIndexer to the cron job.
Add the following line to your config.ini:
[indexer]
useOpenOffice = false
That will stop the indexer trying to use open office for extracting text. Check your web servers write permissions on the tmp directory, to ensure the temp files can be created.
I'm not sure about the excel extractor. I'll have to check with the developer.
Add the following line to your config.ini:
[indexer]
useOpenOffice = false
That will stop the indexer trying to use open office for extracting text. Check your web servers write permissions on the tmp directory, to ensure the temp files can be created.
I'm not sure about the excel extractor. I'll have to check with the developer.
Show
Megan Watson added a comment - 11/Aug/08 08:51 AM I'm so glad its working! It shouldn't be a problem if you add the cronIndexer to the cron job.
Add the following line to your config.ini:
[indexer]
useOpenOffice = false
That will stop the indexer trying to use open office for extracting text. Check your web servers write permissions on the tmp directory, to ensure the temp files can be created.
I'm not sure about the excel extractor. I'll have to check with the developer.
Hide
Ok, adding the config option removed the OO error, but I noticed the following:
2008-08-11 10:06:49 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e
xcel' extractor: 'ExcelExtractor'
2008-08-11 10:06:49 () INFO: Processing docid: 10.
2008-08-11 10:06:49 () DEBUG: Extra Info docid: 10 Source File: '/home/Documents/00/13' Target Fi
le: '/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1'
2008-08-11 10:06:49 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Docu
ments/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1"
2008-08-11 10:06:49 () INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not availab
le and OpenOffice is not in use.
2008-08-11 10:06:49 () DEBUG: Indexer: removing document 10 from the queue - Done indexing docid:
10
xl2csv is in the path /usr/bin/xls2csv. The binary is mentioned in the config.ini. So why does it think it's not available?
Also, KT reports the document as successfully indexed, even though it did't really index it.
2008-08-11 10:06:49 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e
xcel' extractor: 'ExcelExtractor'
2008-08-11 10:06:49 () INFO: Processing docid: 10.
2008-08-11 10:06:49 () DEBUG: Extra Info docid: 10 Source File: '/home/Documents/00/13' Target Fi
le: '/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1'
2008-08-11 10:06:49 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Docu
ments/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1"
2008-08-11 10:06:49 () INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not availab
le and OpenOffice is not in use.
2008-08-11 10:06:49 () DEBUG: Indexer: removing document 10 from the queue - Done indexing docid:
10
xl2csv is in the path /usr/bin/xls2csv. The binary is mentioned in the config.ini. So why does it think it's not available?
Also, KT reports the document as successfully indexed, even though it did't really index it.
Show
Rakesh Mistry added a comment - 11/Aug/08 09:13 AM Ok, adding the config option removed the OO error, but I noticed the following:
2008-08-11 10:06:49 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e
xcel' extractor: 'ExcelExtractor'
2008-08-11 10:06:49 () INFO: Processing docid: 10.
2008-08-11 10:06:49 () DEBUG: Extra Info docid: 10 Source File: '/home/Documents/00/13' Target Fi
le: '/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1'
2008-08-11 10:06:49 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Docu
ments/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1"
2008-08-11 10:06:49 () INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not availab
le and OpenOffice is not in use.
2008-08-11 10:06:49 () DEBUG: Indexer: removing document 10 from the queue - Done indexing docid:
10
xl2csv is in the path /usr/bin/xls2csv. The binary is mentioned in the config.ini. So why does it think it's not available?
Also, KT reports the document as successfully indexed, even though it did't really index it.
Hide
If I run the command from the CLI I get a "permission denied" error:
drwxr-xr-x 9 www-data www-data 4096 2008-05-23 13:24 .
drwxr-xr-x 25 www-data www-data 4096 2008-05-23 13:24 ..
drwxr-xr-x 3 www-data www-data 4096 2008-08-05 10:14 cache
drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 Documents
drwxr-xr-x 2 www-data www-data 4096 2008-08-11 10:06 indexes
drwxrwxr-x 2 www-data www-data 8192 2008-08-11 00:00 log
drwxr-xr-x 4 www-data www-data 4096 2008-07-07 11:38 proxies
drwxr-xr-x 4 www-data www-data 12288 2008-08-11 10:06 tmp
drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 uploads
me@xxxxxxxx:/var/www/knowledgeTree/var$ sudo "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Documents/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1"
-bash: /var/www/knowledgeTree/var/tmp/ktindexeryZIcM1: Permission denied
Even though you can see www-data has full perms to tmp. The webserver is also running as www-data.
drwxr-xr-x 9 www-data www-data 4096 2008-05-23 13:24 .
drwxr-xr-x 25 www-data www-data 4096 2008-05-23 13:24 ..
drwxr-xr-x 3 www-data www-data 4096 2008-08-05 10:14 cache
drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 Documents
drwxr-xr-x 2 www-data www-data 4096 2008-08-11 10:06 indexes
drwxrwxr-x 2 www-data www-data 8192 2008-08-11 00:00 log
drwxr-xr-x 4 www-data www-data 4096 2008-07-07 11:38 proxies
drwxr-xr-x 4 www-data www-data 12288 2008-08-11 10:06 tmp
drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 uploads
me@xxxxxxxx:/var/www/knowledgeTree/var$ sudo "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Documents/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1"
-bash: /var/www/knowledgeTree/var/tmp/ktindexeryZIcM1: Permission denied
Even though you can see www-data has full perms to tmp. The webserver is also running as www-data.
Show
Rakesh Mistry added a comment - 11/Aug/08 09:26 AM If I run the command from the CLI I get a "permission denied" error:
drwxr-xr-x 9 www-data www-data 4096 2008-05-23 13:24 .
drwxr-xr-x 25 www-data www-data 4096 2008-05-23 13:24 ..
drwxr-xr-x 3 www-data www-data 4096 2008-08-05 10:14 cache
drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 Documents
drwxr-xr-x 2 www-data www-data 4096 2008-08-11 10:06 indexes
drwxrwxr-x 2 www-data www-data 8192 2008-08-11 00:00 log
drwxr-xr-x 4 www-data www-data 4096 2008-07-07 11:38 proxies
drwxr-xr-x 4 www-data www-data 12288 2008-08-11 10:06 tmp
drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 uploads
me@xxxxxxxx:/var/www/knowledgeTree/var$ sudo "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Documents/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1"
-bash: /var/www/knowledgeTree/var/tmp/ktindexeryZIcM1: Permission denied
Even though you can see www-data has full perms to tmp. The webserver is also running as www-data.
Hide
One more question:
At the moment, our install feels slightly "hacked together" with all the manual changes we've made.
Do you think this will be an issue if we decided to upgrade at some point?
At the moment, our install feels slightly "hacked together" with all the manual changes we've made.
Do you think this will be an issue if we decided to upgrade at some point?
Show
Rakesh Mistry added a comment - 11/Aug/08 10:02 AM One more question:
At the moment, our install feels slightly "hacked together" with all the manual changes we've made.
Do you think this will be an issue if we decided to upgrade at some point?
Hide
=> INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not available and OpenOffice is not in use.
This should be reworded, it just means that the xls2csv binary wasn't able to index the document.
=> DEBUG: Indexer: removing document 10 from the queue - Done indexing docid: 10
It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully.
There are similar issues being worked on at the moment with xls2csv, catdoc, etc not working on a source install on ubuntu. I'll let you know when a solution is found.
=> our install feels slightly "hacked together" with all the manual changes we've made. Do you think this will be an issue if we decided to upgrade at some point?
Most of the changes we've made are all implemented in our 3.5.3 version so they shouldn't cause any major problems with an upgrade.
This should be reworded, it just means that the xls2csv binary wasn't able to index the document.
=> DEBUG: Indexer: removing document 10 from the queue - Done indexing docid: 10
It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully.
There are similar issues being worked on at the moment with xls2csv, catdoc, etc not working on a source install on ubuntu. I'll let you know when a solution is found.
=> our install feels slightly "hacked together" with all the manual changes we've made. Do you think this will be an issue if we decided to upgrade at some point?
Most of the changes we've made are all implemented in our 3.5.3 version so they shouldn't cause any major problems with an upgrade.
Show
Megan Watson added a comment - 11/Aug/08 10:41 AM => INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not available and OpenOffice is not in use.
This should be reworded, it just means that the xls2csv binary wasn't able to index the document.
=> DEBUG: Indexer: removing document 10 from the queue - Done indexing docid: 10
It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully.
There are similar issues being worked on at the moment with xls2csv, catdoc, etc not working on a source install on ubuntu. I'll let you know when a solution is found.
=> our install feels slightly "hacked together" with all the manual changes we've made. Do you think this will be an issue if we decided to upgrade at some point?
Most of the changes we've made are all implemented in our 3.5.3 version so they shouldn't cause any major problems with an upgrade.
Hide
=> It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully.
Fair enough, but I think it's currently moving the document to the "success" queue, rather than the "failed" queue.
Before we start using this in anger, I'll have to upload our entire repo and see how it works. If all goes well then we'll stick with it the way it is.
Thanks!
=> It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully.
Fair enough, but I think it's currently moving the document to the "success" queue, rather than the "failed" queue.
Before we start using this in anger, I'll have to upload our entire repo and see how it works. If all goes well then we'll stick with it the way it is.
Thanks!
Show
Rakesh Mistry added a comment - 11/Aug/08 11:25 AM
=> It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully.
Fair enough, but I think it's currently moving the document to the "success" queue, rather than the "failed" queue.
Before we start using this in anger, I'll have to upload our entire repo and see how it works. If all goes well then we'll stick with it the way it is.
Thanks!
Hide
Thought you would like to know that we are no longer receiving the error for the xls2csv command..
however, it doesn't look like it's actually indexing the file .. i.e. I can't find anything from the contents of the xls.
however, it doesn't look like it's actually indexing the file .. i.e. I can't find anything from the contents of the xls.
Show
Rakesh Mistry added a comment - 12/Aug/08 09:51 AM Thought you would like to know that we are no longer receiving the error for the xls2csv command..
however, it doesn't look like it's actually indexing the file .. i.e. I can't find anything from the contents of the xls.
Hide
Megan,
I noticed that Outlook files (.msg, .oft) are being recognised as Word documents and catdoc is used to try and index these files.
This however is not working resulting in the following:
2008-08-12 14:15:07 () INFO: Processing docid: 653.
2008-08-12 14:15:07 () DEBUG: Extra Info docid: 653 Source File: '/home/Documents/06/656' Targe
t File: '/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP'
2008-08-12 14:15:07 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/06/
656" > "/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP"
2008-08-12 14:15:07 () INFO: The document 653 cannot be indexed as /usr/bin/catdoc is not avail
able and OpenOffice is not in use.
2008-08-12 14:15:07 () DEBUG: Indexer: removing document 653 from the queue - Done indexing doc
id: 653
Note that Word documents ARE being indexed correctly.
cheers
Rax
I noticed that Outlook files (.msg, .oft) are being recognised as Word documents and catdoc is used to try and index these files.
This however is not working resulting in the following:
2008-08-12 14:15:07 () INFO: Processing docid: 653.
2008-08-12 14:15:07 () DEBUG: Extra Info docid: 653 Source File: '/home/Documents/06/656' Targe
t File: '/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP'
2008-08-12 14:15:07 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/06/
656" > "/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP"
2008-08-12 14:15:07 () INFO: The document 653 cannot be indexed as /usr/bin/catdoc is not avail
able and OpenOffice is not in use.
2008-08-12 14:15:07 () DEBUG: Indexer: removing document 653 from the queue - Done indexing doc
id: 653
Note that Word documents ARE being indexed correctly.
cheers
Rax
Show
Rakesh Mistry added a comment - 12/Aug/08 01:20 PM Megan,
I noticed that Outlook files (.msg, .oft) are being recognised as Word documents and catdoc is used to try and index these files.
This however is not working resulting in the following:
2008-08-12 14:15:07 () INFO: Processing docid: 653.
2008-08-12 14:15:07 () DEBUG: Extra Info docid: 653 Source File: '/home/Documents/06/656' Targe
t File: '/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP'
2008-08-12 14:15:07 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/06/
656" > "/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP"
2008-08-12 14:15:07 () INFO: The document 653 cannot be indexed as /usr/bin/catdoc is not avail
able and OpenOffice is not in use.
2008-08-12 14:15:07 () DEBUG: Indexer: removing document 653 from the queue - Done indexing doc
id: 653
Note that Word documents ARE being indexed correctly.
cheers
Rax
Hide
I've attached a updated set of extractors. The commands calling xls2csv, etc were incorrect and therefore causing the document not to be indexed. Hopefully the new code will help correct your problems.
I'm glad to hear that your word documents at least are being indexed. I'm not sure about the Outlook documents, I'm not sure what extractor would be used for them.
=> it's currently moving the document to the "success" queue, rather than the "failed" queue.
There is no success or failed queue, once the document has been indexed it is moved out of the indexing queue. If it fails on the indexing it is also moved out of the queue as I explained before. The move to a failed queue is something that needs to be addressed in future.
I'm glad to hear that your word documents at least are being indexed. I'm not sure about the Outlook documents, I'm not sure what extractor would be used for them.
=> it's currently moving the document to the "success" queue, rather than the "failed" queue.
There is no success or failed queue, once the document has been indexed it is moved out of the indexing queue. If it fails on the indexing it is also moved out of the queue as I explained before. The move to a failed queue is something that needs to be addressed in future.
Show
Megan Watson added a comment - 20/Aug/08 08:35 AM I've attached a updated set of extractors. The commands calling xls2csv, etc were incorrect and therefore causing the document not to be indexed. Hopefully the new code will help correct your problems.
I'm glad to hear that your word documents at least are being indexed. I'm not sure about the Outlook documents, I'm not sure what extractor would be used for them.
=> it's currently moving the document to the "success" queue, rather than the "failed" queue.
There is no success or failed queue, once the document has been indexed it is moved out of the indexing queue. If it fails on the indexing it is also moved out of the queue as I explained before. The move to a failed queue is something that needs to be addressed in future.
Hide
Hi Megan.
Sorry for the slow reply (I was away last week), and thanks for the updated code.
I will implement the new code today and see what happens.
Cheers
Rax
Sorry for the slow reply (I was away last week), and thanks for the updated code.
I will implement the new code today and see what happens.
Cheers
Rax
Show
Rakesh Mistry added a comment - 25/Aug/08 11:27 AM Hi Megan.
Sorry for the slow reply (I was away last week), and thanks for the updated code.
I will implement the new code today and see what happens.
Cheers
Rax
Hide
Do I follow the same instructions as before for implementing these changes?
Thanks
Thanks
Show
Rakesh Mistry added a comment - 25/Aug/08 11:30 AM Do I follow the same instructions as before for implementing these changes?
Thanks
Hide
Yes.
Copy the files into the <knowledgeTree>/search2/indexing directory.
Then run the sql:
update mime_types set extractor_id = null;
delete from mime_extractors;
delete from system_settings where name='mimeTypesRegistered';
Reschedule some of your documents and hold thumbs.
Copy the files into the <knowledgeTree>/search2/indexing directory.
Then run the sql:
update mime_types set extractor_id = null;
delete from mime_extractors;
delete from system_settings where name='mimeTypesRegistered';
Reschedule some of your documents and hold thumbs.
Show
Megan Watson added a comment - 25/Aug/08 12:54 PM Yes.
Copy the files into the <knowledgeTree>/search2/indexing directory.
Then run the sql:
update mime_types set extractor_id = null;
delete from mime_extractors;
delete from system_settings where name='mimeTypesRegistered';
Reschedule some of your documents and hold thumbs.
Hide
I made the changes.
When I login and try to get to the dashboard I get the following error in the browser:
Fatal error: Class JavaXMLRPCLuceneIndexer contains 1 abstract method and must therefore be declared abstract or implement the remaining methods (Indexer::isDocumentIndexed) in /var/www/knowledgeTree/search2/indexing/indexers/JavaXMLRPCLuceneIndexer.inc.php on line 281
This is in the log file:
2008-08-25 14:07:30 (192.168.1.217) INFO: control.php: about to redirect to /login.php?errorMessage=You
need to login to access this page&redirect=http%3A%2F%2Fkt.xxxxxxxxxxx.org.za%2Fdashboard.php
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: class 'StarOfficeExtractor' does not support any t
ypes.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-08-25 14:10:02 () DEBUG: indexDocuments: start
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types.
2008-08-25 14:10:02 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
2008-08-25 14:10:02 () DEBUG: indexDocuments: stopping - no work to be done
2008-08-25 14:15:01 () DEBUG: indexDocuments: start
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types.
2008-08-25 14:15:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
What to do?
I can browse to documents if I put in the right URL though.
Thanks
When I login and try to get to the dashboard I get the following error in the browser:
Fatal error: Class JavaXMLRPCLuceneIndexer contains 1 abstract method and must therefore be declared abstract or implement the remaining methods (Indexer::isDocumentIndexed) in /var/www/knowledgeTree/search2/indexing/indexers/JavaXMLRPCLuceneIndexer.inc.php on line 281
This is in the log file:
2008-08-25 14:07:30 (192.168.1.217) INFO: control.php: about to redirect to /login.php?errorMessage=You
need to login to access this page&redirect=http%3A%2F%2Fkt.xxxxxxxxxxx.org.za%2Fdashboard.php
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: class 'StarOfficeExtractor' does not support any t
ypes.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-08-25 14:10:02 () DEBUG: indexDocuments: start
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types.
2008-08-25 14:10:02 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
2008-08-25 14:10:02 () DEBUG: indexDocuments: stopping - no work to be done
2008-08-25 14:15:01 () DEBUG: indexDocuments: start
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types.
2008-08-25 14:15:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
What to do?
I can browse to documents if I put in the right URL though.
Thanks
Show
Rakesh Mistry added a comment - 25/Aug/08 01:23 PM I made the changes.
When I login and try to get to the dashboard I get the following error in the browser:
Fatal error: Class JavaXMLRPCLuceneIndexer contains 1 abstract method and must therefore be declared abstract or implement the remaining methods (Indexer::isDocumentIndexed) in /var/www/knowledgeTree/search2/indexing/indexers/JavaXMLRPCLuceneIndexer.inc.php on line 281
This is in the log file:
2008-08-25 14:07:30 (192.168.1.217) INFO: control.php: about to redirect to /login.php?errorMessage=You
need to login to access this page&redirect=http%3A%2F%2Fkt.xxxxxxxxxxx.org.za%2Fdashboard.php
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: class 'StarOfficeExtractor' does not support any t
ypes.
2008-08-25 14:07:37 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za
2008-08-25 14:10:02 () DEBUG: indexDocuments: start
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:10:02 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types.
2008-08-25 14:10:02 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
2008-08-25 14:10:02 () DEBUG: indexDocuments: stopping - no work to be done
2008-08-25 14:15:01 () DEBUG: indexDocuments: start
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 14:15:01 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types.
2008-08-25 14:15:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
What to do?
I can browse to documents if I put in the right URL though.
Thanks
Hide
The dashboard issue looks like a discrepancy between versions of JavaXMLRPCLuceneIndexer class. I've attached the full search2 directory which should contain all the correct class versions. Back up your <knowledgeTree>/search2 directory and drop this one in.
For the second issue, it looks like the extractors haven't regenerated in the database. Run the sql in my previous post again. The most important being the last query:
delete from system_settings where name='mimeTypesRegistered';
Check that the setting mimeTypesRegistered has been deleted from system_settings. If the setting exists then the system doesn't regenerate the extractors.
Clear your cache: <knowledgeTree>/var/cache
For the second issue, it looks like the extractors haven't regenerated in the database. Run the sql in my previous post again. The most important being the last query:
delete from system_settings where name='mimeTypesRegistered';
Check that the setting mimeTypesRegistered has been deleted from system_settings. If the setting exists then the system doesn't regenerate the extractors.
Clear your cache: <knowledgeTree>/var/cache
Show
Megan Watson added a comment - 25/Aug/08 02:00 PM The dashboard issue looks like a discrepancy between versions of JavaXMLRPCLuceneIndexer class. I've attached the full search2 directory which should contain all the correct class versions. Back up your <knowledgeTree>/search2 directory and drop this one in.
For the second issue, it looks like the extractors haven't regenerated in the database. Run the sql in my previous post again. The most important being the last query:
delete from system_settings where name='mimeTypesRegistered';
Check that the setting mimeTypesRegistered has been deleted from system_settings. If the setting exists then the system doesn't regenerate the extractors.
Clear your cache: <knowledgeTree>/var/cache
Hide
That seems to have worked Megan.
Do I need to be concerned about the disabled extractors?
2008-08-25 16:35:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
2008-08-25 16:35:01 () DEBUG: Indexing docid: 969 extension: 'doc' mimetype: 'application/msword' extra
ctor: 'WordExtractor'
2008-08-25 16:35:01 () INFO: Processing docid: 969.
2008-08-25 16:35:01 () DEBUG: Extra Info docid: 969 Source File: '/home/Documents/09/972' Target File:
'/var/www/knowledgeTree/var/tmp/ktindexer1mPePn'
2008-08-25 16:35:01 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/09/972" > "
/var/www/knowledgeTree/var/tmp/ktindexer1mPePn"
2008-08-25 16:35:01 () DEBUG: Indexer: removing document 969 from the queue - Done indexing docid: 969
2008-08-25 16:35:01 () DEBUG: indexDocuments: done
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'StarOfficeExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: indexDocuments: start
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
Do I need to be concerned about the disabled extractors?
2008-08-25 16:35:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
2008-08-25 16:35:01 () DEBUG: Indexing docid: 969 extension: 'doc' mimetype: 'application/msword' extra
ctor: 'WordExtractor'
2008-08-25 16:35:01 () INFO: Processing docid: 969.
2008-08-25 16:35:01 () DEBUG: Extra Info docid: 969 Source File: '/home/Documents/09/972' Target File:
'/var/www/knowledgeTree/var/tmp/ktindexer1mPePn'
2008-08-25 16:35:01 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/09/972" > "
/var/www/knowledgeTree/var/tmp/ktindexer1mPePn"
2008-08-25 16:35:01 () DEBUG: Indexer: removing document 969 from the queue - Done indexing docid: 969
2008-08-25 16:35:01 () DEBUG: indexDocuments: done
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'StarOfficeExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: indexDocuments: start
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
Show
Rakesh Mistry added a comment - 26/Aug/08 07:01 AM That seems to have worked Megan.
Do I need to be concerned about the disabled extractors?
2008-08-25 16:35:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have
been deleted
2008-08-25 16:35:01 () DEBUG: Indexing docid: 969 extension: 'doc' mimetype: 'application/msword' extra
ctor: 'WordExtractor'
2008-08-25 16:35:01 () INFO: Processing docid: 969.
2008-08-25 16:35:01 () DEBUG: Extra Info docid: 969 Source File: '/home/Documents/09/972' Target File:
'/var/www/knowledgeTree/var/tmp/ktindexer1mPePn'
2008-08-25 16:35:01 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/09/972" > "
/var/www/knowledgeTree/var/tmp/ktindexer1mPePn"
2008-08-25 16:35:01 () DEBUG: Indexer: removing document 969 from the queue - Done indexing docid: 969
2008-08-25 16:35:01 () DEBUG: indexDocuments: done
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'StarOfficeExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: indexDocuments: start
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled.
2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled.
Hide
Good to hear!
I don't think you need to be concerned about the disabled extractors, some of the original extractors are disabled in the code because the new extractors override them.
I don't think you need to be concerned about the disabled extractors, some of the original extractors are disabled in the code because the new extractors override them.
Show
Megan Watson added a comment - 26/Aug/08 07:18 AM Good to hear!
I don't think you need to be concerned about the disabled extractors, some of the original extractors are disabled in the code because the new extractors override them.
Hide
Thanks for all the help Megan. Took a while, but I'm glad it's sorted.
I will basically be putting the DMS live today.. hold thumbs! ;)
I will basically be putting the DMS live today.. hold thumbs! ;)
Show
Rakesh Mistry added a comment - 26/Aug/08 07:41 AM Thanks for all the help Megan. Took a while, but I'm glad it's sorted.
I will basically be putting the DMS live today.. hold thumbs! ;)
Hide
Good luck! I hope all goes well :)
I'm going to close the issue now.
I'm going to close the issue now.
Show
Megan Watson added a comment - 26/Aug/08 07:48 AM Good luck! I hope all goes well :)
I'm going to close the issue now.
Hide
This issue was fixed with adding the rootUrl to the serverName.txt file and the updated (version 3.5.3) search2 extractors.
Show
Megan Watson added a comment - 26/Aug/08 07:51 AM This issue was fixed with adding the rootUrl to the serverName.txt file and the updated (version 3.5.3) search2 extractors.
Hide
And I also had to add the following to the cron job:
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
Show
Rakesh Mistry added a comment - 26/Aug/08 10:07 AM And I also had to add the following to the cron job:
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php
2008-07-14 06:01:01 () DEBUG: call_page: calling http://XXXXXXXXXXXX/var/www/knowledgeTree/search2/indexing/bin/cronIndex
is made up of 2 parts, the path to the indexer: search2/indexing/bin/cronIndex and the server name: http://XXXXXXXXXXXX/var/www/knowledgeTree
The server name consists of the domain: http://XXXXXXXXXXXX and the rootUrl: /var/www/knowledgeTree
I assume that your domain is pointing to the knowledgeTree directory (/var/www/knowledgeTree). Check that the rootUrl in your config.ini is set to default.