|
|
|
I can confirm that rootUrl = default for the installation.
This is the apache config for the KT application:
#<Directory /knowledgeTree> # Options FollowSymLinks # AllowOverride All #</Directory> <VirtualHost *> ServerAdmin webmaster@xxxxxxxxxxxx ServerName xxxxxxxxxxxxxx DocumentRoot /var/www/knowledgeTree <Directory /var/www/knowledgeTree> Options FollowSymLinks AllowOverride All Order allow,deny allow from all </Directory> ErrorLog /var/log/apache2/error.log # Possible values include: debug, info, notice, warn, error, crit, # alert, emerg. LogLevel warn CustomLog /var/log/apache2/access.log combined ServerSignature On </VirtualHost> The indexing task gets the server name from a file that is created on login, could you check if this file exists? The file, serverName.txt should be in your knowledgeTree/var/cache directory. If it does exist, what is the server name that it contains?
The file does exist it contains the same value as represented by xxxxxx above.
http://kt.xxxxxxx.org.za Obviously masked for security reasons. The only other place that I can think of where this might be going wrong is in the initialisation of the root url in the dmsDefaults. I'll attach a copy of the dmsDefaults with a few debugging statements. I notice your logLevel is already set to debug. Could you do the following:
- Open the knowledgeTree/config directory - Rename your existing dmsDefaults.php file - Copy the new one in - Open the knowledgeTree/cache directory and delete its contents - Login to knowledgeTree and navigate to the Browse Documents tab. - Send me the tail end of todays log file Thanks. I don't see the attachment... where is the link located?
I will probably get to this tomorrow, as it's the end of the day for me. Thanks! Sorry, I forgot to add the attachment. End of the day for me too.
Morning Megan. Thanks, I've uploaded the tail end of the log file.
Yes, the log was fine, thank you. Sorry for the delay, I needed some advice on this issue. I haven't seen it anywhere else and can't reproduce it myself so the best way of tracking it down and resolving it is on your machine. The problem is that the directory path (/var/www/knowledgeTree) is added to the url for the indexer. Now the parts of the url come from 2 places. The serverName.txt file, which you confirmed did not have the path name in, and the root url, which is set to default in the config file. The dmsDefaults file I attached on Monday contains a function which resolves the rootUrl when it is set to default. However, from the log it appears to be resolving it correctly. The next place to look is at the point where it puts these together and creates the url.
Could you repeat the process and replace your ktutil.inc file with the one I'm about to attach. - The file is in: knowledgeTree / lib / util / Thanks Attached is the log after replacing ktutil.inc.
The new log file looks a lot more promising! I need the debug logs from when the scheduler runs the indexing task. Could you let it run for a few minutes and upload the new log.
Thanks Thank Megan! I've attached the new log with debugs of the scheduler running as well.
Morning. I've finally figured out what has been happening with the url for the indexer. The rootUrl is resolved using the path to the current script, when the script is running in the browser then the path is relative to the server root, however, when running in the background via a cron job the path is the full / absolute path. I'm not sure if this is specific to Debian or because you're running a source install, but it hasn't appeared previously. We have had a similar problem with the server name which is why the serverName.txt file came into existence.
I've attached a new version of the ktutil.inc file (the old one should be greyed out), that should hopefully fix the problem. What it does now is save the full url to the serverName.txt file and doesn't rely on resolving the rootUrl. - Replace the ktutil.inc file in "knowledgeTree/lib/util" with the new one. - Delete the contents of the "knowledgeTree/var/cache" directory. - Open / refresh the login screen in the browser, this should regenerate the serverName.txt file in the cache directory (you may need to clear your browser cache as well) - Check that the serverName.txt file still contains: http://kt.xxxxxxx.org.za - Let the scheduler run for a while You might need to reschedule the indexing on some of your documents. You can do that through the Admin -> Search and Indexing interface. I've attached the latest log file with the new ktutils.inc.
Sorry, it's a bit long, I had to relogin an extra time, because the first time it complained that the user had not logged in for the first time. 2008-07-17 13:01:02 () DEBUG: kt_url: base url - http://kt.XXXXXX.org.za
2008-07-17 13:01:02 () DEBUG: call_page: calling http://kt.XXXXXX.org.za/search2/indexing/bin/cronIndexer.php 2008-07-17 13:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s. Looks like its resolving the url correctly now. I'll put the fix into our next release. You can replace the dmsDefaults.php file with the original otherwise your log files will be quickly filled in debug mode. Thanks for your patience and help with the debugging! A pleasure..
Would I be able to get the fixed file, or will I have to install the new release? I'm a bit limited with re-installation at the moment because our sysadmin is on holiday at the moment and I only have access to the knowledgetree directories. Oops.. can I move the default ktutils.inc back as well?
You already have the fix :) The ktutil.inc file contains the fix. It'll be the same as what I put in for the next release.
Ok.. Thanks!
One thing is worrying me.... If it is working, surely when the scheduler runs I should be getting the error I see in my other bug post: http://issues.knowledgetree.com/browse/KTS-3490 ? And shouldn't the documents in the queue then move to the "problem queue"... Not neccessarily. It depends on how many documents you have and what their mime types are. The indexer batch indexes a set number of documents (50, I think) each time it runs and it should run every minute. If you've been running the indexer from the command line then its possible that all your documents have been indexed or at least attempted to be indexed. Also, Open Office is only used in a subset of the documents, so your pdf's, etc will have been indexed without a problem.
The DMS Administration -> Search and Indexing interface should provide you with some answers. Under "Extractor Information" you can see which document mime types are indexed by Open Office. Under "Pending Documents Indexing Queue" you can see if there are any documents left in the queue. And "Document Indexing Diagnostics" should give you the errors in your other issue. Sorry for the late reply.. I was not in on Friday.
I only have three documents in the queue. One of them is a PDF the other two ODTs. The extractor information (mime types) show the document types listed. Document Indexing diagnostics show no issues. However, dashlet on the document indexing statistics shows last indexing time as 3 days and somewhat ago. Which is probably the last time I kicked off the scheduler manually! And I'm pretty sure if I kick off the scheduler via a browser url it will index the PDF document.
The scheduler gets added as a service during the installation process, on a source install this won't have happened. You'll need to add it to a cron job for it to run regularly.
http://wiki.knowledgetree.com/Scheduler The cronjob was already added by our sysadmin:
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php Okay, the first thing to check then is whether the scheduler is running and when each task was last run. You can see this under DMS Administration -> Miscellaneous -> Manage Task Scheduler.
As you can see from the screenshot, the scheduler appears to be running.
Right, correct me if I'm wrong here, the status at the moment is:
- the scheduler is running correctly - all tasks are being updated as run - there are 3 documents sitting in the Pending Documents Queue - there are no issues in the Document indexing diagnostics The cronIndexer should be using the correct url to call the indexing task after the fix. Your issue with open office shouldn't affect the PDF being indexed. What do your logs say? Hi Megan,
Correct! Only the document indexing dashlet shows that indexing hasn't run correctly in 4 days. Strangely the log file only indicates the scheduler running every hour (not sure if this is normal). And I'm pretty sure if I run http://kt.xxxxxxxxxxx.org.za/search2/indexing/bin/scheduler.php in my browser it will index the PDF at least. <log> 2008-07-23 10:01:02 () DEBUG: Scheduler: starting 2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in dexing/bin/cronIndexer.php 2008-07-23 10:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s. 2008-07-23 10:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-07-23 10:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in dexing/bin/cronMigration.php 2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s. 2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.42s . 2008-07-23 10:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet ed in 0.42s. 2008-07-23 10:01:03 () DEBUG: Scheduler: stopping 2008-07-23 11:01:01 () DEBUG: Scheduler: starting 2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in dexing/bin/cronIndexer.php 2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s. 2008-07-23 11:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-07-23 11:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2/in dexing/bin/cronMigration.php 2008-07-23 11:01:02 () DEBUG: Scheduler - Task: Index Migration completed in 0.43s. 2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0.43s . 2008-07-23 11:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory complet ed in 0.42s. 2008-07-23 11:01:03 () DEBUG: Scheduler: stopping </log> The cronjob has it set to run every hour:
1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php To run it every minute use: */1 * * * * www-data php -Cq /var/www/knowledgeTree/bin/scheduler.php The snippet from your log file doesn't show the cronIndexer to be running. What often happens is the scheduler creates one log file as the root user and the indexer creates another log file as the apache user (www-data). Do you have multiple log files? Thanks.. I'll get our sysadmin to update the cron job.
Nope .. we only have log files for www-data user. I'm not sure what the problem is, it looks like everything is in place but the file isn't being run. It's unlikely to be a permissions / access problem since its being run as the apache user. It's possible that there's a php error that we're not picking up. In your config.ini, set phpErrorLogFile to true, it should log additional information to the php_error_log file. Leave it to run for a bit then check your logs.
In addition what happens when you run the indexer directly: php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php PHP Error log attached.
If I execute the command from the command line I got the following: rmistry@xxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593 And in the log file I get: 2008-07-23 15:03:08 () DEBUG: indexDocuments: start 2008-07-23 15:03:08 () DEBUG: Indexer::clearoutDeleted: removed documents from inde xing queue that have been deleted 2008-07-23 15:03:08 () DEBUG: Indexing docid: 7 extension: 'pdf' mimetype: 'applica tion/pdf' extractor: 'PDFExtractor' 2008-07-23 15:03:08 () INFO: Processing docid: 7. 2008-07-23 15:03:08 () DEBUG: Extra Info docid: 7 Source File: '/home/Documents/00/ 10' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB' 2008-07-23 15:03:08 () DEBUG: PDFExtractor: '/usr/bin/pdftotext' -nopgbrk -enc UTF- 8 "/home/Documents/00/10" "/var/www/knowledgeTree/var/tmp/ktindexerTZTNuB" Which at least looks like it's attempting to do the indexing! Unlike the normal scheduling. What about the php_error_log, was it created?
Many thanks for fixing this issue.
Anyway IMHO scripts run by cron (i.e. locally and without user interaction) shouldn't rely on the serverName.txt file (which in turn relies on browser-server interaction) to initialise their variables. This leads to problems in the following scenario: an instance of KT running in a private LAN, accessible both from the internal LAN and from the external internet, though with different URLs (server names and possibly ports). If the URL (server name/port) used for external access is not DNS-resolvable from the host running KT (and this is not necessarily required!), and the last connection to KT has come from the outside (so that serverName.txt contains the external URL), the indexing scripts run by cron won't work. Such scripts should rely on "static" configuration files for their functioning. (incidentally, this may not be the only issue that prevents using KT in the kind of scenario I described above, it's just the first I run into) Any more ideas Megan? I really need to get this DMS implemented soon..
Thanks. Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 1124253 bytes) in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 593
The line in the error is basically running through the content of the file and replacing all the tabs and new lines with spaces. What is the memory_limit set to in your php.ini file? Mine looks as follows: max_execution_time = 200 ; Maximum execution time of each script, in seconds max_input_time = 200 ; Maximum amount of time each script may spend parsing request data memory_limit = 500000000 ; Maximum amount of memory a script may consume (500MB) I set the memory_limit in Bytes because it wasn't recognising it in MB or GB. max_execution_time = 30 ; Maximum execution time of each script, in seconds max_input_time = 60 ; Maximum amount of time each script may spend parsing request data memory_limit = 16M ; Maximum amount of memory a script may consume (16MB) I'll get our sysadmin to change these values.. but I'm not sure it's the root cause. Because, today when I ran it I got: File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ? converter.convert(argv[1], argv[2]) File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden= True, ReadOnly=True)) __main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one. </output> 2008-07-28 14:35:32 () DEBUG: Document docid: 3 was not removed from the queue as it look s like there was a problem with the extraction process 2008-07-28 14:35:32 () DEBUG: Indexing docid: 6 extension: 'odt' mimetype: 'application/v nd.oasis.opendocument.text' extractor: 'OOTextExtractor' 2008-07-28 14:35:32 () INFO: Processing docid: 6. 2008-07-28 14:35:32 () DEBUG: Extra Info docid: 6 Source File: '/var/www/knowledgeTree/va r/tmp/6.odt' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerrgQMfk' 2008-07-28 14:35:32 () DEBUG: OOTextExtractor: "/usr/bin/python" "/var/www/knowledgeTree/ bin/openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/6.odt" "/var/www/kno wledgeTree/var/tmp/ktindexerrgQMfk.html" 127.0.0.1 8100 2008-07-28 14:35:33 () ERROR: Could not extract contents from document 6 2008-07-28 14:35:33 () ERROR: <output>Traceback (most recent call last): File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 152, in ? converter.convert(argv[1], argv[2]) File "/var/www/knowledgeTree/bin/openoffice/DocumentConverter.py", line 116, in convert document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, _unoProps(Hidden= True, ReadOnly=True)) __main__.com.sun.star.lang.IllegalArgumentException: URL seems to be an unsupported one. </output> 2008-07-28 14:35:33 () DEBUG: Document docid: 6 was not removed from the queue as it look s like there was a problem with the extraction process 2008-07-28 14:35:33 () DEBUG: indexDocuments: done BTW Do you know exactly which Openoffice packages need to be installed for KT?
Like I mentioned in the other bug, these are what we have installed at the moment: i A openoffice.org-base-core - OpenOffice.org office suite -- libdba i A openoffice.org-common - OpenOffice.org office suite architecture i i A openoffice.org-core - OpenOffice.org office suite architecture d i openoffice.org-dtd-officedocume - OfficeDocument 1.0 DTD (OpenOffice.org 1.x i A openoffice.org-filter-binfilter - Legacy filters (e.g. StarOffice 5.2) for O i openoffice.org-headless - Headless VCL plugin for OpenOffice.org i openoffice.org-java-common - OpenOffice.org office suite Java support a i A openoffice.org-style-andromeda - Default symbol style for OpenOffice.org i A openoffice.org-style-crystal - Crystal symbol style for OpenOffice.org i A openoffice.org-style-tango - Tango symbol style for OpenOffice.org i openoffice.org-writer - OpenOffice.org office suite - word process Actually that does kind of make sense. The error is from open office being unable to extract the contents from the document, this happens before the point in the code where the memory limit is reached. I wasn't able to run my installation with a memory_limit of 16MB. Reschedule the pdf document once the values have been changed and see if that works.
Hi Megan
I just noticed that there are two cronIndexer.php scripts: /var/www/knowledgeTree/search2/bin/cronIndexer.php /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php It appears that the first script is supposed to call the second script. When I execute the first script from the CLI, indexing does NOT occur. i.e. I just see the following in the log: 2008-07-30 08:25:37 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2 /indexing/bin/cronIndexer.php 2008-07-30 08:25:54 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxx.o rg.za However if I execute the second script from the CLI then I get the OpenOffice error - "unsupported URL" above. It is only when I execute the second script that the dashlet in KT says that indexing has occurred, and the two Oo documents go into error status. If I'm not mistaken it is the first script that is called from scheduler.php. Actually, maybe I'm talking rubbish because here the correct URL is displayed (even though indexing doesn't actually occur) :
2008-07-30 07:01:04 () DEBUG: Scheduler: stopping 2008-07-30 08:01:02 () DEBUG: Scheduler: starting 2008-07-30 08:01:02 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-07-30 08:01:02 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2 /indexing/bin/cronIndexer.php 2008-07-30 08:01:02 () DEBUG: Scheduler - Task: Indexing completed in 0.43s. 2008-07-30 08:01:03 () DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-07-30 08:01:03 () DEBUG: call_page: calling http://kt.xxxxxxxxxxx.org.za/search2 /indexing/bin/cronMigration.php 2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Index Migration completed in 0.4 2s. 2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Open Office test completed in 0. 42s. 2008-07-30 08:01:03 () DEBUG: Scheduler - Task: Cleanup Temporary Directory comp leted in 0.43s. 2008-07-30 08:01:04 () DEBUG: Scheduler: stopping The scheduler calls the first cronIndexer.php script which uses a curl function to call the second. I can't recall the reason for this right now. Do you have curl installed and working for your www-data user?
Has the memory_limit been changed on your installation yet? Nope.. sysadmin has been a bit busy, but he should do it soon I hope.
I have the following installed for curl: i php5-curl - CURL module for php5 As for the Openoffice indexing issue, I've asked the sysadmin to add the following to the startup command: -nofirststartwizard As ours is a server with no X or GUI, I have a feeling the startup wizard is trying to run. I got this hint from a post on the Oo forums. Attached is the most recent log.
We changed the php memory settings and added the 'nostartwizard' option as well as installed the additional Oo packages. Unfortunately none of this is making a difference! As you can see in the first part of the log ( attachment: 20080805.log) the scheduler looks like it's running. However the last bit of the log is where I run: sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php from the command line. I'm losing hope here! :( I've attached a new file OpenOfficeTextExtractor which will extract the text form the odt file without open office. The only thing it needs is unzip which should be standard on Debian.
- Copy the file into "knowledgeTree/search2/indexing/extractors/" - Reschedule all documents - Run the cronIndexer Hopefully that works. We're releasing 3.5.3 soon which will have support for catdoc and catppt thereby removing our reliance on OpenOffice. Should that file be called OOTextExtractor.inc.php instead of OpenoffceTextExtractor.inc.php?
No, it should be OpenOfficeTextExtractor.inc.php.
I get this when trying to reshcedule all documents:
Fatal error: Call to a member function debug() on a non-object in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 523 sorry.. ignore the last comment.. I had tried to add in a debug statement when I was trying to figure things out.
See 20080805-2.log
I still get the same "unsupported url" problem.. How does it know to call the new extractor? Sorry, I missed that part out.
Run the following sql on your database: update mime_types set extractor_id = null; delete from mime_extractors; delete from system_settings where name='mimeTypesRegistered'; It deletes the current associated extractors. The indexer will regenerate these based on the contents of the directory the next time it runs. FYI: The issue that removes the dependence on OpenOffice is: Morning Megan.
Unfortunately it's still giving the url issue. Please see the log-2008-08-06.www-data.txt attachment. Initially it said that the extractors are disabled. So I then re-ran the script and it then gave the same URL issue. Did I miss something? Thanks Should the original OOTextExtractor.inc.php be removed from the directory so KT doesn't get confused?
Yes, delete all three of the OO extractors. The run the sql again to clear out the mime_extractor mappings. Sorry about dragging this out, it isn't my area.
No worries.. as long as we eventually sort it out I'll be happy.
I tried this but get : Warning: require_once(OOTextExtractor.inc.php) [function.require-once]: failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39 Fatal error: require_once() [function.require]: Failed opening required 'OOTextExtractor.inc.php' (include_path='/var/www/knowledgeTree/search2:/var/www/knowledgeTree/ktapi:/var/www/knowledgeTree/thirdparty/xmlrpc-2.2/lib:/var/www/knowledgeTree/thirdparty/simpletest:/var/www/knowledgeTree/thirdparty/Smarty:/var/www/knowledgeTree/thirdparty/pear:/var/www/knowledgeTree/thirdparty/ZendFramework/library:.:/usr/share/php:/usr/share/pear') in /var/www/knowledgeTree/search2/indexing/extractors/RTFExtractor.inc.php on line 39 Do I need to edit RTFExtractor and change the require to OpenOfficeTextExtractor? There appear to be a number of dependencies involved. It may be better to upgrade to 3.5.3 which includes this than to try and patch it. We're releasing the Community Edition next week.
Another option might be to manually change the mime_extractor mappings. Did you delete those extractor files or just move them out? I just moved them out..
How would I manually change the mime_extractor mappings? I might be able to give 3.5.3 a go.... but I have to get our sysadmin to do it.. and I don't know if he'll have time to do another install for me. I've attached my extractor files, I'm not sure if they'll work, as there have been a number of changes in 3.5.3. The files go into the knowledgeTree/Search2/ directory.
My mime_extractors table is below: INSERT INTO `mime_extractors` VALUES (23, 'ExcelExtractor', 1), (24, 'ExifExtractor', 1), (25, 'OpenOfficeTextExtractor', 1), (26, 'OpenXmlTextExtractor', 1), (27, 'PDFExtractor', 1), (28, 'PlainTextExtractor', 1), (29, 'PowerpointExtractor', 1), (30, 'ScriptExtractor', 1), (31, 'StarOfficeExtractor', 1), (32, 'WordExtractor', 1), (33, 'XMLExtractor', 1); Good news Megan!
That worked, in that all my documents are now indexed!!!! However, when running from the command line I get the following: me@xxxxxxxxx:/var/www/knowledgeTree/var/log$ sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php Warning: filesize(): stat failed for /var/www/knowledgeTree/var/tmp/ktindexernyPIDx in /var/www/knowledgeTree/search2/indexing/extractors/PDFExtractor.inc.php on line 94 Warning: file_get_contents(/var/www/knowledgeTree/var/tmp/ktindexernyPIDx): failed to open stream: No such file or directory in /var/www/knowledgeTree/search2/indexing/indexerCore.inc.php on line 584 The actual scheduler still thinks there is nothing to do, even though I add a new document: 2008-08-11 08:36:03 () DEBUG: Scheduler: starting 2008-08-11 08:36:03 () DEBUG: Scheduler: stopping - nothing to do Would it be a problem for me to add the cronIndexer.php script into the cron job? That way I know that indexing is occuring.. Rax Oops.. I also get the following when trying to index an xls:
Fatal error: Call to undefined method JavaXMLRPCLuceneIndexer::restartBatch() in /var/www/knowledgeTree/search2/indexing/extractors/StarOfficeExtractor.inc.php on line 168 LOG: 2008-08-11 09:07:29 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e xcel' extractor: 'ExcelExtractor' 2008-08-11 09:07:29 () INFO: Processing docid: 10. 2008-08-11 09:07:29 () DEBUG: Extra Info docid: 10 Source File: '/var/www/knowledgeTree/var/tmp/1 0.xls' Target File: '/var/www/knowledgeTree/var/tmp/ktindexerOugVAX' 2008-08-11 09:07:29 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/var/www/k nowledgeTree/var/tmp/10.xls" > "/var/www/knowledgeTree/var/tmp/ktindexerOugVAX" 2008-08-11 09:07:30 () DEBUG: StarOfficeExtractor: "/usr/bin/python" "/var/www/knowledgeTree/bin/ openoffice/DocumentConverter.py" "/var/www/knowledgeTree/var/tmp/10.xls" "/var/www/knowledgeTree/ var/tmp/ktindexerOugVAX.html" 127.0.0.1 8100 2008-08-11 09:07:31 () INFO: DocumentId: 10 - Suspect the file cannot be indexed by Open Office. I'm so glad its working! It shouldn't be a problem if you add the cronIndexer to the cron job.
Add the following line to your config.ini: [indexer] useOpenOffice = false That will stop the indexer trying to use open office for extracting text. Check your web servers write permissions on the tmp directory, to ensure the temp files can be created. I'm not sure about the excel extractor. I'll have to check with the developer. Ok, adding the config option removed the OO error, but I noticed the following:
2008-08-11 10:06:49 () DEBUG: Indexing docid: 10 extension: 'xls' mimetype: 'application/vnd.ms-e xcel' extractor: 'ExcelExtractor' 2008-08-11 10:06:49 () INFO: Processing docid: 10. 2008-08-11 10:06:49 () DEBUG: Extra Info docid: 10 Source File: '/home/Documents/00/13' Target Fi le: '/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1' 2008-08-11 10:06:49 () DEBUG: ExcelExtractor: "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Docu ments/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1" 2008-08-11 10:06:49 () INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not availab le and OpenOffice is not in use. 2008-08-11 10:06:49 () DEBUG: Indexer: removing document 10 from the queue - Done indexing docid: 10 xl2csv is in the path /usr/bin/xls2csv. The binary is mentioned in the config.ini. So why does it think it's not available? Also, KT reports the document as successfully indexed, even though it did't really index it. If I run the command from the CLI I get a "permission denied" error:
drwxr-xr-x 9 www-data www-data 4096 2008-05-23 13:24 . drwxr-xr-x 25 www-data www-data 4096 2008-05-23 13:24 .. drwxr-xr-x 3 www-data www-data 4096 2008-08-05 10:14 cache drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 Documents drwxr-xr-x 2 www-data www-data 4096 2008-08-11 10:06 indexes drwxrwxr-x 2 www-data www-data 8192 2008-08-11 00:00 log drwxr-xr-x 4 www-data www-data 4096 2008-07-07 11:38 proxies drwxr-xr-x 4 www-data www-data 12288 2008-08-11 10:06 tmp drwxr-xr-x 2 www-data www-data 4096 2008-05-23 13:24 uploads me@xxxxxxxx:/var/www/knowledgeTree/var$ sudo "/usr/bin/xls2csv" -d UTF-8 -q 0 -c " " "/home/Documents/00/13" > "/var/www/knowledgeTree/var/tmp/ktindexeryZIcM1" -bash: /var/www/knowledgeTree/var/tmp/ktindexeryZIcM1: Permission denied Even though you can see www-data has full perms to tmp. The webserver is also running as www-data. One more question:
At the moment, our install feels slightly "hacked together" with all the manual changes we've made. Do you think this will be an issue if we decided to upgrade at some point? => INFO: The document 10 cannot be indexed as /usr/bin/xls2csv is not available and OpenOffice is not in use.
This should be reworded, it just means that the xls2csv binary wasn't able to index the document. => DEBUG: Indexer: removing document 10 from the queue - Done indexing docid: 10 It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully. There are similar issues being worked on at the moment with xls2csv, catdoc, etc not working on a source install on ubuntu. I'll let you know when a solution is found. => our install feels slightly "hacked together" with all the manual changes we've made. Do you think this will be an issue if we decided to upgrade at some point? Most of the changes we've made are all implemented in our 3.5.3 version so they shouldn't cause any major problems with an upgrade. => It removes it from the queue to prevent it blocking further documents from being indexed, even though the document wasn't indexed successfully. Fair enough, but I think it's currently moving the document to the "success" queue, rather than the "failed" queue. Before we start using this in anger, I'll have to upload our entire repo and see how it works. If all goes well then we'll stick with it the way it is. Thanks! Thought you would like to know that we are no longer receiving the error for the xls2csv command..
however, it doesn't look like it's actually indexing the file .. i.e. I can't find anything from the contents of the xls. Megan,
I noticed that Outlook files (.msg, .oft) are being recognised as Word documents and catdoc is used to try and index these files. This however is not working resulting in the following: 2008-08-12 14:15:07 () INFO: Processing docid: 653. 2008-08-12 14:15:07 () DEBUG: Extra Info docid: 653 Source File: '/home/Documents/06/656' Targe t File: '/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP' 2008-08-12 14:15:07 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/06/ 656" > "/var/www/knowledgeTree/var/tmp/ktindexerNmo5JP" 2008-08-12 14:15:07 () INFO: The document 653 cannot be indexed as /usr/bin/catdoc is not avail able and OpenOffice is not in use. 2008-08-12 14:15:07 () DEBUG: Indexer: removing document 653 from the queue - Done indexing doc id: 653 Note that Word documents ARE being indexed correctly. cheers Rax I've attached a updated set of extractors. The commands calling xls2csv, etc were incorrect and therefore causing the document not to be indexed. Hopefully the new code will help correct your problems.
I'm glad to hear that your word documents at least are being indexed. I'm not sure about the Outlook documents, I'm not sure what extractor would be used for them. => it's currently moving the document to the "success" queue, rather than the "failed" queue. There is no success or failed queue, once the document has been indexed it is moved out of the indexing queue. If it fails on the indexing it is also moved out of the queue as I explained before. The move to a failed queue is something that needs to be addressed in future. Hi Megan.
Sorry for the slow reply (I was away last week), and thanks for the updated code. I will implement the new code today and see what happens. Cheers Rax Do I follow the same instructions as before for implementing these changes?
Thanks Yes.
Copy the files into the <knowledgeTree>/search2/indexing directory. Then run the sql: update mime_types set extractor_id = null; delete from mime_extractors; delete from system_settings where name='mimeTypesRegistered'; Reschedule some of your documents and hold thumbs. I made the changes.
When I login and try to get to the dashboard I get the following error in the browser: Fatal error: Class JavaXMLRPCLuceneIndexer contains 1 abstract method and must therefore be declared abstract or implement the remaining methods (Indexer::isDocumentIndexed) in /var/www/knowledgeTree/search2/indexing/indexers/JavaXMLRPCLuceneIndexer.inc.php on line 281 This is in the log file: 2008-08-25 14:07:30 (192.168.1.217) INFO: control.php: about to redirect to /login.php?errorMessage=You need to login to access this page&redirect=http%3A%2F%2Fkt.xxxxxxxxxxx.org.za%2Fdashboard.php 2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled. 2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'RTFExtractor' is disabled. 2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOTextExtractor' is disabled. 2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'PSExtractor' is disabled. 2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled. 2008-08-25 14:07:37 (192.168.1.217) DEBUG: diagnose: class 'StarOfficeExtractor' does not support any t ypes. 2008-08-25 14:07:37 (192.168.1.217) DEBUG: kt_url: base url - http://kt.xxxxxxxxxxx.org.za 2008-08-25 14:10:02 () DEBUG: indexDocuments: start 2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled. 2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled. 2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled. 2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled. 2008-08-25 14:10:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled. 2008-08-25 14:10:02 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types. 2008-08-25 14:10:02 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have been deleted 2008-08-25 14:10:02 () DEBUG: indexDocuments: stopping - no work to be done 2008-08-25 14:15:01 () DEBUG: indexDocuments: start 2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled. 2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled. 2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled. 2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled. 2008-08-25 14:15:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled. 2008-08-25 14:15:01 () DEBUG: diagnose: class 'StarOfficeExtractor' does not support any types. 2008-08-25 14:15:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have been deleted What to do? I can browse to documents if I put in the right URL though. Thanks The dashboard issue looks like a discrepancy between versions of JavaXMLRPCLuceneIndexer class. I've attached the full search2 directory which should contain all the correct class versions. Back up your <knowledgeTree>/search2 directory and drop this one in.
For the second issue, it looks like the extractors haven't regenerated in the database. Run the sql in my previous post again. The most important being the last query: delete from system_settings where name='mimeTypesRegistered'; Check that the setting mimeTypesRegistered has been deleted from system_settings. If the setting exists then the system doesn't regenerate the extractors. Clear your cache: <knowledgeTree>/var/cache That seems to have worked Megan.
Do I need to be concerned about the disabled extractors? 2008-08-25 16:35:01 () DEBUG: Indexer::clearoutDeleted: removed documents from indexing queue that have been deleted 2008-08-25 16:35:01 () DEBUG: Indexing docid: 969 extension: 'doc' mimetype: 'application/msword' extra ctor: 'WordExtractor' 2008-08-25 16:35:01 () INFO: Processing docid: 969. 2008-08-25 16:35:01 () DEBUG: Extra Info docid: 969 Source File: '/home/Documents/09/972' Target File: '/var/www/knowledgeTree/var/tmp/ktindexer1mPePn' 2008-08-25 16:35:01 () DEBUG: WordExtractor: "/usr/bin/catdoc" -w -d UTF-8 "/home/Documents/09/972" > " /var/www/knowledgeTree/var/tmp/ktindexer1mPePn" 2008-08-25 16:35:01 () DEBUG: Indexer: removing document 969 from the queue - Done indexing docid: 969 2008-08-25 16:35:01 () DEBUG: indexDocuments: done 2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled. 2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled. 2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled. 2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'PSExtractor' is disabled. 2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled. 2008-08-25 16:35:01 () DEBUG: diagnose: extractor 'StarOfficeExtractor' is disabled. 2008-08-25 16:40:02 () DEBUG: indexDocuments: start 2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOPresentationExtractor' is disabled. 2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'RTFExtractor' is disabled. 2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOTextExtractor' is disabled. 2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'PSExtractor' is disabled. 2008-08-25 16:40:02 () DEBUG: diagnose: extractor 'OOSpreadsheetExtractor' is disabled. Good to hear!
I don't think you need to be concerned about the disabled extractors, some of the original extractors are disabled in the code because the new extractors override them. Thanks for all the help Megan. Took a while, but I'm glad it's sorted.
I will basically be putting the DMS live today.. hold thumbs! ;) Good luck! I hope all goes well :)
I'm going to close the issue now. This issue was fixed with adding the rootUrl to the serverName.txt file and the updated (version 3.5.3) search2 extractors.
And I also had to add the following to the cron job:
sudo php -Cq /var/www/knowledgeTree/search2/indexing/bin/cronIndexer.php Closing this issue. It relates to
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008-07-14 06:01:01 () DEBUG: call_page: calling http://XXXXXXXXXXXX/var/www/knowledgeTree/search2/indexing/bin/cronIndex
is made up of 2 parts, the path to the indexer: search2/indexing/bin/cronIndex and the server name: http://XXXXXXXXXXXX/var/www/knowledgeTree
The server name consists of the domain: http://XXXXXXXXXXXX and the rootUrl: /var/www/knowledgeTree
I assume that your domain is pointing to the knowledgeTree directory (/var/www/knowledgeTree). Check that the rootUrl in your config.ini is set to default.