once again: indexing of pdf documents

View: New views
6 Messages — Rating Filter:   Alert me  

once again: indexing of pdf documents

by Gerrit Kühn :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi folks,

when I installed my first tiki earlier this year I posted a problem here
that my tiki was indexing every kind of file (text, ps, doc, ppt,
whatever I set up), but no pdfs. Back in February I found a bug in
refresh-functions.php, which was fixed (and maybe did some other things),
and after that indexing of pdfs magically worked for me.
Now I have updated from 1.9.9 to 1.9.11, and indexing is broken again for
pdfs. :-(
I already looked into refresh-functions.php to make sure it was the
correct version. As far as I can tell it is (at least it contains the fix
mentioned above). However, it is not the version I edited in February
anymore, but a new one from the 1.9.11 install.

Can anyone here confirm that indexing of pdf documents is working for
her/him (or not)? I am using

application/pdf /usr/local/bin/pdftotext %1 -

as Mime Type filter, if that should matter.
Any other hints/thoughts would be greatly appreciated, too.


cu
  Gerrit

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Tikiwiki-users mailing list
Tikiwiki-users@...
https://lists.sourceforge.net/lists/listinfo/tikiwiki-users

Re: once again: indexing of pdf documents

by Gerrit Kühn :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Jun 27, 2008 at 03:54:08PM +0200, Gerrit Kühn wrote:

> Can anyone here confirm that indexing of pdf documents is working for
> her/him (or not)? I am using
>
> application/pdf /usr/local/bin/pdftotext %1 -
>
> as Mime Type filter, if that should matter.
> Any other hints/thoughts would be greatly appreciated, too.

One addition to this:
Meanwhile I replaced pdftotext with a self-written script that calls
pdftotext and then tees the output into a file (and to stdout of course).
This way I can make sure that an uploaded pdf file is actually processed
during indexing and that the output of pstotext is correct. The problem has
to be somewhere later in the processing, because the pdf files are not found
by the search module, although pstotext is abviously working as expected.
postscript and other file types are working fine, too.
I'm somewhat puzzled here...


cu
  Gerrit
--

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Tikiwiki-users mailing list
Tikiwiki-users@...
https://lists.sourceforge.net/lists/listinfo/tikiwiki-users

Re: once again: indexing of pdf documents

by Gerrit Kühn :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Jun 27, 2008 at 11:24:47PM +0200, Gerrit Kühn wrote:

> One addition to this:

[...]

And one more thing:

I have enabled auto-indexing on upload. However, after uploading a file, the
search function does not find its contents. I have to reindex all files for
search manually to make this work. I guess that is not quite the way it
should be...


cu
  Gerrit
--

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Tikiwiki-users mailing list
Tikiwiki-users@...
https://lists.sourceforge.net/lists/listinfo/tikiwiki-users

Mostly solved (bugfix included): Re: once again: indexing of pdf documents

by Gerrit Kühn :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, 28 Jun 2008 15:11:32 +0200 Gerrit Kühn
<gerrit@...> wrote about Re: [Tikiwiki-users] once again:
indexing of pdf documents:

Hi,

I have made the following changes:

mclane# diff -u refresh-functions.php refresh-functions.php.orig
--- refresh-functions.php       2008-07-02 15:25:20.000000000 +0200
+++ refresh-functions.php.orig  2008-07-02 15:26:28.000000000 +0200
@@ -387,16 +387,16 @@
      $query="select * from `tiki_files`";
      $result=$tikilib->query($query,array(),1,rand(0,$cant-1));
      $info=$result->fetchRow();
-     $words=&search_index($info["description"]." ".$info["name"]." ".$info
["search_data"]." ".$info["filename"]);
+     $words=&search_index($info["data"]." ".$info["description"]." ".$info
["name"]); insert_index($words,"file",$info["fileId"]);
    }
 }
 
 function refresh_index_files() {
   global $tikilib;
-  $result = $tikilib->query("select * from `tiki_files`");
+  $result = $tikilib->query("select * from `tiki_files`", array());
   while ($info = $result->fetchRow()) {
-      $words=&search_index($info["description"]." ".$info["name"]." ".
$info["search_data"]." ".$info["filename"]);
+      $words=&search_index($info['data'].' '.$info['description'].' '.
$info['name']. ' '.$info['search_data']); insert_index($words,"file",$info
["fileId"]); }
 }


Now indexing sort of works for me, even for pdf files. I removed the
'data' from the search words, because they're binary anyway and to me it
seems not to make sense to include them here.
The most important change which fixed my problems was to remove the array
() from the query in refresh_index_files().
I still do not know much about php and mysql, so I do not really know what
I changed there, but the other functions in this file show the same
structure (the array() is only there in the random_-version of the
fucntions).
Maybe someone here can review this and commit it if it is the right
solution.

I have one problem left which I know of only since yesterday when I
installed an sql browser to look into the actual database:
Some of my uploaded pdf-files appear to have the wrong mime type. They are
unrecognized and stored unter filetype application/unknown. Therefore they
are not converted to ascii and subsequently not indexed.
As I said, there are only some pdf files concerned, others are ok. I
cannot see any scheme or structure behind this. Does anyone here have an
idea why that happens and how to fix it?


cu
  Gerrit

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tikiwiki-users mailing list
Tikiwiki-users@...
https://lists.sourceforge.net/lists/listinfo/tikiwiki-users

Re: Mostly solved (bugfix included): Re: once again: indexing of pdf documents

by Gerrit Kühn :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 2 Jul 2008 15:35:47 +0200 Gerrit Kühn <gerrit@...>
wrote about [Tikiwiki-users] Mostly solved (bugfix included): Re: once
again: indexing of pdf documents:

GK> mclane# diff -u refresh-functions.php refresh-functions.php.orig
GK> --- refresh-functions.php       2008-07-02 15:25:20.000000000 +0200
GK> +++ refresh-functions.php.orig  2008-07-02 15:26:28.000000000 +0200

And for your convenience, here is the diff in the correct direction. :-)

mclane# diff -u refresh-functions.php.orig refresh-functions.php
--- refresh-functions.php.orig  2008-07-02 15:26:28.000000000 +0200
+++ refresh-functions.php       2008-07-02 16:03:58.000000000 +0200
@@ -387,16 +387,16 @@
      $query="select * from `tiki_files`";
      $result=$tikilib->query($query,array(),1,rand(0,$cant-1));
      $info=$result->fetchRow();
-     $words=&search_index($info["data"]." ".$info["description"]." ".$info
["name"]);
+     $words=&search_index($info["description"]." ".$info["name"]." ".$info
["search_data"]." ".$info["filename"]); insert_index($words,"file",$info
["fileId"]); }
 }
 
 function refresh_index_files() {
   global $tikilib;
-  $result = $tikilib->query("select * from `tiki_files`", array());
+  $result = $tikilib->query("select * from `tiki_files`");
   while ($info = $result->fetchRow()) {
-      $words=&search_index($info['data'].' '.$info['description'].' '.
$info['name']. ' '.$info['search_data']);
+      $words=&search_index($info["description"]." ".$info["name"]." ".
$info["search_data"]." ".$info["filename"]); insert_index($words,"file",
$info["fileId"]); }
 }


GK> I have one problem left which I know of only since yesterday when I
GK> installed an sql browser to look into the actual database:
GK> Some of my uploaded pdf-files appear to have the wrong mime type. They
GK> are unrecognized and stored unter filetype application/unknown.
GK> Therefore they are not converted to ascii and subsequently not
GK> indexed.
GK> As I said, there are only some pdf files concerned, others are ok. I
GK> cannot see any scheme or structure behind this. Does anyone here have
GK> an idea why that happens and how to fix it?

For the record: I could solve this via Google and found out that the
upload mime-type depends on the setup of the browser used, so I have to
make the changes there. However, I think that a file-type detection
independent of the browser would be desirable. There must already be
something like this in tiki, because the file icons presented in the
file gallery were always the right ones, independent of the mime type
in the database which could be pdf, application/pdf, x-application/pdf,
application/uknown or something else...


cu
  Gerrit

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tikiwiki-users mailing list
Tikiwiki-users@...
https://lists.sourceforge.net/lists/listinfo/tikiwiki-users

Re: Mostly solved (bugfix included): Re: once again: indexing of pdf documents

by Xavier de Pedro Puente-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Gerrit, I would suggest you to send this thread to the Tikiwiki
developers <tikiwiki-devel@...> instead, for a higher
probability to get appropriate feedback (coders and those coding
discussions are mostly in that list)

Cheers

Xavi

En/na Gerrit Kühn ha escrit:

> On Wed, 2 Jul 2008 15:35:47 +0200 Gerrit Kühn <gerrit@...>
> wrote about [Tikiwiki-users] Mostly solved (bugfix included): Re: once
> again: indexing of pdf documents:
>
> GK> mclane# diff -u refresh-functions.php refresh-functions.php.orig
> GK> --- refresh-functions.php       2008-07-02 15:25:20.000000000 +0200
> GK> +++ refresh-functions.php.orig  2008-07-02 15:26:28.000000000 +0200
>
> And for your convenience, here is the diff in the correct direction. :-)
>
> mclane# diff -u refresh-functions.php.orig refresh-functions.php
> --- refresh-functions.php.orig  2008-07-02 15:26:28.000000000 +0200
> +++ refresh-functions.php       2008-07-02 16:03:58.000000000 +0200
> @@ -387,16 +387,16 @@
>       $query="select * from `tiki_files`";
>       $result=$tikilib->query($query,array(),1,rand(0,$cant-1));
>       $info=$result->fetchRow();
> -     $words=&search_index($info["data"]." ".$info["description"]." ".$info
> ["name"]);
> +     $words=&search_index($info["description"]." ".$info["name"]." ".$info
> ["search_data"]." ".$info["filename"]); insert_index($words,"file",$info
> ["fileId"]); }
>  }
>  
>  function refresh_index_files() {
>    global $tikilib;
> -  $result = $tikilib->query("select * from `tiki_files`", array());
> +  $result = $tikilib->query("select * from `tiki_files`");
>    while ($info = $result->fetchRow()) {
> -      $words=&search_index($info['data'].' '.$info['description'].' '.
> $info['name']. ' '.$info['search_data']);
> +      $words=&search_index($info["description"]." ".$info["name"]." ".
> $info["search_data"]." ".$info["filename"]); insert_index($words,"file",
> $info["fileId"]); }
>  }
>
>
> GK> I have one problem left which I know of only since yesterday when I
> GK> installed an sql browser to look into the actual database:
> GK> Some of my uploaded pdf-files appear to have the wrong mime type. They
> GK> are unrecognized and stored unter filetype application/unknown.
> GK> Therefore they are not converted to ascii and subsequently not
> GK> indexed.
> GK> As I said, there are only some pdf files concerned, others are ok. I
> GK> cannot see any scheme or structure behind this. Does anyone here have
> GK> an idea why that happens and how to fix it?
>
> For the record: I could solve this via Google and found out that the
> upload mime-type depends on the setup of the browser used, so I have to
> make the changes there. However, I think that a file-type detection
> independent of the browser would be desirable. There must already be
> something like this in tiki, because the file icons presented in the
> file gallery were always the right ones, independent of the mime type
> in the database which could be pdf, application/pdf, x-application/pdf,
> application/uknown or something else...
>
>
> cu
>   Gerrit
>
> -------------------------------------------------------------------------
> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
> Studies have shown that voting for your favorite open source project,
> along with a healthy diet, reduces your potential for chronic lameness
> and boredom. Vote Now at http://www.sourceforge.net/community/cca08
> _______________________________________________
> Tikiwiki-users mailing list
> Tikiwiki-users@...
> https://lists.sourceforge.net/lists/listinfo/tikiwiki-users
>
>  


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tikiwiki-users mailing list
Tikiwiki-users@...
https://lists.sourceforge.net/lists/listinfo/tikiwiki-users