4 03 2009

How to Query Pdf files using full-text search(FTS)

SQL Server Full-Text Search (FTS) supports IFilters in order to index specific types of docuemnts. By this way not only texts are indexed but also pdf,mp3,pptx files can be indexed. The query below helps you to find out which types of documents can be indexed by the FTS
select*fromsys.fulltext_document_types

If the query result does not contain pdf extension, you have to install pdf IFilter but if you are working on your desktop and installed adobe acrobat  pdf reader before, executing the stored procedure below enables the extensions of IFilters that are  installed on operating system  into SQL Server.
sp_fulltext_service'load_os_resources'
Many IFilter vendors do not verify their components sp_fulltext_service 'verify_signature' procedure checks the signed IFilters.

FTS needs an extension information in order to index different kinds of documents.  In other words, the way document can be indexed  is achieved by providing the document type.  So we have to design our table or view  by adding a document that  is type of varbinary(max) and an extra column that stores the type of the document.
if exists(select * from sys.objects where name = 'tbl_documents')
        
drop table tbl_documents
go
create table tbl_documents
(
DocumentId int not null primary key identity(1,1) , [Document] varbinary(max) not null ,
[Type] varchar(5) not null default('.pdf')
)
To demonstrate i inserted couple of pdf documents into tbl_documents ,created a full-text index on that table  and started crawling. After finishing the crawling operation on tbl_documents, i searched for some sentences, keywords that  are in the books  and the results are  perfect : ).

Hiç yorum yok: