public abstract class AbstractDocumentBaseOAIHarvester extends AbstractOAIHarvester implements DocumentBaseOAIHarvester
DocumentBaseOAIHarvester.ConfigurationNodeOAIObject.Node| Modifier and Type | Field and Description |
|---|---|
protected Database |
_database
Underlying database to store any info
|
protected org.apache.cocoon.serialization.XMLSerializer |
cBytes |
protected java.lang.String |
defaultTransformerFactory |
protected java.lang.String |
defaultTransformerIndent |
protected java.util.ArrayList |
deletedDocs |
protected DocumentBase |
docbase
The underlying document base
|
protected java.lang.String |
docbaseId
Id of the underlying document base
|
protected static java.lang.String |
ERROR_CODE |
protected java.io.FileOutputStream |
fileOs |
protected java.util.Hashtable |
filesProperties
List OAI files with OAI properties
|
protected boolean |
forceIndexOnHarvestError
Force indexation on harvest error option
Default: false.
|
protected static java.lang.String |
FORCEINDEXONHARVESTERROR |
protected java.io.File |
harvestDoc |
protected IDGenerator |
harvesterIdGen
IDGenerator for this object
|
protected boolean |
indexAtHarvestEnd
Indexation at the end of harvesting option
Default: true.
|
protected static java.lang.String |
INDEXATHARVESTEND |
protected boolean |
keepDeletedRecords |
protected boolean |
keepHarvestedRecords
Force harvester to keep harvested records (default: false)
Force harvester to keep harvested records (XML files) in file system server.
|
protected java.util.Set |
m_docsaddedIds |
protected java.util.Set |
m_docsdeletedids |
protected java.util.Set |
m_docsToDeleteIds |
protected static java.lang.String |
NO_DOCS_DELETED |
protected static java.lang.String |
NO_DOCS_HARVESTED |
protected int |
noDocsDeleted |
protected int |
noHarvestedDocs |
protected int |
noRecordsPerBatch |
protected static java.lang.String |
OAI_FAILED_HARVEST |
protected static java.lang.String |
OAI_FROM |
protected static java.lang.String |
OAI_HARVEST_ID |
protected static java.lang.String |
OAI_HARVESTER_LAST_UPDATED |
protected static java.lang.String |
OAI_HARVESTER_RESUMPTION_TOKEN |
protected static java.lang.String |
OAI_IDENTIFIER |
protected static java.lang.String |
OAI_METADATA_PREFIX |
protected static java.lang.String |
OAI_SET |
protected static java.lang.String |
OAI_UNTIL |
protected static java.lang.String |
OAI_VERB |
protected org.apache.cocoon.xml.XMLPipe |
oaiStripper |
protected Pipeline |
pipe
Pre-indexation pipeline
|
protected TimeScheduler |
scheduler
Time scheduler for stored requests
|
protected java.util.Hashtable |
storedRequests
Requests in application.xconf
|
protected java.util.Hashtable |
storeRepositoriesRefs
References to the underlying documentbase's/application's repositories
|
protected java.io.File |
tempDir |
protected java.io.File |
tempDirBatch |
protected java.lang.String |
tempDirPath
Directory to store harvested documents
Temporary path of the directory where the harvested documents will be stored.
|
protected java.lang.String |
TEMPFILE_SUFFIX |
protected static java.lang.String |
TRANSFORMER_FACTORY |
protected static java.lang.String |
TRANSFORMER_INDENT |
protected java.lang.String |
transformerFactory
XML Transformer factory classe name.
|
protected java.lang.String |
transformerIndent
XML Transformer indent option.
|
protected XMLDocument |
urlResource |
adminEmails, captureElemContent, captureRecord, currentDatestamp, currentMetadtaUrlIdentifier, currentOaiIdentifier, currentOaiStatus, cursor, deleteRecord, errorCode, firstXmlConsumer, identifierName, manager, newRequestUrl, OAI_REPOSITORY_URL, OAI_REQUEST_URL, repoUrl, requestParams, requestUrl, responseDate, resumptionToken, sBuff, userAgent_context, loggersynchronizedXmlConsumerHTTP_HEADER_NAME_FROM, HTTP_HEADER_NAME_USER_AGENT, NUMBER_RECORDS_PER_RESPONSE, STRING_DATEFORMAT_GRANULARITY_DAY, STRING_DATEFORMAT_GRANULARITY_SECONDONE_CENTURY, ONE_DAY, ONE_HOUR, ONE_MINUTE, ONE_SECOND, ONE_WEEK, ONE_YEARALL_SAVE_ATTRIB, PATH_ATTRIB, SAVE_DIRECTORY_PARAM| Constructor and Description |
|---|
AbstractDocumentBaseOAIHarvester(DocumentBase base)
Basic constructor
|
| Modifier and Type | Method and Description |
|---|---|
void |
backup(SaveParameters save_config)
Save the timeStamp of the Harvester
|
protected void |
captureRecord()
Ends the capture of an oai record.
|
protected void |
captureResourceFromUrlIdentifier()
Captures the xml from a url taken from an oai record and adds
it to the oai-record as a sibling of the
|
boolean |
checkGranularity(java.lang.String granularity)
Check the granularity of an AOI provider : YYYY-MM-DDThh:mm:ssZ or YYYY-MM-DD
|
void |
close()
Close OAI harvester.
|
void |
configure(org.apache.avalon.framework.configuration.Configuration configuration)
OAI harvester configuration
Configures the OAI harvester reading
application.xconf file
wich may contains a section such as:
<sdx:documentBase [...]>
<sdx:oai-harvester
adminEmail="{some.body@some.where}"
keepDeletedRecords="{true|false}"
noRecordsPerBatch="{number}"
transformer-factory="{Transformer factory classe name}"
transformer-indent="{yes|no}"
keepHarvestedRecords="{true|false}"
tempDirPath="{directory path}">
<sdx:oai-data-providers>
<sdx:oai-repository [...]>[...] |
protected void |
configureAdminEmails(org.apache.avalon.framework.configuration.Configuration configuration)
Configures a list of admin emails
can be sub-elements, a single attribute,
or both
|
protected void |
configureDatabase(org.apache.avalon.framework.configuration.Configuration configuration)
Configures the internal database
|
protected void |
configureDataProviders(org.apache.avalon.framework.configuration.Configuration configuration)
Configures data providers info that can be reused
and from which requests can be automatically executed
|
protected void |
configureHarvestIDGenerator(org.apache.avalon.framework.configuration.Configuration configuration)
Configures the id generator for harvests
|
protected void |
configurePipeline(org.apache.avalon.framework.configuration.Configuration configuration)
Configures the preIndexation pipeline
|
protected void |
configureStoreRepositories(java.lang.String repoUrl,
org.apache.avalon.framework.configuration.Configuration oaiRepoConf)
Configures the repositories
to which data will be stored
based upon their repository url
|
protected void |
configureTempDir(org.apache.avalon.framework.configuration.Configuration conf)
Configures the temporary directory
Configures the temporary directory where harvested documents will be
stored in sub-directories.
|
protected void |
configureUpdateTriggers(java.lang.String requestUrl,
org.apache.avalon.framework.configuration.Configuration updateConf)
Configures time triggers for
stored requests
|
protected void |
deleteOAIDocuments()
Delete OAI documents from the current document base.
|
protected void |
deleteTempDir()
Deletes the directory represented by the tempDir class field
|
protected void |
deleteTempDirBatch()
Deletes the directory
represented by the tempDirBatch
class field
|
void |
endElement(java.lang.String s,
java.lang.String s1,
java.lang.String s2)
Receive notification of the end of an element.
|
protected void |
endHarvest()
Ends the harvest
|
protected java.lang.String |
generateNewHarvestId()
Generates an id to associate
with a harvest
|
protected java.lang.String |
getHarvesterId()
Returns an id for this harvester based upon the underlying document base id
|
protected IndexParameters |
getIndexParameters()
Builds simple index parameters for indexation of
oai records into the undelryi
|
protected java.lang.String |
getIsoDate()
Get's the current date in iso8601 format
|
protected java.io.File |
getNewTempDirBatch()
Creates a new temporary directory for
writing harvested records before the will
be indexed
|
protected void |
handleResumptionToken()
Handles the resumption token by issuing another request
based upon the request from which the resumption token was received.
|
protected void |
initTempDir()
Establishes the tempDirBatch class field
|
protected boolean |
isStartsIndexation() |
java.util.Date |
lastUpdated()
Retrieves the time when the harvester was last updated
|
protected void |
prepareRecordCapture()
Sets up resources to capture an oai record
|
protected void |
prepareRecordForDeletion()
Sets up resources to delete an oai record
Add the record to the list of the records to removed
|
protected void |
prepareResourceFromUrlIdentifierCapture()
Prepares to read a url value from an oai record and
retrieve the XML behind.
|
void |
purgePastHarvestsData()
Destroys all summary data pertaining to past harvests
but not the actual oai records harvested
|
protected void |
resetAllFields()
Resets necessary class fields
|
protected void |
resetRecordCaptureFields(boolean deleteDoc)
Resets the class fields for record capture
possibility deleting the current
harvetDoc
object underlying file |
void |
restore(SaveParameters save_config)
Restore the timeStamp of the Harvester
|
protected void |
saveCriticalFields(boolean dataHarvested)
Saves critical data about a harvest
|
void |
sendPastHarvestsSummary()
Sends sax events to the current consumer
with summary details of the all the past harvests
|
void |
sendStoredHarvestingRequests()
Sends the details of stored harvesting requests
to the current consumer
|
protected boolean |
shouldHarvestDocument()
Querys the underlying data structures
based upon current sax flow
position/set class fields and
determines whether an oai record should be
harvested
|
void |
startElement(java.lang.String s,
java.lang.String s1,
java.lang.String s2,
org.xml.sax.Attributes attributes)
Receive notification of the beginning of an element.
|
protected void |
storeFailedHarvestData(java.lang.Exception e)
Stores data about harvesting failures caused
by problems other than oai errors sent from
a queried repository
|
protected boolean |
storeHarvestedData()
Reads the documents from
tempDirBatch
and indexes them in the corresponding document
base, any marked deletions will be carried out
as well |
void |
targetTriggered(java.lang.String triggerName)
Triggers an OAI request to a repository based
upon a trigger name (also a request url)
|
abortRecordCapture, characters, getAdminEmails, getHarvestParameters, handleErrors, receiveRequest, receiveSynchronizedRequest, receiveSynchronizedRequest, recycle, resetResumptionToken, service, setAdminEmails, setConsumer, setIdentifierName, toSAXcontextualize, enableLogging, getContext, sendElement, sendElementContentacquireSynchronizedXMLConsumer, comment, endCDATA, endDocument, endDTD, endEntity, endPrefixMapping, ignorableWhitespace, processingInstruction, releaseSynchronizedXMLConsumer, setDocumentLocator, skippedEntity, startCDATA, startDocument, startDTD, startEntity, startPrefixMappingsetConsumeracquired, isAcquiredacquire, attempt, getTokens, releaseclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetAdminEmails, receiveRequest, receiveSynchronizedRequest, receiveSynchronizedRequest, setAdminEmails, setIdentifierNamecharacters, endDocument, endPrefixMapping, ignorableWhitespace, processingInstruction, setDocumentLocator, skippedEntity, startDocument, startPrefixMappingcomment, endCDATA, endDTD, endEntity, startCDATA, startDTD, startEntityacquired, setConsumeracquiredprotected DocumentBase docbase
protected java.lang.String docbaseId
protected Pipeline pipe
protected Database _database
protected java.util.Hashtable storedRequests
protected java.util.Hashtable storeRepositoriesRefs
protected TimeScheduler scheduler
protected IDGenerator harvesterIdGen
protected java.lang.String TEMPFILE_SUFFIX
protected java.io.File tempDir
protected java.io.File tempDirBatch
protected java.io.File harvestDoc
protected java.io.FileOutputStream fileOs
protected XMLDocument urlResource
protected java.util.ArrayList deletedDocs
protected int noHarvestedDocs
protected int noDocsDeleted
protected java.util.Set m_docsaddedIds
protected java.util.Set m_docsToDeleteIds
protected java.util.Set m_docsdeletedids
protected boolean keepDeletedRecords
protected int noRecordsPerBatch
protected boolean keepHarvestedRecords
Force harvester to keep harvested records (XML files) in file system server.
Default is false.
This cas be change in document base configuration file:
<oai-harvester keepHarvestedRecords="{true|false}" [...]>
protected java.lang.String tempDirPath
Temporary path of the directory where the harvested documents will be stored.
Default is the servlet context temp dir (eg, $TOMCAT/work/...). If the
directory is not writable, the harvester will use the temporary directory
of the JVM (ie, java.io.tmpdir system property).
This can be change in document base configuration file:
<oai-harvester tempDirPath="{/path/to/directory}" [...]>
To resolve the path, harvester uses the Utilities.resolveFile(org.apache.avalon.framework.logger.Logger, String, Context, String, boolean).
protected java.lang.String transformerFactory
Default: Xalan, "org.apache.xalan.processor.TransformerFactoryImpl". This cas be change in configuration file: <oai-harvester transformer-factory="{classe name}" [...]>
protected java.lang.String defaultTransformerFactory
protected java.lang.String transformerIndent
Default:no. This can be change in configuration file: <oai-harvester transformer-indent="yes|no" [...]>
protected java.lang.String defaultTransformerIndent
protected boolean indexAtHarvestEnd
Default: true. This cas be change in configuraiton file: <oai-harvester index-at-index-end="yes|no" [...]>
protected boolean forceIndexOnHarvestError
Default: false. This cas be change in configuraiton file: <oai-harvester force-index-on-harvest-error="yes|no" [...]>
protected static final java.lang.String TRANSFORMER_FACTORY
protected static final java.lang.String TRANSFORMER_INDENT
protected static final java.lang.String INDEXATHARVESTEND
protected static final java.lang.String FORCEINDEXONHARVESTERROR
protected static final java.lang.String OAI_HARVEST_ID
protected static final java.lang.String OAI_FAILED_HARVEST
protected static final java.lang.String OAI_HARVESTER_LAST_UPDATED
protected static final java.lang.String OAI_HARVESTER_RESUMPTION_TOKEN
protected static final java.lang.String OAI_VERB
protected static final java.lang.String OAI_IDENTIFIER
protected static final java.lang.String OAI_METADATA_PREFIX
protected static final java.lang.String OAI_FROM
protected static final java.lang.String OAI_UNTIL
protected static final java.lang.String OAI_SET
protected static final java.lang.String NO_DOCS_DELETED
protected static final java.lang.String NO_DOCS_HARVESTED
protected static final java.lang.String ERROR_CODE
protected java.util.Hashtable filesProperties
protected org.apache.cocoon.serialization.XMLSerializer cBytes
protected org.apache.cocoon.xml.XMLPipe oaiStripper
public AbstractDocumentBaseOAIHarvester(DocumentBase base)
public void configure(org.apache.avalon.framework.configuration.Configuration configuration)
throws org.apache.avalon.framework.configuration.ConfigurationException
Configures the OAI harvester reading application.xconf file
wich may contains a section such as:
<sdx:documentBase [...]>
<sdx:oai-harvester
adminEmail="{some.body@some.where}"
keepDeletedRecords="{true|false}"
noRecordsPerBatch="{number}"
transformer-factory="{Transformer factory classe name}"
transformer-indent="{yes|no}"
keepHarvestedRecords="{true|false}"
tempDirPath="{directory path}">
<sdx:oai-data-providers>
<sdx:oai-repository [...]>[...]</sdx:oai-repository>
[...]
</sdx:oai-data-providers>
</sdx:oai-harvester>
</sdx:documentBase>
configure in interface org.apache.avalon.framework.configuration.ConfigurableConfiguration - org.apache.avalon.framework.configuration.ConfigurationExceptionkeepDeletedRecords,
noRecordsPerBatch,
transformerFactory,
transformerIndent,
keepHarvestedRecords,
tempDirPathprotected void configureTempDir(org.apache.avalon.framework.configuration.Configuration conf)
throws org.apache.avalon.framework.configuration.ConfigurationException
Configures the temporary directory where harvested documents will be
stored in sub-directories. There will be one sub-directory per batch of
the harvest. This directory will be deleted after harvest. This can be
change with keepHarvestedRecords configuration attribute.
Configuration - org.apache.avalon.framework.configuration.ConfigurationExceptionkeepHarvestedRecords,
tempDirPathprotected void configureDatabase(org.apache.avalon.framework.configuration.Configuration configuration)
throws org.apache.avalon.framework.configuration.ConfigurationException
org.apache.avalon.framework.configuration.ConfigurationExceptionprotected void configureHarvestIDGenerator(org.apache.avalon.framework.configuration.Configuration configuration)
throws org.apache.avalon.framework.configuration.ConfigurationException
org.apache.avalon.framework.configuration.ConfigurationExceptionprotected java.lang.String getHarvesterId()
protected void configureAdminEmails(org.apache.avalon.framework.configuration.Configuration configuration)
throws org.apache.avalon.framework.configuration.ConfigurationException
configuration - org.apache.avalon.framework.configuration.ConfigurationExceptionprotected void configureDataProviders(org.apache.avalon.framework.configuration.Configuration configuration)
throws org.apache.avalon.framework.configuration.ConfigurationException
configuration - org.apache.avalon.framework.configuration.ConfigurationExceptionstoredRequestsprotected void configureUpdateTriggers(java.lang.String requestUrl,
org.apache.avalon.framework.configuration.Configuration updateConf)
throws org.apache.avalon.framework.configuration.ConfigurationException
requestUrl - The request urlupdateConf - The configuration for updatesorg.apache.avalon.framework.configuration.ConfigurationExceptionscheduler,
storedRequestsprotected void configureStoreRepositories(java.lang.String repoUrl,
org.apache.avalon.framework.configuration.Configuration oaiRepoConf)
throws org.apache.avalon.framework.configuration.ConfigurationException
repoUrl - The repository/data provider urloaiRepoConf - The configurationorg.apache.avalon.framework.configuration.ConfigurationExceptionpublic boolean checkGranularity(java.lang.String granularity)
throws org.apache.avalon.framework.configuration.ConfigurationException
granularity - org.apache.avalon.framework.configuration.ConfigurationExceptionprotected void configurePipeline(org.apache.avalon.framework.configuration.Configuration configuration)
throws org.apache.avalon.framework.configuration.ConfigurationException
configuration - org.apache.avalon.framework.configuration.ConfigurationExceptionpipeprotected java.io.File getNewTempDirBatch()
throws SDXException,
java.io.IOException
SDXExceptionjava.io.IOExceptionprotected void deleteTempDirBatch()
protected void deleteTempDir()
protected void initTempDir()
throws SDXException,
java.io.IOException
SDXExceptionjava.io.IOExceptionprotected java.lang.String getIsoDate()
protected void prepareRecordCapture()
throws org.xml.sax.SAXException
prepareRecordCapture in class AbstractOAIHarvesterorg.xml.sax.SAXExceptionprotected void captureRecord()
throws java.lang.Exception
captureRecord in class AbstractOAIHarvesterjava.lang.Exceptionprotected void resetRecordCaptureFields(boolean deleteDoc)
harvetDoc
object underlying fileresetRecordCaptureFields in class AbstractOAIHarvesterdeleteDoc - flag for deletion of actual fileprotected void prepareRecordForDeletion()
prepareRecordForDeletion in class AbstractOAIHarvesterprotected boolean isStartsIndexation()
protected boolean storeHarvestedData()
throws org.apache.cocoon.ProcessingException,
java.io.IOException,
SDXException,
org.xml.sax.SAXException
tempDirBatch
and indexes them in the corresponding document
base, any marked deletions will be carried out
as wellstoreHarvestedData in class AbstractOAIHarvesterSDXExceptionorg.xml.sax.SAXExceptionorg.apache.cocoon.ProcessingExceptionjava.io.IOExceptionAbstractOAIHarvester.storeHarvestedData()protected void deleteOAIDocuments()
throws java.io.IOException,
org.apache.cocoon.ProcessingException,
SDXException,
org.xml.sax.SAXException
java.io.IOExceptionorg.apache.cocoon.ProcessingExceptionSDXExceptionorg.xml.sax.SAXExceptionprotected void handleResumptionToken()
handleResumptionToken in class AbstractOAIHarvesterprotected void prepareResourceFromUrlIdentifierCapture()
prepareResourceFromUrlIdentifierCapture in class AbstractOAIHarvesterAbstractOAIHarvester.identifierName,
AbstractOAIHarvester.currentMetadtaUrlIdentifierprotected void captureResourceFromUrlIdentifier()
captureResourceFromUrlIdentifier in class AbstractOAIHarvesterAbstractOAIHarvester.currentMetadtaUrlIdentifier,
AbstractOAIHarvester.identifierNameprotected void resetAllFields()
resetAllFields in class AbstractOAIHarvesterprotected void endHarvest()
protected IndexParameters getIndexParameters()
public void sendStoredHarvestingRequests()
throws org.xml.sax.SAXException
sendStoredHarvestingRequests in interface OAIHarvesterorg.xml.sax.SAXExceptionpublic void targetTriggered(java.lang.String triggerName)
targetTriggered in interface TargettriggerName - public void startElement(java.lang.String s,
java.lang.String s1,
java.lang.String s2,
org.xml.sax.Attributes attributes)
throws org.xml.sax.SAXException
AbstractSynchronizedXMLPipestartElement in interface org.xml.sax.ContentHandlerstartElement in class AbstractOAIHarvesters - The Namespace URI, or the empty string if the element has no
Namespace URI or if Namespace
processing is not being performed.s1 - The local name (without prefix), or the empty string if
Namespace processing is not being performed.s2 - The raw XML 1.0 name (with prefix), or the empty string if
raw names are not available.attributes - The attributes attached to the element. If there are no
attributes, it shall be an empty Attributes object.org.xml.sax.SAXExceptionpublic void endElement(java.lang.String s,
java.lang.String s1,
java.lang.String s2)
throws org.xml.sax.SAXException
AbstractSynchronizedXMLPipeendElement in interface org.xml.sax.ContentHandlerendElement in class AbstractOAIHarvesters - The Namespace URI, or the empty string if the element has no
Namespace URI or if Namespace
processing is not being performed.s1 - The local name (without prefix), or the empty string if
Namespace processing is not being performed.s2 - The raw XML 1.0 name (with prefix), or the empty string if
raw names are not available.org.xml.sax.SAXExceptionprotected boolean shouldHarvestDocument()
shouldHarvestDocument in class AbstractOAIHarvesterprotected void saveCriticalFields(boolean dataHarvested)
throws org.xml.sax.SAXException
saveCriticalFields in class AbstractOAIHarvesterdataHarvested - org.xml.sax.SAXExceptionprotected java.lang.String generateNewHarvestId()
public void sendPastHarvestsSummary()
throws org.xml.sax.SAXException
sendPastHarvestsSummary in interface OAIHarvesterorg.xml.sax.SAXExceptionpublic java.util.Date lastUpdated()
public void purgePastHarvestsData()
purgePastHarvestsData in interface OAIHarvesterprotected void storeFailedHarvestData(java.lang.Exception e)
storeFailedHarvestData in class AbstractOAIHarvestere - public void backup(SaveParameters save_config) throws SDXException
backup in interface SaveableSDXExceptionSaveable.backup(fr.gouv.culture.sdx.utils.save.SaveParameters)public void restore(SaveParameters save_config) throws SDXException
restore in interface SaveableSDXExceptionSaveable.restore(fr.gouv.culture.sdx.utils.save.SaveParameters)public void close()
close in class AbstractOAIHarvesterCopyright © 2000-2010 Ministere de la culture et de la communication / AJLSM. All Rights Reserved.