Developing Marklogic Applications I - XQuery

This is contunuation continuation of my notes from the MarkLogic training course Developing Marklogic Applications I - XQuery

Day 2: Deeper into MarkLogic and XQuery APIs

Using XQuery Functions and Operators

Different ways to structure code.

Function

  • Mechanize to modulaire code
  • Recursove actions
  • Stong data type
declare function local:daysleft() {
  let $daysleft := xs:date("2018-06-14") - fn:current-date()
  let $daysleft := days-from-duration(xs:duration($daysleft))
  return if ($daysleft > 0)
    then $daysleft
    else 0
}
<p><span class='daysleft'>{local:daysleft()}</span><p>
declare local:countwins($team as xs:string+) as xs:integer {
let $seq := //winner[. = $team]
return fm:count($seq)

Library Modules

Modules are more flexible than local functions.

Main vs Library
Main Modules
  • All functions scoped to the module directly,
  • Executed directly
(no term)
Libary Modules::
  • are not directly executed,
  • Imported in main module.
Organization
  • folders for grouping. eg /project/modules/
  • add -lib suffix to filename
Structure
  1. Declare namespace
  2. Add namespace declaration
  3. Define the functions
Usage

In the main module prolog,

  • import namespace define in module
  • include relative location
Example
xquery version "1.0-ml";
module namespace co = "http://marklogic.com/mlu/world-leaders/common";
declare namespace wl = "http://marklogic.com/mlu/world-leaders";
declare function in-office() as xs:string
{
    let $incount := fn:count(/wl:leader/wl:positions/wl:position[1]/wl:enddate/text()[. = "present"])
    return fn:concat(" (in office: ",  $incount, ")")
};
import module namespace co "http://marklogic.com/mlu/world-leaders/common" at "./modules/common-lib.xqy"
<div>{co:on-office(})</div>

Setting Up a Database and Application

MarkLogic Components
  • Database
  • Forest (uint of replication)
  • Stand (In memory / On Disk)
  • Universal Index
  • Applciation Server (HTTP / XDBC / ODBC / WebDAV)
  • Hosts | Grouped
  • List Cache | Compressed Tree Cache | Expanded Tree Cache
  • Node/Host/Machines Types
    E Node
    evalutor node - eg. app server instance
    D Node
    datamanager node - eg. database manager
Single Node Architecture

Have 4 levels

  1. Application server
    • Handles request / responses
    • Defined on a port
    • Evaluates code
      • xREST –> HTTP
      • WebDAV –> OMDC
      • XDBC
  2. Database
    • Transaction controller
    • Logical Configuration
  3. Memory Caches
    • List Cache (db, in forest)
    • Compressed Tree Cache (db, in forest)
    • Expanded Tree Cache (app server)
    • Triple Data and Triple Value Caches (db)
  4. Forests
    • Pysical storage
    • Attached to db
    • Stands
      • Memory
      • Disk
      • Documents
      • Indexes (eg caches)
      • Compression
    • Journal

MarkLogic Forest by default are in: %ProgramFiles%/data/forests

Commiting data to a forest

  1. Transaction commited
    1. Data commited to in-memory stand
    2. Journal Entry Created
  2. Record then written to on-disc stand
Clustered Architecture
  • Minimum 3 node required for failover. 1 e-node, 2 d-nodes

Group one or many machines in a cluster.

Cache sizes are manager at the group level. EG, A larger Expanded Tree cache for e nodes.

Default Databases (common, shared) database can be shared across projects. These include:

  • Security (must be replicated)
  • Schemas
  • Trigger
  • Modules
Rebalancing
  • Marklogic 7+ does rebalancing between when forests are added.
  • Set assignemnt policy, rebalancer will move data between different forests.
  • Manage data based on it's lifecycle with Tiered Storage.
    • Eg.move data to different storage over time. eg. load to SSD->SAN->NAS
Tiered Storeage
Use partitions (groups of forests) with partition key definitions to mmove data to different places
Partition key
is just some piece of information. Eg. a date in a document.
Security 101

Authentication occurs agains the security database.

  • Roles/Users in Security database.
  • Documents contain their own permission metadata

Role Base

  • Authentication
    • Performed on the Application Server
    • LDAP/Kerberos external authentication protocol
  • Database Level
  • Code Level
    • execute permission says whic user/role can run code.
Architechture Summary
  1. Interface
    • Java, Rest, Xquery, Odbc
  2. Evaluation Layer
    1. Evaluator
      • Xslt, Xpath, Xquery, Sql
    2. Cache
    3. Broadcaster, Aggregator
  3. Datalayer
    1. Transaction Controller
    2. Cache, Transaction Ournal
    3. Indexes
      • Value,Structure, Text, Scalar, Metadata, Security, Geospacial, Reverse
    4. Compressed Storage
      • Xml, Json, Binary, Text

Loading Data

Document Types
  • XML/JSON
  • Text
  • Binary
Document Metadata
  1. Collections. Documents can be stored in collections.
  2. Permissions. Default required admin
  3. Properties. Common with binary documents
  4. Quality. Affexts the relevance ranking
Loading data methods
  • MLCP (MarkLogic Content Pump)
  • REST API (curl)
  • Java API
  • XQuery/Javascript APIs
    • xmdp:document-insert(<xml>)
    • xmdp:document-load(file-on-disk, <options><uri/><collections/><permissions/><repair/></optinos>)
    • xdmp:filesystem-directory (bulk loading helper)
  • Others
    • load() (higher level, information studio)
    • XCC
    • WebDAV
MarkLogic Content Pump
  • 3 Functions not just import.
    1. Loading data
    2. Exporting content
    3. Copying databases
  • 2 Modes
    1. Local
    2. Distributes (parrelize across cluster)
    # There are other params in docs, can tranform some data as well
    mlcp.bat import -host localhost ^
      -port 8012 ^
      -username admin -password admin ^
      -input_file_path C:/mlcp-import/content ^
      -mode local ^
      -input_file_pattern "twitter.*\xml " ^
      -output_uri_replace "C:/mlcp-import/content, 'socialmedia'"
    

Developing Search

The searching Algorith
  • XPath expression returns Document order, which is arbitrary, versus
  • search results with return in Relevance order
  • Impacting Relevancy Ranked Results
  • Relevance Order Algorim
    Score = Log ( TermFrequency ) * InverseDocumentFrequency
    
    • Term Frequency is normalized to total words to yield term density.
      • 1/DocumentFrequency
      • How uncommon the term is in the database (higher is rarer)
Impacting Relevance
  • (Document) Quality is:
    • factor to increase a documetns relevancyt score relative to other matching docs
    • Set on ingesttion
    • Default is 0
    • log(tf) * idf + (QueryWeight * DocQuality)
  • Query Weight
    • Run time component that can be set at run time.
    • Default is 1
  • Word Query
    • Database configuration
    • Weight different properties, eg title over description
cts:score()  (: number: high is higher relevance:)
cts:confidence()  (: 0.0 <--> 1.0: how relevance compared to other documents :)
cts:fitness()  (: 0.0 <--> 1.0: how well the returned document saristifed the query issued (ignoring other docs)  :)
Out of the Box Searching Features
  • Search Phrasing
  • Stemming
Filter vs Unfiltered Search
  • Filtered is 2 steps
    • 2 Step Process
      1. Get candidates based on index candidates
      2. Confirm/unconfirm the content of the document for match
    • XQuery (cts,search api) default
    • Focus on accuracyj;
  • Unfilter
    • 1 step process, just looks at candidates from the index.
    • Java/REST API

Unfiltered searched can be fast & accurate when indexed properly.

Constraints
  • Types
    • Values
    • Collection
    • Range
    • Properties
    • Geospacial
Search Methods
  1. Language APIs
  2. REST API
  3. Searh API (search)
  4. Built-in APIs (cts namespace)
  • Search API Response
    • returns snippeted results list
    • automatically paginated
    import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/serach/search.xqy"
    
    let $options :=
      <options xmlns="http://marklogic.com/appservices/search">
        <transform-results apply="raw" />
      </options>
    search:search("q", $options)
    
  • CTS Search
    • Constructors (element-value, word-query…)
    • Composability (and-query, or-query…)
    cts:word-query()
    cts-element-word-query()
    cts-or-query()
    
  • Query Serialization
    • Search can be modeled as a document
    • Can be stored/search
    • Can reverse query (check document against stored queries on injest)
      • eg. save a query that triggers and alert whena new document matches it.

Indexing

Turn on the index to support the index users want.

Index Conecpts

Term list index

Map values –> Documents

  • Filtering
  • Inverted Index
  • Stemming
  • Phase Index
    • AND query - term list intersections
    • NOT Queries - list term subtractions
    • OR queries - term list unions
  • Proximity Index
  • Structure (speeds up xpath)
    • Parent child
    • Element value

Under the hood it's all hashes of the term list key.

Range Indexes

Value (type) base indexes. Map Values <–> Docuemnts. Live in memory on startup.

  • Values, not textual types
  • Faster Range Queries
  • Fast Sorting
  • Fast Value Extractions
  • Faceting
Range vs Term List

Term list is tokenize, and split on punctuation eg "1.0" is index with "0", "1", "10"

String Range Indexes

Have collation. eg. "The Beatles" can be collated to be equal to "the beatles".

Path Range Indexs

More control over what the range index should contain. Only create indexes on parts of the path.

Word query

Technically a index configuration. Allows setting up word weighting.

Field

  • Collapse peices of documents into a single field for searching.
  • Setting global database options can be set on individual fields. (ie don't have to set in on the whole database)
    • Performer
    • <artist>|<signer>|<group>|<band>

Tuning function

fn:count()
xdmp:estimate()  (: based on index only :)
(: if estimate is the same as fn:count, the indexs are good and the query can remain unfiltered :)
xdmp:query-meters()  (: performance stats, includes cache details :)
xdmp:map()  (: shows the query plan :)

Summary

  • Approaches to query resolution
    1. lok at query
    2. Decide what indexes can help
    3. Use indexes to narrow down the result set
    4. Filter the reesults set to confirm the match
  • Tradeoffs
    • More indexes
      • longer ingestion
      • more size recuired
    • Less indexs
      • more filter
      • slower search
    • Range indexes cost RAM

Day 3: Working In MarkLogic

Working with Indexes

Implementing Geospaial Search

Snippets, Highlighting, Sorting and Pagination

Creating Faceted Navigation

Updating Content and Understanding Transactions

Day 4: External Access

Setting up Application Security

Creating an Advanced Search Interface

Using the REST API

Working with Semantic Data

A: Accessing Log Files

B: Information Studio

C: Application Builder