Developing Marklogic Applications I

Developing Marklogic Applications I - XQuery

This is contunuation continuation of my notes from the MarkLogic training course Developing Marklogic Applications I - XQuery

Day 2: Deeper into MarkLogic and XQuery APIs

Using XQuery Functions and Operators

Different ways to structure code.

Function

Mechanize to modulaire code
Recursove actions
Stong data type

declare function local:daysleft() {
  let $daysleft := xs:date("2018-06-14") - fn:current-date()
  let $daysleft := days-from-duration(xs:duration($daysleft))
  return if ($daysleft > 0)
    then $daysleft
    else 0
}

<p><span class='daysleft'>{local:daysleft()}</span><p>

declare local:countwins($team as xs:string+) as xs:integer {
let $seq := //winner[. = $team]
return fm:count($seq)

Library Modules

Modules are more flexible than local functions.

Main vs Library

Main Modules

All functions scoped to the module directly,
Executed directly

(no term)

Libary Modules::

are not directly executed,
Imported in main module.

Organization

folders for grouping. eg /project/modules/
add -lib suffix to filename

Structure

Declare namespace
Add namespace declaration
Define the functions

Usage

In the main module prolog,

import namespace define in module
include relative location

Example

xquery version "1.0-ml";
module namespace co = "http://marklogic.com/mlu/world-leaders/common";
declare namespace wl = "http://marklogic.com/mlu/world-leaders";
declare function in-office() as xs:string
{
    let $incount := fn:count(/wl:leader/wl:positions/wl:position[1]/wl:enddate/text()[. = "present"])
    return fn:concat(" (in office: ",  $incount, ")")
};

import module namespace co "http://marklogic.com/mlu/world-leaders/common" at "./modules/common-lib.xqy"
<div>{co:on-office(})</div>

Setting Up a Database and Application

MarkLogic Components

Database
Forest (uint of replication)
Stand (In memory / On Disk)
Universal Index
Applciation Server (HTTP / XDBC / ODBC / WebDAV)
Hosts | Grouped
List Cache | Compressed Tree Cache | Expanded Tree Cache
Node/Host/Machines Types

E Node
evalutor node - eg. app server instance

D Node
datamanager node - eg. database manager

Single Node Architecture

Have 4 levels

Application server
- Handles request / responses
- Defined on a port
- Evaluates code
- - xREST –> HTTP
  - WebDAV –> OMDC
  - XDBC
Database
- Transaction controller
- Logical Configuration
Memory Caches
- List Cache (db, in forest)
- Compressed Tree Cache (db, in forest)
- Expanded Tree Cache (app server)
- Triple Data and Triple Value Caches (db)
Forests
- Pysical storage
- Attached to db
- Stands
  - Memory
  - Disk
  - Documents
  - Indexes (eg caches)
  - Compression
- Journal

MarkLogic Forest by default are in: %ProgramFiles%/data/forests

Commiting data to a forest

Transaction commited

Data commited to in-memory stand

Journal Entry Created

Record then written to on-disc stand

Clustered Architecture

Minimum 3 node required for failover. 1 e-node, 2 d-nodes

Group one or many machines in a cluster.

Cache sizes are manager at the group level. EG, A larger Expanded Tree cache for e nodes.

Default Databases (common, shared) database can be shared across projects. These include:

Security (must be replicated)
Schemas
Trigger
Modules

Rebalancing

Marklogic 7+ does rebalancing between when forests are added.
Set assignemnt policy, rebalancer will move data between different forests.
Manage data based on it's lifecycle with Tiered Storage.
- Eg.move data to different storage over time. eg. load to SSD->SAN->NAS

Tiered Storeage
Use partitions (groups of forests) with partition key definitions to mmove data to different places

Partition key
is just some piece of information. Eg. a date in a document.

Security 101

Authentication occurs agains the security database.

Roles/Users in Security database.
Documents contain their own permission metadata

Role Base

Authentication
- Performed on the Application Server
- LDAP/Kerberos external authentication protocol
Database Level
Code Level
- execute permission says whic user/role can run code.

Architechture Summary

Interface
- Java, Rest, Xquery, Odbc
Evaluation Layer
1. Evaluator
  - Xslt, Xpath, Xquery, Sql
2. Cache
3. Broadcaster, Aggregator
Datalayer
1. Transaction Controller
2. Cache, Transaction Ournal
3. Indexes
  - Value,Structure, Text, Scalar, Metadata, Security, Geospacial, Reverse
4. Compressed Storage
  - Xml, Json, Binary, Text

Loading Data

Document Types

XML/JSON
Text
Binary

Document Metadata

Collections. Documents can be stored in collections.
Permissions. Default required admin
Properties. Common with binary documents
Quality. Affexts the relevance ranking

Loading data methods

MLCP (MarkLogic Content Pump)
REST API (curl)
Java API
XQuery/Javascript APIs
- xmdp:document-insert(<xml>)
- xmdp:document-load(file-on-disk, <options><uri/><collections/><permissions/><repair/></optinos>)
- xdmp:filesystem-directory (bulk loading helper)
Others
- load() (higher level, information studio)
- XCC
- WebDAV

MarkLogic Content Pump

3 Functions not just import.
1. Loading data
2. Exporting content
3. Copying databases

2 Modes

Local
Distributes (parrelize across cluster)

# There are other params in docs, can tranform some data as well
mlcp.bat import -host localhost ^
  -port 8012 ^
  -username admin -password admin ^
  -input_file_path C:/mlcp-import/content ^
  -mode local ^
  -input_file_pattern "twitter.*\xml " ^
  -output_uri_replace "C:/mlcp-import/content, 'socialmedia'"

Developing Search

The searching Algorith

XPath expression returns Document order, which is arbitrary, versus
search results with return in Relevance order
Impacting Relevancy Ranked Results

Relevance Order Algorim
```
Score = Log ( TermFrequency ) * InverseDocumentFrequency
```
- Term Frequency is normalized to total words to yield term density.
- - 1/DocumentFrequency
  - How uncommon the term is in the database (higher is rarer)

Impacting Relevance

(Document) Quality is:
- factor to increase a documetns relevancyt score relative to other matching docs
- Set on ingesttion
- Default is 0
- log(tf) * idf + (QueryWeight * DocQuality)
Query Weight
- Run time component that can be set at run time.
- Default is 1
Word Query
- Database configuration
- Weight different properties, eg title over description

cts:score()  (: number: high is higher relevance:)
cts:confidence()  (: 0.0 <--> 1.0: how relevance compared to other documents :)
cts:fitness()  (: 0.0 <--> 1.0: how well the returned document saristifed the query issued (ignoring other docs)  :)

Out of the Box Searching Features

Search Phrasing
Stemming

Filter vs Unfiltered Search

Filtered is 2 steps
- 2 Step Process
  1. Get candidates based on index candidates
  2. Confirm/unconfirm the content of the document for match
- XQuery (cts,search api) default
- Focus on accuracyj;
Unfilter
- 1 step process, just looks at candidates from the index.
- Java/REST API

Unfiltered searched can be fast & accurate when indexed properly.

Constraints

Types
- Values
- Collection
- Range
- Properties
- Geospacial

Search Methods

Language APIs
REST API
Searh API (search)
Built-in APIs (cts namespace)

Search API Response

returns snippeted results list
automatically paginated

import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/serach/search.xqy"

let $options :=
  <options xmlns="http://marklogic.com/appservices/search">
    <transform-results apply="raw" />
  </options>
search:search("q", $options)

CTS Search
- Constructors (element-value, word-query…)
- Composability (and-query, or-query…)
```
cts:word-query()
cts-element-word-query()
cts-or-query()
```
Query Serialization
- Search can be modeled as a document
- Can be stored/search
- Can reverse query (check document against stored queries on injest)
  - eg. save a query that triggers and alert whena new document matches it.

Indexing

Turn on the index to support the index users want.

Index Conecpts

Term list index

Map values –> Documents

Filtering
Inverted Index
Stemming
Phase Index
- AND query - term list intersections
- NOT Queries - list term subtractions
- OR queries - term list unions
Proximity Index
Structure (speeds up xpath)
- Parent child
- Element value

Under the hood it's all hashes of the term list key.

Range Indexes

Value (type) base indexes. Map Values <–> Docuemnts. Live in memory on startup.

Values, not textual types
Faster Range Queries
Fast Sorting
Fast Value Extractions
Faceting

Range vs Term List

Term list is tokenize, and split on punctuation eg "1.0" is index with "0", "1", "10"

String Range Indexes

Have collation. eg. "The Beatles" can be collated to be equal to "the beatles".

Path Range Indexs

More control over what the range index should contain. Only create indexes on parts of the path.

Word query

Technically a index configuration. Allows setting up word weighting.

Field

Collapse peices of documents into a single field for searching.
Setting global database options can be set on individual fields. (ie don't have to set in on the whole database)
- Performer
- <artist>|<signer>|<group>|<band>

Tuning function

fn:count()
xdmp:estimate()  (: based on index only :)
(: if estimate is the same as fn:count, the indexs are good and the query can remain unfiltered :)
xdmp:query-meters()  (: performance stats, includes cache details :)
xdmp:map()  (: shows the query plan :)

Summary

Approaches to query resolution
1. lok at query
2. Decide what indexes can help
3. Use indexes to narrow down the result set
4. Filter the reesults set to confirm the match
Tradeoffs
- More indexes
  - longer ingestion
  - more size recuired
- Less indexs
  - more filter
  - slower search
- Range indexes cost RAM