Contact Us Help Centre

/ Search using Solr query language

Search using Solr query language

Data WA

Updated 13 January 2022 10:10

The Data WA catalogue is powered by CKAN, an open source data portal platform. CKAN uses Apache Solr as its search engine.

Basic CKAN search allows limited use of Solr query language. To get the full benefit of Solr we've implemented an advanced Use query language to search option.

To access this option:

Select the Data heading from the top of any page.
Tick the Advanced search checkbox.
Tick the Use query language to search checkbox.

Screenshot showing both the Advanced search checkbox and the Use query language to search checkbox are ticked.

Solr provides a great syntax that will help us find specific datasets among the thousands on Data WA. Although basic queries are fairly simple, you can create complex queries using boolean operators. We'll learn how to use boolean operators later. For now, we will start from the very beginning.

Getting started with Solr

To perform a free text search, simply enter a text string. For example, the following query will search all fields for the word fishing.

fishing

➜ Try this query

Of course, you can search by multiple words. Solr has a complicated analyser system, but we should know about the basics. First of all, by default Solr will break the query by whitespace, so the query marine park will be split into two “tokens”; marine and park.

park marine

➜ Try this query

The dataset with both words will be on top of the results while related search matches will follow after it. We will talk about priorities later.

To search for multiple words in a particular sequence, wrap them in double quotes " ". This is called a phrase.

"active schools"

➜ Try this query

If you want to specify a field to search for, simply use the following syntax.

title:marine

➜ Try this query

You can specify multiple fields in a search query.

title:marine notes:republished attributes

➜ Try this query

You can combine word and phrase searches too. Here we are searching for the word marine in title and the phrase spatial cadastral database in description.

title:marine notes:"spatial cadastral database"

➜ Try this query

When we are combining fields in this way, there is an implicit AND operator between each field because AND is a default operator in CKAN.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root form. It allows you to make your search return more relevant results.

A stemmer in Solr is basically a set of mapping rules that maps the various forms of a word back to the base, or stem, a word from which they derive. For example, in English the words "fishes" and "fishing" are all forms of the stem word "fish". The stemmer will replace all of these terms with "fish", which is what will be indexed.

So the query fishing will find datasets with the word fish mentioned. You don’t have to write something special to make it work.

➜ Try this query

How do I find out the name of a field?

CKAN provides an API where you can grab metadata related to a particular dataset. For example, we have a link for a dataset—https://catalogue.data.wa.gov.au/dataset/fish-habitat-protection-areas.

In this link, we are using the dataset’s name (the part after dataset/) to build a human-readable URL. We can use this name to fetch metadata about the dataset via the CKAN API. The API call we are making is https://catalogue.data.wa.gov.au/api/action/package_show?id=, where id is the id or name of the dataset. They are interchangeable, so you can use either with this API call.

Try to visit this page to check the metadata of our example dataset. As you can see, the API returns a JSON response. The result key contains an object with all of the dataset’s fields (such as title, notes, access_level, and so on). This is the place where you can check how to refer to the required field in a Solr query.

Case sensitivity and exact search

By default, in CKAN title, notes, author, author_email, maintainer, maintainer_email, res_name, res_description and a few others are case insensitive because their field type is text. This means that searching by the next two queries will return the same results.

title:parks

➜ Try this query

title:PARKS

➜ Try this query

While the next two queries are different.

geospatial_theme:Boundaries

➜ Try this query

geospatial_theme:boundaries

➜ Try this query

As you can see, the second query returns no results. This happens because the geospatial_theme field type is a string. All string-type fields should be considered an exact match. Of course, it’s impossible to remember all of the types of fields, so just try to remember the basic text fields mentioned at the beginning of this section.

Info: The WhitespaceTokenizerFactory doesn’t apply to string fields, therefore searching by these fields looks for an exact search.

Escaping Special Characters

Solr gives the following characters special meaning when they appear in a query:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : /

To make Solr interpret any of these characters literally, rather than as a special character, precede the character with a backslash character \. For example, to search for (1+1):2 without having Solr interpret the plus sign and parentheses as special characters for formulating a sub-query with two terms, escape the characters by preceding each one with a backslash:

\(1\+1\)\:2

Boolean Operators

Since we have already touched on them, let's talk about operators.

Boolean Operator	Alternative symbol	Description
AND	&&	Requires both terms on either side of the Boolean operator to be present for a match.
OR	\|\|	Requires that either term (or both terms) be present for a match.
NOT	!	Requires that the following term not be present.
	+	Requires that the following term be present.
	-	Prohibits the following term. The `-` operator is functionally similar to the Boolean operator `!`. Because it's used by popular search engines such as Google, it may be more familiar to some user communities.

Tip: When specifying Boolean operators with keywords such as AND or NOT, the keywords must appear in all uppercase.

AND (&&)

The AND operator is simple. As mentioned earlier CKAN uses AND as the default operator, but you can also specify it. For example, this query is looking for datasets with the word marine in the title and the phrase spatial cadastral database in the description.

title:marine AND notes:"spatial cadastral database"

➜ Try this query

We can use an alias (&&) but the result will be the same. It’s up to you which one to use.

title:marine && notes:"spatial cadastral database"

➜ Try this query

OR (||)

The OR operator links two terms or phrases and finds a matching dataset if either of the terms exists in a dataset. The symbol || can be used in place of the word OR.

title:parks OR title:water

➜ Try this query

NOT ( ! )

The NOT operator excludes datasets that contain the term after NOT. The symbol ! can be used in place of the word NOT. The following query searches for datasets that contain the word parks and don’t contain the word marine in their title.

title:parks NOT title:marine

➜ Try this query

+ and -

The + symbol (also known as the "required" operator) requires that the term after the + symbol exist somewhere in a field in at least one dataset in order for the query to return a match.

The - symbol or "prohibit" operator excludes datasets that contain the term after the - symbol

For example, to search for dataset titles that must contain water and that must not contain islands, use the following query.

+title:water -title:islands

➜ Try this query

Grouping Terms to Form Sub-Queries

Solr supports using parentheses to group clauses to form sub-queries. This can be very useful if you want to control the Boolean logic for a query. For example:

(title:water OR title:parks) AND NOT title:islands AND num_resources:7

➜ Try this query

You are free to combine multiple conditions as long as you follow the Solr syntax.

Term Modifiers

Wildcard Searches

Solr supports multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases.

The asterisk * symbol matches to multiple characters (zero or more sequential characters) in a single term. The wildcard search title:net* would match any word that starts with net.

➜ Try this query

Info: By default Solr does not support left truncation (e.g. title:*net) so CKAN doesn’t support it either.

Also, you can use a wildcard in the middle of a term.

➜ Try this query

Warning: However, try to avoid using wildcard search in this way because it could be significantly slower than default search by words (tokens) and the result may be unpredictable and irrelevant.

Fuzzy Searches

Solr supports fuzzy searches based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match.

Basic CKAN search is fuzzy by default. To perform a fuzzy advanced search, use the tilde ~ symbol at the end of a single-word term. For example, to search for a term similar in spelling to rest, use the fuzzy search:

title:rest~

➜ Try this query

This search will match terms like test, west, forest, etc. It will also match the word rest itself.

An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2.

title:rest~1

➜ Try this query

This will match terms like test and west, but not forest since it has an edit distance of "2".

In many cases, stemming (reducing terms to a common stem) can produce similar effects to fuzzy searches and wildcard searches.

Proximity Searches

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for contaminated and restricted within 2 words of each other in a title, use the search:

title:"contaminated restricted"~2

➜ Try this query

The distance referred to here is the number of term movements needed to match the specified phrase.

Field Specific Queries

Existence and Non-Existence Searches

Sometimes you want to search for all datasets where a specific field isn’t empty. The syntax of an existing search is by using a wildcard with a field instead of a term. It matches all datasets where the specified field has any value.

data_temporal_extent_begin:[* TO *] 
data_temporal_extent_begin:*

➜ Try this query

Matching `NaN` values with wildcards

For most fields, unbounded range queries, field:[* TO *], are equivalent to existence queries, field: *. However, for float/double types that support NaN values, these two queries perform differently.

field:* matches all existing values, including NaN
field:[* TO *] matches all real values, excluding NaN

Help: NaN stands for Not A Number and is one of the common ways to represent a missing value in data.

If you want to find all datasets where this field is empty, you can reverse this query.

NOT data_temporal_extent_begin:* 
-data_temporal_extent_begin:*

➜ Try this query

Warning: A query like this can be significantly slower, because Solr has to do a full index scan on this field.

Range Searches

A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query matches datasets whose values for the specified field or fields fall within the range. Range queries can be inclusive or exclusive of the upper and lower bounds.

Sorting is done lexicographically (in alphabetical order), except on numeric fields. For example, the range query below matches all datasets whose num_resources field has a value between 11 and 13, inclusive.

num_resources:[11 TO 13]

➜ Try this query

Range queries are not limited to date fields or even numerical fields. You could also use range queries with non-date fields.

theme:{"Business" TO "Defence"}

➜ Try this query

This will find all datasets whose theme is between Business and Defence, but not including these terms.

The brackets around a query determine its inclusiveness.

Square brackets [ and ] denote an inclusive range query that matches values including the upper and lower bound.
Curly brackets { and } denote an exclusive range query that matches values between the upper and lower bounds, but excluding the upper and lower bounds themselves.
You can mix these types so one end of the range is inclusive and the other is exclusive.

theme:{"Business" TO "Education and Training"]

➜ Try this query

Specifying Dates and Times

Queries against fields using the date type (typically range queries) should use the appropriate date syntax:

metadata_created:[* TO NOW]
metadata_created:[1976-03-06T23:59:59.999Z TO *]
metadata_created:[1999-12-31T23:59:59.999Z TO 2017-03-06T00:00:00Z]
metadata_created:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]
metadata_created:[2019-03-06T23:59:59.999Z TO 2020-03-06T23:59:59.999Z+1YEAR]
metadata_created:[2019-03-06T23:59:59.999Z/YEAR TO 2020-03-06T23:59:59.999Z]

Date Formatting

Solr’s date fields represent "dates" as a point in time with millisecond precision. The format used is a restricted form of the canonical representation of DateTime in the XML Schema specification—a restricted subset of ISO-8601.

YYYY-MM-DDThh:mm:ssZ

YYYY is the year.
MM is the month.
DD is the day of the month.
hh is the hour of the day as on a 24-hour clock.
mm is minutes.
ss is seconds.
Z is a literal "Z" character, indicating that this string representation of the date is in UTC.

Date Math Syntax

Date math expressions consist of either adding some quantity of time in a specified unit, or rounding the current time by a specified unit. Expressions can be chained and are evaluated left to right.

For example this represents a point in time two months from now:

NOW+2MONTHS

This is one day ago:

NOW-1DAY

A slash is used to indicate rounding. This represents the beginning of the current hour:

NOW/HOUR

The following example computes (with millisecond precision) the point in time six months and three days into the future and then rounds that time to the beginning of that day:

NOW+6MONTHS+3DAYS/DAY

Note that while date math is most commonly used relative to NOW it can be applied to any fixed moment in time as well:

1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY

For example, if you want to find datasets that were created in the last 2 months you can use the query:

metadata_created:[NOW-2MONTHS TO *]

➜ Try this query

Term Priorities

Term Boosting with ^

Query-time boosts allow one to specify which terms/clauses are "more important". The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding dataset scores.

When you apply a boost to a specific field you are promoting the datasets in the results list by increasing its score. For example:

title:water^2 OR title:parks

➜ Try this query

This will return results with water ones promoted to the top.

Usually, the score is based on many factors that we simply cannot be aware of when executing a query. We also cannot know what the initial score is, before giving it a boost.

Constant Score with ^=

Constant score queries are created with ^=, which sets the entire clause to the specified score for any datasets matching that clause. This is desirable when you only care about matches for a particular clause and don't want other relevancy factors such as term frequency (the number of times the term appears in the field) or inverse dataset frequency (a measure across the whole index for how rare a term is in a field).

For example, the next query has a stronger effect on the score, so there is no way parks results will be higher than water.

title:water^=1 || title:parks^=0.9

➜ Try this query

search searching Solr