The Data WA catalogue is powered by CKAN, an open source data portal platform. CKAN uses Apache Solr as its search engine.
Basic CKAN search allows limited use of Solr query language. To get the full benefit of Solr we've implemented an advanced Use query language to search option.
To access this option:
- Select the Data heading from the top of any page.
- Tick the Advanced search checkbox.
- Tick the Use query language to search checkbox.
Solr provides a great syntax that will help us find specific datasets among the thousands on Data WA. Although basic queries are fairly simple, you can create complex queries using boolean operators. We'll learn how to use boolean operators later. For now, we will start from the very beginning.
Getting started with Solr
To perform a free text search, simply enter a text string. For example, the following query will search all fields for the word fishing
.
fishing
Of course, you can search by multiple words. Solr has a complicated analyser system, but we should know about the basics. First of all, by default Solr will break the query by whitespace, so the query marine park
will be split into two “tokens”; marine
and park
.
park marine
The dataset with both words will be on top of the results while related search matches will follow after it. We will talk about priorities later.
To search for multiple words in a particular sequence, wrap them in double quotes " ". This is called a phrase.
"active schools"
If you want to specify a field to search for, simply use the following syntax.
title:marine
You can specify multiple fields in a search query.
title:marine notes:republished attributes
You can combine word and phrase searches too. Here we are searching for the word marine
in title and the phrase spatial cadastral database
in description.
title:marine notes:"spatial cadastral database"
When we are combining fields in this way, there is an implicit AND operator between each field because AND is a default operator in CKAN.
Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root form. It allows you to make your search return more relevant results.
A stemmer in Solr is basically a set of mapping rules that maps the various forms of a word back to the base, or stem, a word from which they derive. For example, in English the words "fishes" and "fishing" are all forms of the stem word "fish". The stemmer will replace all of these terms with "fish", which is what will be indexed.
So the query fishing
will find datasets with the word fish
mentioned. You don’t have to write something special to make it work.
How do I find out the name of a field?
CKAN provides an API where you can grab metadata related to a particular dataset. For example, we have a link for a dataset—https://catalogue.data.wa.gov.au/dataset/fish-habitat-protection-areas.
In this link, we are using the dataset’s name
(the part after dataset/
) to build a human-readable URL. We can use this name
to fetch metadata about the dataset via the CKAN API. The API call we are making is https://catalogue.data.wa.gov.au/api/action/package_show?id=
, where id is the id
or name
of the dataset. They are interchangeable, so you can use either with this API call.
Try to visit this page to check the metadata of our example dataset. As you can see, the API returns a JSON response. The result
key contains an object with all of the dataset’s fields (such as title
, notes
, access_level,
and so on). This is the place where you can check how to refer to the required field in a Solr query.
Case sensitivity and exact search
By default, in CKAN title
, notes
, author
, author_email
, maintainer
, maintainer_email
, res_name
, res_description
and a few others are case insensitive because their field type is text
. This means that searching by the next two queries will return the same results.
title:parks
title:PARKS
While the next two queries are different.
geospatial_theme:Boundaries
geospatial_theme:boundaries
As you can see, the second query returns no results. This happens because the geospatial_theme
field type is a string
. All string-type
fields should be considered an exact match. Of course, it’s impossible to remember all of the types of fields, so just try to remember the basic text
fields mentioned at the beginning of this section.
Info: The WhitespaceTokenizerFactory
doesn’t apply to string
fields, therefore searching by these fields looks for an exact search.
Escaping Special Characters
Solr gives the following characters special meaning when they appear in a query:
+
-
&&
||
!
(
)
{
}
[
]
^
"
~
*
?
:
/
To make Solr interpret any of these characters literally, rather than as a special character, precede the character with a backslash character \
. For example, to search for (1+1):2 without having Solr interpret the plus sign and parentheses as special characters for formulating a sub-query with two terms, escape the characters by preceding each one with a backslash:
\(1\+1\)\:2
Boolean Operators
Since we have already touched on them, let's talk about operators.
AND |
&& |
Requires both terms on either side of the Boolean operator to be present for a match. |
OR |
|| |
Requires that either term (or both terms) be present for a match. |
NOT |
! |
Requires that the following term not be present. |
|
+ |
Requires that the following term be present. |
|
- |
Prohibits the following term. The |
Tip: When specifying Boolean operators with keywords such as AND or NOT, the keywords must appear in all uppercase.
AND (&&)
The AND operator is simple. As mentioned earlier CKAN uses AND as the default operator, but you can also specify it. For example, this query is looking for datasets with the word marine
in the title and the phrase spatial cadastral database
in the description.
title:marine AND notes:"spatial cadastral database"
We can use an alias (&&) but the result will be the same. It’s up to you which one to use.
title:marine && notes:"spatial cadastral database"
OR (||)
The OR operator links two terms or phrases and finds a matching dataset if either of the terms exists in a dataset. The symbol || can be used in place of the word OR.
title:parks OR title:water
NOT ( ! )
The NOT operator excludes datasets that contain the term after NOT. The symbol ! can be used in place of the word NOT. The following query searches for datasets that contain the word parks
and don’t contain the word marine
in their title.
title:parks NOT title:marine
+ and -
The + symbol (also known as the "required" operator) requires that the term after the + symbol exist somewhere in a field in at least one dataset in order for the query to return a match.
The - symbol or "prohibit" operator excludes datasets that contain the term after the - symbol
For example, to search for dataset titles that must contain water
and that must not contain islands
, use the following query.
+title:water -title:islands
Grouping Terms to Form Sub-Queries
Solr supports using parentheses to group clauses to form sub-queries. This can be very useful if you want to control the Boolean logic for a query. For example:
(title:water OR title:parks) AND NOT title:islands AND num_resources:7
You are free to combine multiple conditions as long as you follow the Solr syntax.
Term Modifiers
Wildcard Searches
Solr supports multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases.
The asterisk * symbol matches to multiple characters (zero or more sequential characters) in a single term. The wildcard search title:net*
would match any word that starts with net
.
Info: By default Solr does not support left truncation (e.g. title:*net
) so CKAN doesn’t support it either.
Also, you can use a wildcard in the middle of a term.
Warning: However, try to avoid using wildcard search in this way because it could be significantly slower than default search by words (tokens) and the result may be unpredictable and irrelevant.
Fuzzy Searches
Solr supports fuzzy searches based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match.
Basic CKAN search is fuzzy by default. To perform a fuzzy advanced search, use the tilde ~ symbol at the end of a single-word term. For example, to search for a term similar in spelling to rest
, use the fuzzy search:
title:rest~
This search will match terms like test
, west
, forest
, etc. It will also match the word rest
itself.
An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2.
title:rest~1
This will match terms like test
and west
, but not forest
since it has an edit distance of "2".
In many cases, stemming (reducing terms to a common stem) can produce similar effects to fuzzy searches and wildcard searches.
Proximity Searches
A proximity search looks for terms that are within a specific distance from one another.
To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for contaminated
and restricted
within 2 words of each other in a title, use the search:
title:"contaminated restricted"~2
The distance referred to here is the number of term movements needed to match the specified phrase.
Field Specific Queries
Existence and Non-Existence Searches
Sometimes you want to search for all datasets where a specific field isn’t empty. The syntax of an existing search is by using a wildcard with a field instead of a term. It matches all datasets where the specified field has any value.
data_temporal_extent_begin:[* TO *]
data_temporal_extent_begin:*
Matching NaN
values with wildcards
For most fields, unbounded range queries, field:[* TO *]
, are equivalent to existence queries, field: *
. However, for float/double types that support NaN
values, these two queries perform differently.
field:*
matches all existing values, includingNaN
field:[* TO *]
matches all real values, excludingNaN
Help: NaN stands for Not A Number and is one of the common ways to represent a missing value in data.
If you want to find all datasets where this field is empty, you can reverse this query.
NOT data_temporal_extent_begin:*
-data_temporal_extent_begin:*
Warning: A query like this can be significantly slower, because Solr has to do a full index scan on this field.
Range Searches
A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query matches datasets whose values for the specified field or fields fall within the range. Range queries can be inclusive or exclusive of the upper and lower bounds.
Sorting is done lexicographically (in alphabetical order), except on numeric fields. For example, the range query below matches all datasets whose num_resources
field has a value between 11 and 13, inclusive.
num_resources:[11 TO 13]
Range queries are not limited to date fields or even numerical fields. You could also use range queries with non-date fields.
theme:{"Business" TO "Defence"}
This will find all datasets whose theme is between Business
and Defence
, but not including these terms.
The brackets around a query determine its inclusiveness.
- Square brackets
[
and]
denote an inclusive range query that matches values including the upper and lower bound. - Curly brackets
{
and}
denote an exclusive range query that matches values between the upper and lower bounds, but excluding the upper and lower bounds themselves. - You can mix these types so one end of the range is inclusive and the other is exclusive.
theme:{"Business" TO "Education and Training"]
Specifying Dates and Times
Queries against fields using the date
type (typically range queries) should use the appropriate date syntax:
metadata_created:[* TO NOW]
metadata_created:[1976-03-06T23:59:59.999Z TO *]
metadata_created:[1999-12-31T23:59:59.999Z TO 2017-03-06T00:00:00Z]
metadata_created:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]
metadata_created:[2019-03-06T23:59:59.999Z TO 2020-03-06T23:59:59.999Z+1YEAR]
metadata_created:[2019-03-06T23:59:59.999Z/YEAR TO 2020-03-06T23:59:59.999Z]
Date Formatting
Solr’s date fields represent "dates" as a point in time with millisecond precision. The format used is a restricted form of the canonical representation of DateTime in the XML Schema specification—a restricted subset of ISO-8601.
YYYY-MM-DDThh:mm:ssZ
YYYY
is the year.MM
is the month.DD
is the day of the month.hh
is the hour of the day as on a 24-hour clock.mm
is minutes.ss
is seconds.Z
is a literal "Z" character, indicating that this string representation of the date is in UTC.
Date Math Syntax
Date math expressions consist of either adding some quantity of time in a specified unit, or rounding the current time by a specified unit. Expressions can be chained and are evaluated left to right.
For example this represents a point in time two months from now:
NOW+2MONTHS
This is one day ago:
NOW-1DAY
A slash is used to indicate rounding. This represents the beginning of the current hour:
NOW/HOUR
The following example computes (with millisecond precision) the point in time six months and three days into the future and then rounds that time to the beginning of that day:
NOW+6MONTHS+3DAYS/DAY
Note that while date math is most commonly used relative to NOW
it can be applied to any fixed moment in time as well:
1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY
For example, if you want to find datasets that were created in the last 2 months you can use the query:
metadata_created:[NOW-2MONTHS TO *]
Term Priorities
Term Boosting with ^
Query-time boosts allow one to specify which terms/clauses are "more important". The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding dataset scores.
When you apply a boost to a specific field you are promoting the datasets in the results list by increasing its score. For example:
title:water^2 OR title:parks
This will return results with water
ones promoted to the top.
Usually, the score is based on many factors that we simply cannot be aware of when executing a query. We also cannot know what the initial score is, before giving it a boost.
Constant Score with ^=
Constant score queries are created with ^=, which sets the entire clause to the specified score for any datasets matching that clause. This is desirable when you only care about matches for a particular clause and don't want other relevancy factors such as term frequency (the number of times the term appears in the field) or inverse dataset frequency (a measure across the whole index for how rare a term is in a field).
For example, the next query has a stronger effect on the score, so there is no way parks
results will be higher than water
.
title:water^=1 || title:parks^=0.9