Result Set Paging • patentsview

Paging changed in the new version of the Patentsview API and R package. This vignette tries to explain the subtleties that the R package handles for you and to show how to do custom paging.

The R package lets you make a single request of up to 1000 rows or to retrieve all rows, with nothing in between, unless you do your own paging. This might be important if you want to retrieve a few thousand utility patents without retrieving all 8 million of them. Or maybe you want to iterate through search results page by page rather than retrieving the entire result set and then iterating. If you do your own paging, you’ll need to be careful when choosing your sort parameter(s), as shown in Example 2 and Example 3.

Example 1

Here we’ll retrieve 5000 utility patents using custom paging.

library(patentsview)

# Lets get 5000 of the most recent utility patents (when this was written),
# 1000 at a time (the API's maximum rows per request)
requested_rows <- 1000
sort <- c("patent_id" = "desc")
query <- qry_funs$eq("patent_type" = "utility")
fields <- c("patent_id", "patent_date", "patent_title")

# The "after" parameter is explained a bit lower.  It's an Elasticsearch thing,
# and is the attrbute the new version of the API uses to page.  For now, just be 
# thankful that the R package handles this for you when you set all_pages = TRUE
after <- NULL
results <- list()

for (n in seq_len(5)) {
  print(paste("page", n, "after is", ifelse(is.null(after),"NULL",after)))

  page_n <- search_pv(query, fields = fields, sort = sort, 
    all_pages = FALSE, size = requested_rows, after = after)
   s <- names(sort)
   last_index <- nrow(page_n$data[[1]])
   after <- page_n$data[[1]][[s]][[last_index]] # the last value of the sort field
   results[[n]] = page_n$data
}
#> [1] "page 1 after is NULL"
#> [1] "page 2 after is 9999018"
#> [1] "page 3 after is 9998008"
#> [1] "page 4 after is 9996992"
#> [1] "page 5 after is 9995985"

utility_sample <- as.data.frame(do.call(rbind, lapply(results, as.data.frame)))

str(utility_sample)
#> 'data.frame':    5000 obs. of  3 variables:
#>  $ patents.patent_id   : chr  "RE34947" "RE34357" "RE31704" "RE31701" ...
#>  $ patents.patent_title: chr  "Light-to-light conversion element provided with wavelength selecting reflection layer and imaging device provid"| __truncated__ "Loose-lay and adhered surface coverings" "Transformer having novel multiple winding and support structure and method of making same" "Portable label printing and applying machine" ...
#>  $ patents.patent_date : chr  "1995-05-23" "1993-08-24" "1984-10-09" "1984-10-09" ...

Example 2

Here we will execute a query two different ways, first having the R package do the paging, the second will be our misguided attempt to do the paging ourselves.

fields <- c("patent_id", "patent_date", "patent_title")
sort <- c("patent_date" = "asc")
query <- '{"_and":[{"_gte":{"patent_date":"1976-01-01"}},{"_lte":{"patent_date":"1976-01-31"}}]}'
r_pkg <- search_pv(query, sort = sort, fields = fields, all_pages = TRUE)

# note the number of rows returned
r_pkg$query_results$total_hits
#> [1] 5352

Quick piece of trivia: with a handful of exceptions, most US patents were issued on a Tuesday. Here are the counts by issue date for January 1976 to help illustrate why what we’re about to do will lead to trouble. (Note that the single patent issued on 1976-01-25 is a mistake in the patentsview database but we need it here so the sum will match r_pkg$query_results$total_hits just displayed)

   issue_dates = unique(r_pkg$data$patents$patent_date)
   counts <- lapply(issue_dates, function(issue_date) {
      query <- qry_funs$eq(patent_date = issue_date)
      res <- search_pv(query)
      weekday <-  weekdays(as.Date(issue_date))
      print(paste(res$query_results$total_hits, issue_date, weekday))
      res$query_results$total_hits
   })
#> [1] "1379 1976-01-06 Tuesday"
#> [1] "1257 1976-01-13 Tuesday"
#> [1] "1383 1976-01-20 Tuesday"
#> [1] "1332 1976-01-27 Tuesday"
#> [1] "1 1976-01-25 Sunday"
   sum(unlist(counts))
#> [1] 5352

Now we’ll try to do our own paging but, as you might notice, we’ll run into trouble using patent_date as the sort field ( and thus the ‘after’ parameter we’ll send to the API).

after <- NULL
combined_data <- NULL
count <- 0
requested_rows <- 1000  # API's maximum rows per request

# these variables remain the same
sort <- c("patent_date" = "asc")
query <- '{"_and":[{"_gte":{"patent_date":"1976-01-01"}},{"_lte":{"patent_date":"1976-01-31"}}]}'
fields <- c("patent_id", "patent_date", "patent_title")

# We'll continue to make requests until we get back an empty or partial
# response from the API
page <- 1

repeat {
  print(paste("page", page, "after is", ifelse(is.null(after), "NULL", after)))

  subsequent <- search_pv(query, sort = sort, all_pages = FALSE,
    fields = fields, size = requested_rows, after = after)

  # subsequent$data$patents is an empty list if we page too far
  returned_rows <- ifelse(length(subsequent$data$patents) == 0, 0, nrow(subsequent$data$patents))

  if(returned_rows > 0) {
     combined_data <- rbind(combined_data, subsequent$data$patents)
     count <- count + returned_rows 
     page <- page  + 1
  }

  # We're done if we got an empty or partial reply from the API
  if(returned_rows < requested_rows) {
     break
  }

  # Now to page we need to set the "after" attribute to where the
  # current results ended.  Its value is the last row's [[sort field]]. 
  # It would need to be a vector of values if there are multiple sort fields
  s <- names(sort)[[1]]
  after <- subsequent$data[[1]][[s]][[returned_rows]]

}
#> [1] "page 1 after is NULL"
#> [1] "page 2 after is 1976-01-06"
#> [1] "page 3 after is 1976-01-13"
#> [1] "page 4 after is 1976-01-20"
#> [1] "page 5 after is 1976-01-27"

print(paste("count is", count))
#> [1] "count is 4000"
print(nrow(combined_data))
#> [1] 4000

We ran into trouble since we chose patent_date as the sort field which isn’t unique row-wise in our result set as patent_id was in Example 1. In most cases when we set ‘after’ to the last patent_date of a page of results, we weren’t done retrieving all of that date’s patents. (The page breaks aren’t guaranteed to align with the patent_date changes in the result set. See the page boundary shown in next example if this isn’t clear yet.)

The R package uses appropriate key(s) when all_pages = TRUE. get_ok_pk(endpoint) has been changed to also return a secondary sort key when needed. The next example shows that sometimes a secondary sort is required to guarantee row uniqueness (vital for paging via the ‘after’ parameter, so the API properly picks up exactly where the previous page of data left off).

Example 3

Here we’ll demonstrate that sometimes, in order to do custom paging, a secondary sort is required. Normally the R package handles this for its users when all_pages = TRUE. It all has to do with the way the API now handles paging, where the sort key(s) is(are) used to determine the ‘after’ parameter’s value(s), instructing the API where the next page of results begin.

Some of the new endpoints can return more than one row of data for their primary key. They are the endpoints that have a sequence parameter. Sorting and thus paging by only the primary key at these endpoints can lead to trouble, like sorting by patent_date did in the second half of Example 2.

library(dplyr)

  sequence_eps <- fieldsdf[grepl("^[^.]*sequence",fieldsdf$field), "endpoint"]
  seq_pks <- lapply(sequence_eps, function(endpoint) {
     c(endpoint, get_ok_pk(endpoint))
  })

  sequences_df <- as.data.frame(do.call("rbind", seq_pks))
  colnames(sequences_df) <- c("endpoint", "primary", "secondary")
  sequences_df
#>                         endpoint         primary          secondary
#> 1                        g_claim       patent_id     claim_sequence
#> 2               g_draw_desc_text       patent_id draw_desc_sequence
#> 3        patent/foreign_citation       patent_id  citation_sequence
#> 4         patent/other_reference       patent_id reference_sequence
#> 5 patent/us_application_citation       patent_id  citation_sequence
#> 6      patent/us_patent_citation       patent_id  citation_sequence
#> 7                       pg_claim document_number     claim_sequence
#> 8              pg_draw_desc_text document_number draw_desc_sequence

Ok, so here we’ll do a minimalist custom paging with a secondary sort using the patent/us_patent_citation endpoint. The code is similar to what’s in the citation network vignette, where we first learned that a primary sort is not always sufficient (requiring changes to the R package).

# Write a query to pull patents assigned to the CPC code of "Y10S707/933"
query <- qry_funs$contains(cpc_current.cpc_group_id = "Y10S707/933")
pv_out <- search_pv(query = query, fields = c("patent_id"))
patent_ids <- pv_out$data$patents$patent_id

# We have to go against the patent citiation endpoint now, these fields
# are no longer available from the patent endpoint

citing_query <- qry_funs$eq(patent_id = patent_ids)
cited_query <- qry_funs$eq(citation_patent_id = patent_ids)

# Create a list of fields to pull from the API
fields <- c(
  "patent_id",
  "citation_patent_id",
  "citation_sequence"
)

sort <- c("patent_id" = "asc", "citation_sequence" = "asc")

# Request the first page of results
res <- search_pv(citing_query,
  fields = fields, all_pages = FALSE,
  sort =  sort,
  endpoint = "patent/us_patent_citation", method = "POST", size = 1000
)

last_row <- nrow(res$data$us_patent_citations)
last_patent_id <- res$data$us_patent_citations$patent_id[[last_row]]
last_citation_sequence <- res$data$us_patent_citations$citation_sequence[[last_row]]
after <- c(last_patent_id, last_citation_sequence)
print(after)
#> [1] "8818996" "6"

# make our own request to get the second page of results, knowing that's it (1066 total rows)
remaining <- search_pv(citing_query,
  fields = fields, all_pages = FALSE,
  sort =  sort,
  after = after,
  endpoint = "patent/us_patent_citation", method = "POST", size = 1000
)

blend <- list(res$data[[1]], remaining$data[[1]])
blended <- list(do.call("rbind", c(blend, make.row.names = FALSE)))
names(blended) <- names(res$data)
str(blended)
#> List of 1
#>  $ us_patent_citations:'data.frame': 1066 obs. of  3 variables:
#>   ..$ patent_id         : chr [1:1066] "10095778" "10095778" "10095778" "10095778" ...
#>   ..$ citation_sequence : int [1:1066] 0 1 2 3 4 5 6 7 8 9 ...
#>   ..$ citation_patent_id: chr [1:1066] "4991087" "5175681" "5392390" "5461699" ...

Here’s a quick look at the data around the page boundary to try to show why we needed a secondary sort. It’s not that we necessarily wanted a secondary sort, but it gives us the ability to use a second column’s value in the ‘after’ parameter.

# ending of the first page of results
tail(res$data$us_patent_citations, n=3)
#>      patent_id citation_sequence citation_patent_id
#> 998    8818996                 4            5680305
#> 999    8818996                 5            5694592
#> 1000   8818996                 6            5721910

#  ***********  data page boundary ***********  

# start of the second page of results
head(remaining$data$us_patent_citations, n=3)
#>   patent_id citation_sequence citation_patent_id
#> 1   8818996                 7            5754840
#> 2   8818996                 8            5774833
#> 3   8818996                 9            5799325

# end of the second page of results
tail(remaining$data$us_patent_citations, n=3)
#>    patent_id citation_sequence citation_patent_id
#> 64   9075849                34            7451388
#> 65   9075849                35            7912842
#> 66   9075849                36            7962511

# If the sort was only by patent_id, the second request would return
# zero rows since there aren't patent_ids 'after' 10095778

Takeaways

Again, when all_pages = TRUE the R package handles all of this for you! You’d only need code like this if your use case requires custom paging. The custom paging takeaways are:

The original version of the API and R package used per_page and page to allow users to page through result sets. Those attributes are replaced by size and after in the new version of the API and R package.
Your sort field(s) need to create row-wise uniqueness. At a minimum the get_ok_pk(endpoint) fields need to be included as sort fields though others can be added.
If there is a single sort field, the after parameter’s value is the sort field’s last value in the most recently retrieved page of data (as in Example 1). If there are multiple sort fields, the after parameter’s value is a vector of the sort fields’ last values (as in Example 3).