Paging changed in the new version of the Patentsview API and R package. This vignette tries to explain the subtleties that the R package handles for you and to show how to do custom paging.
The R package lets you make a single request of up to 1000 rows or to retrieve all rows, with nothing in between, unless you do your own paging. This might be important if you want to retrieve a few thousand utility patents without retrieving all 8 million of them. Or maybe you want to iterate through search results page by page rather than retrieving the entire result set and then iterating. If you do your own paging, you’ll need to be careful when choosing your sort parameter(s), as shown in Example 2 and Example 3.
Example 1
Here we’ll retrieve 5000 utility patents using custom paging.
library(patentsview)
# Lets get 5000 of the most recent utility patents (when this was written),
# 1000 at a time (the API's maximum rows per request)
requested_rows <- 1000
sort <- c("patent_id" = "desc")
query <- qry_funs$eq("patent_type" = "utility")
fields <- c("patent_id", "patent_date", "patent_title")
# The "after" parameter is explained a bit lower. It's an Elasticsearch thing,
# and is the attrbute the new version of the API uses to page. For now, just be
# thankful that the R package handles this for you when you set all_pages = TRUE
after <- NULL
results <- list()
for (n in seq_len(5)) {
print(paste("page", n, "after is", ifelse(is.null(after),"NULL",after)))
page_n <- search_pv(query, fields = fields, sort = sort,
all_pages = FALSE, size = requested_rows, after = after)
s <- names(sort)
last_index <- nrow(page_n$data[[1]])
after <- page_n$data[[1]][[s]][[last_index]] # the last value of the sort field
results[[n]] = page_n$data
}
#> [1] "page 1 after is NULL"
#> [1] "page 2 after is 9999018"
#> [1] "page 3 after is 9998008"
#> [1] "page 4 after is 9996992"
#> [1] "page 5 after is 9995985"
utility_sample <- as.data.frame(do.call(rbind, lapply(results, as.data.frame)))
str(utility_sample)
#> 'data.frame': 5000 obs. of 3 variables:
#> $ patents.patent_id : chr "RE34947" "RE34357" "RE31704" "RE31701" ...
#> $ patents.patent_title: chr "Light-to-light conversion element provided with wavelength selecting reflection layer and imaging device provid"| __truncated__ "Loose-lay and adhered surface coverings" "Transformer having novel multiple winding and support structure and method of making same" "Portable label printing and applying machine" ...
#> $ patents.patent_date : chr "1995-05-23" "1993-08-24" "1984-10-09" "1984-10-09" ...
Example 2
Here we will execute a query two different ways, first having the R package do the paging, the second will be our misguided attempt to do the paging ourselves.
fields <- c("patent_id", "patent_date", "patent_title")
sort <- c("patent_date" = "asc")
query <- '{"_and":[{"_gte":{"patent_date":"1976-01-01"}},{"_lte":{"patent_date":"1976-01-31"}}]}'
r_pkg <- search_pv(query, sort = sort, fields = fields, all_pages = TRUE)
# note the number of rows returned
r_pkg$query_results$total_hits
#> [1] 5352
Quick piece of trivia: with a handful of exceptions, most US patents were issued on a Tuesday. Here are the counts by issue date for January 1976 to help illustrate why what we’re about to do will lead to trouble. (Note that the single patent issued on 1976-01-25 is a mistake in the patentsview database but we need it here so the sum will match r_pkg$query_results$total_hits just displayed)
issue_dates = unique(r_pkg$data$patents$patent_date)
counts <- lapply(issue_dates, function(issue_date) {
query <- qry_funs$eq(patent_date = issue_date)
res <- search_pv(query)
weekday <- weekdays(as.Date(issue_date))
print(paste(res$query_results$total_hits, issue_date, weekday))
res$query_results$total_hits
})
#> [1] "1379 1976-01-06 Tuesday"
#> [1] "1257 1976-01-13 Tuesday"
#> [1] "1383 1976-01-20 Tuesday"
#> [1] "1332 1976-01-27 Tuesday"
#> [1] "1 1976-01-25 Sunday"
sum(unlist(counts))
#> [1] 5352
Now we’ll try to do our own paging but, as you might notice, we’ll run into trouble using patent_date as the sort field ( and thus the ‘after’ parameter we’ll send to the API).
after <- NULL
combined_data <- NULL
count <- 0
requested_rows <- 1000 # API's maximum rows per request
# these variables remain the same
sort <- c("patent_date" = "asc")
query <- '{"_and":[{"_gte":{"patent_date":"1976-01-01"}},{"_lte":{"patent_date":"1976-01-31"}}]}'
fields <- c("patent_id", "patent_date", "patent_title")
# We'll continue to make requests until we get back an empty or partial
# response from the API
page <- 1
repeat {
print(paste("page", page, "after is", ifelse(is.null(after), "NULL", after)))
subsequent <- search_pv(query, sort = sort, all_pages = FALSE,
fields = fields, size = requested_rows, after = after)
# subsequent$data$patents is an empty list if we page too far
returned_rows <- ifelse(length(subsequent$data$patents) == 0, 0, nrow(subsequent$data$patents))
if(returned_rows > 0) {
combined_data <- rbind(combined_data, subsequent$data$patents)
count <- count + returned_rows
page <- page + 1
}
# We're done if we got an empty or partial reply from the API
if(returned_rows < requested_rows) {
break
}
# Now to page we need to set the "after" attribute to where the
# current results ended. Its value is the last row's [[sort field]].
# It would need to be a vector of values if there are multiple sort fields
s <- names(sort)[[1]]
after <- subsequent$data[[1]][[s]][[returned_rows]]
}
#> [1] "page 1 after is NULL"
#> [1] "page 2 after is 1976-01-06"
#> [1] "page 3 after is 1976-01-13"
#> [1] "page 4 after is 1976-01-20"
#> [1] "page 5 after is 1976-01-27"
print(paste("count is", count))
#> [1] "count is 4000"
print(nrow(combined_data))
#> [1] 4000
We ran into trouble since we chose patent_date as the sort field which isn’t unique row-wise in our result set as patent_id was in Example 1. In most cases when we set ‘after’ to the last patent_date of a page of results, we weren’t done retrieving all of that date’s patents. (The page breaks aren’t guaranteed to align with the patent_date changes in the result set. See the page boundary shown in next example if this isn’t clear yet.)
The R package uses appropriate key(s) when
all_pages = TRUE
. get_ok_pk(endpoint)
has been
changed to also return a secondary sort key when needed. The next
example shows that sometimes a secondary sort is required to guarantee
row uniqueness (vital for paging via the ‘after’ parameter, so the API
properly picks up exactly where the previous page of data left off).
Example 3
Here we’ll demonstrate that sometimes, in order to do custom paging,
a secondary sort is required. Normally the R package handles this for
its users when all_pages = TRUE
. It all has to do with the
way the API now handles paging, where the sort key(s) is(are) used to
determine the ‘after’ parameter’s value(s), instructing the API where
the next page of results begin.
Some of the new endpoints can return more than one row of data for their primary key. They are the endpoints that have a sequence parameter. Sorting and thus paging by only the primary key at these endpoints can lead to trouble, like sorting by patent_date did in the second half of Example 2.
library(dplyr)
sequence_eps <- fieldsdf[grepl("^[^.]*sequence",fieldsdf$field), "endpoint"]
seq_pks <- lapply(sequence_eps, function(endpoint) {
c(endpoint, get_ok_pk(endpoint))
})
sequences_df <- as.data.frame(do.call("rbind", seq_pks))
colnames(sequences_df) <- c("endpoint", "primary", "secondary")
sequences_df
#> endpoint primary secondary
#> 1 g_claim patent_id claim_sequence
#> 2 g_draw_desc_text patent_id draw_desc_sequence
#> 3 patent/foreign_citation patent_id citation_sequence
#> 4 patent/other_reference patent_id reference_sequence
#> 5 patent/us_application_citation patent_id citation_sequence
#> 6 patent/us_patent_citation patent_id citation_sequence
#> 7 pg_claim document_number claim_sequence
#> 8 pg_draw_desc_text document_number draw_desc_sequence
Ok, so here we’ll do a minimalist custom paging with a secondary sort using the patent/us_patent_citation endpoint. The code is similar to what’s in the citation network vignette, where we first learned that a primary sort is not always sufficient (requiring changes to the R package).
# Write a query to pull patents assigned to the CPC code of "Y10S707/933"
query <- qry_funs$contains(cpc_current.cpc_group_id = "Y10S707/933")
pv_out <- search_pv(query = query, fields = c("patent_id"))
patent_ids <- pv_out$data$patents$patent_id
# We have to go against the patent citiation endpoint now, these fields
# are no longer available from the patent endpoint
citing_query <- qry_funs$eq(patent_id = patent_ids)
cited_query <- qry_funs$eq(citation_patent_id = patent_ids)
# Create a list of fields to pull from the API
fields <- c(
"patent_id",
"citation_patent_id",
"citation_sequence"
)
sort <- c("patent_id" = "asc", "citation_sequence" = "asc")
# Request the first page of results
res <- search_pv(citing_query,
fields = fields, all_pages = FALSE,
sort = sort,
endpoint = "patent/us_patent_citation", method = "POST", size = 1000
)
last_row <- nrow(res$data$us_patent_citations)
last_patent_id <- res$data$us_patent_citations$patent_id[[last_row]]
last_citation_sequence <- res$data$us_patent_citations$citation_sequence[[last_row]]
after <- c(last_patent_id, last_citation_sequence)
print(after)
#> [1] "8818996" "6"
# make our own request to get the second page of results, knowing that's it (1066 total rows)
remaining <- search_pv(citing_query,
fields = fields, all_pages = FALSE,
sort = sort,
after = after,
endpoint = "patent/us_patent_citation", method = "POST", size = 1000
)
blend <- list(res$data[[1]], remaining$data[[1]])
blended <- list(do.call("rbind", c(blend, make.row.names = FALSE)))
names(blended) <- names(res$data)
str(blended)
#> List of 1
#> $ us_patent_citations:'data.frame': 1066 obs. of 3 variables:
#> ..$ patent_id : chr [1:1066] "10095778" "10095778" "10095778" "10095778" ...
#> ..$ citation_sequence : int [1:1066] 0 1 2 3 4 5 6 7 8 9 ...
#> ..$ citation_patent_id: chr [1:1066] "4991087" "5175681" "5392390" "5461699" ...
Here’s a quick look at the data around the page boundary to try to show why we needed a secondary sort. It’s not that we necessarily wanted a secondary sort, but it gives us the ability to use a second column’s value in the ‘after’ parameter.
# ending of the first page of results
tail(res$data$us_patent_citations, n=3)
#> patent_id citation_sequence citation_patent_id
#> 998 8818996 4 5680305
#> 999 8818996 5 5694592
#> 1000 8818996 6 5721910
# *********** data page boundary ***********
# start of the second page of results
head(remaining$data$us_patent_citations, n=3)
#> patent_id citation_sequence citation_patent_id
#> 1 8818996 7 5754840
#> 2 8818996 8 5774833
#> 3 8818996 9 5799325
# end of the second page of results
tail(remaining$data$us_patent_citations, n=3)
#> patent_id citation_sequence citation_patent_id
#> 64 9075849 34 7451388
#> 65 9075849 35 7912842
#> 66 9075849 36 7962511
# If the sort was only by patent_id, the second request would return
# zero rows since there aren't patent_ids 'after' 10095778
Takeaways
Again, when all_pages = TRUE
the R package handles all
of this for you! You’d only need code like this if your use case
requires custom paging. The custom paging takeaways are:
- The original version of the API and R package used
per_page
andpage
to allow users to page through result sets. Those attributes are replaced bysize
andafter
in the new version of the API and R package. - Your sort field(s) need to create row-wise uniqueness. At a minimum
the
get_ok_pk(endpoint)
fields need to be included as sort fields though others can be added. - If there is a single sort field, the after parameter’s value is the sort field’s last value in the most recently retrieved page of data (as in Example 1). If there are multiple sort fields, the after parameter’s value is a vector of the sort fields’ last values (as in Example 3).