```{julia}
#| label: julia-pkg-install
#| output: false
#| eval: false
#| code-overflow: wrap
import Pkg
"StatsBase", "DataFrames", "DataFramesMeta", "Tidier", "TidierStrings", "CSV", "Pipe", "Chain"])
Pkg.add([```
People always say that it must be nice that Hanukkah has 8 days of presents. This year, with everyone doing Advent of Code for 3+ weeks, I’m grateful that the Hanukkah of Data challenge only has eight days.
The Hanukkah of Data is a fun puzzle that gives folks a chance to practice coding and data analysis skills on a fun, low-stakes group challenge. I’m using this as an opportunity to explore the use of Julia and Python. I still find R to be a much more intuitive data analysis tool, but I want to stretch my legs and build my skills with these other languages, too. So this blog post documents the approach I used to solve these puzzles with Julia and R. This post focuses more on the programming than on the problem-solving; if you’re looking for a good post that talks about the reasoning for each night’s puzzle, I’d highly recommend this post from Ella Kaye: https://ellakaye.co.uk/posts/2023-12-07_hanukkah-of-data-5784/.
In general, I found that Julia syntax for data manipulation was less readable than the Tidyverse syntax that’s become popular in R. The Julia data manipulation syntax for the foundational package DataFrames often feels less like a user-centered design and more like an intellectual exercise for the designers. The extension package DataFramesMeta makes things much more readable, but there’s still a lot more cruft than in the tidyverse: for example, each pipe involves more annoying bookkeeping syntax like @chain mydata begin ... end
. So my main takeaway of the foundational Julia data manipulation packages is that recommending it to the average data analyst would be like recommending a custom Linux distribution to your grandmother. The other big challenge I found with using Julia was that the error messages were often quite difficult to understand: the stacktraces are usually quite long and heavily laden with jargon, so that they’re helpful for package developers but inscrutable to a typical data analyst.
With that said, let’s solve these puzzles!
Setup: Configuring a Project Folder and Installing Packages
As a little bit of housekeeping, I set up a project folder. To keep track of the Julia packages and their versions, I created a TOML file for the project. In the Julia REPL, I opened the package manager by typing ]
; then, from within the package manager, I created a TOML file by simply running activate .
.
Then I added a few Julia packages:
Here’s what this looked like as I waited for these to install.
Precompiling project...
Progress [================================> ] 123/155
◐ TiffImages
◓ ImageCore
✓ Distributions → DistributionsChainRulesCoreExt
◐ Polynomials → PolynomialsMakieCoreExt
◐ Polynomials → PolynomialsChainRulesCoreExt
Maaaany dependency installations and roughly 15 minutes later, I have all of these packages installed and ready to use. With that out of the way, I closed the package manager by typing the keyboard shortcut “CTRL+C”. Then I loaded these packages like so.
using DataFrames, DataFramesMeta, CSV, Chain
import Tidier as TD
import Dates as DATE
Next, I downloaded the data from the Hanukkah of Data site. I unzipped the four CSV files into a subdirectory of the project named data
. Here’s the list of files in the data folder:
readdir("data")
4-element Vector{String}:
"noahs-customers.csv"
"noahs-orders.csv"
"noahs-orders_items.csv"
"noahs-products.csv"
Let’s go ahead and load all of the datasets.
= DataFrame(CSV.File("data/noahs-customers.csv"))
customers = DataFrame(CSV.File("data/noahs-orders.csv"))
orders = DataFrame(CSV.File("data/noahs-orders_items.csv"))
orders_items = DataFrame(CSV.File("data/noahs-products.csv")) products
library(tidyverse)
<- read_csv("data/noahs-customers.csv")
customers <- read_csv("data/noahs-orders.csv")
orders <- read_csv("data/noahs-orders_items.csv")
orders_items <- read_csv("data/noahs-products.csv") products
First Night of Hanukkah
Julia Solution
We know that we need to identify a customer whose last name can be spelled out by a phone’s keypad. So we’ll create two helper functions: the first will extract a person’s surname, and the second will translate a given surname into a phone number.
# Define function to extract the surname from the full name
function get_surname(full_name)
= match(r"[A-z]+$", full_name)
regex_match if isnothing(regex_match)
""
else
uppercase(regex_match.match)
end
end
get_surname (generic function with 1 method)
# Define a function to convert a name to a phone number
# First create a dictionary giving the letters included on each key in a phone keypad
= Dict([ ('2', "ABC"), ('3', "DEF"),
phone_keypad '4', "GHI"), ('5', "JKL"), ('6', "MNO"),
('7', "PQRS"), ('8', "TUV"), ('9', "WXYZ")]) (
Dict{Char, String} with 8 entries:
'2' => "ABC"
'3' => "DEF"
'4' => "GHI"
'7' => "PQRS"
'6' => "MNO"
'9' => "WXYZ"
'8' => "TUV"
'5' => "JKL"
# Initialize the dictionary we're really interested in
= Dict{Char, Char}() phone_keypad_lookup
Dict{Char, Char}()
# Assemble the dictionary
for key in keys(phone_keypad)
for letter in phone_keypad[key]
= key
phone_keypad_lookup[letter] end
end
function name_to_phone_number(surname)
reduce(string, [phone_keypad_lookup[letter] for letter in uppercase(surname)]; init = "")
end
name_to_phone_number (generic function with 1 method)
Now we’ll use these functions to filter the customers data to just customers whose names can be spelled by their phone number. Note that I’ve loaded the Tidier package using an import
statement. This is because the Tidier package includes macros whose names coincide with macros from another package I’ve loaded (DataFramesMeta), and I want it to make sure that I always clearly refer to Tidier package macros with the prefix TD.
.
import Tidier as TD
import TidierStrings as TDS
# Identify the person whose last name matches their phone number
@chain customers begin
@TD.mutate(phone_number_digits = TDS.str_remove_all(phone, "-"),
= get_surname(name))
surname @TD.mutate(name_as_number = name_to_phone_number(surname))
@TD.filter(phone_number_digits == name_as_number)
@TD.select(name, surname, phone)
end
1×3 DataFrame
Row │ name surname phone
│ String31 String String15
─────┼──────────────────────────────────────────
1 │ Sam Tannenbaum TANNENBAUM 826-636-2286
R Solution
For comparison, here’s my R solution:
# Create a lookup key for letters to numbers
<- list(
keypad `2` = c("A", "B", "C"), `3` = c("D", "E", "F"),
`4` = c("G", "H", "I"), `5` = c("J", "K", "L"), `6` = c("M", "N", "O"),
`7` = c("P", "Q", "R", "S"), `8` = c("T", "U", "V"), `9` = c("W", "X", "Y", "Z")
)
<- vector()
keypad_letter_lookup for (key in names(keypad)) {
for (letter in keypad[[key]]) {
<- key
keypad_letter_lookup[letter]
}
}
# Create a function to convert names to numbers
<- function(name) {
name_to_number
<- name |> toupper() |> str_split(pattern = "")
letters_of_names
map_chr(letters_of_names, function(letters) {
|> paste(collapse = "")
keypad_letter_lookup[letters]
})
}
# Identify the person of interest
|>
customers mutate(
surname = str_extract(name, "[A-z]+$"),
surname_as_number = name_to_number(surname)
|>
) filter(
== str_remove_all(phone, "-")
surname_as_number |>
) select(name, surname, phone)
# A tibble: 1 × 3
name surname phone
<chr> <chr> <chr>
1 Sam Tannenbaum Tannenbaum 826-636-2286
Miscellaneous Observations
The biggest difference between these R and Julia solutions is that for R we had to split the names into their component characters, whereas in Julia, strings act like arrays of characters and so they don’t need to be split in order to iterate over the characters. This feels very similar to Python and is a nice feature. Of course, the data manipulation at the end is almost identical to the equivalent R code, since I used the Tidier package (which is a partial copy of the Tidyverse). That’s nice for useRs like me, although it’s troubling that the Julia package completely borrowed the user interface from the R packages without crediting the R package authors as co-authors of the user interface. I’ll keep an eye on the Tidier package, since it’s a promising tool for writing reader-friendly data manipulation code in Julia.
Second Night of Hanukkah
For the second night of Hanukkah, instead of relying on Tidier to provide a familiar user interface, I’ll use the DataFramesMeta and DataFrames packages directly. The strategy for solving tonight’s puzzle is to identify any orders that include bagels, and then–among those orders–filter for customers whose names have the initials “JP”.
Julia Solution
# Filter for orders of bagels
@chain products begin
@rsubset occursin("bagel", lowercase(:desc))
innerjoin(orders_items, on = :sku)
innerjoin(orders, on = :orderid)
# Filter for orders in 2017
@rtransform :ordered = DATE.DateTime(:ordered, "yyyy-mm-dd HH:MM:SS")
@rsubset(DATE.year(:ordered) == 2017)
# Find customers whose initials are J.P.
innerjoin(customers, on = :customerid)
@rsubset occursin(r"^J[A-z]+ P[A-z]+", :name)
@select :customerid :name :phone
end
1×3 DataFrame
Row │ customerid name phone
│ Int64 String31 String15
─────┼───────────────────────────────────────────
1 │ 1475 Joshua Peterson 332-274-4185
R Solution
For comparison, here’s my R solution.
library(lubridate)
# Filter for orders containing bagels
|>
products filter(str_detect(tolower(desc), "bagel")) |>
inner_join(orders_items, by = "sku") |>
# Only look at orders in 2017
inner_join(orders, by = "orderid") |>
filter(year(ordered) == 2017) |>
# Look for a customer with the initials J.P.
inner_join(customers, by = "customerid") |>
filter(str_detect(name, "^J[A-z]+ P[A-z]+")) |>
select(customerid, name, phone)
# A tibble: 1 × 3
customerid name phone
<dbl> <chr> <chr>
1 1475 Joshua Peterson 332-274-4185
Third Night of Hanukkah
Based on tonight’s clue, we want to find someone whose birthsign is a Cancer, and who was born in the Year of the Rabbit.
Julia Solution
First, we’ll record the date ranges of all the rabbit years over the last century.
= hcat(
rabbit_years Date.(["1927-02-02", "1939-02-19", "1951-02-06", "1963-01-25", "1975-02-11",
DATE."1987-01-29", "1999-02-16", "2011-02-03", "2023-01-22"], "yyyy-mm-dd"),
Date.(["1928-01-22", "1940-02-07", "1952-01-26", "1964-02-12", "1976-01-30",
DATE."1988-02-16", "2000-02-04", "2012-01-22", "2024-02-09"], "yyyy-mm-dd")
)
9×2 Matrix{Dates.Date}:
1927-02-02 1928-01-22
1939-02-19 1940-02-07
1951-02-06 1952-01-26
1963-01-25 1964-02-12
1975-02-11 1976-01-30
1987-01-29 1988-02-16
1999-02-16 2000-02-04
2011-02-03 2012-01-22
2023-01-22 2024-02-09
Then we’ll add a variable :born_in_rabbit_year
to identify customers whose date of birth falls in a rabbit year.
# For each rabbit year, figure out who was born in that year
= @rtransform customers :born_in_rabbit_year = false;
customers
for i in 1:size(rabbit_years, 1) # Loop over rows
global customers = @chain customers begin
@rtransform :born_in_rabbit_year = :born_in_rabbit_year | (rabbit_years[i,1] <= :birthdate <= rabbit_years[i,2])
end
end
Now we can filter the customers for those born in a rabbit year, who live in the same neighborhood as the customer from Night Two, and whose astrological sign is Cancer.
@chain semijoin(
@rsubset customers :born_in_rabbit_year),
(@rsubset customers :customerid == 1475),
(= :citystatezip
on begin
) @rsubset (DATE.month(:birthdate) in [6,7])
@select :customerid :name :birthdate :phone
end
2×4 DataFrame
Row │ customerid name birthdate phone
│ Int64 String31 Date String15
─────┼───────────────────────────────────────────────────────
1 │ 2550 Robert Morton 1999-07-08 917-288-9635
2 │ 4304 Stephanie Flynn 1963-06-12 838-477-9033
R Solution
For comparison, here’s my R solution. The key difference is that we were able to use the Tidyverse functions interval()
and %within%
to simplify the process of filtering the data for customers born in a rabbit year.
library(lubridate)
# Enumerate which years are a Year of the Rabbit
<- interval(
rabbit_years start = ymd(c("1927-02-02", "1939-02-19", "1951-02-06",
"1963-01-25", "1975-02-11", "1987-01-29",
"1999-02-16", "2011-02-03", "2023-01-22")),
end = ymd(c("1928-01-22", "1940-02-07", "1952-01-26",
"1964-02-12", "1976-01-30", "1988-02-16",
"2000-02-04", "2012-01-22", "2024-02-09"))
)
# Filter for customers with a birthdate in the range of interest
<- customers |>
suspects # Identify customers born in the Year of the Rabbit
filter(map_lgl(birthdate, ~ any(.x %within% rabbit_years))) |>
# Identify customers whose birthsign is Cancer
filter((month(birthdate) == 6 & day(birthdate) >= 22) |
month(birthdate) == 7 & day(birthdate) < 22))
(
# Narrow down to customers who share the same neighborhood as the contractor
semi_join(
x = suspects,
y = customers |> filter(customerid == 1475),
by = "citystatezip"
|> select(customerid, name, phone, birthdate, citystatezip) )
# A tibble: 1 × 5
customerid name phone birthdate citystatezip
<dbl> <chr> <chr> <date> <chr>
1 2550 Robert Morton 917-288-9635 1999-07-08 Jamaica, NY 11435
Fourth Night of Hanukkah
For tonight, we want to find a female customer who is often the first person to place a bakery order at Noah’s each day.
Julia Solution
I’ll start by ensuring that the datetime variables in the data are encoded as DateTime variables rather than strings.
# Ensure that datetime variables are encoded as such
= @chain orders begin
orders @rtransform :ordered = DATE.DateTime(:ordered, "yyyy-mm-dd HH:MM:SS")
@rtransform :shipped = DATE.DateTime(:shipped, "yyyy-mm-dd HH:MM:SS")
end
213232×6 DataFrame
Row │ orderid customerid ordered shipped items ⋯
│ Int64 Int64 DateTime DateTime Missi ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 1001 6878 2017-01-31T02:56:45 2017-01-31T09:00:00 missi ⋯
2 │ 1002 6375 2017-01-31T04:13:35 2017-01-31T12:15:00 missi
3 │ 1003 8045 2017-01-31T04:45:12 2017-01-31T10:45:00 missi
4 │ 1004 5385 2017-01-31T05:49:19 2017-01-31T09:00:00 missi
5 │ 1005 9235 2017-01-31T06:28:11 2017-01-31T16:00:00 missi ⋯
6 │ 1006 8414 2017-01-31T07:30:08 2017-01-31T19:15:00 missi
7 │ 1007 4285 2017-01-31T08:01:48 2017-01-31T19:15:00 missi
8 │ 1008 5725 2017-01-31T08:03:11 2017-01-31T08:03:11 missi
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
213226 │ 214226 2598 2022-12-16T22:53:21 2022-12-18T15:30:00 missi ⋯
213227 │ 214227 8352 2022-12-16T22:58:00 2022-12-23T11:00:00 missi
213228 │ 214228 2886 2022-12-16T23:11:08 2022-12-20T18:15:00 missi
213229 │ 214229 8735 2022-12-16T23:16:16 2022-12-21T14:15:00 missi
213230 │ 214230 4043 2022-12-16T23:23:33 2022-12-17T15:00:00 missi ⋯
213231 │ 214231 3894 2022-12-16T23:48:43 2022-12-23T18:15:00 missi
213232 │ 214232 1368 2022-12-16T23:49:15 2022-12-17T18:30:00 missi
2 columns and 213217 rows omitted
Then I’ll look at the customers who tended to be the bakery’s first customer in the morning.
# Identify bakery orders
@chain products begin
@rsubset startswith(:sku, "BKY")
semijoin(orders_items, _, on = :sku)
semijoin(orders, _, on = :orderid)
# Subset to customers ordering before 5AM
@rsubset DATE.hour(:ordered) < 5
# Find the first customer of each day
@rtransform :date_of_order = DATE.Date(:ordered)
@orderby :ordered;
groupby(:date_of_order)
combine(first)
# Count how often each customer placed a day's first bakery order
groupby(:customerid)
combine(nrow => :n_first_orders)
# Pull customer information
leftjoin(_, customers, on = :customerid)
@select :customerid :n_first_orders :phone :name
# View top five customers
@orderby -:n_first_orders
first(1)
end
1×4 DataFrame
Row │ customerid n_first_orders phone name
│ Int64 Int64 String15? String31?
─────┼────────────────────────────────────────────────────────
1 │ 2749 4 607-231-3605 Renee Harmon
R Solution
We’ll use the same approach for our R solution.
# First identify orders from the bakery
|>
orders semi_join(orders_items |> filter(startsWith(sku, "BKY"))) |>
# Subset to orders before 5AM
filter(hour(ordered) < 5) |>
# Identify customers who place the first bakery order in a day
group_by(date(ordered)) |>
slice(1) |>
# Count how often the customer placed the first bakery order of the day
group_by(customerid) |>
summarize(n_first_orders = n()) |>
ungroup() |>
# Find the customer with the most first-orders
slice_max(order_by = n_first_orders) |>
# Get customer information
inner_join(customers, by = "customerid") |>
select(customerid, n_first_orders, phone, name)
# A tibble: 1 × 4
customerid n_first_orders phone name
<dbl> <int> <chr> <chr>
1 2749 4 607-231-3605 Renee Harmon
Fifth Night of Hanukkah
Let’s find this cat lady. We’ll look at orders of products for senior cats, then we’ll see which customer has the largest average quantity of such products in their orders.
Julia Solution
import TidierStrings as TDS
using StatsBase: mean
@chain products begin
@rsubset TDS.str_detect(:desc, "Senior Cat")
semijoin(orders_items, _, on = :sku)
innerjoin(orders, on = :orderid)
groupby([:customerid, :orderid])
@combine :total_qty = sum(:qty)
groupby(:customerid)
@combine :avg_order_qty = mean(:total_qty)
innerjoin(customers, _, on = :customerid)
@select :customerid :name :phone :avg_order_qty
@orderby -:avg_order_qty
first(1)
end
1×4 DataFrame
Row │ customerid name phone avg_order_qty
│ Int64 String31 String15 Float64
─────┼────────────────────────────────────────────────────────
1 │ 3068 Nicole Wilson 631-507-6048 8.04762
R solution
We’ll use the exact same approach for the R solution.
# Filter for order items related to senior cats
|>
orders_items left_join(y = products, by = "sku") |>
filter(str_detect(desc, "Senior Cat")) |>
inner_join(orders, by = "orderid") |>
# For orders, calculate the total quantity of senior cat products
group_by(customerid, orderid) |>
summarize(total_qty = sum(qty)) |>
# For the customers, calculate average quantity of senior cat products
group_by(customerid) |>
summarize(avg_order_qty = mean(total_qty)) |>
# Find the custoemr with the largest average quantity
slice_max(n = 1, order_by = avg_order_qty) |>
inner_join(customers, by = "customerid") |>
select(customerid, name, phone, avg_order_qty)
# A tibble: 1 × 4
customerid name phone avg_order_qty
<dbl> <chr> <chr> <dbl>
1 3068 Nicole Wilson 631-507-6048 8.05
And there we have it.
Sixth Night of Hanukkah
The strategy for tonight is to look for orders where every item is discounted, and then see which customer has the most such orders.
Julia Solution
This night was the most annoying to do in Julia. I wanted to filter for groups where every element in the group satisfies some condition. Specifically, I wanted to filter orders to those where every item in the order is discounted.
For this kind of filtering operation, StackOverflow search suggested that we need to use some kind of ugly syntax with the filter()
function, which you can see in the solution below.
@chain orders begin
innerjoin(orders_items, on = :orderid)
innerjoin(products , on = :sku)
# Filter for orders where every item is discounted
groupby(:orderid)
@rtransform :is_discounted = :unit_price < :wholesale_cost
filter(:is_discounted => d -> length(d) == sum(d), _)
# Identify customer with the most discounted orders_items
groupby(:customerid)
combine(nrow => :n_disc_orders)
@orderby -:n_disc_orders
first(1)
# Get their information
innerjoin(customers , on = :customerid)
@select :customerid :name :phone :n_disc_orders
end
1×4 DataFrame
Row │ customerid name phone n_disc_orders
│ Int64 String31 String15 Int64
─────┼──────────────────────────────────────────────────────
1 │ 4167 Sherri Long 585-838-9161 54
The filter()
solution is dissatisfying for two reasons. First, the order of the arguments is annoying when working in a chain: it has the structure filter(my_filter_operation, things_to_filter)
, so in a pipe you always have to do filter(my_filter_operation, _)
. Second, as far as I know, we can’t just write something clean like all(:is_discounted)
to filter for groups where every record meets the condition. Instead, we have to use this weird key => value
where the key is a variable of interest (:is_discounted
) and the value is an anonymous function d -> ...
. It works but it’s hard to read. Like a lot of Julia operations in DataFrames, DataFramesMeta, etc. there’s a theoretical elegance to it if you want to think through an intellectual exercise. But if you’re just a normal analyst trying to work with data, it’s utterly bizarre and makes the code hard to communicate to other people.
R Solution
In contrast, the R code for filtering orders that are entirely discounted is much more readable thanks to the use of the all()
function.
# Identify orders where every item is discounted
<- orders_items |>
fully_discounted_orders left_join(y = products, by = "sku") |>
left_join(y = orders, by = "orderid") |>
group_by(orderid) |>
filter(all(unit_price < wholesale_cost)) |>
ungroup()
# Find the customer with the most fully-discounted orders
|>
fully_discounted_orders count(customerid, name = "n_disc_orders") |>
slice_max(n = 1, order_by = n_disc_orders) |>
left_join(y = customers, by ="customerid") |>
select(n_disc_orders, customerid, name, phone, citystatezip)
# A tibble: 1 × 5
n_disc_orders customerid name phone citystatezip
<int> <dbl> <chr> <chr> <chr>
1 35 4167 Sherri Long 585-838-9161 Bronx, NY 10473
Seventh Night of Hanukkah
Fr this night, we’ll look at orders of items with colors, placed in the same hour as the person from Night 6 (with customer ID 4167), where the ordered item is the same product as that person’s but in a different color.
Julia Solution
function get_color(product_desc)
= match(r"\([A-z]+\)", product_desc)
color_regex_match if isnothing(color_regex_match)
""
else
color_regex_match.matchend
end
get_color (generic function with 1 method)
= @chain orders begin
orders_with_color innerjoin(orders_items, on = :orderid)
innerjoin(products , on = :sku)
@rtransform :color = get_color(:desc) :item = replace(:desc, r" \([A-z]+\)" => "" )
@rsubset :color != ""
@select :customerid :ordered :item :color
end
28012×4 DataFrame
Row │ customerid ordered item color
│ Int64 DateTime String Abstract…
───────┼──────────────────────────────────────────────────────────────────
1 │ 4716 2017-01-31T10:23:50 Noah's Poster (yellow)
2 │ 3808 2017-01-31T10:29:22 Noah's Lunchbox (mauve)
3 │ 2645 2017-01-31T10:53:59 Noah's Gift Box (magenta)
4 │ 2520 2017-01-31T13:45:58 Noah's Jersey (mauve)
5 │ 8385 2017-01-31T13:47:53 Noah's Bobblehead (puce)
6 │ 3779 2017-01-31T15:32:25 Noah's Poster (yellow)
7 │ 8267 2017-01-31T16:11:59 Noah's Jewelry (puce)
8 │ 4721 2017-01-31T17:50:31 Noah's Jersey (white)
⋮ │ ⋮ ⋮ ⋮ ⋮
28006 │ 9105 2022-12-16T17:45:34 Noah's Jewelry (puce)
28007 │ 2824 2022-12-16T17:46:36 Noah's Jewelry (yellow)
28008 │ 6222 2022-12-16T21:57:11 Noah's Action Figure (orange)
28009 │ 5950 2022-12-16T22:00:02 Noah's Poster (mauve)
28010 │ 8352 2022-12-16T22:58:00 Noah's Action Figure (yellow)
28011 │ 3894 2022-12-16T23:48:43 Noah's Lunchbox (green)
28012 │ 1368 2022-12-16T23:49:15 Noah's Gift Box (magenta)
27997 rows omitted
@chain orders_with_color begin
@rtransform :date_ordered = DATE.Date(:ordered) :hour_ordered = DATE.hour(:ordered)
semijoin(_, filter(:customerid => isequal(4167), _), on = [:item, :date_ordered, :hour_ordered])
antijoin(_, filter(:customerid => isequal(4167), _), on = [:item, :color, :date_ordered])
end
1×6 DataFrame
Row │ customerid ordered item color date_ordered ⋯
│ Int64 DateTime String Abstract… Dates.Date ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 5783 2018-12-31T12:26:39 Noah's Poster (orange) 2018-12-31 ⋯
1 column omitted
R Solution
# Identify orders of items that come in different colors
<- orders |>
orders_with_color full_join(orders_items) |>
full_join(products |> select(sku, desc)) |>
mutate(color = str_extract(desc, "\\([A-z]+\\)"),
item = str_remove(desc, fixed(color)) |> str_trim()) |>
filter(!is.na(color))
# Identify orders of the same item but different colors,
# on the same day but preceding Sherri's order
|>
orders_with_color group_by(date(ordered)) |>
filter(any(customerid == 4167)) |>
filter(ordered <= (ordered[customerid == 4167])) |>
filter(item %in% unique(item[customerid == 4167]),
!color %in% unique(item[customerid == 4167])) |>
filter(n_distinct(customerid) > 1) |>
mutate(lead_time = ordered[customerid == 4167] - ordered) |>
ungroup() |>
filter(customerid != 4167) |>
slice_min(order_by = lead_time) |>
left_join(customers |> select(customerid, name, phone)) |>
select(customerid, name, phone)
# A tibble: 1 × 3
customerid name phone
<dbl> <chr> <chr>
1 5783 Carlos Myers 838-335-7157
Of the three people who bought a different-colored poster before Sherri did, one of them bought it just before Sherri did: Carlos.
Eighth Night of Hanukkah
For this night, we want to find someone who owns all of the different collectible items. The hiccup here is that collectibles (identified by the product code with COL
in the name) come in different colors, which we don’t care about. So we’ll remove the color from the product name, and then identify someone who owns each collectible item.
Julia Solution
= @chain products begin
collectibles @rsubset startswith(:sku, "COL")
@rtransform :collectible = replace(:desc, r" \([A-z]+\)" => "" )
end
85×5 DataFrame
Row │ sku desc wholesale_cost dims_cm ⋯
│ String7 String Float64 String15 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ COL0037 Noah's Jewelry (green) 28.32 17.4|11.2|5.7 ⋯
2 │ COL0041 Noah's Ark Model (HO Scale) 2487.35 7.2|4.3|0.4
3 │ COL0065 Noah's Jewelry (mauve) 33.52 19.0|12.2|10.5
4 │ COL0166 Noah's Action Figure (blue) 13.98 12.1|7.7|7.2
5 │ COL0167 Noah's Bobblehead (blue) 5.36 8.9|5.6|0.6 ⋯
6 │ COL0483 Noah's Action Figure (mauve) 13.95 10.3|2.5|0.2
7 │ COL0837 Noah's Poster (mauve) 3.86 13.0|3.3|0.8
8 │ COL0891 Noah's Bobblehead (orange) 5.65 19.9|14.8|0.3
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
79 │ COL9192 Noah's Poster (amber) 4.76 19.6|16.2|11.5 ⋯
80 │ COL9241 Noah's Gift Box (orange) 2.43 17.4|13.1|2.2
81 │ COL9349 Noah's Action Figure (orange) 15.47 16.6|12.9|11.9
82 │ COL9420 Noah's Jewelry (amber) 30.01 13.8|9.1|4.9
83 │ COL9773 Noah's Poster (magenta) 4.13 7.2|4.9|1.9 ⋯
84 │ COL9819 Noah's Lunchbox (blue) 9.87 14.2|4.1|3.3
85 │ COL9948 Noah's Jewelry (magenta) 32.97 18.0|17.5|6.2
1 column and 70 rows omitted
= @chain collectibles begin
n_distinct_collectibles @distinct :collectible
nrowend
8
@chain orders_items begin
innerjoin(collectibles, _, on = :sku)
innerjoin(orders , _, on = :orderid)
@distinct :customerid :collectible
groupby(:customerid)
combine(nrow => :n_collectibles)
@rsubset(:n_collectibles == n_distinct_collectibles)
innerjoin(customers , _, on = :customerid)
@select :customerid :name :phone :n_collectibles
end
1×4 DataFrame
Row │ customerid name phone n_collectibles
│ Int64 String31 String15 Int64
─────┼───────────────────────────────────────────────────────
1 │ 3808 James Smith 212-547-3518 8
R Solution
# Identify all of the item identifiers of collectibles
# and create a "collectible" variable that deduplicates
# the same collectible available in different colors
<- products |>
collectibles filter(startsWith(sku, "COL")) |>
mutate(collectible = str_remove(desc, " \\(.+")) |>
distinct(sku, collectible)
# Count the number of distinct collectibles (ignoring color)
<- collectibles |>
n_distinct_collectibles distinct(collectible) |>
nrow()
# Find a customer with every distinct collectible
# and get his name and phone number
|>
orders_items inner_join(collectibles) |>
inner_join(orders) |>
distinct(customerid, collectible) |>
group_by(customerid) |>
summarize(n_collectibles = n()) |>
filter(n_collectibles == n_distinct_collectibles) |>
left_join(customers) |>
select(customerid, name, phone, n_collectibles)
# A tibble: 1 × 4
customerid name phone n_collectibles
<dbl> <chr> <chr> <int>
1 3808 James Smith 212-547-3518 8