Hierarchical Poisson Factorization¶
This is the documentation page for the python package hpfrec. For more details, see the project’s GitHub page:
Documentation¶

class
hpfrec.
HPF
(k=30, a=0.3, a_prime=0.3, b_prime=1.0, c=0.3, c_prime=0.3, d_prime=1.0, ncores=1, stop_crit='trainllk', check_every=10, stop_thr=0.001, users_per_batch=None, items_per_batch=None, step_size=<function HPF.<lambda>>, maxiter=100, use_float=True, reindex=True, verbose=True, random_seed=None, allow_inconsistent_math=False, full_llk=False, alloc_full_phi=False, keep_data=True, save_folder=None, produce_dicts=True, keep_all_objs=True, sum_exp_trick=False)¶ Bases:
object
Hierarchical Poisson Factorization
Model for recommending items based on probabilistic Poisson factorization on sparse count data (e.g. number of times a user played different songs), using meanfield variational inference with coordinateascent. Can also use stochastic variational inference (using mini batches of data).
Can use different stopping criteria for the opimization procedure:
 Run for a fixed number of iterations (stop_crit=’maxiter’).
 Calculate the Poisson loglikelihood every N iterations (stop_crit=’trainllk’ and check_every) and stop once {1  curr/prev} is below a certain threshold (stop_thr)
 Calculate the Poisson loglikelihood in a userprovided validation set (stop_crit=’valllk’, val_set and check_every) and stop once {1  curr/prev} is below a certain threshold. For this criterion, you might want to lower the default threshold (see Note).
 Check the the difference in the userfactor matrix after every N iterations (stop_crit=’diffnorm’, check_every) and stop once the l2norm of this difference is below a certain threshold (stop_thr). Note that this is not a percent difference as it is for loglikelihood criteria, so you should put a larger value than the default here. This is a much faster criterion to calculate and is recommended for larger datasets.
If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require reindexing if the IDs for users and items in counts_df meet the following criteria:
 Are all integers.
 Start at zero.
 Don’t have any enumeration gaps, i.e. if there is a user ‘4’, user ‘3’ must also be there.
If you only want to obtain the fitted parameters and use your own API later for recommendations, you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also available as numpy arrays in this object’s Theta and Beta attributes). Otherwise, the model will create Python dictionaries with entries for each user and item, which can take quite a bit of RAM memory. These can speed up predictions later through this package’s API.
Passing verbose=True will also print RMSE (root mean squared error) at each iteration. For slighly better speed pass verbose=False once you know what a good threshold should be for your data.
Note
DataFrames and arrays passed to ‘.fit’ might be modified inplace  if this is a problem you’ll need to pass a copy to them, e.g. ‘counts_df=counts_df.copy()’.
Note
If ‘check_every’ is not None and stop_crit is not ‘diffnorm’, it will, every N iterations, calculate the loglikelihood of the data. By default, this is NOT the full likelihood, (not including a constant that depends on the data but not on the parameters and which is quite slow to compute). The reason why it’s calculated by default like this is because otherwise it can result it overflow (number is too big for the data type), but be aware that if not adding this constant, the number can turn positive and will mess with the stopping criterion for likelihood.
Note
If you pass a validation set, it will calculate the Poisson loglikelihood of the nonzero observations only, rather than the complete likelihood that includes also the combinations of users and items not present in the data (assumed to be zero), thus it’s more likely that you might see positive numbers here.
Note
Compared to ALS, iterations from this algorithm are a lot faster to compute, so don’t be scared about passing large numbers for maxiter.
Note
In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see weird values for loglikelihood and RMSE. If this happens, try again with a different random seed.
Note
Fitting in minibatches is more prone to numerical instability and compared to fullbatch variational inference, it is more likely that all your parameters will turn to NaNs (which means the optimization procedure failed).
Parameters:  k (int) – Number of latent factors to use.
 a (float) – Shape parameter for the userfactor matrix.
 a_prime (float) – Shape parameter and dividend of the rate parameter for the user activity vector.
 b_prime (float) – Divisor of the rate parameter for the user activity vector.
 c (float) – Shape parameter for the itemfactor matrix.
 c_prime (float) – Shape parameter and dividend of the rate parameter for the item popularity vector.
 d_prime (float) – Divisor o the rate parameter for the item popularity vector.
 ncores (int) – Number of cores to use to parallelize computations. If set to 1, will use the maximum available on the computer.
 stop_crit (str, one of ‘maxiter’, ‘trainllk’, ‘valllk’, ‘diffnorm’) – Stopping criterion for the optimization procedure.
 check_every (None or int) – Calculate loglikelihood every N iterations.
 stop_thr (float) – Threshold for proportion increase in loglikelihood or l2norm for difference between matrices.
 users_per_batch (None or int) – Number of users to take for each batch update in stochastic variational inference. If passing None both here and for ‘items_per_batch’, will perform fullbatch variational inference, which leads to better results but on larger datasets takes longer to converge. If passing a number for both ‘users_per_batch’ and ‘items_per_batch’, it will alternate between epochs in which it samples by user and epochs in which it samples by item  this leads to faster convergence and is recommended, but using only one type leads to lower memory requirements and might have a use case if memory is limited.
 items_per_batch (None or int) – Number of items to take for each batch update in stochastic variational inference. If passing None both here and for ‘users_per_batch’, will perform fullbatch variational inference, which leads to better results but on larger datasets takes longer to converge. If passing a number for both ‘users_per_batch’ and ‘items_per_batch’, it will alternate between epochs in which it samples by user and epochs in which it samples by item  this leads to faster convergence and is recommended, but using only one type leads to lower memory requirements and might have a use case if memory is limited.
 step_size (function(int) > float in (0, 1)) – Function that takes the iteration/epoch number as input (starting at zero) and produces the step size for the global parameters as output (only used when fitting with stochastic variational inference). The step size must be a number between zero and one, and should be decresing with bigger iteration numbers. Ignored when passing users_per_batch=None.
 maxiter (int or None) – Maximum number of iterations for which to run the optimization procedure. This corresponds to epochs when fitting in batches of users. Recommended to use a lower number when passing a batch size.
 use_float (bool) – Whether to use the C float type (typically
np.float32
). Using float types (as compared to double) results in less memory usage and faster operations, but it has less numeric precision and the results will be slightly worse compared to using double. If passingFalse
, will use C double (typicallynp.float64
).  reindex (bool) – Whether to reindex data internally.
 verbose (bool) – Whether to print convergence messages.
 random_seed (int or None) – Random seed to use when starting the parameters.
 allow_inconsistent_math (bool) – Whether to allow inconsistent floatingpoint math (producing slightly different results on each run) which would allow parallelization of the updates for the shape parameters of Lambda and Gamma. Ignored (forced to True) in stochastic optimization mode.
 full_llk (bool) – Whether to calculate the full Poisson loglikehood, including terms that don’t depend on the model parameters (thus are constant for a given dataset).
 alloc_full_phi (bool) – Whether to allocate the full Phi matrix (size n_samples * k) when using stochastic optimization. Doing so will make it a bit faster, but it will use more memory. Ignored when passing both ‘users_per_batch=None’ and ‘items_per_batch=None’.
 keep_data (bool) – Whether to keep information about which user was associated with each item in the training set, so as to exclude those items later when making TopN recommendations.
 save_folder (str or None) – Folder where to save all model parameters as csv files.
 produce_dicts (bool) – Whether to produce Python dictionaries for users and items, which are used to speedup the prediction API of this package. You can still predict without them, but it might take some additional miliseconds (or more depending on the number of users and items).
 keep_all_objs (bool) – Whether to keep intermediate objects/variables in the object that are not necessary for predictions  these are: Gamma_shp, Gamma_rte, Lambda_shp, Lambda_rte, k_rte, t_rte (when passing True here, the model object will have these extra attributes too). Without these objects, it’s not possible to call functions that alter the model parameters given new information after it’s already fit.
 sum_exp_trick (bool) – Whether to use the sumexp trick when scaling the multinomial parameters  that is, calculating them as exp(val  maxval)/sum_{val}(exp(val  maxval)) in order to avoid numerical overflow if there are too large numbers. For this kind of model, it is unlikely that it will be required, and it adds a small overhead, but if you notice NaNs in the results or in the likelihood, you might give this option a try.
Variables:  Theta (array (nusers, k)) – Userfactor matrix.
 Beta (array (nitems, k)) – Itemfactor matrix.
 user_mapping (array (nusers,)) – ID of the user (as passed to .fit) corresponding to each row of Theta.
 item_mapping (array (nitems,)) – ID of the item (as passed to .fit) corresponding to each row of Beta.
 user_dict (dict (nusers)) – Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.
 item_dict (dict (nitems)) – Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.
 is_fitted (bool) – Whether the model has been fit to some data.
 niter (int) – Number of iterations for which the fitting procedure was run.
 train_llk (int) – Final training likelihood calculated when the model was fit (only when passing ‘verbose=True’).
References
[1] Scalable Recommendation with Hierarchical Poisson Factorization (Gopalan, P., Hofman, J.M. and Blei, D.M., 2015) [2] Stochastic variational inference (Hoffman, M.D., Blei, D.M., Wang, C. and Paisley, J., 2013) 
add_user
(user_id, counts_df, update_existing=False, maxiter=10, ncores=1, random_seed=1, stop_thr=0.001, update_all_params=None)¶ Add a new user to the model or update parameters for a user according to new data
Note
This function will NOT modify any of the item parameters.
Note
This function only works with one user at a time. For updating many users at the same time, use ‘partial_fit’ instead.
Note
For betters results, refit the model again from scratch.
Note
This function is prone to producing all NaNs values.
Parameters:  user_id (obj) – Id to give to be user (when adding a new one) or Id of the existing user whose parameters are to be updated according to the data in ‘counts_df’. Make sure that the data type is the same that was passed in the training data, so if you have integer IDs, don’t pass a string as ID.
 counts_df (data frame or array (nsamples, 2)) – Data Frame with columns ‘ItemId’ and ‘Count’. If passing a numpy array, will take the first two columns in that order. Data containing user/item interactions from one user only for which to add or update parameters. Note that you need to pass all the useritem interactions for this user when making an update, not just the new ones.
 update_existing (bool) – Whether this should be an update of the parameters for an existing user (when passing True), or an addition of a new user that was not in the model before (when passing False).
 maxiter (int) – Maximum number of iterations to run.
 ncores (int) – Number of threads/cores to use. With data for only one user, it’s unlikely that using multiple threads would give a significant speedup, and it might even end up making the function slower due to the overhead.
 random_seed (int) – Random seed used to initialize parameters.
 stop_thr (float) – If the l2norm of the difference between values of Theta_{u} between interations is less than this, it will stop. Smaller values of ‘k’ should require smaller thresholds.
 update_all_params (bool) – Whether to update also the item parameters in each iteration. If passing True, will update them with a step size given determined by the number of iterations already taken and the step_size function given as input in the model constructor call.
Returns: True – Will return True if the process finishes successfully.
Return type: bool

eval_llk
(input_df, full_llk=False)¶ Evaluate Poisson loglikelihood (plus constant) for a given dataset
Note
This Poisson loglikelihood is calculated only for the combinations of users and items provided here, so it’s not a complete likelihood, and it might sometimes turn out to be a positive number because of this. Will filter out the input data by taking only combinations of users and items that were present in the training set.
Parameters:  input_df (pandas data frame (nobs, 3)) – Input data on which to calculate loglikelihood, consisting of IDs and counts. Must contain one row per nonzero observaion, with columns ‘UserId’, ‘ItemId’, ‘Count’. If a numpy array is provided, will assume the first 3 columns contain that info.
 full_llk (bool) – Whether to calculate terms of the likelihood that depend on the data but not on the parameters. Ommitting them is faster, but it’s more likely to result in positive values.
Returns: llk – Dictionary containing the calculated loglikelihood and the number of observations that were used to calculate it.
Return type: dict

fit
(counts_df, val_set=None)¶ Fit Hierarchical Poisson Model to sparse count data
Fits a hierarchical Poisson model to count data using meanfield approximation with either fullbatch coordinateascent or minibatch stochastic coordinateascent.
Note
DataFrames and arrays passed to ‘.fit’ might be modified inplace  if this is a problem you’ll need to pass a copy to them, e.g. ‘counts_df=counts_df.copy()’.
Note
Forcibly terminating the procedure should still keep the last calculated shape and rate parameter values, but is not recommended. If you need to make predictions on a forcedterminated object, set the attribute ‘is_fitted’ to ‘True’.
Note
Fitting in minibatches is more prone to numerical instability and compared to fullbatch variational inference, it is more likely that all your parameters will turn to NaNs (which means the optimization procedure failed).
Parameters:  counts_df (pandas data frame (nobs, 3) or coo_matrix) – Input data with one row per nonzero observation, consisting of triplets (‘UserId’, ‘ItemId’, ‘Count’). Must containin columns ‘UserId’, ‘ItemId’, and ‘Count’. Combinations of users and items not present are implicitly assumed to be zero by the model. Can also pass a sparse coo_matrix, in which case ‘reindex’ will be forced to ‘False’.
 val_set (pandas data frame (nobs, 3)) – Validation set on which to monitor loglikelihood. Same format as counts_df.
Returns: self – Copy of this object
Return type: obj

partial_fit
(counts_df, batch_type='users', step_size=None, nusers=None, nitems=None, users_in_batch=None, items_in_batch=None, new_users=False, new_items=False, random_seed=None)¶ Updates the model with batches of data from a subset of users or items
Note
You must pass either the full set of useritem interactions that are nonzero for some subset of users, or the full set of itemuser interactions that are nonzero for some subset of items. Otherwise, if passing a random sample of triplets, the model will not converge to reasonable results.
Note
All user and items IDs must be integers starting at one, without gaps in the numeration.
Note
For better results, fit the model with fullbatch iterations (using the ‘fit’ method). Adding new users and/or items without refitting the model might result in worsened results for existing users/items. For adding users without altering the parameters for items or for other users, see the method ‘add_user’.
Note
Fitting in minibatches is more prone to numerical instability and compared to fullbatch variational inference, it is more likely that all your parameters will turn to NaNs (which means the optimization procedure failed).
Parameters:  counts_df (data frame (n_samples, 3)) – Data frame with the useritem interactions for some subset of users. Must have columns ‘UserId’, ‘ItemId’, ‘Count’.
 batch_type (str, one of ‘users’ or ‘items’) – Whether ‘counts_df’ contains a sample of users with all their item counts (‘users’), or a sample of items with all their user counts (‘items’).
 step_size (None or float in (0, 1)) – Step size with which to update the global variables in the model. Must be a number between zero and one. If passing None, will determine it according to the step size function with which the model was initialized and the number of iterations or calls to partial fit that have been performed. If no valid function was passed at the initialization, it will use 1/sqrt(i+1).
 nusers (int) – Total number of users (not just in this batch!). Only required if calling partial_fit for the first time on a model object that hasn’t been fit.
 nitems (int) – Total number of items (not just in this batch!). Only required if calling partial_fit for the first time on a model object that hasn’t been fit.
 users_in_batch (None or array (n_users_sample,)) – Users that are present int counts_df. If passing None, will determine the unique elements in counts_df.UserId, but passing them if you already have them will skip this step.
 items_in_batch (None or array (n_items_sample,)) – Items that are present int counts_df. If passing None, will determine the unique elements in counts_df.ItemId, but passing them if you already have them will skip this step.
 new_users (bool) – Whether the data contains new users with numeration greater than the number of users with which the model was initially fit. For better results refit the model including all users/items instead of adding them afterwards.
 new_items (bool) – Whether the data contains new items with numeration greater than the number of items with which the model was initially fit. For better results refit the model including all users/items instead of adding them afterwards.
 random_seed (int) – Random seed to be used for the initialization of new user/item parameters. Ignored when new_users=False and new_items=False.
Returns: self – Copy of this object.
Return type: obj

predict
(user, item)¶ Predict count for combinations of users and items
Note
You can either pass an individual user and item, or arrays representing tuples (UserId, ItemId) with the combinatinons of users and items for which to predict (one row per prediction).
Parameters:  user (arraylike (npred,) or obj) – User(s) for which to predict each item.
 item (arraylike (npred,) or obj) – Item(s) for which to predict for each user.

predict_factors
(counts_df, maxiter=10, ncores=1, random_seed=1, stop_thr=0.001, return_all=False)¶ Gets latent factors for a user given her item counts
This is similar to obtaining topics for a document in LDA.
Note
This function will NOT modify any of the item parameters.
Note
This function only works with one user at a time.
Note
This function is prone to producing all NaNs values.
Parameters:  counts_df (DataFrame or array (nsamples, 2)) – Data Frame with columns ‘ItemId’ and ‘Count’, indicating the nonzero item counts for a user for whom it’s desired to obtain latent factors.
 maxiter (int) – Maximum number of iterations to run.
 ncores (int) – Number of threads/cores to use. With data for only one user, it’s unlikely that using multiple threads would give a significant speedup, and it might even end up making the function slower due to the overhead. If passing 1, it will determine the maximum number of cores in the system and use that.
 random_seed (int) – Random seed used to initialize parameters.
 stop_thr (float) – If the l2norm of the difference between values of Theta_{u} between interations is less than this, it will stop. Smaller values of ‘k’ should require smaller thresholds.
 return_all (bool) – Whether to return also the intermediate calculations (Gamma_shp, Gamma_rte). When passing True here, the output will be a tuple containing (Theta, Gamma_shp, Gamma_rte, Phi)
Returns: latent_factors – Calculated latent factors for the user, given the input data
Return type: array (k,)

topN
(user, n=10, exclude_seen=True, items_pool=None)¶ Recommend TopN items for a user
Outputs the TopN items according to score predicted by the model. Can exclude the items for the user that were associated to her in the training set, and can also recommend from only a subset of userprovided items.
Parameters:  user (obj) – User for which to recommend.
 n (int) – Number of top items to recommend.
 exclude_seen (bool) – Whether to exclude items that were associated to the user in the training set.
 items_pool (None or array) – Items to consider for recommending to the user.
Returns: rec – TopN recommended items.
Return type: array (n,)