Hierarchical Poisson Factorization

This is the documentation page for the python package hpfrec. For more details, see the project’s GitHub page:

https://www.github.com/david-cortes/hpfrec

Installation

Package is available on PyPI, can be installed with

pip install hpfrec

Documentation

class hpfrec.HPF(k=30, a=0.3, a_prime=0.3, b_prime=1.0, c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1, stop_crit='maxiter', check_every=10, stop_thr=0.001, users_per_batch=None, items_per_batch=None, step_size=<function HPF.<lambda>>, maxiter=100, use_float=True, reindex=True, verbose=True, random_seed=None, allow_inconsistent_math=False, full_llk=False, alloc_full_phi=False, keep_data=True, save_folder=None, produce_dicts=True, keep_all_objs=True, sum_exp_trick=False)[source]

Bases: object

Hierarchical Poisson Factorization

Model for recommending items based on probabilistic Poisson factorization on sparse count data (e.g. number of times a user played different songs), using mean-field variational inference with coordinate-ascent. Can also use stochastic variational inference (using mini batches of data).

Can use different stopping criteria for the opimization procedure:

  1. Run for a fixed number of iterations (stop_crit=’maxiter’).

  2. Calculate the Poisson log-likelihood every N iterations (stop_crit=’train-llk’ and check_every) and stop once {1 - curr/prev} is below a certain threshold (stop_thr)

  3. Calculate the Poisson log-likelihood in a user-provided validation set (stop_crit=’val-llk’, val_set and check_every) and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the default threshold (see Note).

  4. Check the the difference in the user-factor matrix after every N iterations (stop_crit=’diff-norm’, check_every) and stop once the l2-norm of this difference is below a certain threshold (stop_thr). Note that this is not a percent difference as it is for log-likelihood criteria, so you should put a larger value than the default here. This is a much faster criterion to calculate and is recommended for larger datasets.

If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require reindexing if the IDs for users and items in counts_df meet the following criteria:

  1. Are all integers.

  2. Start at zero.

  3. Don’t have any enumeration gaps, i.e. if there is a user ‘4’, user ‘3’ must also be there.

If you only want to obtain the fitted parameters and use your own API later for recommendations, you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also available as numpy arrays in this object’s Theta and Beta attributes). Otherwise, the model will create Python dictionaries with entries for each user and item, which can take quite a bit of RAM memory. These can speed up predictions later through this package’s API.

Passing verbose=True will also print RMSE (root mean squared error) at each iteration. For slighly better speed pass verbose=False once you know what a good threshold should be for your data.

Note

DataFrames and arrays passed to ‘.fit’ might be modified inplace - if this is a problem you’ll need to pass a copy to them, e.g. ‘counts_df=counts_df.copy()’.

Note

If ‘check_every’ is not None and stop_crit is not ‘diff-norm’, it will, every N iterations, calculate the log-likelihood of the data. By default, this is NOT the full likelihood, (not including a constant that depends on the data but not on the parameters and which is quite slow to compute). The reason why it’s calculated by default like this is because otherwise it can result it overflow (number is too big for the data type), but be aware that if not adding this constant, the number can turn positive and will mess with the stopping criterion for likelihood.

Note

If you pass a validation set, it will calculate the Poisson log-likelihood of the non-zero observations only, rather than the complete likelihood that includes also the combinations of users and items not present in the data (assumed to be zero), thus it’s more likely that you might see positive numbers here.

Note

Compared to ALS, iterations from this algorithm are a lot faster to compute, so don’t be scared about passing large numbers for maxiter.

Note

In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see weird values for log-likelihood and RMSE. If this happens, try again with a different random seed.

Note

Fitting in mini-batches is more prone to numerical instability and compared to full-batch variational inference, it is more likely that all your parameters will turn to NaNs (which means the optimization procedure failed).

Parameters:
  • k (int) – Number of latent factors to use.

  • a (float) – Shape parameter for the user-factor matrix.

  • a_prime (float) – Shape parameter and dividend of the rate parameter for the user activity vector.

  • b_prime (float) – Divisor of the rate parameter for the user activity vector.

  • c (float) – Shape parameter for the item-factor matrix.

  • c_prime (float) – Shape parameter and dividend of the rate parameter for the item popularity vector.

  • d_prime (float) – Divisor o the rate parameter for the item popularity vector.

  • ncores (int) – Number of cores to use to parallelize computations. If set to -1, will use the maximum available on the computer.

  • stop_crit (str, one of ‘maxiter’, ‘train-llk’, ‘val-llk’, ‘diff-norm’) – Stopping criterion for the optimization procedure.

  • check_every (None or int) – Calculate log-likelihood every N iterations.

  • stop_thr (float) – Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices.

  • users_per_batch (None or int) – Number of users to take for each batch update in stochastic variational inference. If passing None both here and for ‘items_per_batch’, will perform full-batch variational inference, which leads to better results but on larger datasets takes longer to converge. If passing a number for both ‘users_per_batch’ and ‘items_per_batch’, it will alternate between epochs in which it samples by user and epochs in which it samples by item - this leads to faster convergence and is recommended, but using only one type leads to lower memory requirements and might have a use case if memory is limited.

  • items_per_batch (None or int) – Number of items to take for each batch update in stochastic variational inference. If passing None both here and for ‘users_per_batch’, will perform full-batch variational inference, which leads to better results but on larger datasets takes longer to converge. If passing a number for both ‘users_per_batch’ and ‘items_per_batch’, it will alternate between epochs in which it samples by user and epochs in which it samples by item - this leads to faster convergence and is recommended, but using only one type leads to lower memory requirements and might have a use case if memory is limited.

  • step_size (function(int) -> float in (0, 1)) – Function that takes the iteration/epoch number as input (starting at zero) and produces the step size for the global parameters as output (only used when fitting with stochastic variational inference). The step size must be a number between zero and one, and should be decresing with bigger iteration numbers. Ignored when passing users_per_batch=None.

  • maxiter (int or None) – Maximum number of iterations for which to run the optimization procedure. This corresponds to epochs when fitting in batches of users. Recommended to use a lower number when passing a batch size.

  • use_float (bool) – Whether to use the C float type (typically np.float32). Using float types (as compared to double) results in less memory usage and faster operations, but it has less numeric precision and the results will be slightly worse compared to using double. If passing False, will use C double (typically np.float64).

  • reindex (bool) – Whether to reindex data internally.

  • verbose (bool) – Whether to print convergence messages.

  • random_seed (int or None) – Random seed to use when starting the parameters.

  • allow_inconsistent_math (bool) – Whether to allow inconsistent floating-point math (producing slightly different results on each run) which would allow parallelization of the updates for the shape parameters of Lambda and Gamma. Ignored (forced to True) in stochastic optimization mode.

  • full_llk (bool) – Whether to calculate the full Poisson log-likehood, including terms that don’t depend on the model parameters (thus are constant for a given dataset).

  • alloc_full_phi (bool) – Whether to allocate the full Phi matrix (size n_samples * k) when using stochastic optimization. Doing so will make it a bit faster, but it will use more memory. Ignored when passing both ‘users_per_batch=None’ and ‘items_per_batch=None’.

  • keep_data (bool) – Whether to keep information about which user was associated with each item in the training set, so as to exclude those items later when making Top-N recommendations.

  • save_folder (str or None) – Folder where to save all model parameters as csv files.

  • produce_dicts (bool) – Whether to produce Python dictionaries for users and items, which are used to speed-up the prediction API of this package. You can still predict without them, but it might take some additional miliseconds (or more depending on the number of users and items).

  • keep_all_objs (bool) – Whether to keep intermediate objects/variables in the object that are not necessary for predictions - these are: Gamma_shp, Gamma_rte, Lambda_shp, Lambda_rte, k_rte, t_rte (when passing True here, the model object will have these extra attributes too). Without these objects, it’s not possible to call functions that alter the model parameters given new information after it’s already fit.

  • sum_exp_trick (bool) – Whether to use the sum-exp trick when scaling the multinomial parameters - that is, calculating them as exp(val - maxval)/sum_{val}(exp(val - maxval)) in order to avoid numerical overflow if there are too large numbers. For this kind of model, it is unlikely that it will be required, and it adds a small overhead, but if you notice NaNs in the results or in the likelihood, you might give this option a try.

Variables:
  • Theta (array (nusers, k)) – User-factor matrix.

  • Beta (array (nitems, k)) – Item-factor matrix.

  • user_mapping (array (nusers,)) – ID of the user (as passed to .fit) corresponding to each row of Theta.

  • item_mapping (array (nitems,)) – ID of the item (as passed to .fit) corresponding to each row of Beta.

  • user_dict (dict (nusers)) – Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.

  • item_dict (dict (nitems)) – Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.

  • is_fitted (bool) – Whether the model has been fit to some data.

  • niter (int) – Number of iterations for which the fitting procedure was run.

  • train_llk (int) – Final training likelihood calculated when the model was fit (only when passing ‘verbose=True’).

References

add_user(user_id, counts_df, update_existing=False, maxiter=10, ncores=1, random_seed=1, stop_thr=0.001, update_all_params=None)[source]

Add a new user to the model or update parameters for a user according to new data

Note

This function will NOT modify any of the item parameters.

Note

This function only works with one user at a time. For updating many users at the same time, use ‘partial_fit’ instead.

Note

For betters results, refit the model again from scratch.

Parameters:
  • user_id (obj) – Id to give to be user (when adding a new one) or Id of the existing user whose parameters are to be updated according to the data in ‘counts_df’. Make sure that the data type is the same that was passed in the training data, so if you have integer IDs, don’t pass a string as ID.

  • counts_df (data frame or array (nsamples, 2)) – Data Frame with columns ‘ItemId’ and ‘Count’. If passing a numpy array, will take the first two columns in that order. Data containing user/item interactions from one user only for which to add or update parameters. Note that you need to pass all the user-item interactions for this user when making an update, not just the new ones.

  • update_existing (bool) – Whether this should be an update of the parameters for an existing user (when passing True), or an addition of a new user that was not in the model before (when passing False).

  • maxiter (int) – Maximum number of iterations to run.

  • ncores (int) – Number of threads/cores to use. With data for only one user, it’s unlikely that using multiple threads would give a significant speed-up, and it might even end up making the function slower due to the overhead.

  • random_seed (int) – Random seed used to initialize parameters.

  • stop_thr (float) – If the l2-norm of the difference between values of Theta_{u} between interations is less than this, it will stop. Smaller values of ‘k’ should require smaller thresholds.

  • update_all_params (bool) – Whether to update also the item parameters in each iteration. If passing True, will update them with a step size given determined by the number of iterations already taken and the step_size function given as input in the model constructor call.

Returns:

True – Will return True if the process finishes successfully.

Return type:

bool

eval_llk(input_df, full_llk=False)[source]

Evaluate Poisson log-likelihood (plus constant) for a given dataset

Note

This Poisson log-likelihood is calculated only for the combinations of users and items provided here, so it’s not a complete likelihood, and it might sometimes turn out to be a positive number because of this. Will filter out the input data by taking only combinations of users and items that were present in the training set.

Parameters:
  • input_df (pandas data frame (nobs, 3)) – Input data on which to calculate log-likelihood, consisting of IDs and counts. Must contain one row per non-zero observaion, with columns ‘UserId’, ‘ItemId’, ‘Count’. If a numpy array is provided, will assume the first 3 columns contain that info.

  • full_llk (bool) – Whether to calculate terms of the likelihood that depend on the data but not on the parameters. Ommitting them is faster, but it’s more likely to result in positive values.

Returns:

llk – Dictionary containing the calculated log-likelihood and the number of observations that were used to calculate it.

Return type:

dict

fit(counts_df, val_set=None)[source]

Fit Hierarchical Poisson Model to sparse count data

Fits a hierarchical Poisson model to count data using mean-field approximation with either full-batch coordinate-ascent or mini-batch stochastic coordinate-ascent.

Note

DataFrames and arrays passed to ‘.fit’ might be modified inplace - if this is a problem you’ll need to pass a copy to them, e.g. ‘counts_df=counts_df.copy()’.

Note

Forcibly terminating the procedure should still keep the last calculated shape and rate parameter values, but is not recommended. If you need to make predictions on a forced-terminated object, set the attribute ‘is_fitted’ to ‘True’.

Note

Fitting in mini-batches is more prone to numerical instability and compared to full-batch variational inference, it is more likely that all your parameters will turn to NaNs (which means the optimization procedure failed).

Parameters:
  • counts_df (pandas data frame (nobs, 3) or coo_array) – Input data with one row per non-zero observation, consisting of triplets (‘UserId’, ‘ItemId’, ‘Count’). Must containin columns ‘UserId’, ‘ItemId’, and ‘Count’. Combinations of users and items not present are implicitly assumed to be zero by the model. Can also pass a sparse coo_array, in which case ‘reindex’ will be forced to ‘False’.

  • val_set (pandas data frame (nobs, 3)) – Validation set on which to monitor log-likelihood. Same format as counts_df.

Returns:

self – Copy of this object

Return type:

obj

partial_fit(counts_df, batch_type='users', step_size=None, nusers=None, nitems=None, users_in_batch=None, items_in_batch=None, new_users=False, new_items=False, random_seed=None)[source]

Updates the model with batches of data from a subset of users or items

Note

You must pass either the full set of user-item interactions that are non-zero for some subset of users, or the full set of item-user interactions that are non-zero for some subset of items. Otherwise, if passing a random sample of triplets, the model will not converge to reasonable results.

Note

All user and items IDs must be integers starting at one, without gaps in the numeration.

Note

For better results, fit the model with full-batch iterations (using the ‘fit’ method). Adding new users and/or items without refitting the model might result in worsened results for existing users/items. For adding users without altering the parameters for items or for other users, see the method ‘add_user’.

Note

Fitting in mini-batches is more prone to numerical instability and compared to full-batch variational inference, it is more likely that all your parameters will turn to NaNs (which means the optimization procedure failed).

Parameters:
  • counts_df (data frame (n_samples, 3)) – Data frame with the user-item interactions for some subset of users. Must have columns ‘UserId’, ‘ItemId’, ‘Count’.

  • batch_type (str, one of ‘users’ or ‘items’) – Whether ‘counts_df’ contains a sample of users with all their item counts (‘users’), or a sample of items with all their user counts (‘items’).

  • step_size (None or float in (0, 1)) – Step size with which to update the global variables in the model. Must be a number between zero and one. If passing None, will determine it according to the step size function with which the model was initialized and the number of iterations or calls to partial fit that have been performed. If no valid function was passed at the initialization, it will use 1/sqrt(i+1).

  • nusers (int) – Total number of users (not just in this batch!). Only required if calling partial_fit for the first time on a model object that hasn’t been fit.

  • nitems (int) – Total number of items (not just in this batch!). Only required if calling partial_fit for the first time on a model object that hasn’t been fit.

  • users_in_batch (None or array (n_users_sample,)) – Users that are present int counts_df. If passing None, will determine the unique elements in counts_df.UserId, but passing them if you already have them will skip this step.

  • items_in_batch (None or array (n_items_sample,)) – Items that are present int counts_df. If passing None, will determine the unique elements in counts_df.ItemId, but passing them if you already have them will skip this step.

  • new_users (bool) – Whether the data contains new users with numeration greater than the number of users with which the model was initially fit. For better results refit the model including all users/items instead of adding them afterwards.

  • new_items (bool) – Whether the data contains new items with numeration greater than the number of items with which the model was initially fit. For better results refit the model including all users/items instead of adding them afterwards.

  • random_seed (int) – Random seed to be used for the initialization of new user/item parameters. Ignored when new_users=False and new_items=False.

Returns:

self – Copy of this object.

Return type:

obj

predict(user, item)[source]

Predict count for combinations of users and items

Note

You can either pass an individual user and item, or arrays representing tuples (UserId, ItemId) with the combinatinons of users and items for which to predict (one row per prediction).

Parameters:
  • user (array-like (npred,) or obj) – User(s) for which to predict each item.

  • item (array-like (npred,) or obj) – Item(s) for which to predict for each user.

predict_factors(counts_df, maxiter=10, ncores=1, random_seed=1, stop_thr=0.001, return_all=False)[source]

Gets latent factors for a user given her item counts

This is similar to obtaining topics for a document in LDA.

Note

This function will NOT modify any of the item parameters.

Note

This function only works with one user at a time.

Parameters:
  • counts_df (DataFrame or array (nsamples, 2)) – Data Frame with columns ‘ItemId’ and ‘Count’, indicating the non-zero item counts for a user for whom it’s desired to obtain latent factors.

  • maxiter (int) – Maximum number of iterations to run.

  • ncores (int) – Number of threads/cores to use. With data for only one user, it’s unlikely that using multiple threads would give a significant speed-up, and it might even end up making the function slower due to the overhead. If passing -1, it will determine the maximum number of cores in the system and use that.

  • random_seed (int) – Random seed used to initialize parameters.

  • stop_thr (float) – If the l2-norm of the difference between values of Theta_{u} between interations is less than this, it will stop. Smaller values of ‘k’ should require smaller thresholds.

  • return_all (bool) – Whether to return also the intermediate calculations (Gamma_shp, Gamma_rte). When passing True here, the output will be a tuple containing (Theta, Gamma_shp, Gamma_rte, Phi)

Returns:

latent_factors – Calculated latent factors for the user, given the input data

Return type:

array (k,)

topN(user, n=10, exclude_seen=True, items_pool=None)[source]

Recommend Top-N items for a user

Outputs the Top-N items according to score predicted by the model. Can exclude the items for the user that were associated to her in the training set, and can also recommend from only a subset of user-provided items.

Parameters:
  • user (obj) – User for which to recommend.

  • n (int) – Number of top items to recommend.

  • exclude_seen (bool) – Whether to exclude items that were associated to the user in the training set.

  • items_pool (None or array) – Items to consider for recommending to the user.

Returns:

rec – Top-N recommended items.

Return type:

array (n,)

Indices and tables