Source code for hpfrec

import pandas as pd, numpy as np
import multiprocessing, os, warnings
from . import cython_loops_float, cython_loops_double, _check_openmp
import ctypes, types, inspect
from scipy.sparse import coo_array, issparse

# TODO: don't rely on pandas DFs internally for keeping track of a COO matrix.
# Should keep arrays directly and only invoke pd.Categorival / pd.factorize
# when needed.


[docs]
class HPF:
	"""
	Hierarchical Poisson Factorization

	Model for recommending items based on probabilistic Poisson factorization
	on sparse count data (e.g. number of times a user played different songs),
	using mean-field variational inference with coordinate-ascent.
	Can also use stochastic variational inference (using mini batches of data).

	Can use different stopping criteria for the opimization procedure:

	1) Run for a fixed number of iterations (stop_crit='maxiter').
	2) Calculate the Poisson log-likelihood every N iterations (stop_crit='train-llk' and check_every)
	   and stop once {1 - curr/prev} is below a certain threshold (stop_thr)
	3) Calculate the Poisson log-likelihood in a user-provided validation set (stop_crit='val-llk', val_set and check_every)
	   and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the
	   default threshold (see Note).
	4) Check the the difference in the user-factor matrix after every N iterations (stop_crit='diff-norm', check_every)
	   and stop once the *l2-norm* of this difference is below a certain threshold (stop_thr).
	   Note that this is **not a percent** difference as it is for log-likelihood criteria, so you should put a larger
	   value than the default here.
	   This is a much faster criterion to calculate and is recommended for larger datasets.
	
	If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require
	reindexing if the IDs for users and items in counts_df meet the following criteria:

	1) Are all integers.
	2) Start at zero.
	3) Don't have any enumeration gaps, i.e. if there is a user '4', user '3' must also be there.

	If you only want to obtain the fitted parameters and use your own API later for recommendations,
	you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also
	available as numpy arrays in this object's Theta and Beta attributes). Otherwise, the model
	will create Python dictionaries with entries for each user and item, which can take quite a bit of
	RAM memory. These can speed up predictions later through this package's API.

	Passing verbose=True will also print RMSE (root mean squared error) at each iteration.
	For slighly better speed pass verbose=False once you know what a good threshold should be
	for your data.

	Note
	----
	DataFrames and arrays passed to '.fit' might be modified inplace - if this is a problem you'll
	need to pass a copy to them, e.g. 'counts_df=counts_df.copy()'.

	Note
	----
	If 'check_every' is not None and stop_crit is not 'diff-norm', it will, every N iterations,
	calculate the log-likelihood of the data. By default, this is NOT the full likelihood, (not including a constant
	that depends on the data but not on the parameters and which is quite slow to compute). The reason why
	it's calculated by default like this is because otherwise it can result it overflow (number is too big for the data
	type), but be aware that if not adding this constant, the number can turn positive
	and will mess with the stopping criterion for likelihood.

	Note
	----
	If you pass a validation set, it will calculate the Poisson log-likelihood **of the non-zero observations
	only**, rather than the complete likelihood that includes also the combinations of users and items
	not present in the data (assumed to be zero), thus it's more likely that you might see positive numbers here.
	
	Note
	----
	Compared to ALS, iterations from this algorithm are a lot faster to compute, so don't be scared about passing
	large numbers for maxiter.

	Note
	----
	In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see
	weird values for log-likelihood and RMSE. If this happens, try again with a different random seed.

	Note
	----
	Fitting in mini-batches is more prone to numerical instability and compared to full-batch
	variational inference, it is more likely that all your parameters will turn to NaNs (which
	means the optimization procedure failed).

	Parameters
	----------
	k : int
		Number of latent factors to use.
	a : float
		Shape parameter for the user-factor matrix.
	a_prime : float
		Shape parameter and dividend of the rate parameter for the user activity vector.
	b_prime : float
		Divisor of the rate parameter for the user activity vector.
	c : float
		Shape parameter for the item-factor matrix.
	c_prime : float
		Shape parameter and dividend of the rate parameter for the item popularity vector.
	d_prime : float
		Divisor o the rate parameter for the item popularity vector.
	ncores : int
		Number of cores to use to parallelize computations.
		If set to -1, will use the maximum available on the computer.
	stop_crit : str, one of 'maxiter', 'train-llk', 'val-llk', 'diff-norm'
		Stopping criterion for the optimization procedure.
	check_every : None or int
		Calculate log-likelihood every N iterations.
	stop_thr : float
		Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices.
	users_per_batch : None or int
		Number of users to take for each batch update in stochastic variational inference. If passing None both here
		and for 'items_per_batch', will perform full-batch variational inference, which leads to better results but on
		larger datasets takes longer to converge.
		If passing a number for both 'users_per_batch' and 'items_per_batch', it will alternate between epochs in which
		it samples by user and epochs in which it samples by item - this leads to faster convergence and is recommended,
		but using only one type leads to lower memory requirements and might have a use case if memory is limited.
	items_per_batch : None or int
		Number of items to take for each batch update in stochastic variational inference. If passing None both here
		and for 'users_per_batch', will perform full-batch variational inference, which leads to better results but on
		larger datasets takes longer to converge.
		If passing a number for both 'users_per_batch' and 'items_per_batch', it will alternate between epochs in which
		it samples by user and epochs in which it samples by item - this leads to faster convergence and is recommended,
		but using only one type leads to lower memory requirements and might have a use case if memory is limited.
	step_size : function(int) -> float in (0, 1)
		Function that takes the iteration/epoch number as input (starting at zero) and produces the step size
		for the global parameters as output (only used when fitting with stochastic variational inference).
		The step size must be a number between zero and one, and should be decresing with bigger iteration numbers.
		Ignored when passing users_per_batch=None.
	maxiter : int or None
		Maximum number of iterations for which to run the optimization procedure. This corresponds to epochs when
		fitting in batches of users. Recommended to use a lower number when passing a batch size.
	use_float : bool
		Whether to use the C float type (typically ``np.float32``). Using float types (as compared to double)
		results in less memory usage and faster operations, but it has less numeric precision and the results
		will be slightly worse compared to using double.
		If passing ``False``, will use C double (typically ``np.float64``).
	reindex : bool
		Whether to reindex data internally.
	verbose : bool
		Whether to print convergence messages.
	random_seed : int or None
		Random seed to use when starting the parameters.
	allow_inconsistent_math : bool
		Whether to allow inconsistent floating-point math (producing slightly different results on each run)
		which would allow parallelization of the updates for the shape parameters of Lambda and Gamma.
		Ignored (forced to True) in stochastic optimization mode.
	full_llk : bool
		Whether to calculate the full Poisson log-likehood, including terms that don't depend on the model parameters
		(thus are constant for a given dataset).
	alloc_full_phi : bool
		Whether to allocate the full Phi matrix (size n_samples * k) when using stochastic optimization. Doing so
		will make it a bit faster, but it will use more memory.
		Ignored when passing both 'users_per_batch=None' and 'items_per_batch=None'.
	keep_data : bool
		Whether to keep information about which user was associated with each item
		in the training set, so as to exclude those items later when making Top-N
		recommendations.
	save_folder : str or None
		Folder where to save all model parameters as csv files.
	produce_dicts : bool
		Whether to produce Python dictionaries for users and items, which
		are used to speed-up the prediction API of this package. You can still predict without
		them, but it might take some additional miliseconds (or more depending on the
		number of users and items).
	keep_all_objs : bool
		Whether to keep intermediate objects/variables in the object that are not necessary for
		predictions - these are: Gamma_shp, Gamma_rte, Lambda_shp, Lambda_rte, k_rte, t_rte
		(when passing True here, the model object will have these extra attributes too).
		Without these objects, it's not possible to call functions that alter the model parameters
		given new information after it's already fit.
	sum_exp_trick : bool
		Whether to use the sum-exp trick when scaling the multinomial parameters - that is, calculating them as
		exp(val - maxval)/sum_{val}(exp(val - maxval)) in order to avoid numerical overflow if there are
		too large numbers. For this kind of model, it is unlikely that it will be required, and it adds a
		small overhead, but if you notice NaNs in the results or in the likelihood, you might give this option a try.
	
	Attributes
	----------
	Theta : array (nusers, k)
		User-factor matrix.
	Beta : array (nitems, k)
		Item-factor matrix.
	user_mapping_ : array (nusers,)
		ID of the user (as passed to .fit) corresponding to each row of Theta.
	item_mapping_ : array (nitems,)
		ID of the item (as passed to .fit) corresponding to each row of Beta.
	user_dict_ : dict (nusers)
		Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.
	item_dict_ : dict (nitems)
		Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.
	is_fitted : bool
		Whether the model has been fit to some data.
	niter : int
		Number of iterations for which the fitting procedure was run.
	train_llk : int
		Final training likelihood calculated when the model was fit (only when passing 'verbose=True').

	References
	----------
	.. [1] Scalable Recommendation with Hierarchical Poisson Factorization (Gopalan, P., Hofman, J.M. and Blei, D.M., 2015)
	.. [2] Stochastic variational inference (Hoffman, M.D., Blei, D.M., Wang, C. and Paisley, J., 2013)
	"""
	def __init__(self, k=30, a=0.3, a_prime=0.3, b_prime=1.0,
				 c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,
				 stop_crit='maxiter', check_every=10, stop_thr=1e-3,
				 users_per_batch=None, items_per_batch=None, step_size=lambda x: 1/np.sqrt(x+2),
				 maxiter=100, use_float=True, reindex=True, verbose=True,
				 random_seed=None, allow_inconsistent_math=False, full_llk=False,
				 alloc_full_phi=False, keep_data=True, save_folder=None,
				 produce_dicts=True, keep_all_objs=True, sum_exp_trick=False):

		## checking input
		assert isinstance(k, int)
		if isinstance(a, int):
			a = float(a)
		if isinstance(a_prime, int):
			a_prime = float(a_prime)
		if isinstance(b_prime, int):
			b_prime = float(b_prime)
		if isinstance(c, int):
			c = float(c)
		if isinstance(c_prime, int):
			c_prime = float(c_prime)
		if isinstance(d_prime, int):
			d_prime = float(d_prime)
			
		assert isinstance(a, float)
		assert isinstance(a_prime, float)
		assert isinstance(b_prime, float)
		assert isinstance(c, float)
		assert isinstance(c_prime, float)
		assert isinstance(d_prime, float)
		
		assert a>0
		assert a_prime>0
		assert b_prime>0
		assert c>0
		assert c_prime>0
		assert d_prime>0
		assert k>0
		
		if ncores < 1:
			ncores = multiprocessing.cpu_count()
		if ncores is None:
			ncores = 1
		assert ncores>0
		assert isinstance(ncores, int)

		if (ncores > 1) and not (_check_openmp.get()):
			msg_omp  = "Attempting to use more than 1 thread, but "
			msg_omp += "package was built without multi-threading "
			msg_omp += "support - see the project's GitHub page for "
			msg_omp += "more information."
			warnings.warn(msg_omp)

		if random_seed is not None:
			assert isinstance(random_seed, int)

		assert stop_crit in ['maxiter', 'train-llk', 'val-llk', 'diff-norm']

		if maxiter is not None:
			assert maxiter>0
			assert isinstance(maxiter, int)
		else:
			if stop_crit == 'maxiter':
				raise ValueError("If 'stop_crit' is set to 'maxiter', must provide a maximum number of iterations.")
			maxiter = 10**10
			
		if check_every is not None:
			assert isinstance(check_every, int)
			assert check_every>0
			assert check_every<=maxiter
		else:
			if stop_crit != 'maxiter':
				raise ValueError("If 'stop_crit' is not 'maxiter', must input after how many iterations to calculate it.")
			check_every = 0

		if isinstance(stop_thr, int):
			stop_thr = float(stop_thr)
		if stop_thr is not None:
			assert stop_thr>0
			assert isinstance(stop_thr, float)
			
		if save_folder is not None:
			save_folder = os.path.expanduser(save_folder)
			assert os.path.exists(save_folder)
			
		verbose = bool(verbose)
		if (stop_crit == 'maxiter') and (not verbose):
			check_every = 0

		if not isinstance(step_size, types.FunctionType):
			raise ValueError("'step_size' must be a function.")
		if len(inspect.getfullargspec(step_size).args) < 1:
			raise ValueError("'step_size' must be able to take the iteration number as input.")
		assert (step_size(0) >= 0) and (step_size(0) <= 1)
		assert (step_size(1) >= 0) and (step_size(1) <= 1)
		if users_per_batch is not None:
			if isinstance(users_per_batch, float):
				users_per_batch = int(users_per_batch)
			assert isinstance(users_per_batch, int)
			assert users_per_batch > 0
		else:
			users_per_batch = 0

		if items_per_batch is not None:
			if isinstance(items_per_batch, float):
				items_per_batch = int(items_per_batch)
			assert isinstance(items_per_batch, int)
			assert items_per_batch > 0
		else:
			items_per_batch = 0
		
		## storing these parameters
		self.k = k
		self.a = a
		self.a_prime = a_prime
		self.b_prime = b_prime
		self.c = c
		self.c_prime = c_prime
		self.d_prime = d_prime

		self.ncores = ncores
		self.allow_inconsistent_math = bool(allow_inconsistent_math)
		self.use_float = bool(use_float)
		self.random_seed = random_seed
		self.stop_crit = stop_crit
		self.reindex = bool(reindex)
		self.keep_data = bool(keep_data)
		self.maxiter = maxiter
		self.check_every = check_every
		self.stop_thr = stop_thr
		self.save_folder = save_folder
		self.verbose = verbose
		self.produce_dicts = bool(produce_dicts)
		self.full_llk = bool(full_llk)
		self.alloc_full_phi = bool(alloc_full_phi)
		self.keep_all_objs = bool(keep_all_objs)
		self.sum_exp_trick = bool(sum_exp_trick)
		self.step_size = step_size
		self.users_per_batch = users_per_batch
		self.items_per_batch = items_per_batch

		if not self.reindex:
			self.produce_dicts = False
		
		## initializing other attributes
		self.Theta = None
		self.Beta = None
		self.user_mapping_ = None
		self.item_mapping_ = None
		self.user_dict_ = None
		self.item_dict_ = None
		self.is_fitted = False
		self.niter = None
		self.train_llk = None
	

[docs]
	def fit(self, counts_df, val_set=None):
		"""
		Fit Hierarchical Poisson Model to sparse count data

		Fits a hierarchical Poisson model to count data using mean-field approximation with either
		full-batch coordinate-ascent or mini-batch stochastic coordinate-ascent.
		
		Note
		----
		DataFrames and arrays passed to '.fit' might be modified inplace - if this is a problem you'll
		need to pass a copy to them, e.g. 'counts_df=counts_df.copy()'.

		Note
		----
		Forcibly terminating the procedure should still keep the last calculated shape and rate
		parameter values, but is not recommended. If you need to make predictions on a forced-terminated
		object, set the attribute 'is_fitted' to 'True'.

		Note
		----
		Fitting in mini-batches is more prone to numerical instability and compared to full-batch
		variational inference, it is more likely that all your parameters will turn to NaNs (which
		means the optimization procedure failed).

		Parameters
		----------
		counts_df : pandas data frame (nobs, 3) or coo_array
			Input data with one row per non-zero observation, consisting of triplets ('UserId', 'ItemId', 'Count').
			Must containin columns 'UserId', 'ItemId', and 'Count'.
			Combinations of users and items not present are implicitly assumed to be zero by the model.
			Can also pass a sparse coo_array, in which case 'reindex' will be forced to 'False'.
		val_set : pandas data frame (nobs, 3)
			Validation set on which to monitor log-likelihood. Same format as counts_df.

		Returns
		-------
		self : obj
			Copy of this object
		"""

		## a basic check
		if self.stop_crit == 'val-llk':
			if val_set is None:
				raise ValueError("If 'stop_crit' is set to 'val-llk', must provide a validation set.")

		## running each sub-process
		if self.verbose:
			self._print_st_msg()
		self._process_data(counts_df)
		if self.verbose:
			self._print_data_info()
		if (val_set is not None) and (self.stop_crit not in ["diff-norm", "train-llk"]):
			self._process_valset(val_set)
		else:
			self.val_set = None
			
		self._cast_before_fit()
		self._fit()
		
		## after terminating optimization
		if self.keep_data:
			if self.users_per_batch == 0:
				self._store_metadata()
			else:
				self._st_ix_user = self._st_ix_user[:-1]
		if self.produce_dicts and self.reindex:
			self.user_dict_ = {self.user_mapping_[i]:i for i in range(self.user_mapping_.shape[0])}
			self.item_dict_ = {self.item_mapping_[i]:i for i in range(self.item_mapping_.shape[0])}
		self.is_fitted = True
		del self.input_df
		del self.val_set
		
		return self

	
	def _process_data(self, input_df):
		calc_n = True

		if isinstance(input_df, np.ndarray):
			assert len(input_df.shape) > 1
			assert input_df.shape[1] >= 3
			input_df = pd.DataFrame(input_df[:, :3], copy=True, columns=["UserId", "ItemId", "Count"])
			
		elif isinstance(input_df, pd.DataFrame):
			assert input_df.shape[0] > 0
			assert 'UserId' in input_df.columns
			assert 'ItemId' in input_df.columns
			assert 'Count' in input_df.columns
			input_df = input_df[["UserId", "ItemId", "Count"]].copy()
			
		elif issparse(input_df) and (input_df.format == "coo"):
			self.nusers = input_df.shape[0]
			self.nitems = input_df.shape[1]
			input_df = pd.DataFrame({
				'UserId' : input_df.row,
				'ItemId' : input_df.col,
				'Count' : input_df.data
			}, copy=False)
			self.reindex = False
			calc_n = False
		else:
			raise ValueError("'input_df' must be a pandas data frame, numpy array, or scipy sparse coo_array.")

		if self.stop_crit in ['maxiter', 'diff-norm']:
			thr = 0
		else:
			thr = 0.9
		self.input_df = input_df
		obs_zero = self.input_df["Count"] <= thr
		if obs_zero.sum() > 0:
			warnings.warn(
				"'counts_df' contains observations with a count value less than 1, these will be ignored."
				" Any user or item associated exclusively with zero-value observations will be excluded."
				" If using 'reindex=False', make sure that your data still meets the necessary criteria."
				" If you still want to use these observations, set 'stop_crit' to 'diff-norm' or 'maxiter'."
			)
			self.input_df = self.input_df.loc[~obs_zero]
			
		if self.reindex:
			self.input_df["UserId"], self.user_mapping_ = pd.factorize(self.input_df["UserId"])
			self.input_df["ItemId"], self.item_mapping_ = pd.factorize(self.input_df["ItemId"])
			self.user_mapping_ = np.require(self.user_mapping_, requirements=["ENSUREARRAY"]).reshape(-1)
			self.item_mapping_ = np.require(self.item_mapping_, requirements=["ENSUREARRAY"]).reshape(-1)
			self.nusers = self.user_mapping_.shape[0]
			self.nitems = self.item_mapping_.shape[0]
			if (self.save_folder is not None) and self.reindex:
				if self.verbose:
					print("\nSaving user and item mappings...\n")
				pd.Series(self.user_mapping_).to_csv(os.path.join(self.save_folder, 'users.csv'), index=False)
				pd.Series(self.item_mapping_).to_csv(os.path.join(self.save_folder, 'items.csv'), index=False)
		else:
			if calc_n:
				self.nusers = self.input_df["UserId"].max() + 1
				self.nitems = self.input_df["ItemId"].max() + 1

		if self.save_folder is not None:
			with open(os.path.join(self.save_folder, "hyperparameters.txt"), "w") as pf:
				pf.write("a: %.3f\n" % self.a)
				pf.write("a_prime: %.3f\n" % self.a_prime)
				pf.write("b_prime: %.3f\n" % self.b_prime)
				pf.write("c: %.3f\n" % self.c)
				pf.write("c_prime: %.3f\n" % self.c_prime)
				pf.write("d_prime: %.3f\n" % self.d_prime)
				pf.write("k: %d\n" % self.k)
				if self.random_seed is not None:
					pf.write("random seed: %d\n" % self.random_seed)
				else:
					pf.write("random seed: None\n")
		
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		if self.input_df['Count'].dtype != cython_loops.c_real_t:
			self.input_df['Count'] = self.input_df["Count"].astype(cython_loops.c_real_t)
		if self.input_df['UserId'].dtype != cython_loops.obj_ind_type:
			self.input_df['UserId'] = self.input_df["UserId"].astype(cython_loops.obj_ind_type)
		if self.input_df['ItemId'].dtype != cython_loops.obj_ind_type:
			self.input_df['ItemId'] = self.input_df["ItemId"].astype(cython_loops.obj_ind_type)

		if self.users_per_batch != 0:
			if self.nusers < self.users_per_batch:
				warnings.warn("Batch size passed is larger than number of users. Will set it to nusers/10.")
				self.users_per_batch = ctypes.c_int(np.ceil(self.nusers/10))
			self.input_df.sort_values('UserId', inplace=True)
			self._store_metadata(for_partial_fit=True)

		return None
		
	def _process_valset(self, val_set, valset=True):
		if isinstance(val_set, np.ndarray):
			assert len(val_set.shape) > 1
			assert val_set.shape[1] >= 3
			self.val_set = pd.DataFrame(val_set[:, :3], copy=True, columns = ["UserId", "ItemId", "Count"])
			
		elif isinstance(val_set, pd.DataFrame):
			assert val_set.shape[0] > 0
			assert 'UserId' in val_set.columns
			assert 'ItemId' in val_set.columns
			assert 'Count' in val_set.columns
			self.val_set = val_set[["UserId", "ItemId", "Count"]].copy()

		elif issparse(val_set) and (val_set.format == "coo"):
			assert val_set.shape[0] <= self.nusers
			assert val_set.shape[1] <= self.nitems
			self.val_set = pd.DataFrame({
				'UserId' : val_set.row,
				'ItemId' : val_set.col,
				'Count'  : val_set.data
			}, copy=False)
		else:
			raise ValueError("'val_set' must be a pandas data frame, numpy array, or sparse coo_array.")
			
		if self.stop_crit == 'val-llk':
			thr = 0
		else:
			thr = 0.9
		obs_zero = self.val_set["Count"] <= thr
		if obs_zero.sum() > 0:
			warnings.warn(
				"'val_set' contains observations with a count value less than 1, these will be ignored."
			)
			self.val_set = self.val_set.loc[~obs_zero]

		if self.reindex:
			self.val_set['UserId'] = pd.Categorical(self.val_set["UserId"], self.user_mapping_).codes
			self.val_set['ItemId'] = pd.Categorical(self.val_set["ItemId"], self.item_mapping_).codes
			self.val_set = self.val_set.loc[(self.val_set["UserId"] != (-1)) & (self.val_set["ItemId"] != (-1))]
			if self.val_set.shape[0] == 0:
				if valset:
					warnings.warn("Validation set has no combinations of users and items"+
								  " in common with training set. If 'stop_crit' was set"+
								  " to 'val-llk', will now be switched to 'train-llk'.")
					if self.stop_crit == 'val-llk':
						self.stop_crit = 'train-llk'
					self.val_set = None
				else:
					raise ValueError("'input_df' has no combinations of users and items"+
									 "in common with the training set.")
			else:
				self.val_set.reset_index(drop=True, inplace=True)

		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		if self.val_set['Count'].dtype != cython_loops.c_real_t:
			self.val_set['Count'] = self.val_set["Count"].astype(cython_loops.c_real_t)
		if self.val_set['UserId'].dtype != cython_loops.obj_ind_type:
			self.val_set['UserId'] = self.val_set["UserId"].astype(cython_loops.obj_ind_type)
		if self.val_set['ItemId'].dtype != cython_loops.obj_ind_type:
			self.val_set['ItemId'] = self.val_set["ItemId"].astype(cython_loops.obj_ind_type)
		return None
			
	def _store_metadata(self, for_partial_fit=False):
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		if self.verbose and for_partial_fit:
			print("Creating user indices for stochastic optimization...")
		X = coo_array(
			(
				self.input_df["Count"].to_numpy(copy=False),
				(self.input_df["UserId"].to_numpy(copy=False), self.input_df["ItemId"].to_numpy(copy=False))
			),
			shape=(self.nusers, self.nitems),
			dtype=ctypes.c_float if self.use_float else ctypes.c_double
		).tocsr()
		self._n_seen_by_user = X.indptr[1:] - X.indptr[:-1]
		if for_partial_fit:
			self._st_ix_user = np.require(X.indptr, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
			self.input_df.sort_values('UserId', inplace=True)
		else:
			self._st_ix_user = X.indptr[:-1]
		self.seen = X.indices
		return None

	def _cast_before_fit(self):
		## setting all parameters and data to the right type
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		self.Theta = np.empty((self.nusers, self.k), dtype=cython_loops.c_real_t)
		self.Beta = np.empty((self.nitems, self.k), dtype=cython_loops.c_real_t)
		self.k = cython_loops.cast_ind_type(self.k)
		self.nusers = cython_loops.cast_ind_type(self.nusers)
		self.nitems = cython_loops.cast_ind_type(self.nitems)
		self.ncores = cython_loops.cast_int(self.ncores)
		self.maxiter = cython_loops.cast_int(self.maxiter)
		self.verbose = cython_loops.cast_int(self.verbose)
		if self.random_seed is None:
			self.random_seed = 0
		self.random_seed = cython_loops.cast_int(self.random_seed)
		self.check_every = cython_loops.cast_int(self.check_every)

		self.stop_thr = cython_loops.cast_real_t(self.stop_thr)
		self.a = cython_loops.cast_real_t(self.a)
		self.a_prime = cython_loops.cast_real_t(self.a_prime)
		self.b_prime = cython_loops.cast_real_t(self.b_prime)
		self.c = cython_loops.cast_real_t(self.c)
		self.c_prime = cython_loops.cast_real_t(self.c_prime)
		self.d_prime = cython_loops.cast_real_t(self.d_prime)

		if self.save_folder is None:
			self.save_folder = ""
	
	def _fit(self):
		cython_loops = cython_loops_float if self.use_float else cython_loops_double

		if self.val_set is None:
			use_valset = cython_loops.cast_int(0)
			self.val_set = pd.DataFrame(np.empty((0,3)), columns=['UserId','ItemId','Count'])
			self.val_set['UserId'] = self.val_set["UserId"].to_numpy(copy=False, dtype=cython_loops.obj_ind_type)
			self.val_set['ItemId'] = self.val_set["ItemId"].to_numpy(copy=False, dtype=cython_loops.obj_ind_type)
			self.val_set['Count'] = self.val_set["Count"].to_numpy(copy=False, dtype=cython_loops.c_real_t)
		else:
			use_valset = cython_loops.cast_int(1)

		if self.users_per_batch == 0:
			self._st_ix_user = np.arange(1).astype(cython_loops.obj_ind_type)

		self.niter, temp, self.train_llk = cython_loops.fit_hpf(
			self.a, self.a_prime, self.b_prime,
			self.c, self.c_prime, self.d_prime,
			np.require(self.input_df["Count"].to_numpy(copy=False), dtype=cython_loops.c_real_t, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			np.require(self.input_df["UserId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			np.require(self.input_df["ItemId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			self.Theta, self.Beta,
			self.maxiter, self.stop_crit, self.check_every, self.stop_thr,
			self.users_per_batch, self.items_per_batch,
			self.step_size, cython_loops.cast_int(self.sum_exp_trick),
			self._st_ix_user.astype(cython_loops.obj_ind_type),
			self.save_folder, self.random_seed, self.verbose,
			self.ncores, cython_loops.cast_int(self.allow_inconsistent_math),
			use_valset,
			np.require(self.val_set["Count"].to_numpy(copy=False), dtype=cython_loops.c_real_t, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			np.require(self.val_set["UserId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			np.require(self.val_set["ItemId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			cython_loops.cast_int(self.full_llk), cython_loops.cast_int(self.keep_all_objs),
			cython_loops.cast_int(self.alloc_full_phi)
			)

		if self.users_per_batch == 0:
			del self._st_ix_user

		if self.keep_all_objs:
			self.Gamma_shp = temp[0]
			self.Gamma_rte = temp[1]
			self.Lambda_shp = temp[2]
			self.Lambda_rte = temp[3]
			self.k_rte = temp[4]
			self.t_rte = temp[5]

	def _process_data_single(self, counts_df):
		assert self.is_fitted
		assert self.keep_all_objs
		
		if isinstance(counts_df, np.ndarray):
			assert len(counts_df.shape) > 1
			assert counts_df.shape[1] >= 2
			counts_df = pd.DataFrame(counts_df[:,:2], columns=["ItemId", "Count"], copy=True)
		elif isinstance(counts_df, pd.DataFrame):
			assert counts_df.shape[0] > 0
			assert "ItemId" in counts_df.columns
			assert "Count" in counts_df.columns
			counts_df = counts_df[["ItemId", "Count"]].copy()
		else:
			raise ValueError("'counts_df' must be a pandas data frame or a numpy array")
			
		if self.reindex:
			if self.produce_dicts:
				try:
					counts_df["ItemId"] = counts_df["ItemId"].map(lambda x: self.item_dict_[x])
				except Exception:
					raise ValueError("Can only make calculations for items that were in the training set.")
			else:
				counts_df["ItemId"] = pd.Categorical(counts_df["ItemId"].to_numpy(copy=False), self.item_mapping_).codes
				if (counts_df["ItemId"] == -1).sum() > 0:
					raise ValueError("Can only make calculations for items that were in the training set.")

		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		counts_df["ItemId"] = np.require(counts_df["ItemId"], dtype=cython_loops.obj_ind_type)
		counts_df["Count"] = np.require(counts_df["Count"], dtype=cython_loops.c_real_t)
		return counts_df


[docs]
	def partial_fit(self, counts_df, batch_type='users', step_size=None,
					nusers=None, nitems=None, users_in_batch=None, items_in_batch=None,
					new_users=False, new_items=False, random_seed=None):
		"""
		Updates the model with batches of data from a subset of users or items

		Note
		----
		You must pass either the **full set of user-item interactions** that are non-zero for some
		subset of users, or the **full set of item-user interactions** that are non-zero for some
		subset of items.
		Otherwise, if passing a random sample of triplets, the model will not converge to reasonable results.

		Note
		----
		All user and items IDs must be integers starting at one, without gaps in the numeration.

		Note
		----
		For better results, fit the model with full-batch iterations (using the 'fit' method).
		Adding new users and/or items without refitting the model might result in worsened results
		for existing users/items. For adding users without altering the parameters for items or for
		other users, see the method 'add_user'.

		Note
		----
		Fitting in mini-batches is more prone to numerical instability and compared to full-batch
		variational inference, it is more likely that all your parameters will turn to NaNs (which
		means the optimization procedure failed).

		Parameters
		----------
		counts_df : data frame (n_samples, 3)
			Data frame with the user-item interactions for some subset of users. Must have columns
			'UserId', 'ItemId', 'Count'.
		batch_type : str, one of 'users' or 'items'
			Whether 'counts_df' contains a sample of users with all their item counts ('users'), or a
			sample of items with all their user counts ('items').
		step_size : None or float in (0, 1)
			Step size with which to update the global variables in the model. Must be a number between
			zero and one. If passing None, will determine it according to the step size function with which
			the model was initialized and the number of iterations or calls to partial fit that have been
			performed. If no valid function was passed at the initialization, it will use 1/sqrt(i+1).
		nusers : int
			Total number of users (not just in this batch!). Only required if calling partial_fit for the
			first time on a model object that hasn't been fit.
		nitems : int
			Total number of items (not just in this batch!). Only required if calling partial_fit for the
			first time on a model object that hasn't been fit.
		users_in_batch : None or array (n_users_sample,)
			Users that are present int counts_df. If passing None, will determine the unique elements in
			counts_df.UserId, but passing them if you already have them will skip this step.
		items_in_batch : None or array (n_items_sample,)
			Items that are present int counts_df. If passing None, will determine the unique elements in
			counts_df.ItemId, but passing them if you already have them will skip this step.
		new_users : bool
			Whether the data contains new users with numeration greater than the number of users with which
			the model was initially fit. **For better results refit the model including all users/items instead
			of adding them afterwards**.
		new_items : bool
			Whether the data contains new items with numeration greater than the number of items with which
			the model was initially fit. **For better results refit the model including all users/items instead
			of adding them afterwards**.
		random_seed : int
			Random seed to be used for the initialization of new user/item parameters. Ignored when
			new_users=False and new_items=False.

		Returns
		-------
		self : obj
			Copy of this object.
		"""

		if self.reindex:
			raise ValueError("'partial_fit' can only be called when using reindex=False.")
		if not self.keep_all_objs:
			raise ValueError("'partial_fit' can only be called when using keep_all_objs=True.")
		if self.keep_data:
			try:
				self.seen
				warnings.warn(
					"When using 'partial_fit', the list of items seen by each user is not updated "
					"with the data passed here."
				)
			except Exception:
				warnings.warn(msg)
				warnings.warn(
					"When fitting the model through 'partial_fit' without calling 'fit' beforehand, "
					"'keep_data' will be forced to False."
				)
				self.keep_data = False

		assert batch_type in ['users', 'items']
		if batch_type == 'users':
			user_batch = True
		else:
			user_batch = False

		if nusers is None:
			try:
				nusers = self.nusers
			except Exception:
				raise ValueError("Must specify total number of users when calling 'partial_fit' for the first time.")
		if nitems is None:
			try:
				nitems = self.nitems
			except Exception:
				raise ValueError("Must specify total number of items when calling 'partial_fit' for the first time.")

		try:
			if self.nusers is None:
				self.nusers = nusers
		except Exception:
			self.nusers = nusers
		try:
			if self.nitems is None:
				self.nitems = nitems
		except Exception:
			self.nitems = nitems

		if step_size is None:
			try:
				self.step_size(0)
				try:
					step_size = self.step_size(self.niter)
				except Exception:
					self.niter = 0
					step_size = 1.0
			except Exception:
				try:
					step_size = 1 / np.sqrt(self.niter + 2)
				except Exception:
					self.niter = 0
					step_size = 1.0
		assert step_size >= 0
		assert step_size <= 1

		if random_seed is not None:
			if isinstance(random_seed, float):
				random_seed = int(random_seed)
			assert isinstance(random_seed, int)

		if isinstance(counts_df, np.ndarray):
			counts_df = pd.DataFrame(counts_df[:,:3], copy=False, columns=["UserId", "ItemId", "Count"])

		assert isinstance(counts_df, pd.DataFrame)
		assert 'UserId' in counts_df.columns
		assert 'ItemId' in counts_df.columns
		assert 'Count' in counts_df.columns
		assert counts_df.shape[0] > 0

		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		Y_batch = np.require(counts_df["Count"], dtype=cython_loops.c_real_t, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
		ix_u_batch = np.require(counts_df["UserId"], dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
		ix_i_batch = np.require(counts_df["ItemId"], dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])

		if users_in_batch is None:
			users_in_batch = np.unique(ix_u_batch)
		else:
			users_in_batch = np.require(users_in_batch, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
		if items_in_batch is None:
			items_in_batch = np.unique(ix_i_batch)
		else:
			items_in_batch = np.require(items_in_batch, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])

		if (self.Theta is None) or (self.Beta is None):
			self._cast_before_fit()
			self.Gamma_shp, self.Gamma_rte, self.Lambda_shp, self.Lambda_rte, \
			self.k_rte, self.t_rte = cython_loops.initialize_parameters(
				self.Theta, self.Beta, self.random_seed, self.a, self.a_prime,
				self.b_prime, self.c, self.c_prime, self.d_prime)
			self.Theta = self.Gamma_shp / self.Gamma_rte
			self.Beta = self.Lambda_shp / self.Lambda_rte

		if new_users:
			if not self.keep_all_objs:
				raise ValueError("Can only add users without refitting when using keep_all_objs=True")
			nusers_now = ix_u_batch.max() + 1
			nusers_add = self.nusers - nusers_now
			if nusers_add < 1:
				raise ValueError("There are no new users in the data passed to 'partial_fit'.")
			self._initialize_extra_users(nusers_add, random_seed)
			self.nusers += nusers_add

		if new_items:
			if not self.keep_all_objs:
				raise ValueError("Can only add items without refitting when using keep_all_objs=True")
			nitems_now = ix_i_batch.max() + 1
			nitems_add = self.nitems - nitems_now
			if nitems_add < 1:
				raise ValueError("There are no new items in the data passed to 'partial_fit'.")
			self._initialize_extra_items(nitems_add, random_seed)
			self.nitems += nitems_add

		k_shp = cython_loops.cast_real_t(self.a_prime + self.k * self.a)
		t_shp = cython_loops.cast_real_t(self.c_prime + self.k * self.c)
		add_k_rte = cython_loops.cast_real_t(self.a_prime / self.b_prime)
		add_t_rte = cython_loops.cast_real_t(self.c_prime / self.d_prime)
		multiplier_batch = float(nusers) / users_in_batch.shape[0]

		cython_loops.partial_fit(
					Y_batch,
					ix_u_batch, ix_i_batch,
					self.Theta, self.Beta,
					self.Gamma_shp, self.Gamma_rte,
					self.Lambda_shp, self.Lambda_rte,
					self.k_rte, self.t_rte,
					add_k_rte, add_t_rte, self.a, self.c,
					k_shp, t_shp, cython_loops.cast_ind_type(self.k),
					users_in_batch, items_in_batch,
					cython_loops.cast_int(self.allow_inconsistent_math),
					cython_loops.cast_real_t(step_size), cython_loops.cast_real_t(multiplier_batch),
					self.ncores, user_batch
				)

		self.niter += 1
		self.is_fitted = True
		return self


	def _initialize_extra_users(self, n, seed):
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		c_real_t = ctypes.c_float if self.use_float else ctypes.c_double
		rng = np.random.default_rng(seed = seed if seed > 0 else None)

		new_Gamma_shp = self.a_prime + 0.01 * rng.random(size=(n, self.k), dtype=c_real_t)
		new_Gamma_rte = self.a_prime + 0.01 * rng.random(size=(n, self.k), dtype=c_real_t)
		new_Theta = new_Gamma_shp / new_Gamma_rte
		new_k_rte = np.empty((n,1), dtype=c_real_t)
		new_k_rte[:,:] = self.b_prime
		
		self.k_rte = np.r_[self.k_rte, new_k_rte]
		self.Theta = np.r_[self.Theta, new_Theta]
		self.Gamma_rte = np.r_[self.Gamma_rte, new_Gamma_rte]
		self.Gamma_shp = np.r_[self.Gamma_shp, new_Gamma_shp]

	def _initialize_extra_items(self, n, seed):
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		c_real_t = ctypes.c_float if self.use_float else ctypes.c_double
		rng = np.random.default_rng(seed = seed if seed > 0 else None)

		new_Lambda_shp = self.c_prime + 0.01 * rng.random(size=(n, self.k), dtype=c_real_t)
		new_Lambda_rte = self.c_prime + 0.01 * rng.random(size=(n, self.k), dtype=c_real_t)
		new_Beta = new_Lambda_shp / new_Lambda_rte
		new_t_rte = np.empty((n,1), dtype=c_real_t)
		new_t_rte[:,:] = self.d_prime
		
		self.t_rte = np.r_[self.t_rte, new_t_rte]
		self.Beta = np.r_[self.Beta, new_Beta]
		self.Lambda_rte = np.r_[self.Lambda_rte, new_Lambda_rte]
		self.Lambda_shp = np.r_[self.Lambda_shp, new_Lambda_shp]

	def _check_input_predict_factors(self, ncores, random_seed, stop_thr, maxiter):

		if ncores  < 1:
			ncores = multiprocessing.cpu_count()
		if ncores is None:
			ncores = 1
		assert ncores>0
		assert isinstance(ncores, int)

		assert isinstance(random_seed, int)
		assert random_seed > 0

		if isinstance(stop_thr, int):
			stop_thr = float(stop_thr)
		assert stop_thr>0
		assert isinstance(stop_thr, float)

		if isinstance(maxiter, float):
			maxiter = int(maxiter)
		assert isinstance(maxiter, int)
		assert maxiter > 0

		return ncores, random_seed, stop_thr, maxiter


[docs]
	def predict_factors(self, counts_df, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3, return_all=False):
		"""
		Gets latent factors for a user given her item counts

		This is similar to obtaining topics for a document in LDA.

		Note
		----
		This function will NOT modify any of the item parameters.

		Note
		----
		This function only works with one user at a time.

		Parameters
		----------
		counts_df : DataFrame or array (nsamples, 2)
			Data Frame with columns 'ItemId' and 'Count', indicating the non-zero item counts for a
			user for whom it's desired to obtain latent factors.
		maxiter : int
			Maximum number of iterations to run.
		ncores : int
			Number of threads/cores to use. With data for only one user, it's unlikely that using
			multiple threads would give a significant speed-up, and it might even end up making
			the function slower due to the overhead.
			If passing -1, it will determine the maximum number of cores in the system and use that.
		random_seed : int
			Random seed used to initialize parameters.
		stop_thr : float
			If the l2-norm of the difference between values of Theta_{u} between interations is less
			than this, it will stop. Smaller values of 'k' should require smaller thresholds.
		return_all : bool
			Whether to return also the intermediate calculations (Gamma_shp, Gamma_rte). When
			passing True here, the output will be a tuple containing (Theta, Gamma_shp, Gamma_rte, Phi)

		Returns
		-------
		latent_factors : array (k,)
			Calculated latent factors for the user, given the input data
		"""

		ncores, random_seed, stop_thr, maxiter = self._check_input_predict_factors(ncores, random_seed, stop_thr, maxiter)

		## processing the data
		counts_df = self._process_data_single(counts_df)

		## calculating the latent factors
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		Theta = np.empty(self.k, dtype = cython_loops.c_real_t)
		temp = cython_loops.calc_user_factors(
			self.a, self.a_prime, self.b_prime,
			self.c, self.c_prime, self.d_prime,
			np.require(counts_df["Count"].to_numpy(copy=False), dtype=cython_loops.c_real_t, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			np.require(counts_df["ItemId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
			Theta, self.Beta,
			self.Lambda_shp,
			self.Lambda_rte,
			cython_loops.cast_ind_type(counts_df.shape[0]), cython_loops.cast_ind_type(self.k),
			cython_loops.cast_int(int(maxiter)), cython_loops.cast_int(ncores),
			cython_loops.cast_int(int(random_seed)), cython_loops.cast_real_t(stop_thr),
			cython_loops.cast_int(bool(return_all))
		)

		if np.isnan(Theta).sum() > 0:
			raise ValueError("NaNs encountered in the result. Failed to produce latent factors.")

		if return_all:
			return (Theta, temp[0], temp[1], temp[2])
		else:
			return Theta



[docs]
	def add_user(self, user_id, counts_df, update_existing=False, maxiter=10, ncores=1,
				 random_seed=1, stop_thr=1e-3, update_all_params=None):
		"""
		Add a new user to the model or update parameters for a user according to new data
		
		Note
		----
		This function will NOT modify any of the item parameters.

		Note
		----
		This function only works with one user at a time. For updating many users at the same time,
		use 'partial_fit' instead.

		Note
		----
		For betters results, refit the model again from scratch.

		Parameters
		----------
		user_id : obj
			Id to give to be user (when adding a new one) or Id of the existing user whose parameters are to be
			updated according to the data in 'counts_df'. **Make sure that the data type is the same that was passed
			in the training data, so if you have integer IDs, don't pass a string as ID**.
		counts_df : data frame or array (nsamples, 2)
			Data Frame with columns 'ItemId' and 'Count'. If passing a numpy array, will take the first two columns
			in that order. Data containing user/item interactions **from one user only** for which to add or update
			parameters. Note that you need to pass *all* the user-item interactions for this user when making an update,
			not just the new ones.
		update_existing : bool
			Whether this should be an update of the parameters for an existing user (when passing True), or
			an addition of a new user that was not in the model before (when passing False).
		maxiter : int
			Maximum number of iterations to run.
		ncores : int
			Number of threads/cores to use. With data for only one user, it's unlikely that using
			multiple threads would give a significant speed-up, and it might even end up making
			the function slower due to the overhead.
		random_seed : int
			Random seed used to initialize parameters.
		stop_thr : float
			If the l2-norm of the difference between values of Theta_{u} between interations is less
			than this, it will stop. Smaller values of 'k' should require smaller thresholds.
		update_all_params : bool
			Whether to update also the item parameters in each iteration. If passing True, will update them
			with a step size given determined by the number of iterations already taken and the step_size function
			given as input in the model constructor call.

		Returns
		-------
		True : bool
			Will return True if the process finishes successfully.
		"""

		ncores, random_seed, stop_thr, maxiter = self._check_input_predict_factors(ncores, random_seed, stop_thr, maxiter)

		if update_existing:
			## checking that the user already exists
			if self.produce_dicts and self.reindex:
				user_id = self.user_dict_[user_id]
			else:
				if self.reindex:
					user_id = pd.Categorical(np.array([user_id]), self.user_mapping_).codes[0]
					if user_id == -1:
						raise ValueError("User was not present in the training data.")

		## processing the data
		counts_df = self._process_data_single(counts_df)
		
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		if update_all_params:
			counts_df['UserId'] = user_id
			counts_df['UserId'] = np.require(counts_df["UserId"], dtype=cython_loops.obj_ind_type)
			self.partial_fit(counts_df, new_users=(not update_existing))
			Theta_prev = self.Theta[-1].copy()
			for i in range(maxiter - 1):
				self.partial_fit(counts_df)
				new_Theta = self.Theta[-1]
				if np.linalg.norm(new_Theta - Theta_prev) <= stop_thr:
					break
				else:
					Theta_prev = self.Theta[-1].copy()
		else:
			## calculating the latent factors
			Theta = np.empty(self.k, dtype = cython_loops.c_real_t)
			temp = cython_loops.calc_user_factors(
				self.a, self.a_prime, self.b_prime,
				self.c, self.c_prime, self.d_prime,
				np.require(counts_df["Count"].to_numpy(copy=False), dtype=cython_loops.c_real_t, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
				np.require(counts_df["ItemId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
				Theta, self.Beta,
				self.Lambda_shp,
				self.Lambda_rte,
				cython_loops.cast_ind_type(counts_df.shape[0]), cython_loops.cast_ind_type(self.k),
				cython_loops.cast_int(maxiter), cython_loops.cast_int(ncores),
				cython_loops.cast_int(random_seed), cython_loops.cast_int(stop_thr),
				cython_loops.cast_int(self.keep_all_objs)
			)

			if np.isnan(Theta).sum() > 0:
				raise ValueError("NaNs encountered in the result. Failed to produce latent factors.")

			## adding the data to the model
			if update_existing:
				self.Theta[user_id] = Theta
				if self.keep_all_objs:
					self.Gamma_shp[user_id] = temp[0]
					self.Gamma_rte[user_id] = temp[1]
					self.k_rte[user_id] = self.a_prime/self.b_prime + \
										(temp[0].reshape((1,-1))/temp[1].reshape((1,-1))).sum(axis=1, keepdims=True)
			else:
				if self.reindex:
					new_id = self.user_mapping_.shape[0]
					self.user_mapping_ = np.r_[self.user_mapping_, np.array(user_id)]
					if self.produce_dicts:
						self.user_dict_[user_id] = new_id
				self.Theta = np.r_[self.Theta, Theta.reshape((1, self.k))]
				if self.keep_all_objs:
					self.Gamma_shp = np.r_[self.Gamma_shp, temp[0].reshape((1, self.k))]
					self.Gamma_rte = np.r_[self.Gamma_rte, temp[1].reshape((1, self.k))]
					self.k_rte = np.r_[self.k_rte, self.a_prime/self.b_prime + \
									(temp[0].reshape((1,-1))/temp[1].reshape((1,-1))).sum(axis=1, keepdims=True)]
				self.nusers += 1

		## updating the list of seen items for this user
		if self.keep_data:
			if update_existing:
				n_seen_by_user_before = self._n_seen_by_user[user_id]
				self._n_seen_by_user[user_id] = counts_df.shape[0]
				self.seen = np.r_[self.seen[:user_id], counts_df["ItemId"].to_numpy(copy=False), self.seen[(user_id + 1):]]
				self._st_ix_user[(user_id + 1):] += self._n_seen_by_user[user_id] - n_seen_by_user_before
			else:
				self._n_seen_by_user = np.r_[self._n_seen_by_user, np.array(counts_df.shape[0])]
				self._st_ix_user = np.r_[self._st_ix_user, self.seen.shape[0]]
				self.seen = np.r_[self.seen, counts_df["ItemId"].to_numpy(copy=False)]

		return True

	

[docs]
	def predict(self, user, item):
		"""
		Predict count for combinations of users and items
		
		Note
		----
		You can either pass an individual user and item, or arrays representing
		tuples (UserId, ItemId) with the combinatinons of users and items for which
		to predict (one row per prediction).

		Parameters
		----------
		user : array-like (npred,) or obj
			User(s) for which to predict each item.
		item: array-like (npred,) or obj
			Item(s) for which to predict for each user.
		"""
		assert self.is_fitted
		if not np.isscalar(user):
			user = np.require(user, requirements=["ENSUREARRAY"]).reshape(-1)
		if not np.isscalar(item):
			item = np.require(item, requirements=["ENSUREARRAY"]).reshape(-1)
			
		if isinstance(user, np.ndarray):
			assert user.shape[0] > 0
			if self.reindex:
				if user.shape[0] > 1:
					user = pd.Categorical(user, self.user_mapping_).codes
					user = np.require(user, requirements=["ENSUREARRAY"])
				else:
					if self.user_dict_ is not None:
						try:
							user = self.user_dict_[user]
						except Exception:
							user = -1
					else:
						user = pd.Categorical(user, self.user_mapping_).codes[0]
		else:
			if self.reindex:
				if self.user_dict_ is not None:
					try:
						user = self.user_dict_[user]
					except Exception:
						user = -1
				else:
					user = pd.Categorical(np.array([user]), self.user_mapping_).codes[0]
			user = np.array([user])
			
		if isinstance(item, np.ndarray):
			assert item.shape[0] > 0
			if self.reindex:
				if item.shape[0] > 1:
					item = pd.Categorical(item, self.item_mapping_).codes
					item = np.require(item, requirements=["ENSUREARRAY"])
				else:
					if self.item_dict_ is not None:
						try:
							item = self.item_dict_[item]
						except Exception:
							item = -1
					else:
						item = pd.Categorical(item, self.item_mapping_).codes[0]
		else:
			if self.reindex:
				if self.item_dict_ is not None:
					try:
						item = self.item_dict_[item]
					except Exception:
						item = -1
				else:
					item = pd.Categorical(np.array([item]), self.item_mapping_).codes[0]
			item = np.array([item])

		assert user.shape[0] == item.shape[0]
		
		if user.shape[0] == 1:
			if (user[0] == -1) or (item[0] == -1):
				return np.nan
			else:
				return self.Theta[user].dot(self.Beta[item].T).reshape(-1)[0]
		else:
			cython_loops = cython_loops_float if self.use_float else cython_loops_double
			nan_entries = (user == -1) | (item == -1)
			if nan_entries.sum() == 0:
				user = np.require(user, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
				item = np.require(item, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
				return cython_loops.predict_arr(self.Theta, self.Beta, user, item, self.ncores)
			else:
				non_na_user = user[~nan_entries]
				non_na_item = item[~nan_entries]
				non_na_user = np.require(non_na_user, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
				non_na_item = np.require(non_na_item, dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"])
				out = np.empty(user.shape[0], dtype=self.Theta.dtype)
				out[~nan_entries] = cython_loops.predict_arr(self.Theta, self.Beta, non_na_user, non_na_item, self.ncores)
				out[nan_entries] = np.nan
				return out

		
	

[docs]
	def topN(self, user, n=10, exclude_seen=True, items_pool=None):
		"""
		Recommend Top-N items for a user

		Outputs the Top-N items according to score predicted by the model.
		Can exclude the items for the user that were associated to her in the
		training set, and can also recommend from only a subset of user-provided items.

		Parameters
		----------
		user : obj
			User for which to recommend.
		n : int
			Number of top items to recommend.
		exclude_seen: bool
			Whether to exclude items that were associated to the user in the training set.
		items_pool: None or array
			Items to consider for recommending to the user.
		
		Returns
		-------
		rec : array (n,)
			Top-N recommended items.
		"""
		if isinstance(n, float):
			n = int(n)
		assert isinstance(n ,int)
		if self.reindex:
			if self.produce_dicts:
				try:
					user = self.user_dict_[user]
				except Exception:
					raise ValueError("Can only predict for users who were in the training set.")
			else:
				user = pd.Categorical(np.array([user]), self.user_mapping_).codes[0]
				if user == -1:
					raise ValueError("Can only predict for users who were in the training set.")
		if exclude_seen and not self.keep_data:
			raise Exception("Can only exclude seen items when passing 'keep_data=True' to .fit")
			
		if items_pool is None:
			allpreds = - (self.Theta[user].dot(self.Beta.T))
			if exclude_seen:
				n_ext = np.min([n + self._n_seen_by_user[user], self.Beta.shape[0]])
				rec = np.argpartition(allpreds, n_ext-1)[:n_ext]
				seen = self.seen[self._st_ix_user[user] : self._st_ix_user[user] + self._n_seen_by_user[user]]
				rec = np.setdiff1d(rec, seen)
				rec = rec[np.argsort(allpreds[rec])[:n]]
				if self.reindex:
					return self.item_mapping_[rec]
				else:
					return rec

			else:
				n = np.min([n, self.Beta.shape[0]])
				rec = np.argpartition(allpreds, n-1)[:n]
				rec = rec[np.argsort(allpreds[rec])]
				if self.reindex:
					return self.item_mapping_[rec]
				else:
					return rec

		else:
			items_pool = np.require(items_pool, requirements=["ENSUREARRAY"]).reshape(-1)
			if self.reindex:
				items_pool_reind = pd.Categorical(items_pool, self.item_mapping_).codes
				items_pool_reind = np.require(items_pool_reind, requirements=["ENSUREARRAY"])
				nan_ix = (items_pool_reind == -1)
				if nan_ix.sum() > 0:
					items_pool_reind = items_pool_reind[~nan_ix]
					msg = "There were " + ("%d" % int(nan_ix.sum())) + " entries from 'item_pool'"
					msg += "that were not in the training data and will be exluded."
					warnings.warn(msg)
				del nan_ix
				if items_pool_reind.shape[0] == 0:
					raise ValueError("No items to recommend.")
				elif items_pool_reind.shape[0] == 1:
					raise ValueError("Only 1 item to recommend.")
				else:
					pass

			if self.reindex:
				allpreds = - self.Theta[user].dot(self.Beta[items_pool_reind].T)
			else:
				allpreds = - self.Theta[user].dot(self.Beta[items_pool].T)
			n = np.min([n, items_pool.shape[0]])
			if exclude_seen:
				n_ext = np.min([n + self._n_seen_by_user[user], items_pool.shape[0]])
				rec = np.argpartition(allpreds, n_ext-1)[:n_ext]
				seen = self.seen[self._st_ix_user[user] : self._st_ix_user[user] + self._n_seen_by_user[user]]
				if self.reindex:
					rec = np.setdiff1d(items_pool_reind[rec], seen)
					allpreds = - self.Theta[user].dot(self.Beta[rec].T)
					return self.item_mapping_[rec[np.argsort(allpreds)[:n]]]
				else:
					rec = np.setdiff1d(items_pool[rec], seen)
					allpreds = - self.Theta[user].dot(self.Beta[rec].T)
					return rec[np.argsort(allpreds)[:n]]
			else:
				rec = np.argpartition(allpreds, n-1)[:n]
				return items_pool[rec[np.argsort(allpreds[rec])]]


	

[docs]
	def eval_llk(self, input_df, full_llk=False):
		"""
		Evaluate Poisson log-likelihood (plus constant) for a given dataset
		
		Note
		----
		This Poisson log-likelihood is calculated only for the combinations of users and items
		provided here, so it's not a complete likelihood, and it might sometimes turn out to
		be a positive number because of this.
		Will filter out the input data by taking only combinations of users
		and items that were present in the training set.

		Parameters
		----------
		input_df : pandas data frame (nobs, 3)
			Input data on which to calculate log-likelihood, consisting of IDs and counts.
			Must contain one row per non-zero observaion, with columns 'UserId', 'ItemId', 'Count'.
			If a numpy array is provided, will assume the first 3 columns
			contain that info.
		full_llk : bool
			Whether to calculate terms of the likelihood that depend on the data but not on the
			parameters. Ommitting them is faster, but it's more likely to result in positive values.

		Returns
		-------
		llk : dict
			Dictionary containing the calculated log-likelihood and the number of
			observations that were used to calculate it.
		"""
		assert self.is_fitted
		self._process_valset(input_df, valset=False)
		cython_loops = cython_loops_float if self.use_float else cython_loops_double
		self.ncores = cython_loops.cast_int(self.ncores)
		out = {
			'llk': cython_loops.calc_llk(
				np.require(self.val_set["Count"].to_numpy(copy=False), dtype=cython_loops.c_real_t, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
				np.require(self.val_set["UserId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
				np.require(self.val_set["ItemId"].to_numpy(copy=False), dtype=cython_loops.obj_ind_type, requirements=["ENSUREARRAY", "C_CONTIGUOUS"]),
				self.Theta,
				self.Beta,
				self.k,
				self.ncores,
				cython_loops.cast_int(bool(full_llk))
			),
			  'nobs':self.val_set.shape[0]
		}
		del self.val_set
		return out


	def _print_st_msg(self):
		print("**********************************")
		print("Hierarchical Poisson Factorization")
		print("**********************************")
		print("")

	def _print_data_info(self):
		print("Number of users: %d" % self.nusers)
		print("Number of items: %d" % self.nitems)
		print("Latent factors to use: %d" % self.k)
		print("")