Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMSingleLatentVariable is producing random error at random times #189

Open
ianchlee opened this issue Dec 6, 2022 · 1 comment
Open

Comments

@ianchlee
Copy link

ianchlee commented Dec 6, 2022

Description

I was trying to determine a single latent variable in my model, and when I tried to run the EM algorithm using fit_latent_cpds, it sometimes throw random errors while some times it can product some result.

Steps to Reproduce

I have created the following test data to try the model:

data = pd.DataFrame({'node1': np.repeat(1, 50), 'node2': np.repeat(1,50)})
for i in [0, 3, 5, 13, 17, 29, 30, 31, 32]:
    data['node1'][i] = 0

for i in [4,5,11,15,17,25,27,34,41,47]:
    data['node2'][i] = 0

The data structure is very simple, a latent variable latent1 that affects node1 and node2.

sm = StructureModel()
sm.add_edges_from([('latent1', 'node1'), ('latent1', 'node2')])
bn = BayesianNetwork(sm)
bn.node_states = {'latent1':{0,1}, 'node1': {0,1}, 'node2': {0,1}}
bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)

Some times I received good result as following:

{'latent1':                  
latent1          
0        0.283705
1        0.716295,

'node1': latent1        0         1
node1                     
0        0.21017  0.168051
1        0.78983  0.831949,

'node2': latent1         0         1
node2                      
0        0.253754  0.178709
1        0.746246  0.821291}

However, some times I receive different error messages:

Traceback (most recent call last):
  File "test_2.py", line 28, in <module>
    bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/network/network.py", line 553, in fit_latent_cpds
    estimator = EMSingleLatentVariable(
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 144, in __init__
    self._mb_data, self._mb_partitions = self._get_markov_blanket_data(data)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 585, in _get_markov_blanket_data
    mb_product = cpd_multiplication([self.cpds[node] for node in self.valid_nodes])
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/utils/pgmpy_utils.py", line 122, in cpd_multiplication
    product_pgmpy = factor_product(*cpds_pgmpy)  # type: TabularCPD
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/base.py", line 76, in factor_product
    return reduce(lambda phi1, phi2: phi1 * phi2, args)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/base.py", line 76, in <lambda>
    return reduce(lambda phi1, phi2: phi1 * phi2, args)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 930, in __mul__
    return self.product(other, inplace=False)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 697, in product
    phi = self if inplace else self.copy()
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/CPD.py", line 299, in copy
    return TabularCPD(
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/CPD.py", line 142, in __init__
    super(TabularCPD, self).__init__(
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/pgmpy/factors/discrete/DiscreteFactor.py", line 99, in __init__
    raise ValueError("Variable names cannot be same")
ValueError: Variable names cannot be same

And sometimes I receive this error:

Traceback (most recent call last):
  File "test_2.py", line 28, in <module>
    bn.fit_latent_cpds(lv_name="latent1", lv_states=[0, 1], data=data[["node1", "node2"]], n_runs=30)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/network/network.py", line 563, in fit_latent_cpds
    estimator.run(n_runs=n_runs, stopping_delta=stopping_delta)
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 181, in run
    self.e_step()  # Expectation step
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 233, in e_step
    results = self._update_sufficient_stats(node_mb_data["_lookup_"])
  File "/Users/user/opt/anaconda3/envs/py38/lib/python3.8/site-packages/causalnex/estimator/em.py", line 448, in _update_sufficient_stats
    prob_lv_given_mb = self._mb_product[mb_cols]
KeyError: (nan, 0.0)

My code originally also includes the boundaries and priors, however I realise these two errors just randomly pop up at different times.

Please let me know if I have done something wrong in setting up the network.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • CausalNex version used (pip show causalnex): 0.11.0
  • Python version used (python -V): 3.8.15 (via conda)
  • Operating system and version: Mac OS M1
@ngkaching
Copy link

ngkaching commented Aug 8, 2023

For reference: pgmpy/pgmpy#1582

In line 702 of DiscreteFactor.py from pgmpy library

Change from
new_variables = list(set(phi.variables).union(phi1.variables))
to
new_variables = phi.variables + [var for var in phi1.variables if var not in phi.variables]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants