Housing Prices
This data is taken from Kaggle and concerns predicting housing prices. The problem of this dataset is tagged as a classification problem but clearly, the price
field is continuous, and so this data is more naturally a regression problem. We have discretized the features as follows.
price: [low, medium, high]
area: [very_low, low, medium, high, very_high]
bedrooms: [low, medium, high]
bathrooms: [low, high]
stories: [low, medium, high]
parking, [low, medium, high]
The price and area features are discretized using univariate k-means clustering. The other features were binned using manual inspection (work omitted for brevity).
[1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
def discretize(column, df, n_clusters=3):
X = df[[column]]
kmeans = KMeans(n_clusters=n_clusters, random_state=37)
kmeans.fit(X)
c2v = {
c: v
for v, (c, _) in list(
enumerate(
sorted(list(enumerate(np.ravel(kmeans.cluster_centers_))), key=lambda tup: tup[1])
)
)
}
c2v
y = pd.Series(kmeans.predict(X)).map(c2v)
return y
df = pd.read_csv("./data/Housing.csv").assign(
price=lambda d: discretize("price", d).map({0: "low", 1: "medium", 2: "high"}),
area=lambda d: discretize("area", d, 5).map(
{0: "very_low", 1: "low", 2: "medium", 3: "high", 4: "very_high"}
),
bedrooms=lambda d: pd.cut(
d["bedrooms"], [0, 2, 3, 10], include_lowest=True, labels=["low", "medium", "high"]
),
bathrooms=lambda d: pd.cut(
d["bathrooms"], [0, 1, 5], include_lowest=True, labels=["low", "high"]
),
stories=lambda d: pd.cut(
d["stories"], [0, 1, 2, 5], include_lowest=True, labels=["low", "medium", "high"]
),
parking=lambda d: d["parking"].map({0: "low", 1: "medium", 2: "high", 3: "high"}),
)
df.shape
[1]:
(545, 13)
[2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null object
1 area 545 non-null object
2 bedrooms 545 non-null category
3 bathrooms 545 non-null category
4 stories 545 non-null category
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null object
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: category(3), object(10)
memory usage: 44.7+ KB
Spark
Now let’s load up the data into a Spark dataframe.
[3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("housing").master("local[*]").getOrCreate()
sdf = spark.createDataFrame(df).cache()
sdf.count()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/10 17:42:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[3]:
545
[4]:
sdf.show(5)
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|price|area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
| high|high| high| high| high| yes| no| no| no| yes| high| yes| furnished|
| high|high| high| high| high| yes| no| no| no| yes| high| no| furnished|
| high|high| medium| high| medium| yes| no| yes| no| no| high| yes| semi-furnished|
| high|high| high| high| medium| yes| no| yes| no| yes| high| yes| furnished|
| high|high| high| low| medium| yes| yes| yes| no| yes| high| no| furnished|
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
only showing top 5 rows
Learning
We will learn two models, one from Three-Phase Dependency Analysis (TPDA) and another from Maximum Weight Spanning Tree (MWST).
[ ]:
from pyspark_bbn.discrete.data import DiscreteData
from pyspark_bbn.discrete.plearn import ParamLearner
from pyspark_bbn.discrete.scblearn import Tpda, Mwst
from pyspark_bbn.discrete.bbn import get_bbn
from pybbn.pptc.inferencecontroller import InferenceController
data = DiscreteData(sdf)
# TPDA
g_tpda = Tpda(data, cmi_threshold=0.05).get_network()
p_tpda = ParamLearner(data, g_tpda).get_params()
t_tpda = InferenceController.apply(get_bbn(g_tpda, p_tpda, data.get_profile()))
# MWST
g_mwst = Mwst(data).get_network()
p_mwst = ParamLearner(data, g_mwst).get_params()
t_mwst = InferenceController.apply(get_bbn(g_mwst, p_mwst, data.get_profile()))
23/09/10 17:42:10 WARN CacheManager: Asked to cache already cached data.
Here’s a plot of the structures learned.
[9]:
import networkx as nx
import matplotlib.pyplot as plt
pos_tpda = nx.nx_pydot.graphviz_layout(g_tpda, prog="dot")
pos_mwst = nx.nx_pydot.graphviz_layout(g_mwst, prog="dot")
fig, ax = plt.subplots(2, 1, figsize=(7, 7))
ax = np.ravel(ax)
nx.draw(
g_tpda,
pos_tpda,
ax=ax[0],
with_labels=True,
node_size=10,
node_color="#2eb82e",
edge_color="#4da6ff",
arrowsize=10,
min_target_margin=5,
nodelist=[n for n in g_tpda.nodes() if len(list(nx.to_undirected(g_tpda).neighbors(n))) > 0],
)
nx.draw(
g_mwst,
pos_mwst,
ax=ax[1],
with_labels=True,
node_size=10,
node_color="#2eb82e",
edge_color="#4da6ff",
arrowsize=10,
min_target_margin=10,
nodelist=[n for n in g_mwst.nodes() if len(list(nx.to_undirected(g_mwst).neighbors(n))) > 0],
)
ax[0].set_title("TPDA")
ax[1].set_title("MWST")
fig.tight_layout()

Lift
Now let’s see how the observation of each variable at its highest value gives lift to the housing price being high.
[7]:
from pybbn.graph.jointree import EvidenceBuilder
def get_sensitivity(name, value, tree):
tree.unobserve_all()
ev = (
EvidenceBuilder()
.with_node(tree.get_bbn_node_by_name(name))
.with_evidence(value, 1.0)
.build()
)
tree.set_observation(ev)
meta = {"name": name, "value": value}
post = tree.get_posteriors()["price"]
return {**meta, **post}
n2v = {
"bedrooms": "high",
"hotwaterheating": "yes",
"area": "very_high",
"stories": "high",
"bathrooms": "high",
"airconditioning": "yes",
"guestroom": "yes",
"basement": "yes",
"mainroad": "yes",
"parking": "high",
"prefarea": "yes",
"furnishingstatus": "furnished",
}
t_tpda.unobserve_all()
h = t_tpda.get_posteriors()["price"]["high"]
m = t_tpda.get_posteriors()["price"]["medium"]
l = t_tpda.get_posteriors()["price"]["low"]
lift_df = pd.DataFrame([get_sensitivity(name, value, t_tpda) for name, value in n2v.items()])[
["name", "value", "low", "medium", "high"]
].assign(
low_lift=lambda d: d["low"] / l,
medium_lift=lambda d: d["medium"] / m,
high_lift=lambda d: d["high"] / h,
)
lift_df.sort_values(["high_lift", "medium_lift", "low_lift"], ascending=False).rename(
columns={"name": "variable", "high_lift": "price_lift"}
)[["variable", "value", "price_lift"]]
[7]:
variable | value | price_lift | |
---|---|---|---|
2 | area | very_high | 3.945727 |
4 | bathrooms | high | 2.184265 |
5 | airconditioning | yes | 2.095619 |
3 | stories | high | 1.992604 |
0 | bedrooms | high | 1.389292 |
6 | guestroom | yes | 1.219085 |
7 | basement | yes | 1.043853 |
8 | mainroad | yes | 1.000000 |
9 | parking | high | 1.000000 |
10 | prefarea | yes | 1.000000 |
11 | furnishingstatus | furnished | 1.000000 |
1 | hotwaterheating | yes | 0.567470 |