Housing Prices

This data is taken from Kaggle and concerns predicting housing prices. The problem of this dataset is tagged as a classification problem but clearly, the price field is continuous, and so this data is more naturally a regression problem. We have discretized the features as follows.

price: [low, medium, high]
area: [very_low, low, medium, high, very_high]
bedrooms: [low, medium, high]
bathrooms: [low, high]
stories: [low, medium, high]
parking, [low, medium, high]

The price and area features are discretized using univariate k-means clustering. The other features were binned using manual inspection (work omitted for brevity).

[1]:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans


def discretize(column, df, n_clusters=3):
    X = df[[column]]

    kmeans = KMeans(n_clusters=n_clusters, random_state=37)
    kmeans.fit(X)

    c2v = {
        c: v
        for v, (c, _) in list(
            enumerate(
                sorted(list(enumerate(np.ravel(kmeans.cluster_centers_))), key=lambda tup: tup[1])
            )
        )
    }
    c2v

    y = pd.Series(kmeans.predict(X)).map(c2v)
    return y


df = pd.read_csv("./data/Housing.csv").assign(
    price=lambda d: discretize("price", d).map({0: "low", 1: "medium", 2: "high"}),
    area=lambda d: discretize("area", d, 5).map(
        {0: "very_low", 1: "low", 2: "medium", 3: "high", 4: "very_high"}
    ),
    bedrooms=lambda d: pd.cut(
        d["bedrooms"], [0, 2, 3, 10], include_lowest=True, labels=["low", "medium", "high"]
    ),
    bathrooms=lambda d: pd.cut(
        d["bathrooms"], [0, 1, 5], include_lowest=True, labels=["low", "high"]
    ),
    stories=lambda d: pd.cut(
        d["stories"], [0, 1, 2, 5], include_lowest=True, labels=["low", "medium", "high"]
    ),
    parking=lambda d: d["parking"].map({0: "low", 1: "medium", 2: "high", 3: "high"}),
)
df.shape

[1]:

(545, 13)

[2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   price             545 non-null    object
 1   area              545 non-null    object
 2   bedrooms          545 non-null    category
 3   bathrooms         545 non-null    category
 4   stories           545 non-null    category
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    object
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: category(3), object(10)
memory usage: 44.7+ KB

Spark

Now let’s load up the data into a Spark dataframe.

[3]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("housing").master("local[*]").getOrCreate()

sdf = spark.createDataFrame(df).cache()
sdf.count()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

23/09/10 17:42:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[3]:

[4]:

sdf.show(5)

+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|price|area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
| high|high|    high|     high|   high|     yes|       no|      no|             no|            yes|   high|     yes|       furnished|
| high|high|    high|     high|   high|     yes|       no|      no|             no|            yes|   high|      no|       furnished|
| high|high|  medium|     high| medium|     yes|       no|     yes|             no|             no|   high|     yes|  semi-furnished|
| high|high|    high|     high| medium|     yes|       no|     yes|             no|            yes|   high|     yes|       furnished|
| high|high|    high|      low| medium|     yes|      yes|     yes|             no|            yes|   high|      no|       furnished|
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
only showing top 5 rows

Learning

We will learn two models, one from Three-Phase Dependency Analysis (TPDA) and another from Maximum Weight Spanning Tree (MWST).

[ ]:

from pyspark_bbn.discrete.data import DiscreteData
from pyspark_bbn.discrete.plearn import ParamLearner
from pyspark_bbn.discrete.scblearn import Tpda, Mwst
from pyspark_bbn.discrete.bbn import get_bbn
from pybbn.pptc.inferencecontroller import InferenceController

data = DiscreteData(sdf)

# TPDA
g_tpda = Tpda(data, cmi_threshold=0.05).get_network()
p_tpda = ParamLearner(data, g_tpda).get_params()
t_tpda = InferenceController.apply(get_bbn(g_tpda, p_tpda, data.get_profile()))

# MWST
g_mwst = Mwst(data).get_network()
p_mwst = ParamLearner(data, g_mwst).get_params()
t_mwst = InferenceController.apply(get_bbn(g_mwst, p_mwst, data.get_profile()))

23/09/10 17:42:10 WARN CacheManager: Asked to cache already cached data.

Here’s a plot of the structures learned.

[9]:

import networkx as nx
import matplotlib.pyplot as plt

pos_tpda = nx.nx_pydot.graphviz_layout(g_tpda, prog="dot")
pos_mwst = nx.nx_pydot.graphviz_layout(g_mwst, prog="dot")

fig, ax = plt.subplots(2, 1, figsize=(7, 7))
ax = np.ravel(ax)

nx.draw(
    g_tpda,
    pos_tpda,
    ax=ax[0],
    with_labels=True,
    node_size=10,
    node_color="#2eb82e",
    edge_color="#4da6ff",
    arrowsize=10,
    min_target_margin=5,
    nodelist=[n for n in g_tpda.nodes() if len(list(nx.to_undirected(g_tpda).neighbors(n))) > 0],
)

nx.draw(
    g_mwst,
    pos_mwst,
    ax=ax[1],
    with_labels=True,
    node_size=10,
    node_color="#2eb82e",
    edge_color="#4da6ff",
    arrowsize=10,
    min_target_margin=10,
    nodelist=[n for n in g_mwst.nodes() if len(list(nx.to_undirected(g_mwst).neighbors(n))) > 0],
)

ax[0].set_title("TPDA")
ax[1].set_title("MWST")

fig.tight_layout()

Lift

Now let’s see how the observation of each variable at its highest value gives lift to the housing price being high.

[7]:

from pybbn.graph.jointree import EvidenceBuilder


def get_sensitivity(name, value, tree):
    tree.unobserve_all()

    ev = (
        EvidenceBuilder()
        .with_node(tree.get_bbn_node_by_name(name))
        .with_evidence(value, 1.0)
        .build()
    )
    tree.set_observation(ev)

    meta = {"name": name, "value": value}
    post = tree.get_posteriors()["price"]

    return {**meta, **post}


n2v = {
    "bedrooms": "high",
    "hotwaterheating": "yes",
    "area": "very_high",
    "stories": "high",
    "bathrooms": "high",
    "airconditioning": "yes",
    "guestroom": "yes",
    "basement": "yes",
    "mainroad": "yes",
    "parking": "high",
    "prefarea": "yes",
    "furnishingstatus": "furnished",
}

t_tpda.unobserve_all()
h = t_tpda.get_posteriors()["price"]["high"]
m = t_tpda.get_posteriors()["price"]["medium"]
l = t_tpda.get_posteriors()["price"]["low"]

lift_df = pd.DataFrame([get_sensitivity(name, value, t_tpda) for name, value in n2v.items()])[
    ["name", "value", "low", "medium", "high"]
].assign(
    low_lift=lambda d: d["low"] / l,
    medium_lift=lambda d: d["medium"] / m,
    high_lift=lambda d: d["high"] / h,
)
lift_df.sort_values(["high_lift", "medium_lift", "low_lift"], ascending=False).rename(
    columns={"name": "variable", "high_lift": "price_lift"}
)[["variable", "value", "price_lift"]]

[7]:

	variable	value	price_lift
2	area	very_high	3.945727
4	bathrooms	high	2.184265
5	airconditioning	yes	2.095619
3	stories	high	1.992604
0	bedrooms	high	1.389292
6	guestroom	yes	1.219085
7	basement	yes	1.043853
8	mainroad	yes	1.000000
9	parking	high	1.000000
10	prefarea	yes	1.000000
11	furnishingstatus	furnished	1.000000
1	hotwaterheating	yes	0.567470