Mingchin Chen
Fu Jen Catholic University, Taiwan
E-mail: 081438@mail.fju.edu.tw
Pei-De Wang
Fu Jen Catholic University, Taiwan
E-mail: kmpeterwang@gmail.com
Submission: 08/06/2017
Revision: 29/06/2017
Accept: 22/07/2017
ABSTRACT
While many studies have applied data
mining techniques to judge housing prices, few have decoded the important
attributes or prioritized them simultaneously. This paper aims to utilize five
data mining techniques to discover the important attributes for three major
types of real estate in Taipei city. The datasets, involving a total of 22,480
transactions, were publicly available from the Taiwan Actual Price Registration
from July 2013 to August 2015. The five models are decision trees, random
forests, model trees, artificial neural networks and multiple regression. The
criteria used to measure the forecasting accuracy are MAPE, R², RMSE, MAE and
COR. The model with the best performance for all houses is the Model Tree with
a MAPE value of 27.59. As for apartments, the best is Random Forests.
Artificial Neural Networks perform best for suites and buildings with
elevators. Different housing types need different models. Furthermore, the
attributes importance helps us to conclude the really critical attributes,
which include the floor area, administrative districts, parking area and land
area, and their rankings. This variable ranking and selection procedure
proposed by this research can also be adopted to improve the prediction
efficiency for most big data applications other than the housing transactions.
Keywords: data mining; housing pricing;
forecasting accuracy; variables ranking; variables selection
1. INTRODUCTION
Buying
a house in Taipei is relatively hard-affordable. Therefore, evaluating a
housing price become an issue. Even Taiwan authorities take the transactions
more transparent in action. Taipei remains one of the most expensive cities in
the world in which to buy a house. Taipei’s house price-to-income ratio stood
between 15 and 17 in 2015, higher than London (8.5x), New York (5.9x), or
Sydney (12.2x) (DELMENDO, 2016).
Housing
affordability remains a major problem in Taipei city. Furthermore, higher housing affordability
means higher housing prices relatively. In addition, there must be some
inherent factors giving rise to these high housing prices. Those inherent factors determine the housing prices
and meanwhile stand for the favor of people when they are going to buy or sell
a house in Taipei.
Actual Price Registration (APR)
refers to a national system for registering the actual prices of property
transactions—an initiative created to boost transparency in Taiwan’s real estate
market. This regulation came into effect
on August 1, 2011. This study intends to determine what those factors are from
that open system with real transactions by utilizing five data mining skills.
There
are 3 major housing types to which this paper pays particular attention.
According to statistics from the Department of Urban Development, Taipei City
Government for 2013 to 2015 , transactions involving condominiums in buildings
of 5 storeys or less without an elevator (apartment) accounted on average for
21% of housing transactions (Type_APT), condominiums with elevators (buildings)
for 58% (Type_BLD) and suites (Type_SUT) for 19% as shown in Figure 1.
The
curve corresponding to the right coordinate axis represents the volumes of
transactions in each season. Even though the volumes have changed over time,
the percentages of those 3 types remain relatively stable. Therefore, those 3
types become our study targets.
Figure 1:
Volumes of transactions from 2013 to 2015
The
hedonic-based regression approach has been utilized extensively to investigate
the relationship between house prices and housing characteristics(FAN; ONG; KOH,
2006). For example, Goodman (1978) extended hedonic price analysis to the
formation of housing price indices measuring variations within a metropolitan
area (GOODMAN, 1978).
Fit
et al., (2003) developed several hedonic specifications that attempt to more
fully capture the interactive components of location values (FIK; LING; MULLIGAN, 2003). Welch et al., (2016) estimated a hedonic
spatial panel model to determine the long-term impact of improved network
access to bike and public transit facilities on housing sales prices (WELCH; GEHRKE;
WANG, 2016). However, this approach is subject to criticisms arising from
potential problems related to fundamental model assumptions and estimation (FAN;
ONG; KOH, 2006).
Nowadays,
there are more and more studies that focus on real estate by using data mining
techniques. Acciani et al., (2011)
adopted model trees and multivariate adaptive regression splines to predictors
in real estate appraisal (ACCIANI; FUCILLI; SARDARO, 2011). Fong and Wah (2013)
utilized feature selection techniques to screen important attributes and
applied those attributes to build up a predictive model by using different
kinds of data mining techniques. Gan et al., (2015) built decision trees and
neural networks and compared their results.
While
these authors all used different data mining techniques to figure out the
housing prices, few of them attempted to find out what were the important
attributes or to rank them by importance at the same time. Moreover, none of
them identified the attributes according to the types of houses.
This
paper is going to utilize five models and five measurements to evaluate them.
The five models are decision trees, random forests, model trees, artificial
neural networks and multiple regression. The criteria used to measure the
forecasting accuracy are MAPE, R², RMSE, MAE and COR. The final result is the roadmap
for evaluating the more reasonable housing prices.
2. RESEARCH METHODOLOGY
The
research flow is shown in figure 2. All the data used in this paper is
downloaded from APR. By using 5 data mining techniques and comparing 3 major
housing types by MAPE, R², RMSE, MAE and COR.
This paper finds out that different housing types need different data
mining models.
Each
type has its own favor attributes with higher importance values. Therefore,
ranks those attributes according to the averages of these importance values.
Then count the number of models that have the same attributes. This ranking and
selection process helps us to figure out the relative important attributes in
each housing types.
Finally,
according to the statistics on rankings and votes of attributes, this paper identify
the classifications of the attributes and build a roadmap to depict the
diversities of attributes.
Figure 2: Research flow
3. DATA MINING SKILLS
This session is going to introduce the data mining skills
used in this paper.
3.1.
Decision
Tree(DT)
A
DT algorithm works by
splitting a dataset in order to build a model that successfully classifies each
record in terms of a target field or variable (WOODS; KYRAL, 1997). There are
two types of DT: a classification tree and a regression tree that can be
implemented using the four most popular algorithms: the chi-squared automatic
interaction detection (CHAID) (KASS, 1980; MAGIDSON, 1994), the iterative
dichotomiser (ID3) (QUINLAN, 1986), the classification and regression trees (BREIMAN;
FRIEDMAN; OLSHEN; STONE, 1984) and C4.5 (QUINLAN, 1992).
CHAID
and ID3 can only be used for the classification tree, while both the
classification and regression trees can be used for the others. A response
variable which has more classes or categories than a classification tree can be
used, otherwise a regression tree that has numeric or continuous responsiveness
can be used instead.
Two
main processes used to
construct a tree are tree growing and pruning. The tree growing process
searches for independent variables as splitters that start from the root node
with all the instances and keeps partitioning those with the greatest
differences until no significant differences can be identified. In this
process, the purity or impurity criterion is used to split a node that makes
instances more likely in a node. In the case of a classification tree,
splitting the data is based on homogeneity. A regression model splits each of
the independent variables as nodes where their inclusion decreases the error
measure the most. The best criterion should produce the greatest purity or
reduce the impurity the most.
3.2.
Random
Forests (RF)
The
pros and cons of DT are as follows (JAMES; WITTEN; HASTIE; TIBSHIRANI, 2013).
The advantages are that they are easy to explain, more closely mirror human
decision-making, may be displayed graphically and can easily handle qualitative
predictors. Unfortunately, DT generally do not have precise predictive power.
However, the performance of the predictive power can be substantially improved
by RF.
In
actual fact, RF are an example of ensemble methods that combines a series of k
base models (or trees) with the aim of co-creating an improved composite model.
Each tree depends on the values of a random vector sampled independently and
with the same distribution for all trees in the forest (BREIMAN, 2001). After a
large number of trees are generated, they are combined to yield a single
consensus prediction by voting for classification trees or averaging for
regression trees. Besides, RF are characterized by significant improvements in
accuracy, and greater robustness to errors and outliers.
There
are two basic beliefs regarding RF in
that most trees can provide correct predictions and the trees make mistakes in
different places. Beriman (2001) stated
that the use of the Strong Law of Large Numbers shows that RF always converge
so that overfitting is not a problem and they produce limit values of the
generalization errors that are measures of how accurate the individual classifiers
are (strength) and of the dependence between them(correlation) (BREIMAN, 2001).
The idea is to maintain the strength without increasing their correlation.
3.3.
Model
Tree (MT)
The
MT is based on a divide-and-conquer approach
through which it is possible to learn from a set of instances (WITTEN; FRANK, 2005).
The output of a MT is represented by a tree–like structure in which it is
possible to distinguish a root node, parent and child nodes, arches (or
branches) and leaves (ACCIANI; FUCILLI; SARDARO, 2011).
The
greatest difference when compared with a decision tree is the content of the
leaf node. In the model tree, each terminal node represents more and delivers
more information. A linear regression model is calculated based on the number
of instances of that node that it contains, and not on an averaged value in the
regression tree. As a result, it may provide a more precise estimation. This
paper uses a rule-based model that is an extension of Quinlan's M5 MT (KUHN;
WESTON; DEEFER; COUTLER, 2016).
3.4.
Artificial
Neural Network (ANN)
ANN
is an artificial intelligence
model originally designated to replicate the human nervous system (BAHIA, 2013).
Once the nervous system is alerted by outside stimulations, neurons work and
react. Therefore, ANN consists of three main layers: the input data layer
(stimulations), the hidden layer(s) and the output layer. Each artificial
neuron has a set of input connections that receive signals from other neurons
and a bias adjustment, as well as a transfer function that transforms the sum
of the weighted inputs and bias to decide the value of the output (COAKLEY;
BROWN, 2000).
3.5.
Multiple
Regression (MR)
The
hedonic-based regression approach belongs
to MR. There are many independent variables and one dependent variable in MR. The
relationships between the independent variables and dependent variable will be
described. Fixed independent variables derive the conditional expectation of
the dependent variable, an averaged value. Therefore, MR is widely used for
prediction.
4. DATA SOURCE AND PREPARATION
The
data used in this research are downloaded
from APR. Raw data amount to 48,658 observations from July 2013 to August 2015.
After deleting all records with empty column(s) and unreasonable values, the
total number of observations is 22,480 and encompasses the three most popular
housing types that are all only for home use.
To
facilitate further inspections and comparisons, this paper also combines each
of these three types into an overall group (Type_ALL). Generally speaking,
Type_APT and Type_BLD are both suitable for a family and Type_SUT might be more
suitable for singles.
There
are 20 attributes used in this paper that are listed in Table 1. This research
has partitioned the houses into three types, and therefore the total number of
attributes used in Type_APT, Type_BLD and Type_SUT is 19. The housing prices
are naturally chosen as the dependent variable while the other housing
attributes are treated as independent variables.
There
are two types of attributes: C stands for category and N for numeric. The
amounts of data used in Type_APT, Type_BLD and Type_SUT are 6,115, 13,039 and
3,326, respectively. Two-thirds of the sample data are used in building the
model, and the remaining one-third is used as an external holdout for measurement
purposes.
Table 1: Data
attributes
|
Attributes |
Type |
Description |
1 |
target_dst |
C |
Administrative districts:
Songshan(1), Sinyi(2), Da-an(3), Jhongshan(4),Jhonjheng(5), Datong(6),
Wanhua(7), Wunshan(8), Nangang(9), Neihu(10), Shihlin(11) and Beitou(12). |
2 |
target_tp |
C |
With(1) or without(2) parking place |
3 |
lnd_area |
N |
Occupied land area of the house(M²) |
4 |
lndusg_tp |
C |
Type of land usage: Residential(1),
Commercial(2), Industrial(3), Others(4), Agricultural(5) |
5 |
ym_sold |
C |
Year and month when the house has
been sold |
6 |
prk_sold |
N |
Number of parking places sold |
7 |
flat_type |
C |
Floor numbering |
8 |
total_flat |
N |
Total floor level of a building |
9 |
hs_tp |
C |
Housing types: APT(1),BLD (2) and
SUT(6) |
10 |
cstrct_tp |
C |
Types of construction methods:
Reinforced concrete (1),Reinforced brick structure (2) ,Referring to building
occupation permit (3), Brick structure (4) ,Steel reinforced concrete (5),
Referring to other registrations (6), Steel concrete (7), Precast reinforced
concrete (8). |
11 |
flr_area |
N |
Area of the house (M²) |
12 |
room |
N |
Number of rooms |
13 |
sit_room |
N |
Number of living and/or dining rooms |
14 |
bathroom |
N |
Number of bathrooms |
15 |
cmptmt |
C |
Compartment (1) or not (2) |
16 |
mgt_cmt |
C |
Having (1) or not having (2) a
management committee |
17 |
pk_type |
C |
Parking type: No parking space (0),
On the ground floor (1), Lifting plane (2), Lifting machinery (3), Ramp (4),
Ramp machinery (5), Tower (6), Others (7) |
18 |
pk_area |
N |
Parking area (M²) |
19 |
flat_age |
N |
Housing age (year) |
20 |
price |
N |
Total price (NTD) |
5. RESULTS AND DISCUSSION
The
purpose of this section is to ascertain
the predominant attributes of housing prices. Five models are utilized in the
prediction. There are many criteria used to measure the forecasting accuracy (MUNUSAMY;
MUTHUVEERAPPAN; BABA; ABDULLAH; ASMONI, 2015).
In
this paper, the measures used for comparison purposes are the MAPE (Mean
Absolute Percentage Error), R² (Coefficient of determination), RMSE (Root mean
squared error), MAE (Mean absolute error) and COR (Correlation).
The
results are derived from package ‘rminer’ (CORTEZ, 2016) and displayed in Table
2. The notation "<" means "better" if a lower value, and
">" stands for "better" if a higher value. The notation
"¹" represents the best performance based on the specific measure for
each housing type.
For
all houses, the MT’s MAE is larger than the RF’s, however, the MT’s RMSE is
smaller than the RF’s. That means RF have more forecasting values closer to
real prices than the MT, but meanwhile the RF have more outliers than the MT.
In a word, MT has the best forecasting performance of all houses because the MT
has the four best measures of the five.
For
apartments, RF have all the
measures to win: the smallest MAPE, RMSE and MAE, and the largest R² and COR.
Furthermore, ANN is found to do better than the other models because over half
the measures are better than those for the other models. Obviously, due to the
distinct characteristics of the different housing types, different algorithms
need to be adopted.
Table 2:
Measurement results for all types
Model |
Measurement |
Type_ALL |
Type_APT |
Type_BLD |
Type_SUT |
DT |
MAPE |
54.73 |
62.11 |
48.75 |
31.78 |
ANN |
34.38 |
46.83 |
32.00 |
20.81 |
|
MR |
50.58 |
50.18 |
40.93 |
23.94 |
|
MT |
27.59¹ |
41.23 |
27.97¹ |
16.90¹ |
|
RF |
27.95 |
39.71¹ |
29.33 |
23.38 |
|
DT |
R² |
0.73 |
0.49 |
0.68 |
0.61 |
ANN |
0.81 |
0.53 |
0.87¹ |
0.80¹ |
|
MR |
0.73 |
0.57 |
0.74 |
0.76 |
|
MT |
0.84¹ |
0.56 |
0.84 |
0.78 |
|
RF |
0.78 |
0.59¹ |
0.80 |
0.79 |
|
DT |
RMSE |
12447320 |
5003089 |
15617442 |
2789666 |
ANN |
10374326 |
4794231 |
9971407¹ |
1996698¹ |
|
MR |
12349125 |
4565800 |
14159109 |
2196543 |
|
MT |
9648036¹ |
4628516 |
11232151 |
2010710 |
|
RF |
11181599 |
4479439¹ |
12480842 |
2032913 |
|
DT |
MAE |
6638674 |
3642247 |
8678978 |
1986994 |
ANN |
4531948 |
3436007 |
5762095 |
1335743¹ |
|
MR |
6051855 |
3321844 |
7791333 |
1536106 |
|
MT |
4307236 |
3289040 |
5485538 |
1372310 |
|
RF |
4115843¹ |
3169619¹ |
5467660¹ |
1384466 |
|
DT |
COR |
0.85 |
0.70 |
0.83 |
0.78 |
ANN |
0.90 |
0.74 |
0.93¹ |
0.89 |
|
MR |
0.86 |
0.76 |
0.86 |
0.87 |
|
MT |
0.92¹ |
0.75 |
0.92 |
0.89 |
|
RF |
0.90 |
0.78¹ |
0.91 |
0.90¹ |
Each
type has its own ranking or focused attributes. Insights may be gained by
utilizing the important
values of each attribute in a model that can be derived from rminer. Different
models have different importance values for the same attributes. Inspired by
the ensemble model, this paper averages those importance values from the five
models outlined above, and ranks those attributes according to the averages of
these importance values.
Those
attributes appearing in the bold frames constitute 95 percent of the importance
resident in each type as shown in Table 3. We then count the number of models
(#) that have the same attributes among the top 10 attributes of Type_ALL and
these appear in the bold frames for each method simultaneously. The results can
be seen as the voting results based on the five models.
For
Type_BLD, for instance, there are nine
attributes that account for 95 percent of the importance with respect to
housing prices. Those attributes from the most important to the least important
are floor area, land area, number of rooms, and number of sold parking places,
etc. The attribute of floor area in Type_BLD receives the five models’ votes,
land area gets four, and number of rooms gets two, and so on.
Table 3:
Rankings and votes of attributes
|
Type_ALL |
Type_APT |
Type_BLD |
Type_SUT |
||||
Ranking |
Attributes |
# |
Attributes |
# |
Attributes |
# |
Attributes |
# |
1 |
flr_area |
5 |
flr_area |
5 |
flr_area |
5 |
flr_area |
5 |
2 |
target_dst |
4 |
target_dst |
5 |
lnd_area |
4 |
target_dst |
5 |
3 |
pk_area |
3 |
bathroom |
4 |
room |
2 |
flat_age |
5 |
4 |
lnd_area |
2 |
lnd_area |
4 |
prk_sold |
3 |
lnd_area |
4 |
5 |
room |
2 |
pk_area |
3 |
target_dst |
5 |
cstrct_tp |
3 |
6 |
prk_sold |
3 |
sit_room |
4 |
pk_area |
3 |
pk_type |
4 |
7 |
total_flat |
3 |
prk_sold |
2 |
total_flat |
2 |
total_flat |
2 |
8 |
bathroom |
1 |
flat_age |
3 |
bathroom |
3 |
pk_area |
4 |
9 |
cstrct_tp |
3 |
target_tp |
2 |
cstrct_tp |
2 |
prk_sold |
1 |
10 |
cmptmt |
1 |
room |
4 |
flat_age |
2 |
lndusg_tp |
2 |
11 |
sit_room |
2 |
cstrct_tp |
2 |
pk_type |
1 |
bathroom |
3 |
12 |
flat_age |
2 |
ym_sold |
1 |
sit_room |
1 |
room |
3 |
13 |
target_tp |
3 |
cmptmt |
3 |
target_tp |
1 |
cmptmt |
2 |
14 |
pk_type |
1 |
pk_type |
2 |
cmptmt |
0 |
flat_type |
2 |
15 |
lndusg_tp |
2 |
lndusg_tp |
2 |
flat_type |
0 |
target_tp |
1 |
16 |
ym_sold |
1 |
flat_type |
1 |
lndusg_tp |
0 |
sit_room |
0 |
17 |
flat_type |
1 |
mgt_cmt |
1 |
mgt_cmt |
0 |
ym_sold |
0 |
18 |
hs_tp |
0 |
total_flat |
0 |
ym_sold |
0 |
mgt_cmt |
0 |
19 |
mgt_cmt |
0 |
|
|
|
|
|
|
Floor
area is the most important and most robust attribute, and all five models agree
with the three types. There are many studies whose findings are in line with
this point of view. Sirmans et al. (2006) stated that floor area is perhaps the
most important structural attribute in determining house prices (SIRMANS;
MACDONALD; MACPHERSON; ZIETZ, 2006). In addition, Bracke (2015) showed that the
contribution of floor area is positive for housing prices. Xiao et al. (2016) also said that property
prices increase as floor area increases.
Moreover,
to discover the characteristics of these attributes, this paper extracts the 10
most important attributes from Type_ALL in Table 3 and uses these attributes as the baseline. For each type of house, we sum each
model’s votes (#), sum the rankings of each attribute before averaging them,
and, finally, calculate the variances of the rankings as shown in Table 4.
Table 4:
Statistics on rankings and votes
Ranking |
Attributes |
Sum of Votes |
Sum of |
Averaged |
Variances |
1 |
flr_area |
15 |
3 |
1.00 |
0.00 |
2 |
target_dst |
15 |
9 |
3.00 |
4.50 |
3 |
pk_area |
10 |
19 |
6.33 |
0.50 |
4 |
lnd_area |
12 |
10 |
3.33 |
2.00 |
5 |
room |
9 |
25 |
8.33 |
24.50 |
6 |
prk_sold |
6 |
20 |
6.67 |
4.50 |
7 |
total_flat |
4 |
32 |
10.67 |
60.50 |
8 |
bathroom |
10 |
22 |
7.33 |
12.50 |
9 |
cstrct_tp |
7 |
25 |
8.33 |
2.00 |
10 |
cmptmt |
5 |
40 |
13.33 |
0.50 |
There
are various other inferences obtained from these attributes. By identifying the
attributes, these inferences will be discovered. Those attributes in Table 4
that occupy over 50 percent of total votes (15) are referred to as major.
Meanwhile, those that have relatively small variances of rankings (less than 5)
are referred to as stable.
Thus
the major-stable attributes are identified in red shading, such us floor area,
administrative districts, parking and land area, due to their high importance
and relatively small variances. Similarly, those major-unstable attributes
appear with orange shading, the minor-stable ones with yellow and the
minor-unstable ones with green.
First,
del Cacho (2010) stated that location is a factor of paramount importance when
determining the pricing of a property. Second, in downtown areas and inner
cities, parking requirements could profoundly alter the housing stock
(MANVILLE, 2013). Therefore, parking requirements can increase the price of
real estate (SHOUP, 2014). Finally, a larger land area leads to more floor area
in each of those three types of housing. Therefore, land area is also an
indicator.
Furthermore,
the attributes that are
referred to as type-dependent attributes show up in the bold frames for
Type_APT and Type_SUT, but do not appear in the bold frames for Type_ALL in
Table 3. This indicates that different types have their own favorite attributes
in addition. Finally, there are attributes outside the bold area for each type
of housing that are referred to as others. Those attributes are less important.
By
identifying the attributes, the roadmap of importance as shown in Figure 3 is
constructed. This could serve as a reference when people appraise a house in
Taipei. For example, when people want to buy a condominium with an elevator,
the first considerations will be floor area, target district, parking area and
land area, all of which are major-stable attributes. Next, major-unstable
attributes, such as the numbers of rooms and bathrooms, followed by minor
attributes, will be taken into account. Finally, other attributes will be
considered.
The
roadmap depicts the diversities of
attributes. The same type of
major-unstable attributes, for example, the number of rooms and bathrooms,
appears in different ranking positions. The apartments and condominiums with an
elevator are preferred in terms of the number of rooms and bathrooms than the
suites. This road map helps us to price the houses.
Figure 3: Roadmap of important
attributes
The
attributes in the bold area may or may not always be important. In view of
this, we captured those attributes in
the bold frames in Table 3 and reran those five models. The total amounts of
the independent attributes used in Type_APT’, Type_BLD’ and Type_SUT’ were
changed to 13, 9 and 12, respectively.
Those
attributes were considered to be the most important 95 percent from the
appraisals of the five models for each housing type. The consequences are
listed in Table 5. The yellow shadings
reflect belonging to the better parts of the performances than in the previous
experiment that adopted all 19 attributes in the evaluation. The green parts
were worse and the white parts were equal.
Type_SUT’
performs better in the situation where only 12 important attributes were
used. This indicates that most of the important attributes for Type_SUT’
were found in this paper. However, those attributes for Type_APT’ and Type_BLD’
did not work as well as those for Type_SUT’. This reveals that there are
attributes that were considered to be more important than this research
discovered that were not exposed.
Table 5:
Measurement results for 3 major types
Model |
Measurement |
Type_APT’ |
Type_BLD’ |
Type_SUT’ |
DT |
MAPE |
62.11 |
48.57 |
31.03 |
ANN |
45.76 |
33.66 |
16.59 |
|
MR |
50.18 |
44.07 |
23.76 |
|
MT |
41.52 |
28.36 |
16.99 |
|
RF |
40.05 |
27.13 |
16.90 |
|
DT |
R² |
0.49 |
0.71 |
0.66 |
ANN |
0.56 |
0.83 |
0.81 |
|
MR |
0.56 |
0.73 |
0.76 |
|
MT |
0.56 |
0.83 |
0.78 |
|
RF |
0.58 |
0.83 |
0.83 |
|
DT |
RMSE |
5003089 |
15051426 |
2581766 |
ANN |
4621929 |
11397272 |
1918152 |
|
MR |
4620246 |
14371934 |
2196793 |
|
MT |
4612869 |
11513048 |
2090838 |
|
RF |
4530046 |
11428551 |
1840169 |
|
DT |
MAE |
3642247 |
8492579 |
189946 |
ANN |
3278997 |
6381590 |
1254615 |
|
MR |
3324592 |
7992023 |
1530882 |
|
MT |
3298646 |
5671468 |
1378868 |
|
RF |
3205514 |
5095230 |
1190809 |
|
DT |
COR |
0.70 |
0.84 |
0.82 |
ANN |
0.75 |
0.91 |
0.90 |
|
MR |
0.75 |
0.86 |
0.87 |
|
MT |
0.75 |
0.91 |
0.89 |
|
RF |
0.77 |
0.92 |
0.91 |
6. CONCLUSION
In
this study, five data mining techniques were constructed from the Actual Price
Registration of Taiwan to examine those models’ performances in regard to
prediction, and to find out the relatively important attributes that will help
to identify which attributes are more important
according to the type of houses. In such a big data era with huge volumes of
data, variables and methods, this paper delineates a road map for the selection
of variables in relation to house prices.
First,
this paper used five measures, namely, the
MAPE, R², RMSE, MAE and COR, to evaluate those five models’ performances in
terms of prediction. In general, there was no one single best model that could
satisfy all three types of houses concurrently. While random forests were more
suitable for apartments, ANN were more reliable for the condominiums with
elevator(s) and for the suites. The only
reason for this was that the patterns of each housing type were not completely
similar. Therefore, the model selected was dependent on the housing type.
Second,
Figure 3 will help us to
identify which attributes are important and their rankings. Through the process
of identification, influential factors will be shown in sequence, and decisions
to buy or set prices will be made.
Suggestions
for future studies include vicinity issues, such as the distances to schools,
department stores and parks, etc. That should be taken into account. This
research lacks this kind of information. However, the models used could be
revalidated when having such data. More new findings about the neighborhood of
the houses will be obtained.
REFERENCES
ACCIANI, C.; FUCILLI, V.; SARDARO, R. (2011) Data
Mining in Real Estate Appraisal: A Model Tree and Multivariate Adaptive
Regression Spline Approach. Aestimum,
v. 58, p. 27-45.
BAHIA, I. S. H. (2013) A Data Mining Model by Using
ANN for Predicting Real Estate Market: Comparative Study. International Journal
of Intelligence Science, v. 3, n. 4. p. 162-169.
BREIMAN, L.; FRIEDMAN, J. H.; OLSHEN, R. A.; STONE, C.
J. (1984) Classification and Regression Trees, Belmont, CA: Wadsworth.
BREIMAN, L. (2001) Random Forests. Machine Learning,
v. 45, n. 1, p. 5-32.
BRACKE, P. (2015) House Prices and Rents:
Microevidence From A Matched Data Set in Central London. Real Estate Economics,
v. 43, n. 2, p. 403-431.
COAKLEY, J. R.; BROWN, C. E. (2000) Artificial Neural
Networks in Accounting and Finance: Modeling Issues. International Journal of Intelligent Systems in Accounting, Finance and
Management, v. 9, n. 2. p. 119-144.
CORTEZ, P. (2016) Package ‘rminer’. Available: https://cran.r-project.org/web/packages/rminer/rminer.pdf . Access:
2th September, 2016.
DEL CACHO, C. (2010) A Comparison of Data Mining
Methods for Mass Real Estate Appraisal (No. 27378). Munich Personal RePEc Archive.
DELMENDO, L. C. 2016. Taiwanese House Prices Continue
to Fall Due to Harsh Taxes. Retrieved on September 16, 2016, from
http://www.globalpropertyguide.com/Asia/Taiwan/Price-History
FAN, G. Z.; ONG, S. E.; KOH, H. C. (2006) Determinants
of House Price: A Decision Tree Approach. Urban
Studies, v. 43, n. 12, p. 2301-2315.
FIK, T. J.; LING, D. C.; MULLIGAN, G. F. (2003)
Modeling Spatial Variation in Housing Prices: A Variable Interaction Approach. Real Estate Economics, v. 31, n. 4, p.
623-646.
FONG, S.; WAH, Y. B. (2013) A Prediction Model for
Forecasting the Trend of Macau Property Price Movements and Understanding the
Influential Factors. Journal of Emerging
Technologies in Web Intelligence, v.5, n. 2, p. 122-131.
GAN, V.; AGARWAL, V.; KIM, B. (2015) Data Mining
Analysis and Predictions of Real Estate Prices. Issues in Information Systems, v. 16, n. 4, p. 30-36.
GOODMAN, A. C. (1978) Hedonic Prices, Price Indices
and Housing Markets. Journal of Urban
Economics, v. 5, n. 4, p. 471-484.
JAMES, G.; WITTEN, D.; HASTIE, T.; TIBSHIRANI, R. (2013)
An Introduction to Statistical Learning,
New York: Springer.
KASS, G. V. (1980) An Exploratory Technique for
Investigating Large Quantities of Categorical Data. Applied Statistics, v. 29, n. 2, p. 119-127.
KUHN, M.; WESTON, S.; DEEFER, C.; COUTLER, N. (2016)
Cubist Models for Regression, Available: https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.pdf . Access:
10th December, 2016.
MAGIDSON, J. (1994) The CHAID Approach to Segmentation
Modeling: Chi-squared Automatic Interaction Detection, in: BAGOZZI, R. P.
(Ed.), Advanced Methods of Marketing
Research. Malden (Mass. US): Blackwell Business, p. 118-159.
MANVILLE, M. (2013) Parking Requirements and Housing
Development: Regulation and Reform in Los Angeles. Journal of the American Planning Association, v. 79, n. 1, p. 49-66.
MULLEY C. (Ed.),
Parking: Issues and Policies. United Kingdom: Emerald Publishing, p.
87-113.
MUNUSAMY, M.; MUTHUVEERAPPAN, C.; BABA, M.; ABDULLAH, M.
N.; ASMONI, M. (2015). An Overview of the Forecasting Methods Used in Real
Estate Housing Price Modelling. Jurnal
Teknologi, v. 73, n. 5, p. 189-193.
QUINLAN, J. R. (1986) Induction of Decision Trees. Machine Learning, v. 1, p. 81-106.
QUINLAN,
J. R. (1992) C4. 5: Programming for
Machine Learning, San Mateo, CA: Morgan Kauffmann.
SHOUP, D. (2014) The High Cost of Minimum Parking
Requirements, in: ISON, S.;
SIRMANS, G. S.; MACDONALD, L.; MACPHERSON, D. A.;
ZIETZ, E. N. (2006) The Value of Housing Characteristics: A Meta Analysis. The Journal of Real Estate Finance and
Economics, v. 33, n. 3, p. 215-240.
WELCH, T. F.; GEHRKE, S. R.; WANG, F. (2016) Long-term
Impact of Network Access to Bike Facilities and Public Transit Stations on
Housing Sales Prices in Portland, Oregon. Journal
of Transport Geography, v. 54, p. 264-272.
WITTEN, I. H.; FRANK, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques, 5 ed.
Boston, MA: Morgan Kaufmann.
WOODS, E.; KYRAL, E. (1997) Ovum Evaluates Data Mining, London: Ovum.
XIAO, Y.; ORFORD, S.; WEBSTER, C. J. (2016) Urban
Configuration, Accessibility, and Property Prices: A Case Study of Cardiff,
Wales. Environment and Planning B:
Planning and Design, v. 43, n. 1, p. 108-129.