In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13,8)
In [ ]:
df = pd.read_csv("./diamonds.csv")
df.head()
The carat weight measures the mass of a diamond. One carat is defined as 200 milligrams (about 0.007 ounce avoirdupois). The point unit—equal to one one-hundredth of a carat (0.01 carat, or 2 mg)—is commonly used for diamonds of less than one carat. All else being equal, the price per carat increases with carat weight, since larger diamonds are both rarer and more desirable for use as gemstones.
In [ ]:
df.carat.describe()
Diamond cutting is the art and science of creating a gem-quality diamond out of mined rough. The cut of a diamond describes the manner in which a diamond has been shaped and polished from its beginning form as a rough stone to its final gem proportions. The cut of a diamond describes the quality of workmanship and the angles to which a diamond is cut. Often diamond cut is confused with "shape".
In [ ]:
df.cut.unique()
Diamond clarity is a quality of diamonds relating to the existence and visual appearance of internal characteristics of a diamond called inclusions, and surface defects called blemishes. Inclusions may be crystals of a foreign material or another diamond crystal, or structural imperfections such as tiny cracks that can appear whitish or cloudy. The number, size, color, relative location, orientation, and visibility of inclusions can all affect the relative clarity of a diamond. A clarity grade is assigned based on the overall appearance of the stone under ten times magnification.
In [ ]:
df.clarity.unique()
The finest quality as per color grading is totally colorless, which is graded as "D" color diamond across the globe, meaning it is absolutely free from any color. The next grade has a very slight trace of color, which can be observed by any expert diamond valuer/grading laboratory. However when studded in jewellery these very light colored diamonds do not show any color or it is not possible to make out color shades. These are graded as E color or F color diamonds.
In [ ]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["cut_num"] = encoder.fit_transform(df.cut)
In [ ]:
df.head()
In [ ]:
df.color.unique()
We need to convert values in color column to numeric values.
In [ ]:
encoder = LabelEncoder()
df["color_num"] = encoder.fit_transform(df.color)
df.head(20)
In [ ]:
What is the highest price of D colour diamonds?
In [ ]:
In [ ]:
We start with histograms and in histograms, the number of bins is an important parameter
In [ ]:
df.price.hist(bins=100)
In [ ]:
df["price_log"] = np.log10(df.price)
In [ ]:
df.price_log.hist(bins=100)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df.plot(x="carat", y="price", kind="scatter")
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df[df.cut_num == 0].plot(x="carat_log", y="price_log", kind="scatter", color="red", label="Fair")
In [ ]:
ax = df[df.cut_num == 0].plot(x="carat_log", y="price_log", kind="scatter", color="red", label="Fair")
df[df.cut_num == 1].plot(x="carat_log", y="price_log", kind="scatter", color="green", label="Good", ax=ax)
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:
y=β0+β1x
What does each term represent?
Together, β0 and β1 are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Price!
Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):
What elements are present in the diagram?
How do the model coefficients relate to the least squares line?
In [ ]:
from sklearn.linear_model import LinearRegression
In [ ]:
linear_model = LinearRegression()
sklearn expects the dimension of the features to be a 2d array
In [ ]:
X_train = df["carat_log"]
X_train.shape
In [ ]:
X_train = X_train.reshape(X_train.shape[0],1)
X_train.shape
In [ ]:
y_train = df["price_log"]
Train the model
In [ ]:
linear_model.fit(X_train, y_train)
co-efficient - β1
In [ ]:
linear_model.coef_
intercept - β0
In [ ]:
linear_model.intercept_
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df[['carat_log','price_log']].head()
In [ ]:
df[['carat_log','price_log']].tail()
In [ ]:
X_test = pd.Series([df.carat_log.min(), df.carat_log.max()])
X_test = X_test.reshape(X_test.shape[0],1)
In [ ]:
predicted = linear_model.predict(X_test)
print predicted
In [ ]:
# first, plot the observed data
df.plot(kind='scatter', x='carat_log', y='price_log')
# then, plot the least squares line
plt.plot(X_test, predicted, c='red', linewidth=2)
The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":
In [ ]:
test_samples = pd.Series(predicted)
test_samples = test_samples.reshape(test_samples.shape[0],1)
In [ ]:
true_values = pd.Series([3.440437, 2.513218])
true_values = true_values.reshape(true_values.shape[0],1)
In [ ]:
linear_model.score(X_train, df["price_log"].reshape(df["price_log"].shape[0],1))
In [ ]:
linear_model.coef_
In [ ]:
linear_model.intercept_
In [ ]:
df.head()
We need to do one hot encoding on cut
In [ ]:
df_encoded = pd.get_dummies(df["cut_num"])
df_encoded.columns = ['cut_num0', 'cut_num1', 'cut_num2', 'cut_num3', 'cut_num4']
df_encoded.head()
In [ ]:
df.head()
Concatenate the two dataframes
In [ ]:
frames = [df, df_encoded]
df2 = pd.concat(frames, axis=1)
df2.head()
To run the model, we drop one of the dummy variables
In [ ]:
X4_train = df2[["carat_log", "cut_num0", "cut_num1", "cut_num2", "cut_num3"]]
In [ ]:
X4_train.head()
In [ ]:
y4_train = df2["price_log"]
In [ ]:
y4_train.head()
In [ ]:
X4_train.shape
In [ ]:
linear_model2 = LinearRegression()
In [ ]:
linear_model2.fit(X4_train, y4_train)
In [ ]:
linear_model2.coef_
In [ ]:
pd.set_option("precision", 4)
In [ ]:
pd.DataFrame(linear_model2.coef_)
In [ ]:
linear_model2.intercept_
In [ ]:
X4_test = pd.Series([-0.508638, 0, 1, 0, 0])
In [ ]:
X4_test
In [ ]:
X4_test.shape
In [ ]:
linear_model2.predict(X4_test.reshape(1,5))
In [ ]:
predicted = linear_model2.predict(X4_test)
print predicted
In [ ]:
y_train[4]
In [ ]:
true_values = pd.Series([y_train[4]])
In [ ]:
X4_test
In [ ]:
X4_test.shape
In [ ]:
X4_test = X4_test.reshape(1,5)
In [ ]:
X4_test
In [ ]:
true_values
In [ ]:
linear_model2.score(X4_test, true_values)
In [ ]:
linear_model2.coef_
In [ ]:
linear_model.intercept_
In [ ]: