Diamond Quality Analysis

Frame

If you want to buy one of the best diamonds in the world, what are the different aspects you want to look at? Let's find out how a stone is turned into a precious gem.

It is the 4C's that differentiates each stone

  • Carat
  • Cut
  • Clarity
  • Colour

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [ ]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13,8)

Acquire


In [ ]:
df = pd.read_csv("./diamonds.csv")
df.head()

Carat

The carat weight measures the mass of a diamond. One carat is defined as 200 milligrams (about 0.007 ounce avoirdupois). The point unit—equal to one one-hundredth of a carat (0.01 carat, or 2 mg)—is commonly used for diamonds of less than one carat. All else being equal, the price per carat increases with carat weight, since larger diamonds are both rarer and more desirable for use as gemstones.


In [ ]:
df.carat.describe()

Cut

Diamond cutting is the art and science of creating a gem-quality diamond out of mined rough. The cut of a diamond describes the manner in which a diamond has been shaped and polished from its beginning form as a rough stone to its final gem proportions. The cut of a diamond describes the quality of workmanship and the angles to which a diamond is cut. Often diamond cut is confused with "shape".


In [ ]:
df.cut.unique()

Clarity

Diamond clarity is a quality of diamonds relating to the existence and visual appearance of internal characteristics of a diamond called inclusions, and surface defects called blemishes. Inclusions may be crystals of a foreign material or another diamond crystal, or structural imperfections such as tiny cracks that can appear whitish or cloudy. The number, size, color, relative location, orientation, and visibility of inclusions can all affect the relative clarity of a diamond. A clarity grade is assigned based on the overall appearance of the stone under ten times magnification.


In [ ]:
df.clarity.unique()

Colour

The finest quality as per color grading is totally colorless, which is graded as "D" color diamond across the globe, meaning it is absolutely free from any color. The next grade has a very slight trace of color, which can be observed by any expert diamond valuer/grading laboratory. However when studded in jewellery these very light colored diamonds do not show any color or it is not possible to make out color shades. These are graded as E color or F color diamonds.

Refine

To perform any kind of visual exploration, we will need a numeric values for the cut categories. We need to create a new column with numeric values the cut categories


In [ ]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["cut_num"] = encoder.fit_transform(df.cut)

In [ ]:
df.head()

In [ ]:
df.color.unique()

We need to convert values in color column to numeric values.


In [ ]:
encoder = LabelEncoder()
df["color_num"] = encoder.fit_transform(df.color)

df.head(20)

Exercise

Create a column clarity_num with clarity as numeric value


In [ ]:

What is the highest price of D colour diamonds?


In [ ]:


In [ ]:

Visual Exploration

We start with histograms and in histograms, the number of bins is an important parameter


In [ ]:
df.price.hist(bins=100)

Ploting the price after log transformation

Transformation are done to address skewness in the distribution of a variable.


In [ ]:
df["price_log"] = np.log10(df.price)

In [ ]:
df.price_log.hist(bins=100)

Exercise

Plot bar charts for colour_num, cut_num, clarity_num


In [ ]:


In [ ]:


In [ ]:

Two variable exploration


In [ ]:
df.plot(x="carat", y="price", kind="scatter")

Exercise

Plot the price against carat after tranformation


In [ ]:


In [ ]:


In [ ]:

Exercise:

Plot the log transforms for the other C's against price


In [ ]:


In [ ]:


In [ ]:

Compare how the price change with respect to carat and colour of a diamond


In [ ]:
df[df.cut_num == 0].plot(x="carat_log", y="price_log", kind="scatter", color="red", label="Fair")

In [ ]:
ax = df[df.cut_num == 0].plot(x="carat_log", y="price_log", kind="scatter", color="red", label="Fair")
df[df.cut_num == 1].plot(x="carat_log", y="price_log", kind="scatter", color="green", label="Good", ax=ax)

Linear Regression

Simple Linear Regression

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

y=β0+β1x

What does each term represent?

  • y is the response
  • x is the feature
  • β0 is the intercept
  • β1 is the coefficient for x

Together, β0 and β1 are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Price!

Estimating ("Learning") Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

What elements are present in the diagram?

  • The black dots are the observed values of x and y.
  • The blue line is our least squares line.
  • The red lines are the residuals, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

  • β0 is the intercept (the value of y when x=0)
  • β1 is the slope (the change in y divided by change in x) Here is a graphical depiction of those calculations:

Building the linear model


In [ ]:
from sklearn.linear_model import LinearRegression

In [ ]:
linear_model = LinearRegression()

sklearn expects the dimension of the features to be a 2d array


In [ ]:
X_train = df["carat_log"]
X_train.shape

In [ ]:
X_train = X_train.reshape(X_train.shape[0],1)
X_train.shape

In [ ]:
y_train = df["price_log"]

Train the model


In [ ]:
linear_model.fit(X_train, y_train)

co-efficient - β1


In [ ]:
linear_model.coef_

intercept - β0


In [ ]:
linear_model.intercept_

Interpreting Model Coefficients

How do we interpret the carat coefficient (β1)? Increase in carat is associated with a 1.67581673 increase in Sales.

Note that if an increase in carat was associated with a decrease in price, β1 would be negative.

Using the Model for Prediction

y=β0+β1x

In [ ]:


In [ ]:


In [ ]:


In [ ]:
df[['carat_log','price_log']].head()

In [ ]:
df[['carat_log','price_log']].tail()

In [ ]:
X_test = pd.Series([df.carat_log.min(), df.carat_log.max()])
X_test = X_test.reshape(X_test.shape[0],1)

In [ ]:
predicted = linear_model.predict(X_test)
print predicted

In [ ]:
# first, plot the observed data
df.plot(kind='scatter', x='carat_log', y='price_log')

# then, plot the least squares line
plt.plot(X_test, predicted, c='red', linewidth=2)

How Well Does the Model Fit the data?

The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":

Goodness of fit - R2 score


In [ ]:
test_samples = pd.Series(predicted)
test_samples = test_samples.reshape(test_samples.shape[0],1)

In [ ]:
true_values = pd.Series([3.440437, 2.513218])
true_values = true_values.reshape(true_values.shape[0],1)

In [ ]:
linear_model.score(X_train, df["price_log"].reshape(df["price_log"].shape[0],1))

In [ ]:
linear_model.coef_

In [ ]:
linear_model.intercept_

Using multi variable for regression


In [ ]:
df.head()

We need to do one hot encoding on cut


In [ ]:
df_encoded = pd.get_dummies(df["cut_num"])
df_encoded.columns = ['cut_num0', 'cut_num1', 'cut_num2', 'cut_num3', 'cut_num4']
df_encoded.head()

In [ ]:
df.head()

Concatenate the two dataframes


In [ ]:
frames = [df, df_encoded]

df2 = pd.concat(frames, axis=1)
df2.head()

To run the model, we drop one of the dummy variables


In [ ]:
X4_train = df2[["carat_log", "cut_num0", "cut_num1", "cut_num2", "cut_num3"]]

In [ ]:
X4_train.head()

In [ ]:
y4_train = df2["price_log"]

In [ ]:
y4_train.head()

In [ ]:
X4_train.shape

In [ ]:
linear_model2 = LinearRegression()

In [ ]:
linear_model2.fit(X4_train, y4_train)

In [ ]:
linear_model2.coef_

In [ ]:
pd.set_option("precision", 4)

In [ ]:
pd.DataFrame(linear_model2.coef_)

In [ ]:
linear_model2.intercept_

In [ ]:
X4_test = pd.Series([-0.508638, 0, 1, 0, 0])

In [ ]:
X4_test

In [ ]:
X4_test.shape

In [ ]:
linear_model2.predict(X4_test.reshape(1,5))

In [ ]:
predicted = linear_model2.predict(X4_test)
print predicted

In [ ]:
y_train[4]

In [ ]:
true_values = pd.Series([y_train[4]])

In [ ]:
X4_test

In [ ]:
X4_test.shape

In [ ]:
X4_test = X4_test.reshape(1,5)

In [ ]:
X4_test

In [ ]:
true_values

In [ ]:
linear_model2.score(X4_test, true_values)

In [ ]:
linear_model2.coef_

In [ ]:
linear_model.intercept_

In [ ]: