In [0]:
#@title Copyright 2020 Google LLC. Double-click here for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This Colab introduces DataFrames, which are the central data structure in the pandas API. This Colab is not a comprehensive DataFrames tutorial. Rather, this Colab provides a very quick introduction to the parts of DataFrames required to do the other Colab exercises in Machine Learning Crash Course.
A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:
In [0]:
import numpy as np
import pandas as pd
The following code cell creates a simple DataFrame containing 10 cells organized as follows:
temperature and the other named activityThe following code cell instantiates a pd.DataFrame class to generate a DataFrame. The class takes two arguments:
np.array to generate the 5x2 NumPy array.
In [0]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])
# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']
# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)
# Print the entire DataFrame
print(my_dataframe)
In [0]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2
# Print the entire DataFrame
print(my_dataframe)
In [0]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3), '\n')
print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')
print("Rows #1, #2, and #3:")
print(my_dataframe[1:4], '\n')
print("Column 'temperature':")
print(my_dataframe['temperature'])
Do the following:
Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named Eleanor, Chidi, Tahani, and Jason. Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.
Output the following:
Eleanor columnCreate a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason.
To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial.
In [0]:
# Write your code here.
In [0]:
#@title Double-click for a solution to Task 1.
# Create a Python list that holds the names of the four columns.
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']
# Create a 3x4 numpy array, each cell populated with a random integer.
my_data = np.random.randint(low=0, high=101, size=(3, 4))
# Create a DataFrame.
df = pd.DataFrame(data=my_data, columns=my_column_names)
# Print the entire DataFrame
print(df)
# Print the value in row #1 of the Eleanor column.
print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1])
# Create a column named Janet whose contents are the sum
# of two other columns.
df['Janet'] = df['Tahani'] + df['Jason']
# Print the enhanced DataFrame
print(df)
Pandas provides two different ways to duplicate a DataFrame:
pd.DataFrame.copy method, you create a true independent copy. Changes to the original DataFrame or to the copy will not be reflected in the other. The difference is subtle, but important.
In [0]:
# Create a reference by assigning my_dataframe to a new variable.
print("Experiment with a reference:")
reference_to_df = df
# Print the starting value of a particular cell.
print(" Starting value of df: %d" % df['Jason'][1])
print(" Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])
# Modify a cell in df.
df.at[1, 'Jason'] = df['Jason'][1] + 5
print(" Updated df: %d" % df['Jason'][1])
print(" Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])
# Create a true copy of my_dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = my_dataframe.copy()
# Print the starting value of a particular cell.
print(" Starting value of my_dataframe: %d" % my_dataframe['activity'][1])
print(" Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1])
# Modify a cell in df.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print(" Updated my_dataframe: %d" % my_dataframe['activity'][1])
print(" copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1])