Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh
In [1]:
import pandas as pd
import pandas_profiling
import numpy as np
In [2]:
df=pd.read_csv("examples/Meteorite_Landings.csv", parse_dates=['year'], encoding='UTF-8')
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')
# Example: Constant variable
df['source'] = "NASA"
# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])
# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])
# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))
# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"
df = df.append(duplicates_to_add, ignore_index=True)
In [3]:
pandas_profiling.ProfileReport(df)
Out[3]:
Overview
Dataset info
Number of variables
14
Number of observations
45726
Total Missing (%)
3.5%
Total size in memory
4.6 MiB
Average record size in memory
105.0 B
Variables types
Numeric
4
Categorical
5
Boolean
1
Date
1
Text (Unique)
1
Rejected
2
Unsupported
0
Warnings
GeoLocation
has 7315 / 16.0% missing values MissingGeoLocation
has a high cardinality: 17101 distinct values Warningmass (g)
is highly skewed (γ1 = 76.918) Skewedrecclass
has a high cardinality: 466 distinct values Warningreclat
has 6438 / 14.1% zeros Zerosreclat
has 7315 / 16.0% missing values Missingreclat_city
is highly correlated with reclat
(ρ = 0.99424) Rejectedreclong
has 6214 / 13.6% zeros Zerosreclong
has 7315 / 16.0% missing values Missingsource
has constant value NASA Rejected
Variables
GeoLocation
Categorical
Distinct count
17101
Unique (%)
37.4%
Missing (%)
16.0%
Missing (n)
7315
(0.000000, 0.000000)
6214
(-71.500000, 35.666670)
4761
(-84.000000, 168.000000)
3040
Other values (17097)
24396
(Missing)
7315
Value
Count
Frequency (%)
(0.000000, 0.000000)
6214
13.6%
(-71.500000, 35.666670)
4761
10.4%
(-84.000000, 168.000000)
3040
6.6%
(-72.000000, 26.000000)
1505
3.3%
(-79.683330, 159.750000)
657
1.4%
(-76.716670, 159.666670)
637
1.4%
(-76.183330, 157.166670)
539
1.2%
(-79.683330, 155.750000)
473
1.0%
(-84.216670, 160.500000)
263
0.6%
(-86.366670, -70.000000)
226
0.5%
Other values (17090)
20096
43.9%
(Missing)
7315
16.0%
boolean
Boolean
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
Mean
0.49821
True
22781
(Missing)
22945
Value
Count
Frequency (%)
True
22781
49.8%
(Missing)
22945
50.2%
fall
Categorical
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
Found
44609
Fell
1117
Value
Count
Frequency (%)
Found
44609
97.6%
Fell
1117
2.4%
id
Numeric
Distinct count
45716
Unique (%)
100.0%
Missing (%)
0.0%
Missing (n)
0
Infinite (%)
0.0%
Infinite (n)
0
Mean
26884
Minimum
1
Maximum
57458
Zeros (%)
0.0%
Quantile statistics
Minimum
1
5-th percentile
2388.8
Q1
12681
Median
24256
Q3
40654
95-th percentile
54891
Maximum
57458
Range
57457
Interquartile range
27972
Descriptive statistics
Standard deviation
16863
Coef of variation
0.62727
Kurtosis
-1.1601
Mean
26884
MAD
14490
Skewness
0.26653
Sum
1229293495
Variance
284380000
Memory size
357.3 KiB
Value
Count
Frequency (%)
417
2
0.0%
398
2
0.0%
1
2
0.0%
6
2
0.0%
392
2
0.0%
370
2
0.0%
379
2
0.0%
2
2
0.0%
390
2
0.0%
10
2
0.0%
Other values (45706)
45706
100.0%
Minimum 5 values
Value
Count
Frequency (%)
1
2
0.0%
2
2
0.0%
4
1
0.0%
5
1
0.0%
6
2
0.0%
Maximum 5 values
Value
Count
Frequency (%)
57454
1
0.0%
57455
1
0.0%
57456
1
0.0%
57457
1
0.0%
57458
1
0.0%
mass (g)
Numeric
Distinct count
12577
Unique (%)
27.5%
Missing (%)
0.3%
Missing (n)
131
Infinite (%)
0.0%
Infinite (n)
0
Mean
13278
Minimum
0
Maximum
60000000
Zeros (%)
0.0%
Quantile statistics
Minimum
0
5-th percentile
1.1
Q1
7.2
Median
32.61
Q3
202.9
95-th percentile
4000
Maximum
60000000
Range
60000000
Interquartile range
195.7
Descriptive statistics
Standard deviation
574930
Coef of variation
43.298
Kurtosis
6798.4
Mean
13278
MAD
25113
Skewness
76.918
Sum
605430000
Variance
330540000000
Memory size
357.3 KiB
Value
Count
Frequency (%)
1.3
171
0.4%
1.2
140
0.3%
1.4
138
0.3%
2.1
130
0.3%
2.4
126
0.3%
1.6
120
0.3%
0.5
119
0.3%
1.1
116
0.3%
3.8
114
0.2%
1.5
111
0.2%
Other values (12566)
44310
96.9%
(Missing)
131
0.3%
Minimum 5 values
Value
Count
Frequency (%)
0.0
19
0.0%
0.01
2
0.0%
0.013000000000000001
1
0.0%
0.02
1
0.0%
0.03
1
0.0%
Maximum 5 values
Value
Count
Frequency (%)
28000000.0
1
0.0%
30000000.0
1
0.0%
50000000.0
1
0.0%
58200000.0
1
0.0%
60000000.0
1
0.0%
mixed
Categorical
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
1
22987
A
22739
Value
Count
Frequency (%)
1
22987
50.3%
A
22739
49.7%
name
Categorical, Unique
First 3 values
Dominion Range 10049
Yamato 74391
Miller Range 090500
Last 3 values
Roberts Massif 04129
Lewis Cliff 87087
Northwest Africa 6079
First 10 values
Value
Count
Frequency (%)
Aachen
1
0.0%
Aachen copy
1
0.0%
Aarhus
1
0.0%
Aarhus copy
1
0.0%
Abajo
1
0.0%
Last 10 values
Value
Count
Frequency (%)
Österplana 062
1
0.0%
Österplana 063
1
0.0%
Österplana 064
1
0.0%
Łowicz
1
0.0%
Święcany
1
0.0%
nametype
Categorical
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
Valid
45651
Relict
75
Value
Count
Frequency (%)
Valid
45651
99.8%
Relict
75
0.2%
recclass
Categorical
Distinct count
466
Unique (%)
1.0%
Missing (%)
0.0%
Missing (n)
0
L6
8287
H5
7143
L5
4797
Other values (463)
25499
Value
Count
Frequency (%)
L6
8287
18.1%
H5
7143
15.6%
L5
4797
10.5%
H6
4529
9.9%
H4
4211
9.2%
LL5
2766
6.0%
LL6
2043
4.5%
L4
1253
2.7%
H4/5
428
0.9%
CM2
416
0.9%
Other values (456)
9853
21.5%
reclat
Numeric
Distinct count
12739
Unique (%)
27.9%
Missing (%)
16.0%
Missing (n)
7315
Infinite (%)
0.0%
Infinite (n)
0
Mean
-39.107
Minimum
-87.367
Maximum
81.167
Zeros (%)
14.1%
Quantile statistics
Minimum
-87.367
5-th percentile
-84.355
Q1
-76.714
Median
-71.5
Q3
0
95-th percentile
34.494
Maximum
81.167
Range
168.53
Interquartile range
76.714
Descriptive statistics
Standard deviation
46.386
Coef of variation
-1.1861
Kurtosis
-1.4769
Mean
-39.107
MAD
43.937
Skewness
0.49132
Sum
-1502100
Variance
2151.7
Memory size
357.3 KiB
Value
Count
Frequency (%)
0.0
6438
14.1%
-71.5
4761
10.4%
-84.0
3040
6.6%
-72.0
1506
3.3%
-79.68333
1130
2.5%
-76.71667
680
1.5%
-76.18333
539
1.2%
-84.21667
263
0.6%
-86.36667
226
0.5%
-86.71667
217
0.5%
Other values (12728)
19611
42.9%
(Missing)
7315
16.0%
Minimum 5 values
Value
Count
Frequency (%)
-87.36667
4
0.0%
-87.03333
3
0.0%
-86.93333
3
0.0%
-86.71667
217
0.5%
-86.56667
17
0.0%
Maximum 5 values
Value
Count
Frequency (%)
72.68333
1
0.0%
72.88333
1
0.0%
76.13333
1
0.0%
76.53333
1
0.0%
81.16667
1
0.0%
reclat_city
Highly correlated
This variable is highly correlated with reclat
and should be ignored for analysis
Correlation
0.99424
reclong
Numeric
Distinct count
14641
Unique (%)
32.0%
Missing (%)
16.0%
Missing (n)
7315
Infinite (%)
0.0%
Infinite (n)
0
Mean
61.053
Minimum
-165.43
Maximum
354.47
Zeros (%)
13.6%
Quantile statistics
Minimum
-165.43
5-th percentile
-90.427
Q1
0
Median
35.667
Q3
157.17
95-th percentile
168
Maximum
354.47
Range
519.91
Interquartile range
157.17
Descriptive statistics
Standard deviation
80.655
Coef of variation
1.3211
Kurtosis
-0.73139
Mean
61.053
MAD
67.606
Skewness
-0.17438
Sum
2345100
Variance
6505.3
Memory size
357.3 KiB
Value
Count
Frequency (%)
0.0
6214
13.6%
35.66667
4985
10.9%
168.0
3040
6.6%
26.0
1506
3.3%
159.75
657
1.4%
159.66666999999998
637
1.4%
157.16666999999998
542
1.2%
155.75
473
1.0%
160.5
263
0.6%
-70.0
228
0.5%
Other values (14630)
19866
43.4%
(Missing)
7315
16.0%
Minimum 5 values
Value
Count
Frequency (%)
-165.43333
9
0.0%
-165.11667
17
0.0%
-163.16666999999998
1
0.0%
-162.55
1
0.0%
-157.86667
1
0.0%
Maximum 5 values
Value
Count
Frequency (%)
175.13333
1
0.0%
175.73028
1
0.0%
178.08333000000002
1
0.0%
178.2
1
0.0%
354.47333
1
0.0%
source
Constant
This variable is constant and should be ignored for analysis
Constant value
NASA
year
Date
Distinct count
246
Unique (%)
0.5%
Missing (%)
0.7%
Missing (n)
312
Infinite (%)
0.0%
Infinite (n)
0
Minimum
1688-01-01 00:00:00
Maximum
2101-01-01 00:00:00
Correlations
Sample
name
id
nametype
recclass
mass (g)
fall
year
reclat
reclong
GeoLocation
source
boolean
mixed
reclat_city
0
Aachen
1
Valid
L5
21.0
Fell
1880-01-01
50.77500
6.08333
(50.775000, 6.083330)
NASA
True
A
53.104124
1
Aarhus
2
Valid
H6
720.0
Fell
1951-01-01
56.18333
10.23333
(56.183330, 10.233330)
NASA
True
1
58.838867
2
Abee
6
Valid
EH4
107000.0
Fell
1952-01-01
54.21667
-113.00000
(54.216670, -113.000000)
NASA
True
1
59.307067
3
Acapulco
10
Valid
Acapulcoite
1914.0
Fell
1976-01-01
16.88333
-99.90000
(16.883330, -99.900000)
NASA
False
A
23.087539
4
Achiras
370
Valid
L6
780.0
Fell
1902-01-01
-33.16667
-64.95000
(-33.166670, -64.950000)
NASA
True
1
-34.589431
In [4]:
pfr = pandas_profiling.ProfileReport(df)
pfr.to_file("/tmp/example.html")
In [5]:
pfr
Out[5]:
Overview
Dataset info
Number of variables
14
Number of observations
45726
Total Missing (%)
3.5%
Total size in memory
4.6 MiB
Average record size in memory
105.0 B
Variables types
Numeric
4
Categorical
5
Boolean
1
Date
1
Text (Unique)
1
Rejected
2
Unsupported
0
Warnings
GeoLocation
has 7315 / 16.0% missing values MissingGeoLocation
has a high cardinality: 17101 distinct values Warningmass (g)
is highly skewed (γ1 = 76.918) Skewedrecclass
has a high cardinality: 466 distinct values Warningreclat
has 6438 / 14.1% zeros Zerosreclat
has 7315 / 16.0% missing values Missingreclat_city
is highly correlated with reclat
(ρ = 0.99424) Rejectedreclong
has 6214 / 13.6% zeros Zerosreclong
has 7315 / 16.0% missing values Missingsource
has constant value NASA Rejected
Variables
GeoLocation
Categorical
Distinct count
17101
Unique (%)
37.4%
Missing (%)
16.0%
Missing (n)
7315
(0.000000, 0.000000)
6214
(-71.500000, 35.666670)
4761
(-84.000000, 168.000000)
3040
Other values (17097)
24396
(Missing)
7315
Value
Count
Frequency (%)
(0.000000, 0.000000)
6214
13.6%
(-71.500000, 35.666670)
4761
10.4%
(-84.000000, 168.000000)
3040
6.6%
(-72.000000, 26.000000)
1505
3.3%
(-79.683330, 159.750000)
657
1.4%
(-76.716670, 159.666670)
637
1.4%
(-76.183330, 157.166670)
539
1.2%
(-79.683330, 155.750000)
473
1.0%
(-84.216670, 160.500000)
263
0.6%
(-86.366670, -70.000000)
226
0.5%
Other values (17090)
20096
43.9%
(Missing)
7315
16.0%
boolean
Boolean
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
Mean
0.49821
True
22781
(Missing)
22945
Value
Count
Frequency (%)
True
22781
49.8%
(Missing)
22945
50.2%
fall
Categorical
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
Found
44609
Fell
1117
Value
Count
Frequency (%)
Found
44609
97.6%
Fell
1117
2.4%
id
Numeric
Distinct count
45716
Unique (%)
100.0%
Missing (%)
0.0%
Missing (n)
0
Infinite (%)
0.0%
Infinite (n)
0
Mean
26884
Minimum
1
Maximum
57458
Zeros (%)
0.0%
Quantile statistics
Minimum
1
5-th percentile
2388.8
Q1
12681
Median
24256
Q3
40654
95-th percentile
54891
Maximum
57458
Range
57457
Interquartile range
27972
Descriptive statistics
Standard deviation
16863
Coef of variation
0.62727
Kurtosis
-1.1601
Mean
26884
MAD
14490
Skewness
0.26653
Sum
1229293495
Variance
284380000
Memory size
357.3 KiB
Value
Count
Frequency (%)
417
2
0.0%
398
2
0.0%
1
2
0.0%
6
2
0.0%
392
2
0.0%
370
2
0.0%
379
2
0.0%
2
2
0.0%
390
2
0.0%
10
2
0.0%
Other values (45706)
45706
100.0%
Minimum 5 values
Value
Count
Frequency (%)
1
2
0.0%
2
2
0.0%
4
1
0.0%
5
1
0.0%
6
2
0.0%
Maximum 5 values
Value
Count
Frequency (%)
57454
1
0.0%
57455
1
0.0%
57456
1
0.0%
57457
1
0.0%
57458
1
0.0%
mass (g)
Numeric
Distinct count
12577
Unique (%)
27.5%
Missing (%)
0.3%
Missing (n)
131
Infinite (%)
0.0%
Infinite (n)
0
Mean
13278
Minimum
0
Maximum
60000000
Zeros (%)
0.0%
Quantile statistics
Minimum
0
5-th percentile
1.1
Q1
7.2
Median
32.61
Q3
202.9
95-th percentile
4000
Maximum
60000000
Range
60000000
Interquartile range
195.7
Descriptive statistics
Standard deviation
574930
Coef of variation
43.298
Kurtosis
6798.4
Mean
13278
MAD
25113
Skewness
76.918
Sum
605430000
Variance
330540000000
Memory size
357.3 KiB
Value
Count
Frequency (%)
1.3
171
0.4%
1.2
140
0.3%
1.4
138
0.3%
2.1
130
0.3%
2.4
126
0.3%
1.6
120
0.3%
0.5
119
0.3%
1.1
116
0.3%
3.8
114
0.2%
1.5
111
0.2%
Other values (12566)
44310
96.9%
(Missing)
131
0.3%
Minimum 5 values
Value
Count
Frequency (%)
0.0
19
0.0%
0.01
2
0.0%
0.013000000000000001
1
0.0%
0.02
1
0.0%
0.03
1
0.0%
Maximum 5 values
Value
Count
Frequency (%)
28000000.0
1
0.0%
30000000.0
1
0.0%
50000000.0
1
0.0%
58200000.0
1
0.0%
60000000.0
1
0.0%
mixed
Categorical
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
1
22987
A
22739
Value
Count
Frequency (%)
1
22987
50.3%
A
22739
49.7%
name
Categorical, Unique
First 3 values
Dominion Range 10049
Yamato 74391
Miller Range 090500
Last 3 values
Roberts Massif 04129
Lewis Cliff 87087
Northwest Africa 6079
First 10 values
Value
Count
Frequency (%)
Aachen
1
0.0%
Aachen copy
1
0.0%
Aarhus
1
0.0%
Aarhus copy
1
0.0%
Abajo
1
0.0%
Last 10 values
Value
Count
Frequency (%)
Österplana 062
1
0.0%
Österplana 063
1
0.0%
Österplana 064
1
0.0%
Łowicz
1
0.0%
Święcany
1
0.0%
nametype
Categorical
Distinct count
2
Unique (%)
0.0%
Missing (%)
0.0%
Missing (n)
0
Valid
45651
Relict
75
Value
Count
Frequency (%)
Valid
45651
99.8%
Relict
75
0.2%
recclass
Categorical
Distinct count
466
Unique (%)
1.0%
Missing (%)
0.0%
Missing (n)
0
L6
8287
H5
7143
L5
4797
Other values (463)
25499
Value
Count
Frequency (%)
L6
8287
18.1%
H5
7143
15.6%
L5
4797
10.5%
H6
4529
9.9%
H4
4211
9.2%
LL5
2766
6.0%
LL6
2043
4.5%
L4
1253
2.7%
H4/5
428
0.9%
CM2
416
0.9%
Other values (456)
9853
21.5%
reclat
Numeric
Distinct count
12739
Unique (%)
27.9%
Missing (%)
16.0%
Missing (n)
7315
Infinite (%)
0.0%
Infinite (n)
0
Mean
-39.107
Minimum
-87.367
Maximum
81.167
Zeros (%)
14.1%
Quantile statistics
Minimum
-87.367
5-th percentile
-84.355
Q1
-76.714
Median
-71.5
Q3
0
95-th percentile
34.494
Maximum
81.167
Range
168.53
Interquartile range
76.714
Descriptive statistics
Standard deviation
46.386
Coef of variation
-1.1861
Kurtosis
-1.4769
Mean
-39.107
MAD
43.937
Skewness
0.49132
Sum
-1502100
Variance
2151.7
Memory size
357.3 KiB
Value
Count
Frequency (%)
0.0
6438
14.1%
-71.5
4761
10.4%
-84.0
3040
6.6%
-72.0
1506
3.3%
-79.68333
1130
2.5%
-76.71667
680
1.5%
-76.18333
539
1.2%
-84.21667
263
0.6%
-86.36667
226
0.5%
-86.71667
217
0.5%
Other values (12728)
19611
42.9%
(Missing)
7315
16.0%
Minimum 5 values
Value
Count
Frequency (%)
-87.36667
4
0.0%
-87.03333
3
0.0%
-86.93333
3
0.0%
-86.71667
217
0.5%
-86.56667
17
0.0%
Maximum 5 values
Value
Count
Frequency (%)
72.68333
1
0.0%
72.88333
1
0.0%
76.13333
1
0.0%
76.53333
1
0.0%
81.16667
1
0.0%
reclat_city
Highly correlated
This variable is highly correlated with reclat
and should be ignored for analysis
Correlation
0.99424
reclong
Numeric
Distinct count
14641
Unique (%)
32.0%
Missing (%)
16.0%
Missing (n)
7315
Infinite (%)
0.0%
Infinite (n)
0
Mean
61.053
Minimum
-165.43
Maximum
354.47
Zeros (%)
13.6%
Quantile statistics
Minimum
-165.43
5-th percentile
-90.427
Q1
0
Median
35.667
Q3
157.17
95-th percentile
168
Maximum
354.47
Range
519.91
Interquartile range
157.17
Descriptive statistics
Standard deviation
80.655
Coef of variation
1.3211
Kurtosis
-0.73139
Mean
61.053
MAD
67.606
Skewness
-0.17438
Sum
2345100
Variance
6505.3
Memory size
357.3 KiB
Value
Count
Frequency (%)
0.0
6214
13.6%
35.66667
4985
10.9%
168.0
3040
6.6%
26.0
1506
3.3%
159.75
657
1.4%
159.66666999999998
637
1.4%
157.16666999999998
542
1.2%
155.75
473
1.0%
160.5
263
0.6%
-70.0
228
0.5%
Other values (14630)
19866
43.4%
(Missing)
7315
16.0%
Minimum 5 values
Value
Count
Frequency (%)
-165.43333
9
0.0%
-165.11667
17
0.0%
-163.16666999999998
1
0.0%
-162.55
1
0.0%
-157.86667
1
0.0%
Maximum 5 values
Value
Count
Frequency (%)
175.13333
1
0.0%
175.73028
1
0.0%
178.08333000000002
1
0.0%
178.2
1
0.0%
354.47333
1
0.0%
source
Constant
This variable is constant and should be ignored for analysis
Constant value
NASA
year
Date
Distinct count
246
Unique (%)
0.5%
Missing (%)
0.7%
Missing (n)
312
Infinite (%)
0.0%
Infinite (n)
0
Minimum
1688-01-01 00:00:00
Maximum
2101-01-01 00:00:00
Correlations
Sample
name
id
nametype
recclass
mass (g)
fall
year
reclat
reclong
GeoLocation
source
boolean
mixed
reclat_city
0
Aachen
1
Valid
L5
21.0
Fell
1880-01-01
50.77500
6.08333
(50.775000, 6.083330)
NASA
True
A
53.104124
1
Aarhus
2
Valid
H6
720.0
Fell
1951-01-01
56.18333
10.23333
(56.183330, 10.233330)
NASA
True
1
58.838867
2
Abee
6
Valid
EH4
107000.0
Fell
1952-01-01
54.21667
-113.00000
(54.216670, -113.000000)
NASA
True
1
59.307067
3
Acapulco
10
Valid
Acapulcoite
1914.0
Fell
1976-01-01
16.88333
-99.90000
(16.883330, -99.900000)
NASA
False
A
23.087539
4
Achiras
370
Valid
L6
780.0
Fell
1902-01-01
-33.16667
-64.95000
(-33.166670, -64.950000)
NASA
True
1
-34.589431
Content source: JosPolfliet/pandas-profiling
Similar notebooks: