datascience
Tables¶
-
class
prob140.
Table
(labels=None, _deprecated=None, *, formatter=<datascience.formats.Formatter object>)[source] A sequence of string-labeled columns.
-
class
Rows
(table)[source] An iterable view over the rows in a table.
-
append
(row_or_table)[source] Append a row or all rows of a table. An appended table must have all columns of self.
-
append_column
(label, values)[source] Appends a column to the table or replaces a column.
__setitem__
is aliased to this method:table.append_column('new_col', make_array(1, 2, 3))
is equivalent totable['new_col'] = make_array(1, 2, 3)
.- Args:
label
(str): The label of the new column.values
(single value or list/array): If a single value, everyvalue in the new column is
values
.If a list or array, the new column contains the values in
values
, which must be the same length as the table.
- Returns:
- Original table with new or replaced column
- Raises:
ValueError
: Iflabel
is not a string.values
is a list/array and does not have the same length as the number of rows in the table.
>>> table = Table().with_columns( ... 'letter', make_array('a', 'b', 'c', 'z'), ... 'count', make_array(9, 3, 3, 1), ... 'points', make_array(1, 2, 2, 10)) >>> table letter | count | points a | 9 | 1 b | 3 | 2 c | 3 | 2 z | 1 | 10 >>> table.append_column('new_col1', make_array(10, 20, 30, 40)) >>> table letter | count | points | new_col1 a | 9 | 1 | 10 b | 3 | 2 | 20 c | 3 | 2 | 30 z | 1 | 10 | 40 >>> table.append_column('new_col2', 'hello') >>> table letter | count | points | new_col1 | new_col2 a | 9 | 1 | 10 | hello b | 3 | 2 | 20 | hello c | 3 | 2 | 30 | hello z | 1 | 10 | 40 | hello >>> table.append_column(123, make_array(1, 2, 3, 4)) Traceback (most recent call last): ... ValueError: The column label must be a string, but a int was given >>> table.append_column('bad_col', [1, 2]) Traceback (most recent call last): ... ValueError: Column length mismatch. New column does not have the same number of rows as table.
-
apply
(fn, *column_or_columns)[source] Apply
fn
to each element or elements ofcolumn_or_columns
. If nocolumn_or_columns
provided, fn` is applied to each row.- Args:
fn
(function) – The function to apply.column_or_columns
: Columns containing the arguments tofn
as either column labels (str
) or column indices (int
). The number of columns must match the number of arguments thatfn
expects.- Raises:
ValueError
– ifcolumn_label
is not an existing- column in the table.
TypeError
– if insufficent number ofcolumn_label
passed- to
fn
.
- Returns:
- An array consisting of results of applying
fn
to elements specified bycolumn_label
in each row.
>>> t = Table().with_columns( ... 'letter', make_array('a', 'b', 'c', 'z'), ... 'count', make_array(9, 3, 3, 1), ... 'points', make_array(1, 2, 2, 10)) >>> t letter | count | points a | 9 | 1 b | 3 | 2 c | 3 | 2 z | 1 | 10 >>> t.apply(lambda x: x - 1, 'points') array([0, 1, 1, 9]) >>> t.apply(lambda x, y: x * y, 'count', 'points') array([ 9, 6, 6, 10]) >>> t.apply(lambda x: x - 1, 'count', 'points') Traceback (most recent call last): ... TypeError: <lambda>() takes 1 positional argument but 2 were given >>> t.apply(lambda x: x - 1, 'counts') Traceback (most recent call last): ... ValueError: The column "counts" is not in the table. The table contains these columns: letter, count, points
Whole rows are passed to the function if no columns are specified.
>>> t.apply(lambda row: row[1] * 2) array([18, 6, 6, 2])
-
as_html
(max_rows=0)[source] Format table as HTML.
-
as_text
(max_rows=0, sep=' | ')[source] Format table as text.
-
bar
(column_for_categories=None, select=None, overlay=True, width=6, height=4, **vargs)[source] Plot bar charts for the table.
Each plot is labeled using the values in column_for_categories and one plot is produced for every other column (or for the columns designated by select).
Every selected column except column_for_categories must be numerical.
- Args:
- column_for_categories (str): A column containing x-axis categories
- Kwargs:
- overlay (bool): create a chart with one color per data column;
- if False, each will be displayed separately.
- vargs: Additional arguments that get passed into plt.bar.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.
-
barh
(column_for_categories=None, select=None, overlay=True, width=6, **vargs)[source] Plot horizontal bar charts for the table.
- Args:
column_for_categories
(str
): A column containing y-axis categories- used to create buckets for bar chart.
- Kwargs:
- overlay (bool): create a chart with one color per data column;
- if False, each will be displayed separately.
- vargs: Additional arguments that get passed into plt.barh.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.barh for additional arguments that can be passed into vargs.
- Raises:
- ValueError – Every selected except column for
column_for_categories
- must be numerical.
- ValueError – Every selected except column for
- Returns:
- Horizontal bar graph with buckets specified by
column_for_categories
. Each plot is labeled using the values incolumn_for_categories
and one plot is produced for every other column (or for the columns designated byselect
).
>>> t = Table().with_columns( ... 'Furniture', make_array('chairs', 'tables', 'desks'), ... 'Count', make_array(6, 1, 2), ... 'Price', make_array(10, 20, 30) ... ) >>> t Furniture | Count | Price chairs | 6 | 10 tables | 1 | 20 desks | 2 | 30 >>> furniture_table.barh('Furniture') <bar graph with furniture as categories and bars for count and price> >>> furniture_table.barh('Furniture', 'Price') <bar graph with furniture as categories and bars for price> >>> furniture_table.barh('Furniture', make_array(1, 2)) <bar graph with furniture as categories and bars for count and price>
-
bin
(*columns, **vargs)[source] Group values by bin and compute counts per bin by column.
By default, bins are chosen to contain all values in all columns. The following named arguments from numpy.histogram can be applied to specialize bin widths:
If the original table has n columns, the resulting binned table has n+1 columns, where column 0 contains the lower bound of each bin.
- Args:
columns
(str or int): Labels or indices of columns to be- binned. If empty, all columns are binned.
bins
(int or sequence of scalars): If bins is an int,- it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths.
range
((float, float)): The lower and upper range of- the bins. If not provided, range contains all values in the table. Values outside the range are ignored.
density
(bool): If False, the result will contain the number of- samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.
-
boxplot
(**vargs)[source] Plots a boxplot for the table.
Every column must be numerical.
- Kwargs:
- vargs: Additional arguments that get passed into plt.boxplot.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.boxplot for additional arguments that can be passed into vargs. These include vert and showmeans.
- Returns:
- None
- Raises:
- ValueError: The Table contains columns with non-numerical values.
>>> table = Table().with_columns( ... 'test1', make_array(92.5, 88, 72, 71, 99, 100, 95, 83, 94, 93), ... 'test2', make_array(89, 84, 74, 66, 92, 99, 88, 81, 95, 94)) >>> table test1 | test2 92.5 | 89 88 | 84 72 | 74 71 | 66 99 | 92 100 | 99 95 | 88 83 | 81 94 | 95 93 | 94 >>> table.boxplot() <boxplot of test1 and boxplot of test2 side-by-side on the same figure>
-
cdf
(x) Finds the cdf of the distribution
Parameters: - x : float
Value in distribution
Returns: - float
Finds P(X<=x)
Examples
>>> dist = Table().with_columns( ... 'Value', make_array(2, 3, 4), ... 'Probability', make_array(0.25, 0.5, 0.25)) >>> dist.cdf(0) 0 >>> dist.cdf(2) 0.25 >>> dist.cdf(3.5) 0.75 >>> dist.cdf(1000) 1.0
-
column
(index_or_label)[source] Return the values of a column as an array.
table.column(label) is equivalent to table[label].
>>> tiles = Table().with_columns( ... 'letter', make_array('c', 'd'), ... 'count', make_array(2, 4), ... )
>>> list(tiles.column('letter')) ['c', 'd'] >>> tiles.column(1) array([2, 4])
- Args:
- label (int or str): The index or label of a column
- Returns:
- An instance of
numpy.array
. - Raises:
ValueError
: When theindex_or_label
is not in the table.
-
column_index
(label)[source] Return the index of a column by looking up its label.
-
column_labels
Return a tuple of column labels. [Deprecated]
-
copy
(*, shallow=False)[source] Return a copy of a table.
-
drop
(*column_or_columns)[source] Return a Table with only columns other than selected label or labels.
- Args:
column_or_columns
(string or list of strings): The header names or indices of the columns to be dropped.column_or_columns
must be an existing header name, or a valid column index.- Returns:
- An instance of
Table
with given columns removed.
>>> t = Table().with_columns( ... 'burgers', make_array('cheeseburger', 'hamburger', 'veggie burger'), ... 'prices', make_array(6, 5, 5), ... 'calories', make_array(743, 651, 582)) >>> t burgers | prices | calories cheeseburger | 6 | 743 hamburger | 5 | 651 veggie burger | 5 | 582 >>> t.drop('prices') burgers | calories cheeseburger | 743 hamburger | 651 veggie burger | 582 >>> t.drop(['burgers', 'calories']) prices 6 5 5 >>> t.drop('burgers', 'calories') prices 6 5 5 >>> t.drop([0, 2]) prices 6 5 5 >>> t.drop(0, 2) prices 6 5 5 >>> t.drop(1) burgers | calories cheeseburger | 743 hamburger | 651 veggie burger | 582
-
classmethod
empty
(labels=None)[source] Creates an empty table. Column labels are optional. [Deprecated]
- Args:
labels
(None or list): IfNone
, a table with 0- columns is created. If a list, each element is a column label in a table with 0 rows.
- Returns:
- A new instance of
Table
.
-
ev
() Finds expected value of distribution
Returns: - float
Expected value
Examples
>>> dist = Table().values([1, 2, 4]).probability([0.5, 0.4, 0.1]) >>> dist.ev() 1.7 >>> 1 * 0.5 + 2 * 0.4 + 4 * 0.1 1.7
-
event
(x) Shows the probability that distribution takes on value x or list of values x.
Parameters: - x : float or Iterable
An event represented either as a specific value in the domain or a subset of the domain
Returns: - Table
Shows the probabilities of each value in the event
Examples
>>> dist = Table().values([1 2, 3, 4]).probability([1/4, 1/4, 1/4, 1/4]) >>> dist.event(2) Domain | Probability 2 | 0.25 >>> dist.event([2,3]) Domain | Probability 2 | 0.25 3 | 0.25
-
exclude
()[source] Return a new Table without a sequence of rows excluded by number.
- Args:
row_indices_or_slice
(integer or list of integers or slice):- The row index, list of row indices or a slice of row indices to be excluded.
- Returns:
- A new instance of
Table
.
>>> t = Table().with_columns( ... 'letter grade', make_array('A+', 'A', 'A-', 'B+', 'B', 'B-'), ... 'gpa', make_array(4, 4, 3.7, 3.3, 3, 2.7)) >>> t letter grade | gpa A+ | 4 A | 4 A- | 3.7 B+ | 3.3 B | 3 B- | 2.7 >>> t.exclude(4) letter grade | gpa A+ | 4 A | 4 A- | 3.7 B+ | 3.3 B- | 2.7 >>> t.exclude(-1) letter grade | gpa A+ | 4 A | 4 A- | 3.7 B+ | 3.3 B | 3 >>> t.exclude(make_array(1, 3, 4)) letter grade | gpa A+ | 4 A- | 3.7 B- | 2.7 >>> t.exclude(range(3)) letter grade | gpa B+ | 3.3 B | 3 B- | 2.7
Note that
exclude
also supports NumPy-like indexing and slicing:>>> t.exclude[:3] letter grade | gpa B+ | 3.3 B | 3 B- | 2.7
>>> t.exclude[1, 3, 4] letter grade | gpa A+ | 4 A- | 3.7 B- | 2.7
-
classmethod
from_array
(arr)[source] Convert a structured NumPy array into a Table.
-
classmethod
from_columns_dict
(columns)[source] Create a table from a mapping of column labels to column values. [Deprecated]
-
classmethod
from_df
(df)[source] Convert a Pandas DataFrame into a Table.
-
classmethod
from_records
(records)[source] Create a table from a sequence of records (dicts with fixed keys).
-
classmethod
from_rows
(rows, labels)[source] Create a table from a sequence of rows (fixed-length sequences). [Deprecated]
-
group
(column_or_label, collect=None)[source] Group rows by unique values in a column; count or aggregate others.
- Args:
column_or_label
: values to group (column label or index, or array)collect
: a function applied to values in other columns for each group- Returns:
- A Table with each row corresponding to a unique value in
column_or_label
, where the first column contains the unique values fromcolumn_or_label
, and the second contains counts for each of the unique values. Ifcollect
is provided, a Table is returned with all original columns, each containing values calculated by first grouping rows according tocolumn_or_label
, then applyingcollect
to each set of grouped values in the other columns. - Note:
- The grouped column will appear first in the result table. If
collect
does not accept arguments with one of the column types, that column will be empty in the resulting table.
>>> marbles = Table().with_columns( ... "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"), ... "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"), ... "Amount", make_array(4, 6, 12, 7, 9, 2), ... "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00)) >>> marbles Color | Shape | Amount | Price Red | Round | 4 | 1.3 Green | Rectangular | 6 | 1.3 Blue | Rectangular | 12 | 2 Red | Round | 7 | 1.75 Green | Rectangular | 9 | 1.4 Green | Round | 2 | 1 >>> marbles.group("Color") # just gives counts Color | count Blue | 1 Green | 3 Red | 2 >>> marbles.group("Color", max) # takes the max of each grouping, in each column Color | Shape max | Amount max | Price max Blue | Rectangular | 12 | 2 Green | Round | 9 | 1.4 Red | Round | 7 | 1.75 >>> marbles.group("Shape", sum) # sum doesn't make sense for strings Shape | Color sum | Amount sum | Price sum Rectangular | | 27 | 4.7 Round | | 13 | 4.05
-
group_bar
(column_label, **vargs)[source] Plot a bar chart for the table.
The values of the specified column are grouped and counted, and one bar is produced for each group.
Note: This differs from
bar
in that there is no need to specify bar heights; the height of a category’s bar is the number of copies of that category in the given column. This method behaves more likehist
in that regard, whilebar
behaves more likeplot
orscatter
(which require the height of each point to be specified).- Args:
column_label
(str or int): The name or index of a column- Kwargs:
- overlay (bool): create a chart with one color per data column;
- if False, each will be displayed separately.
width (float): The width of the plot, in inches height (float): The height of the plot, in inches
- vargs: Additional arguments that get passed into plt.bar.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.
-
group_barh
(column_label, **vargs)[source] Plot a horizontal bar chart for the table.
The values of the specified column are grouped and counted, and one bar is produced for each group.
Note: This differs from
barh
in that there is no need to specify bar heights; the size of a category’s bar is the number of copies of that category in the given column. This method behaves more likehist
in that regard, whilebarh
behaves more likeplot
orscatter
(which require the second coordinate of each point to be specified in another column).- Args:
column_label
(str or int): The name or index of a column- Kwargs:
- overlay (bool): create a chart with one color per data column;
- if False, each will be displayed separately.
width (float): The width of the plot, in inches height (float): The height of the plot, in inches
- vargs: Additional arguments that get passed into plt.bar.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.
-
groups
(labels, collect=None)[source] Group rows by multiple columns, count or aggregate others.
- Args:
labels
: list of column names (or indices) to group oncollect
: a function applied to values in other columns for each group- Returns: A Table with each row corresponding to a unique combination of values in
- the columns specified in
labels
, where the first columns are those specified inlabels
, followed by a column of counts for each of the unique values. Ifcollect
is provided, a Table is returned with all original columns, each containing values calculated by first grouping rows according to to values in thelabels
column, then applyingcollect
to each set of grouped values in the other columns. - Note:
- The grouped columns will appear first in the result table. If
collect
does not accept arguments with one of the column types, that column will be empty in the resulting table.
>>> marbles = Table().with_columns( ... "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"), ... "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"), ... "Amount", make_array(4, 6, 12, 7, 9, 2), ... "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00)) >>> marbles Color | Shape | Amount | Price Red | Round | 4 | 1.3 Green | Rectangular | 6 | 1.3 Blue | Rectangular | 12 | 2 Red | Round | 7 | 1.75 Green | Rectangular | 9 | 1.4 Green | Round | 2 | 1 >>> marbles.groups(["Color", "Shape"]) Color | Shape | count Blue | Rectangular | 1 Green | Rectangular | 2 Green | Round | 1 Red | Round | 2 >>> marbles.groups(["Color", "Shape"], sum) Color | Shape | Amount sum | Price sum Blue | Rectangular | 12 | 2 Green | Rectangular | 15 | 2.7 Green | Round | 2 | 1 Red | Round | 11 | 3.05
-
hist
(*columns, overlay=True, bins=None, bin_column=None, unit=None, counts=None, group=None, side_by_side=False, width=6, height=4, **vargs)[source] Plots one histogram for each column in columns. If no column is specified, plot all columns.
- Kwargs:
- overlay (bool): If True, plots 1 chart with all the histograms
- overlaid on top of each other (instead of the default behavior of one histogram for each column in the table). Also adds a legend that matches each bar color to its column. Note that if the histograms are not overlaid, they are not forced to the same scale.
- bins (list or int): Lower bound for each bin in the
- histogram or number of bins. If None, bins will be chosen automatically.
- bin_column (column name or index): A column of bin lower bounds.
- All other columns are treated as counts of these bins. If None, each value in each row is assigned a count of 1.
counts (column name or index): Deprecated name for bin_column.
- unit (string): A name for the units of the plotted column (e.g.
- ‘kg’), to be used in the plot.
- group (column name or index): A column of categories. The rows are
- grouped by the values in this column, and a separate histogram is generated for each group. The histograms are overlaid or plotted separately depending on the overlay argument. If None, no such grouping is done.
- side_by_side (bool): Whether histogram bins should be plotted side by
- side (instead of directly overlaid). Makes sense only when plotting multiple histograms, either by passing several columns or by using the group option.
- vargs: Additional arguments that get passed into :func:plt.hist.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist for additional arguments that can be passed into vargs. These include: range, normed, cumulative, and orientation, to name a few.
>>> t = Table().with_columns( ... 'count', make_array(9, 3, 3, 1), ... 'points', make_array(1, 2, 2, 10)) >>> t count | points 9 | 1 3 | 2 3 | 2 1 | 10 >>> t.hist() <histogram of values in count> <histogram of values in points>
>>> t = Table().with_columns( ... 'value', make_array(101, 102, 103), ... 'proportion', make_array(0.25, 0.5, 0.25)) >>> t.hist(bin_column='value') <histogram of values weighted by corresponding proportions>
>>> t = Table().with_columns( ... 'value', make_array(1, 2, 3, 2, 5 ), ... 'category', make_array('a', 'a', 'a', 'b', 'b')) >>> t.hist('value', group='category') <two overlaid histograms of the data [1, 2, 3] and [2, 5]>
-
index_by
(column_or_label)[source] Return a dict keyed by values in a column that contains lists of rows corresponding to each value.
-
join
(column_label, other, other_label=None)[source] Creates a new table with the columns of self and other, containing rows for all values of a column that appear in both tables.
- Args:
column_label
(str
): label of column in self that is used to- join rows of
other
. other
: Table object to join with self on matching values ofcolumn_label
.
- Kwargs:
other_label
(str
): default None, assumescolumn_label
.- Otherwise in
other
used to join rows.
- Returns:
- New table self joined with
other
by matching values incolumn_label
andother_label
. If the resulting join is empty, returns None.
>>> table = Table().with_columns('a', make_array(9, 3, 3, 1), ... 'b', make_array(1, 2, 2, 10), ... 'c', make_array(3, 4, 5, 6)) >>> table a | b | c 9 | 1 | 3 3 | 2 | 4 3 | 2 | 5 1 | 10 | 6 >>> table2 = Table().with_columns( 'a', make_array(9, 1, 1, 1), ... 'd', make_array(1, 2, 2, 10), ... 'e', make_array(3, 4, 5, 6)) >>> table2 a | d | e 9 | 1 | 3 1 | 2 | 4 1 | 2 | 5 1 | 10 | 6 >>> table.join('a', table2) a | b | c | d | e 1 | 10 | 6 | 2 | 4 1 | 10 | 6 | 2 | 5 1 | 10 | 6 | 10 | 6 9 | 1 | 3 | 1 | 3 >>> table.join('a', table2, 'a') # Equivalent to previous join a | b | c | d | e 1 | 10 | 6 | 2 | 4 1 | 10 | 6 | 2 | 5 1 | 10 | 6 | 10 | 6 9 | 1 | 3 | 1 | 3 >>> table.join('a', table2, 'd') # Repeat column labels relabeled a | b | c | a_2 | e 1 | 10 | 6 | 9 | 3 >>> table2 #table2 has three rows with a = 1 a | d | e 9 | 1 | 3 1 | 2 | 4 1 | 2 | 5 1 | 10 | 6 >>> table #table has only one row with a = 1 a | b | c 9 | 1 | 3 3 | 2 | 4 3 | 2 | 5 1 | 10 | 6
-
labels
Return a tuple of column labels.
-
move_to_end
(column_label)[source] Move a column to the last in order.
-
move_to_start
(column_label)[source] Move a column to the first in order.
-
normalized
() Returns the distribution by making the proabilities sum to 1
Returns: - Table
A distribution with the probabilities normalized
Examples
>>> Table().values([1, 2, 3]).probability([1, 1, 1]) Value | Probability 1 | 1 2 | 1 3 | 1 >>> Table().values([1, 2, 3]).probability([1, 1, 1]).normalized() Value | Probability 1 | 0.333333 2 | 0.333333 3 | 0.333333
-
num_columns
Number of columns.
-
num_rows
Number of rows.
-
percentile
(p)[source] Return a new table with one row containing the pth percentile for each column.
Assumes that each column only contains one type of value.
Returns a new table with one row and the same column labels. The row contains the pth percentile of the original column, where the pth percentile of a column is the smallest value that at at least as large as the p% of numbers in the column.
>>> table = Table().with_columns( ... 'count', make_array(9, 3, 3, 1), ... 'points', make_array(1, 2, 2, 10)) >>> table count | points 9 | 1 3 | 2 3 | 2 1 | 10 >>> table.percentile(80) count | points 9 | 10
-
pivot
(columns, rows, values=None, collect=None, zero=None)[source] Generate a table with a column for each unique value in
columns
, with rows for each unique value inrows
. Each row counts/aggregates the values that match both row and column based oncollect
.- Args:
columns
– a single column label or index, (str
orint
),- used to create new columns, based on its unique values.
rows
– row labels or indices, (str
orint
or list),- used to create new rows based on it’s unique values.
values
– column label in table for use in aggregation.- Default None.
collect
– aggregation function, used to groupvalues
- over row-column combinations. Default None.
zero
– zero value for non-existent row-column combinations.- Raises:
- TypeError – if
collect
is passed in andvalues
is not, - vice versa.
- TypeError – if
- Returns:
- New pivot table, with row-column combinations, as specified, with
aggregated
values
bycollect
across the intersection ofcolumns
androws
. Simple counts provided if values and collect are None, as default.
>>> titanic = Table().with_columns('age', make_array(21, 44, 56, 89, 95 ... , 40, 80, 45), 'survival', make_array(0,0,0,1, 1, 1, 0, 1), ... 'gender', make_array('M', 'M', 'M', 'M', 'F', 'F', 'F', 'F'), ... 'prediction', make_array(0, 0, 1, 1, 0, 1, 0, 1)) >>> titanic age | survival | gender | prediction 21 | 0 | M | 0 44 | 0 | M | 0 56 | 0 | M | 1 89 | 1 | M | 1 95 | 1 | F | 0 40 | 1 | F | 1 80 | 0 | F | 0 45 | 1 | F | 1 >>> titanic.pivot('survival', 'gender') gender | 0 | 1 F | 1 | 3 M | 3 | 1 >>> titanic.pivot('prediction', 'gender') gender | 0 | 1 F | 2 | 2 M | 2 | 2 >>> titanic.pivot('survival', 'gender', values='age', collect = np.mean) gender | 0 | 1 F | 80 | 60 M | 40.3333 | 89 >>> titanic.pivot('survival', make_array('prediction', 'gender')) prediction | gender | 0 | 1 0 | F | 1 | 1 0 | M | 2 | 0 1 | F | 0 | 2 1 | M | 1 | 1 >>> titanic.pivot('survival', 'gender', values = 'age') Traceback (most recent call last): ... TypeError: values requires collect to be specified >>> titanic.pivot('survival', 'gender', collect = np.mean) Traceback (most recent call last): ... TypeError: collect requires values to be specified
-
pivot_bin
(pivot_columns, value_column, bins=None, **vargs)[source] Form a table with columns formed by the unique tuples in pivot_columns containing counts per bin of the values associated with each tuple in the value_column.
By default, bins are chosen to contain all values in the value_column. The following named arguments from numpy.histogram can be applied to specialize bin widths:
- Args:
bins
(int or sequence of scalars): If bins is an int,- it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths.
range
((float, float)): The lower and upper range of- the bins. If not provided, range contains all values in the table. Values outside the range are ignored.
normed
(bool): If False, the result will contain the number of- samples in each bin. If True, the result is normalized such that the integral over the range is 1.
-
pivot_hist
(pivot_column_label, value_column_label, overlay=True, width=6, height=4, **vargs)[source] Draw histograms of each category in a column.
-
plot
(column_for_xticks=None, select=None, overlay=True, width=6, height=4, **vargs)[source] Plot line charts for the table.
- Args:
- column_for_xticks (
str/array
): A column containing x-axis labels - Kwargs:
- overlay (bool): create a chart with one color per data column;
- if False, each plot will be displayed separately.
- vargs: Additional arguments that get passed into plt.plot.
- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot for additional arguments that can be passed into vargs.
- Raises:
- ValueError – Every selected column must be numerical.
- Returns:
- Returns a line plot (connected scatter). Each plot is labeled using the values in column_for_xticks and one plot is produced for all other columns in self (or for the columns designated by select).
>>> table = Table().with_columns( ... 'days', make_array(0, 1, 2, 3, 4, 5), ... 'price', make_array(90.5, 90.00, 83.00, 95.50, 82.00, 82.00), ... 'projection', make_array(90.75, 82.00, 82.50, 82.50, 83.00, 82.50)) >>> table days | price | projection 0 | 90.5 | 90.75 1 | 90 | 82 2 | 83 | 82.5 3 | 95.5 | 82.5 4 | 82 | 83 5 | 82 | 82.5 >>> table.plot('days') <line graph with days as x-axis and lines for price and projection> >>> table.plot('days', overlay=False) <line graph with days as x-axis and line for price> <line graph with days as x-axis and line for projection> >>> table.plot('days', 'price') <line graph with days as x-axis and line for price>
-
prob_event
(x) Finds the probability of an event x.
Parameters: - x : float or Iterable
An event represented either as a specific value in the domain or a subset of the domain
Returns: - float
Probability of the event
Examples
>>> dist = Table().values([1, 2, 3, 4]).probability([1/4, 1/4, 1/4, 1/4]) >>> dist.prob_event(2) 0.25 >>> dist.prob_event([2, 3]) 0.5 >>> dist.prob_event(np.arange(1, 5)) 1.0
-
probability
(values) Assigns probabilities to domain values.
Parameters: - values : List or Array
Values that must correspond to the domain in the same order.
Returns: - Table
A probability distribution with those probabilities
-
probability_function
(pfunc) Assigns probabilities to a Distribution via a probability function. The probability function is applied to each value of the domain. Must have domain values in the first columns.
Parameters: - pfunc : func
Probability function of the distribution.
Returns: - Table
Table with probabilities in its last column.
-
classmethod
read_table
(filepath_or_buffer, *args, **vargs)[source] Read a table from a file or web address.
- filepath_or_buffer – string or file handle / StringIO; The string
- could be a URL. Valid URL schemes include http, ftp, s3, and file.
-
relabel
(column_label, new_label)[source] Changes the label(s) of column(s) specified by
column_label
to labels innew_label
.- Args:
column_label
– (single str or array of str) The label(s) of- columns to be changed to
new_label
. new_label
– (single str or array of str): The label name(s)- of columns to replace
column_label
.
- Raises:
ValueError
– ifcolumn_label
is not in table, or ifcolumn_label
andnew_label
are not of equal length.TypeError
– ifcolumn_label
and/ornew_label
is notstr
.
- Returns:
- Original table with
new_label
in place ofcolumn_label
.
>>> table = Table().with_columns( ... 'points', make_array(1, 2, 3), ... 'id', make_array(12345, 123, 5123)) >>> table.relabel('id', 'yolo') points | yolo 1 | 12345 2 | 123 3 | 5123 >>> table.relabel(make_array('points', 'yolo'), ... make_array('red', 'blue')) red | blue 1 | 12345 2 | 123 3 | 5123 >>> table.relabel(make_array('red', 'green', 'blue'), ... make_array('cyan', 'magenta', 'yellow', 'key')) Traceback (most recent call last): ... ValueError: Invalid arguments. column_label and new_label must be of equal length.
-
relabeled
(label, new_label)[source] Return a new table with
label
specifying column label(s) replaced by correspondingnew_label
.- Args:
label
– (str or array of str) The label(s) of- columns to be changed.
new_label
– (str or array of str): The new label(s) of- columns to be changed. Same number of elements as label.
- Raises:
ValueError
– iflabel
does not exist in- table, or if the
label
andnew_label
are not not of equal length. Also, raised iflabel
and/ornew_label
are notstr
.
- Returns:
- New table with
new_label
in place oflabel
.
>>> tiles = Table().with_columns('letter', make_array('c', 'd'), ... 'count', make_array(2, 4)) >>> tiles letter | count c | 2 d | 4 >>> tiles.relabeled('count', 'number') letter | number c | 2 d | 4 >>> tiles # original table unmodified letter | count c | 2 d | 4 >>> tiles.relabeled(make_array('letter', 'count'), ... make_array('column1', 'column2')) column1 | column2 c | 2 d | 4 >>> tiles.relabeled(make_array('letter', 'number'), ... make_array('column1', 'column2')) Traceback (most recent call last): ... ValueError: Invalid labels. Column labels must already exist in table in order to be replaced.
-
remove
(row_or_row_indices)[source] Removes a row or multiple rows of a table in place.
-
remove_zeros
() Removes all values with zero probability from the Distribution.
Returns: - Distribution
Examples
>>> dist = Table().values([2, 3, 4, 5]).probability([0.5, 0, 0.5, 0]) >>> dist Value | Probability 2 | 0.5 3 | 0 4 | 0.5 5 | 0 >>> dist.remove_zeros() Value | Probability 2 | 0.5 4 | 0.5
-
row
(index)[source] Return a row.
-
rows
Return a view of all rows.
-
sample
(k=None, with_replacement=True, weights=None)[source] Return a new table where k rows are randomly sampled from the original table.
- Args:
k
– specifies the number of rows (int
) to be sampled from- the table. Default is k equal to number of rows in the table.
with_replacement
– (bool
) By default True;- Samples
k
rows with replacement from table, else samplesk
rows without replacement. weights
– Array specifying probability the ith row of the- table is sampled. Defaults to None, which samples each row
with equal probability.
weights
must be a valid probability distribution – i.e. an array the length of the number of rows, summing to 1.
- Raises:
- ValueError – if
weights
is not length equal to number of rows - in the table; or, if
weights
does not sum to 1.
- ValueError – if
- Returns:
- A new instance of
Table
withk
rows resampled.
>>> jobs = Table().with_columns( ... 'job', make_array('a', 'b', 'c', 'd'), ... 'wage', make_array(10, 20, 15, 8)) >>> jobs job | wage a | 10 b | 20 c | 15 d | 8 >>> jobs.sample() job | wage b | 20 b | 20 a | 10 d | 8 >>> jobs.sample(with_replacement=True) job | wage d | 8 b | 20 c | 15 a | 10 >>> jobs.sample(k = 2) job | wage b | 20 c | 15 >>> ws = make_array(0.5, 0.5, 0, 0) >>> jobs.sample(k=2, with_replacement=True, weights=ws) job | wage a | 10 a | 10 >>> jobs.sample(k=2, weights=make_array(1, 0, 1, 0)) Traceback (most recent call last): ... ValueError: probabilities do not sum to 1
# Weights must be length of table. >>> jobs.sample(k=2, weights=make_array(1, 0, 0)) Traceback (most recent call last):
…ValueError: a and p must have same size
-
sample_from_dist
(n=1) Randomly samples from the distribution.
Note that this function was previously named sample but was renamed due to naming conflicts with the datascience library.
Parameters: - n : int
Number of times to sample from the distribution (default: 1)
Returns: - float or array
Samples from the distribution
Examples
>>> dist = Table().with_columns( ... 'Value', make_array(2, 3, 4), ... 'Probability', make_array(0.25, 0.5, 0.25)) >>> dist.sample_from_dist() 3 >>> dist.sample_from_dist() 2 >>> dist.sample_from_dist(10) array([3, 2, 2, 4, 3, 4, 3, 4, 3, 3])
-
sample_from_distribution
(distribution, k, proportions=False)[source] Return a new table with the same number of rows and a new column. The values in the distribution column are define a multinomial. They are replaced by sample counts/proportions in the output.
>>> sizes = Table(['size', 'count']).with_rows([ ... ['small', 50], ... ['medium', 100], ... ['big', 50], ... ]) >>> sizes.sample_from_distribution('count', 1000) size | count | count sample small | 50 | 239 medium | 100 | 496 big | 50 | 265 >>> sizes.sample_from_distribution('count', 1000, True) size | count | count sample small | 50 | 0.24 medium | 100 | 0.51 big | 50 | 0.25
-
scatter
(column_for_x, select=None, overlay=True, fit_line=False, colors=None, labels=None, sizes=None, width=5, height=5, s=20, **vargs)[source] Creates scatterplots, optionally adding a line of best fit.
- Args:
column_for_x
(str
): The column to use for the x-axis values- and label of the scatter plots.
- Kwargs:
overlay
(bool
): If true, creates a chart with one color- per data column; if False, each plot will be displayed separately.
fit_line
(bool
): draw a line of best fit for each set of points.vargs
: Additional arguments that get passed into plt.scatter.- See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter for additional arguments that can be passed into vargs. These include: marker and norm, to name a couple.
colors
: A column of categories to be used for coloring dots.labels
: A column of text labels to annotate dots.sizes
: A column of values to set the relative areas of dots.s
: Size of dots. If sizes is also provided, then dots will be- in the range 0 to 2 * s.
- Raises:
- ValueError – Every column,
column_for_x
orselect
, must be numerical - Returns:
- Scatter plot of values of
column_for_x
plotted against values for all other columns in self. Each plot uses the values in column_for_x for horizontal positions. One plot is produced for all other columns in self as y (or for the columns designated by select).
>>> table = Table().with_columns( ... 'x', make_array(9, 3, 3, 1), ... 'y', make_array(1, 2, 2, 10), ... 'z', make_array(3, 4, 5, 6)) >>> table x | y | z 9 | 1 | 3 3 | 2 | 4 3 | 2 | 5 1 | 10 | 6 >>> table.scatter('x') <scatterplot of values in y and z on x>
>>> table.scatter('x', overlay=False) <scatterplot of values in y on x> <scatterplot of values in z on x>
>>> table.scatter('x', fit_line=True) <scatterplot of values in y and z on x with lines of best fit>
-
sd
() Finds standard deviation of Distribution.
Returns: - float
Standard Deviation
Examples
>>> dist = Table().values([1, 2, 4]).probability([0.5, 0.4, 0.1]) >>> dist.sd() 0.9
-
select
(*column_or_columns)[source] Return a table with only the columns in
column_or_columns
.- Args:
column_or_columns
: Columns to select from theTable
as either column labels (str
) or column indices (int
).- Returns:
- A new instance of
Table
containing only selected columns. The columns of the newTable
are in the order given incolumn_or_columns
. - Raises:
KeyError
if any ofcolumn_or_columns
are not in the table.
>>> flowers = Table().with_columns( ... 'Number of petals', make_array(8, 34, 5), ... 'Name', make_array('lotus', 'sunflower', 'rose'), ... 'Weight', make_array(10, 5, 6) ... )
>>> flowers Number of petals | Name | Weight 8 | lotus | 10 34 | sunflower | 5 5 | rose | 6
>>> flowers.select('Number of petals', 'Weight') Number of petals | Weight 8 | 10 34 | 5 5 | 6
>>> flowers # original table unchanged Number of petals | Name | Weight 8 | lotus | 10 34 | sunflower | 5 5 | rose | 6
>>> flowers.select(0, 2) Number of petals | Weight 8 | 10 34 | 5 5 | 6
-
set_format
(column_or_columns, formatter)[source] Set the format of a column.
-
show
(max_rows=0)[source] Display the table.
-
sort
(column_or_label, descending=False, distinct=False)[source] Return a Table of rows sorted according to the values in a column.
- Args:
column_or_label
: the column whose values are used for sorting.descending
: if True, sorting will be in descending, rather than- ascending order.
distinct
: if True, repeated values incolumn_or_label
will- be omitted.
- Returns:
- An instance of
Table
containing rows sorted based on the values incolumn_or_label
.
>>> marbles = Table().with_columns( ... "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"), ... "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"), ... "Amount", make_array(4, 6, 12, 7, 9, 2), ... "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00)) >>> marbles Color | Shape | Amount | Price Red | Round | 4 | 1.3 Green | Rectangular | 6 | 1.3 Blue | Rectangular | 12 | 2 Red | Round | 7 | 1.75 Green | Rectangular | 9 | 1.4 Green | Round | 2 | 1 >>> marbles.sort("Amount") Color | Shape | Amount | Price Green | Round | 2 | 1 Red | Round | 4 | 1.3 Green | Rectangular | 6 | 1.3 Red | Round | 7 | 1.75 Green | Rectangular | 9 | 1.4 Blue | Rectangular | 12 | 2 >>> marbles.sort("Amount", descending = True) Color | Shape | Amount | Price Blue | Rectangular | 12 | 2 Green | Rectangular | 9 | 1.4 Red | Round | 7 | 1.75 Green | Rectangular | 6 | 1.3 Red | Round | 4 | 1.3 Green | Round | 2 | 1 >>> marbles.sort(3) # the Price column Color | Shape | Amount | Price Green | Round | 2 | 1 Red | Round | 4 | 1.3 Green | Rectangular | 6 | 1.3 Green | Rectangular | 9 | 1.4 Red | Round | 7 | 1.75 Blue | Rectangular | 12 | 2 >>> marbles.sort(3, distinct = True) Color | Shape | Amount | Price Green | Round | 2 | 1 Red | Round | 4 | 1.3 Green | Rectangular | 9 | 1.4 Red | Round | 7 | 1.75 Blue | Rectangular | 12 | 2
-
split
(k)[source] Return a tuple of two tables where the first table contains
k
rows randomly sampled and the second contains the remaining rows.- Args:
k
(int): The number of rows randomly sampled into the first- table.
k
must be between 1 andnum_rows - 1
.
- Raises:
ValueError
:k
is not between 1 andnum_rows - 1
.- Returns:
- A tuple containing two instances of
Table
.
>>> jobs = Table().with_columns( ... 'job', make_array('a', 'b', 'c', 'd'), ... 'wage', make_array(10, 20, 15, 8)) >>> jobs job | wage a | 10 b | 20 c | 15 d | 8 >>> sample, rest = jobs.split(3) >>> sample job | wage c | 15 a | 10 b | 20 >>> rest job | wage d | 8
-
stack
(key, labels=None)[source] Takes k original columns and returns two columns, with col. 1 of all column names and col. 2 of all associated data.
-
stats
(ops=(<built-in function min>, <built-in function max>, <function median>, <built-in function sum>))[source] Compute statistics for each column and place them in a table.
-
take
()[source] Return a new Table with selected rows taken by index.
- Args:
row_indices_or_slice
(integer or array of integers): The row index, list of row indices or a slice of row indices to be selected.- Returns:
- A new instance of
Table
with selected rows in order corresponding torow_indices_or_slice
. - Raises:
IndexError
, if anyrow_indices_or_slice
is out of bounds with respect to column length.
>>> grades = Table().with_columns('letter grade', ... make_array('A+', 'A', 'A-', 'B+', 'B', 'B-'), ... 'gpa', make_array(4, 4, 3.7, 3.3, 3, 2.7)) >>> grades letter grade | gpa A+ | 4 A | 4 A- | 3.7 B+ | 3.3 B | 3 B- | 2.7 >>> grades.take(0) letter grade | gpa A+ | 4 >>> grades.take(-1) letter grade | gpa B- | 2.7 >>> grades.take(make_array(2, 1, 0)) letter grade | gpa A- | 3.7 A | 4 A+ | 4 >>> grades.take[:3] letter grade | gpa A+ | 4 A | 4 A- | 3.7 >>> grades.take(np.arange(0,3)) letter grade | gpa A+ | 4 A | 4 A- | 3.7 >>> grades.take(10) Traceback (most recent call last): ... IndexError: index 10 is out of bounds for axis 0 with size 6
-
to_array
()[source] Convert the table to a structured NumPy array.
-
to_csv
(filename)[source] Creates a CSV file with the provided filename.
The CSV is created in such a way that if we run
table.to_csv('my_table.csv')
we can recreate the same table withTable.read_table('my_table.csv')
.- Args:
filename
(str): The filename of the output CSV file.- Returns:
- None, outputs a file with name
filename
.
>>> jobs = Table().with_columns( ... 'job', make_array('a', 'b', 'c', 'd'), ... 'wage', make_array(10, 20, 15, 8)) >>> jobs job | wage a | 10 b | 20 c | 15 d | 8 >>> jobs.to_csv('my_table.csv') <outputs a file called my_table.csv in the current directory>
-
to_df
()[source] Convert the table to a Pandas DataFrame.
-
to_joint
(X_column_label=None, Y_column_label=None, probability_column_label=None, reverse=True) Converts a table of probabilities associated with two variables into a JointDistribution object
Parameters: - table : Table
You can either call pass in a Table directly or call the toJoint() method of that Table. See examples.
- X_column_label (optional) : str
Label for the first variable. Defaults to the same label as that of first variable of Table.
- Y_column_label (optional) : str
Label for the second variable. Defaults to the same label as that of second variable of Table.
- probability_column_label (optional) : str
Label for probabilities.
- reverse (optional) : bool
If True, the vertical values will be reversed.
Returns: - JointDistribution
A JointDistribution object.
Examples
>>> dist1 = Table().values([0,1],[2,3]) >>> dist1['Probability'] = make_array(0.1, 0.2, 0.3, 0.4) >>> dist1.to_joint() X=0 X=1 Y=3 0.2 0.4 Y=2 0.1 0.3 >>> dist2 = Table().values('Coin1',['H','T'], 'Coin2', ['H','T']) >>> dist2['Probability'] = np.array([0.4*0.6, 0.6*0.6, 0.4*0.4, 0.6*0.4]) >>> dist2.toJoint() Coin1=H Coin1=T Coin2=T 0.36 0.24 Coin2=H 0.24 0.16
-
to_markov_chain
() Constructs a Markov Chain from the Table.
Returns: - MarkovChain
-
transition_function
(pfunc) Assigns transition probabilities to a Distribution via a probability function. The probability function is applied to each value of the domain. Must have domain values in the first column first.
Parameters: - pfunc : variate function
Conditional probability function of the distribution ( P(Y | X))
Returns: - Table
Table with those probabilities in its final column
-
transition_probability
(values) For a multivariate probability distribution, assigns transition probabilities, ie P(Y | X).
Parameters: - values : List or Array
Values that must correspond to the domain in the same order
Returns: - Table
A probability distribution with those probabilities
-
var
() Finds variance of distribution
Returns: - float
Variance
Examples
>>> dist = Table().values([1, 2, 4]).probability([0.5, 0.4, 0.1]) >>> dist.var() 0.81 >>> (1 * 0.5 + 4 * 0.4 + 16 * 0.1) - (1.7) ** 2 0.81
-
where
(column_or_label, value_or_predicate=None, other=None)[source] Return a new
Table
containing rows wherevalue_or_predicate
returns True for values incolumn_or_label
.- Args:
column_or_label
: A column of theTable
either as a label (str
) or an index (int
). Can also be an array of booleans; only the rows where the array value isTrue
are kept.value_or_predicate
: If a function, it is applied to every value incolumn_or_label
. Only the rows wherevalue_or_predicate
returns True are kept. If a single value, only the rows where the values incolumn_or_label
are equal tovalue_or_predicate
are kept.other
: Optional additional column label forvalue_or_predicate
to make pairwise comparisons. See the examples below for usage. Whenother
is supplied,value_or_predicate
must be a callable function.- Returns:
If
value_or_predicate
is a function, returns a newTable
containing only the rows wherevalue_or_predicate(val)
is True for theval``s in ``column_or_label
.If
value_or_predicate
is a value, returns a newTable
containing only the rows where the values incolumn_or_label
are equal tovalue_or_predicate
.If
column_or_label
is an array of booleans, returns a newTable
containing only the rows wherecolumn_or_label
isTrue
.
>>> marbles = Table().with_columns( ... "Color", make_array("Red", "Green", "Blue", ... "Red", "Green", "Green"), ... "Shape", make_array("Round", "Rectangular", "Rectangular", ... "Round", "Rectangular", "Round"), ... "Amount", make_array(4, 6, 12, 7, 9, 2), ... "Price", make_array(1.30, 1.20, 2.00, 1.75, 0, 3.00))
>>> marbles Color | Shape | Amount | Price Red | Round | 4 | 1.3 Green | Rectangular | 6 | 1.2 Blue | Rectangular | 12 | 2 Red | Round | 7 | 1.75 Green | Rectangular | 9 | 0 Green | Round | 2 | 3
Use a value to select matching rows
>>> marbles.where("Price", 1.3) Color | Shape | Amount | Price Red | Round | 4 | 1.3
In general, a higher order predicate function such as the functions in
datascience.predicates.are
can be used.>>> from datascience.predicates import are >>> # equivalent to previous example >>> marbles.where("Price", are.equal_to(1.3)) Color | Shape | Amount | Price Red | Round | 4 | 1.3
>>> marbles.where("Price", are.above(1.5)) Color | Shape | Amount | Price Blue | Rectangular | 12 | 2 Red | Round | 7 | 1.75 Green | Round | 2 | 3
Use the optional argument
other
to apply predicates to compare columns.>>> marbles.where("Price", are.above, "Amount") Color | Shape | Amount | Price Green | Round | 2 | 3
>>> marbles.where("Price", are.equal_to, "Amount") # empty table Color | Shape | Amount | Price
-
with_column
(label, values, *rest)[source] Return a new table with an additional or replaced column.
- Args:
label
(str): The column label. If an existing label is used,- the existing column will be replaced in the new table.
values
(single value or sequence): If a single value, every- value in the new column is
values
. If sequence of values, new column takes on values invalues
. rest
: An alternating list of labels and values describing- additional columns. See with_columns for a full description.
- Raises:
ValueError
: Iflabel
is not a valid column name- if
label
is not of type (str) values
is a list/array that does not have the same- length as the number of rows in the table.
- Returns:
- copy of original table with new or replaced column
>>> alphabet = Table().with_column('letter', make_array('c','d')) >>> alphabet = alphabet.with_column('count', make_array(2, 4)) >>> alphabet letter | count c | 2 d | 4 >>> alphabet.with_column('permutes', make_array('a', 'g')) letter | count | permutes c | 2 | a d | 4 | g >>> alphabet letter | count c | 2 d | 4 >>> alphabet.with_column('count', 1) letter | count c | 1 d | 1 >>> alphabet.with_column(1, make_array(1, 2)) Traceback (most recent call last): ... ValueError: The column label must be a string, but a int was given >>> alphabet.with_column('bad_col', make_array(1)) Traceback (most recent call last): ... ValueError: Column length mismatch. New column does not have the same number of rows as table.
-
with_columns
(*labels_and_values)[source] Return a table with additional or replaced columns.
- Args:
labels_and_values
: An alternating list of labels and values or- a list of label-value pairs. If one of the labels is in
existing table, then every value in the corresponding column is
set to that value. If label has only a single value (
int
), every row of corresponding column takes on that value.
- Raises:
ValueError
: If- any label in
labels_and_values
is not a valid column - name, i.e if label is not of type (str).
- any label in
- if any value in
labels_and_values
is a list/array and - does not have the same length as the number of rows in the table.
- if any value in
AssertionError
:- ‘incorrect columns format’, if passed more than one sequence
- (iterables) for
labels_and_values
.
- ‘even length sequence required’ if missing a pair in
- label-value pairs.
- Returns:
- Copy of original table with new or replaced columns. Columns added
in order of labels. Equivalent to
with_column(label, value)
when passed only one label-value pair.
>>> players = Table().with_columns('player_id', ... make_array(110234, 110235), 'wOBA', make_array(.354, .236)) >>> players player_id | wOBA 110234 | 0.354 110235 | 0.236 >>> players = players.with_columns('salaries', 'N/A', 'season', 2016) >>> players player_id | wOBA | salaries | season 110234 | 0.354 | N/A | 2016 110235 | 0.236 | N/A | 2016 >>> salaries = Table().with_column('salary', ... make_array('$500,000', '$15,500,000')) >>> players.with_columns('salaries', salaries.column('salary'), ... 'years', make_array(6, 1)) player_id | wOBA | salaries | season | years 110234 | 0.354 | $500,000 | 2016 | 6 110235 | 0.236 | $15,500,000 | 2016 | 1 >>> players.with_columns(2, make_array('$600,000', '$20,000,000')) Traceback (most recent call last): ... ValueError: The column label must be a string, but a int was given >>> players.with_columns('salaries', make_array('$600,000')) Traceback (most recent call last): ... ValueError: Column length mismatch. New column does not have the same number of rows as table.
-
with_row
(row)[source] Return a table with an additional row.
- Args:
row
(sequence): A value for each column.- Raises:
ValueError
: If the row length differs from the column count.
>>> tiles = Table(make_array('letter', 'count', 'points')) >>> tiles.with_row(['c', 2, 3]).with_row(['d', 4, 2]) letter | count | points c | 2 | 3 d | 4 | 2
-
with_rows
(rows)[source] Return a table with additional rows.
- Args:
rows
(sequence of sequences): Each row has a value per column.If
rows
is a 2-d array, its shape must be (_, n) for n columns.- Raises:
ValueError
: If a row length differs from the column count.
>>> tiles = Table(make_array('letter', 'count', 'points')) >>> tiles.with_rows(make_array(make_array('c', 2, 3), ... make_array('d', 4, 2))) letter | count | points c | 2 | 3 d | 4 | 2
-
class
prob140
JointDistribution¶
-
class
prob140.
JointDistribution
(data=None, index=None, columns=None, dtype=None, copy=False)[source] -
both_marginals
()[source] Finds the marginal distribution of both variables.
Returns: - JointDistribution Table.
Examples
>>> dist1 = Table().values([0, 1], [2, 3]).probability([0.1, 0.2, 0.3, 0.4]).to_joint() >>> dist1.both_marginals() X=0 X=1 Sum: Marginal of Y Y=3 0.2 0.4 0.6 Y=2 0.1 0.3 0.4 Sum: Marginal of X 0.3 0.7 1.0
-
conditional_dist
(label, given='', show_ev=False)[source] Given the random variable label, finds the conditional distribution of the other variable.
Parameters: - label : String
Variable given.
Returns: - JointDistribution Table
Examples
>>> coins = Table().values('Coin1', ['H', 'T'], 'Coin2', ['H','T']).probability(np.array([0.24, 0.36, 0.16,0.24])).to_joint() >>> coins.conditional_dist('Coin1', 'Coin2') Coin1=H Coin1=T Sum Dist. of Coin1 | Coin2=H 0.6 0.4 1.0 Dist. of Coin1 | Coin2=T 0.6 0.4 1.0 Marginal of Coin1 0.6 0.4 1.0 >>> coins.conditional_dist('Coin2', 'Coin1') Dist. of Coin2 | Coin1=H Dist. of Coin2 | Coin1=T Marginal of Coin2 Coin2=H 0.4 0.4 0.4 Coin2=T 0.6 0.6 0.6 Sum 1.0 1.0 1.0
-
classmethod
from_table
(table, reverse=True)[source] Constructs a JointDistribution from a Table.
Parameters: - table : Table
3-column table with RV1, RV2, and joint probability
- reverse : bool (optional)
If True, vertical random variables are reversed. (Default: True)
Returns: - JointDistribution
-
get_possible_values
(label='')[source] Returns the possible values. If a label is given, returns the values for that random variable. Automatically converts to float/int if relevant.
Parameters: - label : str
Name of random variable.
Returns: - List of values.
-
marginal
(label)[source] Returns the marginal distribution of label.
Parameters: - label : String
The label of the variable of which we want to find the marginal distribution.
Returns: - JointDistribution Table
Examples
>>> dist2 = Table().values('Coin1', ['H', 'T'], 'Coin2', ['H', 'T']).probability(np.array([0.24, 0.36, 0.16, 0.24])).to_joint() >>> dist2.marginal('Coin1') Coin1=H Coin1=T Coin2=T 0.36 0.24 Coin2=H 0.24 0.16 Sum: Marginal of Coin1 0.60 0.40 >>> dist2.marginal('Coin2') Coin1=H Coin1=T Sum: Marginal of Coin2 Coin2=T 0.36 0.24 0.6 Coin2=H 0.24 0.16 0.4
-
marginal_dist
(label)[source] Finds the marginal marginal distribution of label, returns as a single variable distribution.
Parameters: - label
The label of the variable of which we want to find the marginal distribution.
Returns: - Table
Single variable distribution of label.
-
prob140
MarkovChain¶
-
class
prob140.
MarkovChain
(states, transition_matrix)[source] A class for representing, simulating, and computing Markov Chains.
-
distribution
(starting_condition, steps=1)[source] Finds the distribution of states after n steps given a starting condition.
Parameters: - starting_condition : state or Table
The initial distribution or the original state.
- n : integer
Number of transition steps.
Returns: - Table
Shows the distribution after n steps
Examples
>>> states = make_array('A', 'B') >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.distribution(start) State | Probability A | 0.24 B | 0.76 >>> mc.distribution(start, 0) State | Probability A | 0.8 B | 0.2 >>> mc.distribution(start, 3) State | Probability A | 0.3576 B | 0.6424
-
expected_return_time
()[source] Finds the expected return time of the Markov Chain (1 / steady state).
Returns: - Table
Expected Return Time
Examples
>>> states = ['A', 'B'] >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.expected_return_time() Value | Expected Return Time A | 1.5 B | 3
-
classmethod
from_matrix
(states, transition_matrix)[source] Constructs a MarkovChain from a transition matrix.
Parameters: - states : iterable
List of states.
- transition_matrix : ndarray
Square transition matrix.
Returns: - MarkovChain
Examples
>>> states = [1, 2] >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> MarkovChain.from_matrix(states, transition_matrix) 1 2 1 0.1 0.9 2 0.8 0.2
-
classmethod
from_table
(table)[source] Constructs a Markov Chain from a Table
Parameters: - table : Table
A table with three columns for source state, target state, and probability.
Returns: - MarkovChain
Examples
>>> table = Table().states(make_array('A', 'B')) ... .transition_probability(make_array(0.5, 0.5, 0.3, 0.7)) >>> table Source | Target | Probability A | A | 0.5 A | B | 0.5 B | A | 0.3 B | B | 0.7 >>> MarkovChain.from_table(table) A B A 0.5 0.5 B 0.3 0.7
-
classmethod
from_transition_function
(states, transition_function)[source] Constructs a MarkovChain from a transition function.
Parameters: - states : iterable
List of states.
- transition_function : function
Bivariate transition function that maps two states to a probability.
Returns: - MarkovChain
Examples
>>> states = make_array(1, 2) >>> def transition(s1, s2): ... if s1 == s2: ... return 0.7 ... else: ... return 0.3 >>> MarkovChain.from_transition_function(states, transition) 1 2 1 0.7 0.3 2 0.3 0.7
-
get_transition_matrix
(steps=1)[source] Returns the transition matrix after n steps as a numpy matrix.
Parameters: - steps : int (optional)
Number of steps. (default: 1)
Returns: - Transition matrix
-
log_prob_of_path
(starting_condition, path)[source] Finds the log-probability of a path given a starting condition.
May have better precision than prob_of_path.
Parameters: - starting_condition : state or Distribution
If a state, finds the log-probability of the path starting at that state. If a Distribution, finds the probability of the path with the first element sampled from the Distribution
- path : ndarray
Array of states
Returns: - float
log of probability
Examples
>>> states = make_array('A', 'B') >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.log_prob_of_path('A', ['A', 'B', 'A']) -2.6310891599660815 >>> start = Table().states(['A', 'B']).probability([0.8, 0.2]) >>> mc.log_prob_of_path(start, ['A', 'B', 'A']) -0.55164761828624576
-
plot_path
(starting_condition, path)[source] Plots a Markov Chain’s path.
Parameters: - starting_condition : state
State to start at.
- path : iterable
List of valid states.
Examples
>>> states = ['A', 'B'] # Works with all state data types! >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.plot_path(mc.simulate_path('B', 20)) <Plot of a Markov Chain that starts at 'B' and takes 20 steps>
-
prob_of_path
(starting_condition, path)[source] Finds the probability of a path given a starting condition.
Parameters: - starting_condition : state or Distribution
If a state, finds the probability of the path starting at that state. If a Distribution, finds the probability of the path with the first element sampled from the Distribution.
- path : ndarray
Array of states
Returns: - float
probability
Examples
>>> states = ['A', 'B'] >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.prob_of_path('A', ['A', 'B', 'A']) 0.072 >>> 0.1 * 0.9 * 0.8 0.072 >>> start = Table().states(['A', 'B']).probability([0.8, 0.2]) >>> mc.prob_of_path(start, ['A', 'B', 'A']) 0.576 >>> 0.8 * 0.9 * 0.8 0.576
-
simulate_path
(starting_condition, steps, plot_path=False)[source] Simulates a path of n steps with a specific starting condition.
Parameters: - starting_condition : state or Distribution
If a state, simulates n steps starting at that state. If a Distribution, samples from that distribution to find the starting state.
- steps : int
Number of steps to take.
- plot_path : bool
If True, plots the simulated path.
Returns: - ndarray
Array of sampled states.
Examples
>>> states = ['A', 'B'] >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.simulate_path('A', 10) array(['A', 'A', 'B', 'A', 'B', 'A', 'B', 'B', 'A', 'B', 'B'])
-
steady_state
()[source] Finds the stationary distribution of the Markov Chain.
Returns: - Table
Distribution.
Examples
>>> states = ['A', 'B'] >>> transition_matrix = np.array([[0.1, 0.9], ... [0.8, 0.2]]) >>> mc = MarkovChain.from_matrix(states, transition_matrix) >>> mc.steady_state() Value | Probability A | 0.666667 B | 0.333333
-
to_pandas
()[source] Returns the Pandas DataFrame representation of the MarkovChain.
-
transition_matrix
(steps=1)[source] Returns the transition matrix after n steps visually as a Pandas df.
Parameters: - steps : int (optional)
Number of steps. (default: 1)
Returns: - Pandas DataFrame
-