Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates deprecated pandas.DataFrame.append() usage to supported pandas patterns to keep transcript profiling and assay ordering utilities working on newer pandas versions.
Changes:
- Replaced
DataFrame.append(Series)withpd.concat(...)-based row additions in SAM transcript profile generation. - Replaced repeated
append()calls in IDT bulk input generation with a list-accumulation approach and a single DataFrame construction.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
trxtools/sam/SAMtranscripts.py |
Swaps deprecated DataFrame.append() calls for pd.concat() when accumulating per-transcript profiles. |
trxtools/folding/assays.py |
Reworks bulk IDT output construction to avoid deprecated append() by collecting rows and building the DataFrame once. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # output_df = output_df.append(s1.rename(n)) | ||
| output_df = pd.concat([output_df, s1.rename(n).to_frame().T]) | ||
|
|
||
| # deletions | ||
| s2 = transcript2profileDeletions(l, length=length) | ||
| outputDel_df = outputDel_df.append(s2.rename(n)) | ||
| outputDel_df = pd.concat([outputDel_df, s2.rename(n).to_frame().T]) | ||
|
|
||
| # deletions ext | ||
| s3 = transcript2profileDeletions(l, length=length, expand=expand) | ||
| outputDelExt_df = outputDelExt_df.append(s3.rename(n)) | ||
| outputDelExt_df = pd.concat([outputDelExt_df, s3.rename(n).to_frame().T]) |
There was a problem hiding this comment.
pd.concat(...) is executed once per transcript inside the loop for output_df/outputDel_df/outputDelExt_df, which repeatedly copies/reallocates the growing DataFrames. To keep this efficient on large inputs, accumulate the per-transcript rows in lists and concatenate once after the loop (or concatenate each list once per output).
| output = pd.DataFrame() | ||
| rows = [] | ||
| for i, row in working.iterrows(): | ||
| output = output.append(pd.Series(index=['sequence'], data=row["template"], name=str(i) + "_template")) | ||
| output = output.append(pd.Series(index=['sequence'], data=row["non-template"], name=str(i) + "_non-temp")) | ||
| rows.append(pd.Series(index=['sequence'], data=row["template"], name=str(i) + "_template")) | ||
| rows.append(pd.Series(index=['sequence'], data=row["non-template"], name=str(i) + "_non-temp")) | ||
| output = pd.DataFrame(rows) |
There was a problem hiding this comment.
output = pd.DataFrame() is immediately overwritten by output = pd.DataFrame(rows) after the loop, so the initial assignment is redundant. Consider removing the first initialization (or directly construct output from rows) to reduce clutter.
| s1 = transcript2profile(l, length=length) | ||
| output_df = output_df.append(s1.rename(n)) | ||
| # output_df = output_df.append(s1.rename(n)) | ||
| output_df = pd.concat([output_df, s1.rename(n).to_frame().T]) |
There was a problem hiding this comment.
Building output_df by calling pd.concat(...) on every groupby iteration repeatedly reallocates/copies the whole DataFrame, which can get very slow for many transcripts (same scaling problem as df.append). Collect the per-transcript rows/frames in a list and pd.concat once after the loop (keeping n as the row index) to avoid O(n²) behavior.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.